Tuning VAEs

Exploring how beta-VAE untangles the latent-dimension
Created on August 24|Last edited on November 8
Comment
Colab: https://colab.research.google.com/drive/1wvQAfGVnJlMbc3HAYUpOcit17TkSOYA_#scrollTo=SckrR0Okh0Q1﻿
How VAEs workAn auto-encoder is a neural network that has a layer containing smaller amounts of nodes inducing a "bottleneck" effect causing the neural network to squeeze information into less dimensions by learning a more efficient representation.
﻿
A VAE (Variational Auto-encoder) is an auto-encoder using a probabilistic latent dimension. Thus, rather than having each point map to another distinct point in the latent dimension, each point maps to a probabilistic distribution in the latent space. 
Initial ExplorationTo illustrate how the VAE works and produce some initial results, I ran a single run and plotted the embedding(latent) space shown in the top left. Alongside, the decoded/generated numbers sampled at uniform points from the embedding space are shown in the top right. 
﻿
As our model fits the data, the embedding space continues to separate different numbers. We see this reflected in the decoded output - numbers in the middle originally are blurry and illegible but slowly start to take form as the model trains.
﻿
On the bottom, is a table displaying some reconstructions of our original data.
﻿
﻿
﻿
Optimizing Hyper-parametersOne simple hyper-parameter we can set is the latent dimension or how many dimensions we should have in our embedding space. We also introduce a new hyper-parameter, the beta value, a value that serves to change our loss function by weighing how much our regularizing term,  a metric KL-divergence, has an impact on our loss function.
﻿Lbeta=Reconstruction_Loss+β∗KL_DivergenceL_{beta} = Reconstruction\_Loss + \beta * KL\_DivergenceLbeta​=Reconstruction_Loss+β∗KL_Divergence﻿﻿
vae_loss = reconstruction_loss + beta*kl_loss
﻿
Sweep: yn5ujapv 150
﻿
From the sweeps we performed, we can see generally, a beta value that is too high like values of 3 perform poorly but beta values that are too low also aren't ideal. The best beta values seem to hover around the range of 0.5-1.5. Our latent dimension hyper-parameter however seems to have lower loss the higher it is. This makes sense as the model is given more dimensions to fit our training data into, giving the latent dimension more space to fit information in.
Original Hyperparameters (Latent Dimension = 2, Beta = 2)﻿
﻿
Tuned Hyperparameters (Latent Dimension = 10, Beta = 0.5)﻿
﻿
﻿
﻿
Add a comment