Variational Autoencoder (VAE)

Created on July 10|Last edited on July 17
Comment
﻿
TheoryHere, we summarize the basics of the variational autoencoder (VAE).
Let's assume we want to model a complicated distribution pθ(x)
p_{\theta}(\mathbf{x})pθ​(x)﻿. To do that we can resort to the latent variable model, that allows us to utilize simple distributions pθ(z)p_{\theta}(\mathbf{z})pθ​(z)﻿ and pθ(x∣z)
p_{\theta}(\mathbf{x} \mid \mathbf{z})pθ​(x∣z)﻿ such that:
									pθ(x)=∫pθ(x,z)dz=∫pθ(x∣z)pθ(z)dz.p_{\theta}(\mathbf{x}) = \int p_{\theta}(\mathbf{x}, \mathbf{z}) d\mathbf{z} = \int p_{\theta}(\mathbf{x} \mid \mathbf{z}) p_{\theta}(\mathbf{z}) d\mathbf{z}.pθ​(x)=∫pθ​(x,z)dz=∫pθ​(x∣z)pθ​(z)dz.﻿﻿
﻿
In a machine learning setting we are given a dataset and we want to maximize the log-likelihood with respect to the parameters θ
\thetaθ﻿:
									max⁡θlog⁡pθ(x)=max⁡θ1N∑ilog⁡pθ(xi).
\max_{\theta} \log p_{\theta}(\mathbf{x}) = \max_{\theta} \frac{1}{N} \sum_i \log p_{\theta}(\mathbf{x}_i).maxθ​logpθ​(x)=maxθ​N1​∑i​logpθ​(xi​).﻿﻿
However, to optimize this objective we need to compute the integral above, which is intractable, and also its gradient... 
﻿
The idea is that instead of maximizing log⁡pθ(x)\log p_{\theta}(\mathbf{x})logpθ​(x)﻿ directly, we try to find a lower bound that it's easier to optimise.
Therefore, we find the expression of the evidence lower bound (ELBO) to be equal to
﻿
﻿ELBO=Ez∼qϕ(z)[log⁡pθ(x,z)−log⁡qϕ(z)]=Ez∼qϕ(z)[log⁡pθ(x∣z)+log⁡pθ(z)−log⁡qϕ(z)]=Ez∼qϕ(z)[log⁡pθ(x∣z)]−KL(qϕ(z)∥pθ(z))\begin{aligned}
\text{ELBO} &= \mathbb{E}_{\mathbf{z} \sim q_{\phi}(\mathbf{z})} \left[ \log p_{\theta}(\mathbf{x}, \mathbf{z}) - \log q_{\phi}(\mathbf{z}) \right] \\
&= \mathbb{E}_{\mathbf{z} \sim q_{\phi}(\mathbf{z})} \left[ \log p_{\theta}(\mathbf{x} \mid \mathbf{z}) + \log p_{\theta}(\mathbf{z}) - \log q_{\phi}(\mathbf{z}) \right] \\
&= \mathbb{E}_{\mathbf{z} \sim q_{\phi}(\mathbf{z})} \left[ \log p_{\theta}(\mathbf{x} \mid \mathbf{z}) \right] - \text{KL}( q_{\phi}(\mathbf{z}) \parallel p_{\theta}(\mathbf{z}))
\end{aligned}ELBO​=Ez∼qϕ​(z)​[logpθ​(x,z)−logqϕ​(z)]=Ez∼qϕ​(z)​[logpθ​(x∣z)+logpθ​(z)−logqϕ​(z)]=Ez∼qϕ​(z)​[logpθ​(x∣z)]−KL(qϕ​(z)∥pθ​(z))​﻿﻿
where qϕ
q_{\phi}qϕ​﻿ is the so called variational distribution.
﻿
Now we have to make some choices:
we choose a simple prior for z
\mathbf{z}z﻿, that is pθ(z)=p(z)=N(0,I)
p_{\theta}(\mathbf{z}) = p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})pθ​(z)=p(z)=N(0,I)﻿.
for pθ(x∣z)
p_{\theta}(\mathbf{x} \mid \mathbf{z})pθ​(x∣z)﻿ we can decide a parametric distribution, such that the parameters θ\thetaθ﻿ are produced by a neural network (the decoder of our VAE)
for example, for a binary input x\mathbf{x}x﻿, we can choose our decoder to output the parameter of the Bernoulli distribution
in other cases we can choose pθ(x∣z)p_{\theta}(\mathbf{x} \mid \mathbf{z})pθ​(x∣z)﻿ to be a Normal with zero-mean and diagonal covariance
﻿
Additionally, we assume that there are no dependencies between the latent variables zi
\mathbf{z}_izi​﻿ that correspond to the different observations xi
\mathbf{x}_ixi​﻿, for i=1,…,N
i=1, \dots, Ni=1,…,N﻿. (mean field assumption)
Therefore we have qϕ(i)(z(i))=N(μ(i),Σ(i)).
q_{\phi^{(i)}}(\mathbf{z}^{(i)}) = \mathcal{N}(\boldsymbol{\mu}^{(i)}, \boldsymbol{\Sigma}^{(i)}).qϕ(i)​(z(i))=N(μ(i),Σ(i)).﻿ A nice thing about this choice is that we can compute the KL(qϕ(z)∥p(z))\text{KL}( q_{\phi}(\mathbf{z}) \parallel p(\mathbf{z}))KL(qϕ​(z)∥p(z))﻿ in closed form because we have two multivariate Normal distributions.
The parameters ϕ(i)
\phi^{(i)}ϕ(i)﻿ for each sample, are also learned with a neural network (the encoder) that takes the observation xi\mathbf{x}_ixi​﻿ as input and outputs the parameters μ(i),Σ(i)
\boldsymbol{\mu}^{(i)}, \boldsymbol{\Sigma}^{(i)}μ(i),Σ(i)﻿ of the Normal distribution. 
Furthermore, we can simplify Σ(i)=diag(σ2,(i))
\boldsymbol{\Sigma}^{(i)} = \text{diag} (\boldsymbol{\sigma}^{2, (i)})Σ(i)=diag(σ2,(i))﻿.
﻿
To optimise with respect to the parameters of the encoder and decoder neural networks we have to resort to the reparametrization trick. In particular, to optimise the ELBO we draw samples z
\mathbf{z}z﻿ as
sample ϵ∼N(0,I)
\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})ϵ∼N(0,I)﻿﻿
﻿z=ϵ⊙σ+μ\mathbf{z} = \boldsymbol{\epsilon} \odot \boldsymbol{\sigma} + \boldsymbol{\mu}z=ϵ⊙σ+μ﻿ .
﻿
Coding the VAEHere, we consider a simple example with the MNIST dataset. We will implement step by step the class VAE.
EncoderWe implement an encoder that outputs the parameters μ(i)\boldsymbol{\mu}^{(i)}μ(i)﻿ and log⁡σ(i)\log \boldsymbol{\sigma}^{(i)}logσ(i)﻿, for the given datapoint x(i)\mathbf{x}^{(i)}x(i)﻿.
Encoder structure.
﻿
def encoder(self, x):
    # Obtain the parameters of q(z) for a batch of data points.
    
    # Args:
        # x: Batch of data points, shape [batch_size, obs_dim]
    
    # Returns:
        # mu: Means of q(z), shape [batch_size, latent_dim]
        # logsigma: Log-sigmas of q(z), shape [batch_size, latent_dim]
﻿
    h_relu = torch.relu(self.linear1(x))
    mu = self.linear21(h_relu)
    logsigma = self.linear22(h_relu)
﻿
    return mu, logsigma
﻿
Sampling with reparametrizationThe next step is to implement the sampling with reparametrization, to obtain the latent variable z\mathbf{z}z﻿.
def sample_with_reparam(self, mu, logsigma):
    # Draw sample from q(z) with reparametrization.
    
    # We draw a single sample z_i for each data point x_i.
    
    # Args:
        # mu: Means of q(z) for the batch, shape [batch_size, latent_dim]
        # logsigma: Log-sigmas of q(z) for the batch, shape [batch_size, latent_dim]
    
    # Returns:
        # z: Latent variables samples from q(z), shape [batch_size, latent_dim]
﻿
    batch_size, latent_dim  = mu.shape
    eps = torch.normal(0, 1, size=(batch_size, latent_dim)).to(device)
    sigma = torch.exp(logsigma)
    z = sigma * eps + mu
﻿
    return z
﻿
DecoderThe decoder takes the samples z(i)\mathbf{z}^{(i)}z(i)﻿ and produces the parameters θ(i)∈RD\boldsymbol{\theta}^{(i)} \in \mathbb{R}^Dθ(i)∈RD﻿ of the data likelihood pθ(i)(x(i)∣z(i))p_{\boldsymbol{\theta}^{(i)}}(\mathbf{x}^{(i)}|\mathbf{z}^{(i)})pθ(i)​(x(i)∣z(i))﻿.  
Our data x(i)∈{0,1}D\mathbf{x}^{(i)} \in \{0, 1\}^Dx(i)∈{0,1}D﻿ is binary, so we use Bernoulli likelihood:
								pθ(i)(x(i)∣z(i))=∏j=1D(θj(i))xj(i)(1−θj(i))1−xj(i)p_{\boldsymbol{\theta}^{(i)}}(\mathbf{x}^{(i)}|\mathbf{z}^{(i)}) = \prod_{j=1}^D \left(\theta_{j}^{(i)}\right)^{x_{j}^{(i)}} \left(1 - \theta_{j}^{(i)}\right)^{1 - x_{j}^{(i)}}pθ(i)​(x(i)∣z(i))=∏j=1D​(θj(i)​)xj(i)​(1−θj(i)​)1−xj(i)​﻿.
The parameters θj(i)
\theta_{j}^{(i)}θj(i)​﻿ must be in the interval (0, 1), therefore, we use Sigmoid activation function in the last layer of the decoder.
The decoder has the following structure:
Decoder structure.
def decoder(self, z):
    # Convert sampled latent variables z into observations x.
    
    # Args:
        # z: Sampled latent variables, shape [batch_size, latent_dim]
    
    # Returns:
        # theta: Parameters of the conditional likelihood, shape [batch_size, obs_dim]
﻿
    h_relu = torch.relu(self.linear3(z))
    theta = torch.sigmoid(self.linear4(h_relu))
    
    return theta
﻿
KL divergenceTo compute the ELBO, we will need to compute the KL divergence KL(qϕ(i)(z(i))∣∣p(z(i)))\text{KL}(q_{\boldsymbol{\phi}^{(i)}}(\mathbf{z}^{(i)}) || p(\mathbf{z}^{(i)}))KL(qϕ(i)​(z(i))∣∣p(z(i)))﻿, where p(z(i))p(\mathbf{z}^{(i)})p(z(i))﻿ is the standard multivariate normal distribution (zero mean, identity covariance).
The KL divergence can be computed in closed form.
def kl_divergence(self, mu, logsigma):
        # Compute KL divergence KL(q_i(z)||p(z)) for each q_i in the batch.
        
        # Args:
            # mu: Means of the q_i distributions, shape [batch_size, latent_dim]
            # logsigma: Logarithm of standard deviations of the q_i distributions, shape [batch_size, latent_dim]
        
        # Returns:
            # kl: KL divergence for each of the q_i distributions, shape [batch_size]
﻿
        sigma = torch.exp(logsigma)
        pre_kl = sigma**2 + mu**2 - 2*logsigma - 1
        kl = 0.5 * torch.sum(pre_kl, dim=1)
﻿
        return kl
﻿
ELBOFinally, we can compute the ELBO using all the methods that we implemented above.
The ELBO for a single sample x(i)∈{0,1}D\mathbf{x}^{(i)} \in \{0, 1\}^Dx(i)∈{0,1}D﻿ reads as:
								Li(ψ,λ)=Ez(i)∼qϕ(i)(z(i))[log⁡pθ(i)(x(i)∣z(i))]−KL(qϕ(i)(z(i))∣∣p(z))\mathcal{L}_i(\boldsymbol{\psi}, \boldsymbol{\lambda}) = \mathbb{E}_{\mathbf{z}^{(i)} \sim q_{\boldsymbol{\phi}^{(i)}} (\mathbf{z}^{(i)})}\left[\log p_{\boldsymbol{\theta}^{(i)}}(\mathbf{x}^{(i)} | \mathbf{z}^{(i)})\right] - \mathbb{KL}(q_{\boldsymbol{\phi}^{(i)}}(\mathbf{z}^{(i)}) || p(\mathbf{z}))Li​(ψ,λ)=Ez(i)∼qϕ(i)​(z(i))​[logpθ(i)​(x(i)∣z(i))]−KL(qϕ(i)​(z(i))∣∣p(z))﻿,
where ψ\boldsymbol{\psi}ψ﻿ and λ\boldsymbol{\lambda}λ﻿ are the parameters of the encoder and the decoder neural networks, respectively.
def elbo(self, x):
    # Estimate the ELBO for the mini-batch of data.
    
    # Args:
        # x: Mini-batch of the observations, shape [batch_size, obs_dim]
    
    # Returns:
        # elbo_mc: MC estimate of ELBO for each sample in the mini-batch, shape [batch_size]
﻿
    mu, logsigma = self.encoder(x)
    z = self.sample_with_reparam(mu, logsigma)
    theta = self.decoder(z)
    kl = self.kl_divergence(mu, logsigma)
    log_px_ifz = torch.sum(x*torch.log(theta) + (1 -x)*torch.log(1 - theta), dim=1)
    elbo_mc = log_px_ifz - kl
﻿
    return elbo_mc
﻿
Generating new dataWe can then implement a method for generating new data points, by sampling from the prior and then utilizing the decoder neural network.
def sample(self, num_samples):
    # Generate samples from the model.
    
    # Args:
        # num_samples: Number of samples to generate.
    
    # Returns:
        # x: Samples generated by the model, shape [num_samples, obs_dim]
﻿
    zp = torch.normal(0, 1, size=(num_samples, self.latent_dim)).to(device)
    theta = self.decoder(zp)
    x = torch.bernoulli(theta)
﻿
    return x
﻿
Now we can train our model using the negative ELBO as loss function, that is
loss = -vae.elbo(x).mean(-1)
﻿
ResultsAfter training for five epochs we observe the following results.
Sampling new data points using the sample method:
﻿
﻿
Run set1
﻿
Visualizing the embeddings by taking the Means at the encoder output and running the t-SNE algorithm.
﻿
Run: olive-snowflake-11
﻿
Nicely, we can observe that the encoder learned to assign similar Means to the images that belong to the same class. Therefore, these images are close to each other in the latent space.
﻿
Add a comment