Inject Noise to Remove Noise: A Deep Dive into Score-Based Generative Modeling Techniques
A look at the recent Score-Based Generative Modeling through Stochastic Differential Equations paper by Yang Song et al.
Created on April 8|Last edited on December 10
Comment
Try it in Colab
Contents
- Why Score Based?
- Score Estimation and Score Matching
- Sample Generation and Langevin Dynamics
- Gaussian Perturbations of data
- Score-Based Generative Modeling through Stochastic Differential Equations (ICLR 2021)
- Experiments
- Conclusions
- Appendix
1. Why Score Based?
What is a Score?
Before we get into score-based modeling, it's important to first understand what a "score" is in this context. Given a probability density function the 'score' is defined as
or the gradient of the log-likelihood of the object x with respect to the input dimensions x. In this case, the object x is input dimensions.
Instead of working with a probability density function (pdf), we work with the gradients and the gradients here are with respect to the input dimensions and notably not w.r.t the model parameters
In this report, we will assume that pdfs are continuous random variables.
The score is a vector field of the gradient at any point x. This gradient of tells us the directions in which to move if we want to increase the likelihood as much as possible.
Score-based generative models are trained to estimate
Why use Scores?
Unlike Likelihood-based models like Normalizing Flows or Autoregressive models, these models are easier to parameterize and do not need to be normalized.
Moreover, behaves like an unconstrained function so modeling it is easier.
Success of Score-Based techniques with Deep Energy-Based Models
In EBMs, we treat as a function
where is an energy function parameterized by and is the normalization constant aka partition function, which is introduced here for the same reasons.
Learning Parameters via Maximum Likelihood Estimation
where is intractable.
The Score does not depend on the partition function
Idea: Learn by fitting to .
This is also called score estimation.


2. Score Estimation and Score Matching

Left: , Right: . The data density is encoded using an orange colormap: darker colour implies higher density. Red rectangles highlight regions where model scores are close to the data scores. Source: https://arxiv.org/abs/1907.05600
The caveat to all of this is : we don't have . In this article,
we assume that
Task: Estimate the score
Score Model: A vector-valued function .
Objective: How to compare two vector fields of scores?
Score Matching
Average Euclidean distance over the whole space.
which is also referred to as Fisher Divergence.
This expression is simplified using Integration By Parts method which gives us
This was derived by Aapo Hyvarinen in 2005. On further simplification by lifting the Expectation terms, we obtain
,
.
Some takeaways:
- Trace of the jacobian of the score is used in order to make local maxima.
- The score objective says "Choose such that every data point in the training set becomes a local maxima of our estimated density".
3. Sample Generation and Langevin Dynamics
Vanilla Score Matching has its drawbacks:
- "Not Scalable"
Let's take an example of a simple score model that computes on forward pass.

Score Matching is not Scalable
The second of our score matching objective requires O(D) backpropagations which is expensive in practice.
A remedy that the community has come up with: Sliced Score Matching
- Project the vectors onto random directions and attach an Expectation over these directions over the original objective.
This is cheaper in practice than the vanilla formulation. This sliced fisher divergence is as follows:
where v is a random direction based on some distribution . It has been found that Multivariate Rademacher Distribution is a good candidate for this distribution other than Multivariate Standard Normal.
Langevin Dynamics
In computational statistics and recently in generative modeling, Langevin sampling has had great success. Langevin Monte Carlo is a Markov Chain Monte Carlo (MCMC) method for obtaining random samples from probability distributions for which direct sampling is difficult. The goal is to "follow the gradient but add a bit of noise" so as to not get stuck at the local optima regions and thus we are able to explore the distribution and sample from it.
Procedure
- Sample from using only the score
- Initialize
- Repeat for
-
-
gradient gaussian noise
Score Based Workflow

Created by Sayantan Das
4. Gaussian Perturbations of data
Score Matching has various pitfalls:
Pitfall #1: Manifold Hypothesis
This hypothesis states that high dimensional data often tends to lie in a low dimensional manifold -- which makes undefined in some regions.
Pitfall #2: Inaccurate Score Estimation in Low Data-Density regions
When most of the samples will be around the modes of the distribution, the vector field can be misled around low density regions where not enough samples can guide the correct direction to the score vector.
Pitfall #3: Slow Mixing of Langevin Dynamics between data modes
When the distribution has disconnected supports -- or is a mixture of two disjoint components with a weighting coefficient , scores cannot recover this coefficient and is invariant towards mode weights. This is a failure mode of Langevin Dynamics as the density is not supported over the whole space.
Solution - Multiple levels of gaussian perturbations of the data
This is also known as "Annealed Langevin Dynamics"
- Sample using sequentially with Langevin Dynamics(LD) where,
- Run with
- Run with particles from the previous LD run i.e and
- and so on.
5. Score-Based Generative Modeling through Stochastic Differential Equations
Introduction
As we saw in the previous section, gaussian perturbations corrupt the data into random noise. In order to generate samples with score-based models, we need to consider a diffusion process. Scores arise when we reverse this diffusion process leading to sample generation. Let be a diffusion process, indexed by a continuous random variable of time . A diffusion process is governed by a stochastic differential equation (SDE), in the following form
where is the drift coefficient of the SDE , g(t) is the diffusion coefficient, w represents the standard brownian motion. It is this w that makes SDEs the stochastic generalization of an Ordinary Differential Equation (ODE) -- since the particles not only follow the deterministic drift guided by f(x,t), but are also affected by the random noise coming from g(t)dw.
Reversing the SDE
For score based generative modeling, the diffusion process needs to be bounded i.e. and where is used to denote the distribution for . Here is the data distribution where we have a dataset of i.i.d. samples, and is the prior distribution that has a tractable form and easy to sample from. The noise perturbation by the diffusion process is large enough to ensure does not depend on .
By starting out with a sample from and reversing this diffusion SDE, we can obtain samples from our data distribution The reverse-time SDE is given as:
where is the brownian motion in the reverse time direction and dt is the infinitesimal negative time step.

Extracted from the paper.
Score Estimation
A time-dependent score function is required to approximate in order to numerically solve the reverse-time SDE. This score model is trained using denoising objective:
where is a uniform distribution over , denotes the transition probability from to , and denotes a positive weighting function.
In the objective, the expectation over can be estimated with empirical means over data samples from . The expectation over can be estimated by sampling from , which is efficient when the drift coefficient is affine. The weight function is typically chosen to be inverse proportional to .
Sampling Procedures in the paper
The authors discuss various sampling methods in their paper which are as follows:
- Sampling using Numerical Solvers -- such as the Euler-Maruyama approach where dt is approximated as .
- Sampling using Predictor-Corrector Methods : These methods use one step of Numerical Solvers like the one discussed above to obtain from which is called the 'predictor' step. This is followed by applying several steps of Langevin MCMC to refine such that it becomes a more accurate sample from . This is the 'corrector' step as the MCMC helps reduce the error of the numerical SDE solver.
Reason this is popular is that Score-based MCMC approaches can produce samples from underlying distributions when is known -- which in our case is approximated well by the time-dependent score-model.
3. Sampling with Numerical ODE Solvers: For any SDE of the form
there exists an associated ordinary differential equation (ODE)
such that their trajectories have the same mariginal probability density 𝑝𝑡(𝐱) . Therefore, by solving this ODE in the reverse time direction, we can sample from the same distribution as solving the reverse-time SDE. We call this ODE the probability flow ODE.
This approach is also very popular since when the formulation turns into an ODE there is a possibility to estimate the true likelihood of data or in our case , using the change of variable formula. More on this in the ICLR 21 paper.
6. Experiments
Run set
1
7. Conclusion
This report presents and summarizes the latest developments in score-based generative models -- with a goal to enable better understanding of existing approaches, new sampling algorithms, exact likelihood computation and conditional generation abilities to the family of score based generative models.
8. Appendix
Add a comment
https://github.com/ucalyptus
Reply
Iterate on AI agents and models faster. Try Weights & Biases today.