An introduction to diffusion models :: Draft

(Work in progress)

In this post, we’ll explore how diffusion models work. This post will follow the presentation in “Denoising Diffusion Probabilistic Models”.

How do diffusion models work? #

Diffusion models are Markov chains trained to produce samples matching the data after finite time. We start with a diffusion process that destroys the data: we apply multiple Markov chain transitions until a datapoint $\mathbf{x}_0$ gets gradually converted to noise. Diffusion models learn transitions that reverse this process.

Formally, diffusion models are latent variables of the form: $$p_{\theta}(\mathbf{x}_0) = \int p_{\theta}(\mathbf{x}_{0:T}) d \mathbf{x_{1:T}}$$

Here, $\mathbf{x}_1, \dots, \mathbf{x}_T$ are latent variable models which are the same dimensionality as the data $\mathbf{x}_0 \sim q(\mathbf{x}_0)$ .

We call $p_{\theta} (\mathbf{x}_{0:T})$ the reverse process, which is defined as follows:

$$p_{\theta}(x_{0:T}) = p(\mathbf{x}_T) \prod_{t=1}^{T} p_{\theta} (\mathbf{x}_{t-1} | \mathbf{x}_t),$$ where the conditional probabilities are modeled as Gaussians: $$ p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_t) = \mathcal{N} (\mathbf{x}_{t-1}; \mathbf{u}_{\theta}(\mathbf{x}_t, t), \Sigma_{\theta}(\mathbf{x}_t, t)). $$

The diffusion process is defined by a Markov chain that gradually adds Gaussian noise to the data according to a variance schedule $\beta_1, \dots, \beta_T$ .

$$ q(\mathbf{x}_{1:T} | \mathbf{x}_0) = \prod_{t=1}^{T} q(\mathbf{x}_t | \mathbf{x}_{t-1}), $$ where

$$ q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right):=\mathcal{N}\left(\mathbf{x}_t ; \sqrt{1-\beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I}\right). $$

The usual approach would be to optimize this lower bound of the negative log likelihood:

$$ \mathbb{E}\left[-\log p_\theta\left(\mathbf{x}_0\right)\right] \leq \mathbb{E}_q\left[-\log \frac{p_\theta\left(\mathbf{x}_{0: T}\right)}{q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right)}\right]=\mathbb{E}_q\left[-\log p\left(\mathbf{x}_T\right)-\sum_{t \geq 1} \log \frac{p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)}{q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right)}\right]=: L $$