< All posts

Diffusion Models

May 4, 2026

Estimated Reading Time: 18 mins

Before diffusion models, image generation was largely dominated by GANs. This was despite their inherently poor training stability caused by their two competing networks. Diffusion models emerged as a more stable alternative that produce highly realistic images, and now, are the most popular model for image generation tasks. But what are diffusion models, and how do they work?

What are diffusion models?

Diffusion models are a type of generative model that learn how to transform noise into real images. They are trained by adding various amounts of random noise to images, and learning how to undo the added noise. After training, a diffusion model can transform a pure noise input into a realistic image.

Diffusion models (DDPMS) were introduced by Sohl-Dickstein et al. (2015) and popularized through Ho et al’s “Denoising Diffusion Probabilistic Models”. DDPMs have two key functionalities: the forward process, and the reverse process.

Forward Process

The forward (diffusion) process gradually adds Gaussian noise to an input image over a total of TT timesteps. At each timestep, the amount of noise for the current image xtx_t is computed with:

q(xtx0)=N(xt;αˉtx0,(1αˉt)I)q(\mathbf{x}_t|\mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t}\mathbf{x}_0, (1 - \bar{\alpha}_t)\mathbf{I})

αt\alpha_t represents the amount of signal, or non-noise, in the image. Conversely, the intensity of the noise at the current timestep 1αt=βT1 - \alpha_t = \beta_T is controlled by a noise scheduler. Ho et al. utilized a linear noise scheduler, which creates TT equally incremented values of βT\beta_T from 1e-4 to 0.02. Larger timesteps produce larger βT\beta_T values, which increasingly corrupt the input image. However, a linear scheduler produces high βT\beta_T values for many timesteps, producing a larger portion of degraded images that are less useful to train with. Nichol & Dhariwal later proposed a cosine noise scheduler that increments βT\beta_T more gradually to provide more informative data samples, consequently improving image generations.

Linear vs. cosine scheduler
Figure 1: The cosine schedule preserves signal longer, yielding more informative noisy samples than a purely linear schedule. Image source: Nichol & Dhariwal (2021)

Rather than computing each noisy step sequentially, the forward process can be reparameterized so we can jump directly to any timestep. We can represent XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2) as X=μ+σϵX = \mu + \sigma\epsilon, ϵN(0,1)\epsilon \sim \mathcal{N}(0, 1), which means we can represent μ\mu as αˉtx0\bar{\alpha}_t x_0, and σ2=(1αˉt)\sigma^2 = (1 - \bar{\alpha}_t), so σ=1αˉt\sigma = \sqrt{1 - \bar{\alpha}_t}, where αT\alpha_T represents the amount of signal (1βT)(1 - \beta_T). Plugging into μ+σϵ\mu + \sigma\epsilon, we get

xt=αˉtx0+1αˉtϵ(2)x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon \tag{2}

where ϵ\epsilon is sampled from Gaussian noise. The model then attempts to predict the added noise ϵ\epsilon at the current timestep. To learn how to predict the added noise ϵ\epsilon, the model is trained with the loss term:

Lsimple(θ):=Et,x0,ϵ[ϵϵθ(αˉtx0+1αˉtϵ,t)2](3)L_{\text{simple}}(\theta) := \mathbb{E}_{t, \mathbf{x}_0, \epsilon} \left[ \|\epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\epsilon, t)\|^2 \right] \tag{3}

which is essentially MSE loss between the true noise and the predicted noise. Interestingly, this loss term is actually a simplified version of the original term:

Ex0,ϵ[βt22σt2αt(1αˉt)ϵϵθ(αˉtx0+1αˉtϵ,t)2](4)\mathbb{E}_{\mathbf{x}_0, \epsilon} \left[ \frac{\beta_t^2}{2\sigma_t^2 \alpha_t(1 - \bar{\alpha}_t)} \|\epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\epsilon, t)\|^2 \right] \tag{4}

The original objective penalizes the model heavily for incorrect predictions on later timesteps (noisier images) so it can prioritize learning from difficult-to-denoise images. However, Ho et al. determined that the simplified loss term produces better quality samples.

Reverse Process

The reverse process undoes the forward process through a Markov chain, meaning that an image's next slightly-less noised state depends on the current noise. For each timestep, reverse diffusion slightly undoes the noise of the image until a clean image is produced. After the network predicts ϵ\epsilon from image XtX_t, we can compute the mean of the distribution of Xt1X_{t-1} with

μθ(xt,t)=μ~t(xt,1αˉt(xt1αˉtϵθ(xt)))=1αt(xtβt1αˉtϵθ(xt,t))(5)\boldsymbol{\mu}_\theta(\mathbf{x}_t, t) = \tilde{\boldsymbol{\mu}}_t\left(\mathbf{x}_t, \frac{1}{\sqrt{\bar{\alpha}_t}}(\mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t}\epsilon_\theta(\mathbf{x}_t))\right) = \frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}}\epsilon_\theta(\mathbf{x}_t, t)\right) \tag{5}

We can then find Xt1X_{t-1} with

xt1=1αt(xt1αt1αˉtϵθ(xt,t))+σtz,(6)x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right) + \sigma_t z, \tag{6}

where zN(0,I)z \sim \mathcal{N}(0, I) if t>1t > 1, else z=0z = 0. When t=0t = 0, that is the very last step of the reverse process, which is where the noiseless image is produced.

Model Architectures

To perform the noise prediction, Ho et al. (2021) utilized a modified U-Net model (Ronneberger et al., 2015). U-Net consists of a decoder, bottleneck, and encoder, with skip connections between layers of the same resolution in the encoder and decoder.

UNet diagram
Figure 2: Diagram of the UNet architecture. Image source: Ronneberger et al. (2015)

The U-Net is time-conditioned using the current timestep so the model can understand that larger timesteps correspond with noisier images. However, the timestep TT is not directly passed through the model because a single large integer timestep value is not informative or efficient for the model to process. Timesteps are embedded using Transformer sinusoidal positional embeddings (Vaswani et al., 2017), which turn the single integer into a vector of size dmodeld_{model} that is computed with:

PE(pos,2i)=sin(pos/100002i/dmodel)(7)PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{\text{model}}}) \tag{7} PE(pos,2i+1)=cos(pos/100002i/dmodel)(8)PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{\text{model}}}) \tag{8}

Cosine and sine are utilized for computing positional embeddings because they allow timestep tt to be a linear transformation of timestep t+kt + k, which allows the model to learn the relative positions between timesteps with ease. Neither cosine and sine can individually be used because when we represent the embedding as a point in a 2D unit circle, shifting kk timesteps requires both cosine and sine shifts. Additionally, a huge benefit of representing the timestep with sinusoidal positional embeddings is they are deterministic. No training is required to produce them.

The U-Net's blocks consist of two residual layers (He et al., 2015), with two residual blocks per image resolution. The U-Net also includes self-attention blocks (Vaswani et al., 2017) at the bottleneck and 16x16 resolution layers. These self-attention blocks pass feature maps through three linear projections (WQW_Q, WKW_K, and WVW_V) to produce queries (QQ), keys (KK), and values (VV). Attention weights are then computed as:

Attention(Q,K,V)=softmax ⁣(QKdk)V\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

While it is unclear exactly what relationship the self-attention blocks create between pixels of a feature map, they improve denoising capabilities because they help the model create relationships between any region of a feature map. Convolutional blocks alone are insufficient for diffusion because they only learn local relationships due to their small kernel sizes.

An alternative to the U-Net in diffusion models emerged with Diffusion Transformers (DiTs) (Peebles & Xie, 2022), where a vision transformer (ViT) replaces the U-Net. The flow of a DiT is: compress an input image with a VAE encoder into a latent representation zz, noise zz, and patch zz with the ViT before performing forward diffusion. The benefit of using a transformer instead of an attention-based UNet is because transformers are more scalable. Larger transformers empirically perform better at diffusion generation than similarly sized UNets (Peebles & Xie, 2022).

Guided diffusion

As is, diffusion models produce diverse, uncontrollable outputs. Their generations are inherently probabilistic because their inference step begins with random noise. Diffusion models can have their outputs be guided towards a specific class or result through several methods. Basic conditioning simply involves concatenating or cross-attending yy (a condition image or text label) with the input before it is passed through the diffusion model. In the case of text label conditioning, the text and image must both be converted to embeddings through CLIP encoders before passing through the diffusion model.

Dhariwal & Nichol (2021) introduced classifier guided diffusion, where they train a classifier model on noisy images from forward diffusion, and use their gradients to guide the noise prediction towards the target conditioning class yy.

Ho & Salimans (2022) introduced classifier-free diffusion guidance, where instead of training a separate classifier, a diffusion model that is simultaneously conditioned and unconditioned is trained. During training, the condition cc can be randomly dropped. During inference, both conditioned and unconditioned predictions are obtained for an input sample, and the final sample is guided using:

ϵ~θ(zλ,c)=(1+w)ϵθ(zλ,c)wϵθ(zλ)(9)\tilde{\epsilon}_\theta(\mathbf{z}_\lambda, \mathbf{c}) = (1 + w)\epsilon_\theta(\mathbf{z}_\lambda, \mathbf{c}) - w\epsilon_\theta(\mathbf{z}_\lambda) \tag{9}

Where ww is a weight that when increased, guides the generation towards the conditioned prediction while decreasing sample diversity.

Training

The flow of training a diffusion model is as follows for an input training sample xx:

  1. Randomly select a timestep value between 0 and TT. T=1000T = 1000 is a common choice.
  2. Add noise to input xtx_t with equation (2)
  3. Pass the noised image, and condition if using a conditional diffusion model, into the model. Return the predicted noise ϵ\epsilon.
  4. Pass the predicted noise ϵ\epsilon and the actual noise to the loss function.
  5. Perform backprop.
DDPM training process
Figure 3: DDPM training pipeline. Image source: Ho et al. (2020)

Inference: DDPM vs. DDIM

After the diffusion model has been trained, we can input an image consisting of total noise (alongside potential conditioning) to the model, perform reverse diffusion, and produce realistic data samples. DDPM inference, however, is extremely time-consuming and compute-intensive: for all timestep values, we have to predict a denoised image. For 1000 timesteps, that means 1000 consecutive denoising operations.

DDPM inference process
Figure 4: DDPM inference starts from pure Gaussian noise and iteratively denoises over T timesteps to reconstruct a realistic sample. Image source: Ho et al. (2020)

Song et al. introduced denoising diffusion implicit models (DDIMs), which reparameterize the Markovian forward and reverse process to become non-Markovian. The reverse process is represented as:

xt1=αt1(xt1αtϵθ(t)(xt)αt)"predicted x0"+1αt1σt2ϵθ(t)(xt)"direction pointing to xt"+σtϵtrandom noise(10)\mathbf{x}_{t-1} = \sqrt{\alpha_{t-1}} \underbrace{\left(\frac{\mathbf{x}_t - \sqrt{1 - \alpha_t}\epsilon_\theta^{(t)}(\mathbf{x}_t)}{\sqrt{\alpha_t}}\right)}_{\text{"predicted } \mathbf{x}_0\text{"}} + \underbrace{\sqrt{1 - \alpha_{t-1} - \sigma_t^2} \cdot \epsilon_\theta^{(t)}(\mathbf{x}_t)}_{\text{"direction pointing to } \mathbf{x}_t\text{"}} + \underbrace{\sigma_t \epsilon_t}_{\text{random noise}} \tag{10}

At each inference step, we can directly predict the clean image x0x_0 with our current image and the predicted ϵ\epsilon. Since we can jump directly to the cleaner image without requiring consecutive renoising steps, inference can be accelerated by taking large timestep jumps with little loss in generation quality. DDIM can reduce the 1000 timesteps of DDPM inference to ~50 steps. Because DDPMs also predict ϵ\epsilon, we can use DDIM inference on DDPM-trained models.

References

  1. Sohl-Dickstein et al. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. arXiv

  2. Ho et al. (2020). Denoising diffusion probabilistic models (DDPM). arXiv

  3. Nichol & Dhariwal (2021). Improved denoising diffusion probabilistic models. arXiv

  4. Ronneberger et al. (2015). U-Net: Convolutional networks for biomedical image segmentation. arXiv

  5. Vaswani et al. (2017). Attention is all you need. arXiv

  6. He et al. (2015). Deep residual learning for image recognition (ResNet). arXiv

  7. Dosovitskiy et al. (2020). An image is worth 16×16 words (Vision Transformer). arXiv

  8. Peebles & Xie (2022). Scalable diffusion models with transformers (DiT). arXiv

  9. Dhariwal & Nichol (2021). Diffusion models beat GANs on image synthesis (ADM). arXiv

  10. Ho & Salimans (2022). Classifier-free diffusion guidance. arXiv

  11. Song et al. (2020). Denoising diffusion implicit models (DDIM). arXiv