My takeaway series follow a Q&A format to explain AI concepts at three levels:
Anyone with general knowledge can understand them.
For anyone who wants to dive into the code implementation details of the concept.
For anyone who wants to understand the mathematics behind the technique.
Diffusion model is a type of generative models that learn to generate data by simulating the process of diffusion. It is also neural network but added a process of gradually denoising data, where each step of denoising is learned by a neural network.
Input:
- A random noise fixed-length data (like image)
, typically drawn from a Gaussian distribution.
Output:
- A generated data sample that resembles the training data, such as an image
in the same shape as the input.

Here the forward pass refers to the process of training the diffusion model.
A noise
This process is typically considered as a Markov chain. Markov chain is a mathematical system that undergoes transitions from one state to another on a state space. The state space here is the data space, and the transition from one state to another is the denoising neural network.
The core architecture of diffusion model is the denoising neural network. It can be any neural network that maps a sample to the output with the same shape. The most commonly used architecture is the U-Net, which is a type of convolutional neural network (CNN) that has an encoder-decoder structure with skip connections.
The denoising neural network at each step shares the same architecture and weights.
Diffusion models are trained by supervised learning each step of the denoising process.
To construct the training data, we start with a clean data sample
The model is indeed trained to learn the random noise added at each step, but it does not learn to generate randomly, but based on the