AI Concept Takeaway: Batch Normalisation

My takeaway series follow a Q&A format to explain AI concepts at three levels:

Conceptual Level

Anyone with general knowledge can understand them.

Implementation Level

For anyone who wants to dive into the code implementation details of the concept.

Mathematical Level

For anyone who wants to understand the mathematics behind the technique.

What is batch normalisation?

Batch normalisation belongs to a training technique that can accelerate and stabilise the training of deep neural networks. It appears as a layer in the neural network.

Who proposed batch normalisation? What is the background?

Batch normalisation was proposed by Sergey Ioffe and Christian Szegedy in 2015 in an ICML paper (Ioffe and Szegedy 2015). At that age of deep learning, deeper neural networks were rapidly developing and brought significant challenges in training. Batch normalisation was proposed to to help train models faster.

Why is it called batch normalisation?

The term batch normalisation quite literally describes its function: it’s a layer that performs a normalisation operation on a batch of input data. normalisation is a mathematical transformation that rescales a collection of data points to a standard range without distorting the relative relationships between them.

Conceptually, normalisation requires multiple data points to compute the necessary statistics (like mean and variance). In deep learning, it’s often not feasible to load the entire training dataset into memory at once due to its size. Therefore, the data is typically divided into smaller mini-batches. Batch normalisation applies its normalisation procedure to each of these mini-batches individually as they are processed during training, hence the name ‘batch normalisation’.

Why does batch normalisation work?

The original batch normalisation paper (Ioffe and Szegedy 2015) explains its effectiveness through the concept of reducing internal covariate shift. Internal covariate shift refers to the change in the distribution of layer inputs during training, caused by updates in the parameters of preceding layers.

As illustrated in Figure 1, neural networks are modularised as layers. While the input $x$ to the network might remain aligned, the intermediate representations (e.g., $a$ and $a^{'}$ ) can shift as layer A is updated to A’. Although this update is appropriate for the input $x$ , it changes the distribution of inputs for the next layer B, which is now receiving a different input distribution $a^{'}$ than it was initially trained on with $a$ . This internal distribution shift can slow down training and make optimisation unstable.

Figure 1: Illustration of internal covariate shift. Source: Hung-Yi Lee

However, a later paper (Santurkar et al. 2018) questioned whether reducing internal covariate shift is truly the key reason BN is effective. They showed in both theory and experiment that the benefit of Batch normalisation is more accurately attributed to its ability to smooth and reshape the loss landscape, making gradients more predictable and training more stable. This effect leads to faster convergence and improved generalisation.

What are the input and output of batch normalisation?

Batch normalisation works as a layer in a neural network, and like other layers, it has both inputs and outputs. Since it involves statistical normalisation, the input must be a batch of data.

Inputs:

Input feature, either before activation $z_{l} = [z_{l}^{1}, \dots, z_{l}^{N}]$ or after activation $h_{l}$ . Must consist of a batch of data where batch size $N > 1$ .

Outputs:

Processed feature ${\hat{z}}_{l}$ or ${\hat{h}}_{l}$ . ${\hat{z}}_{l}$ is passed to the activation function; ${\hat{h}}_{l}$ is passed to the next layer in the neural network.

Note that this can be imagined as a network batch size times larger.

Is batch normalisation a parameterised layer? What are the parameters?

Batch normalisation is actually a parameterised layer. It has learnable parameters that are optimised during training, similar to other layers in the network.

Although from the mechanism it seems like it only involves normalisation which is purely based on statistics (mean and variance) calculated from the input batch, batch normalisation introduces rescaling and shifting after the normalisation. This introduces two variables as learnable parameters:

Rescaling parameter $γ_{l}$ and shifting parameter $β_{l}$ .

They have the same dimension as number of neurons (features) in the layer, because the rescaling and shifting are applied to neuron-wise.

How does batch normalisation calculate?

It involves two parts: normalisation and rescaling/shift.

normalisation: Calculate the layer-wise mean and variance of the input feature $z_{l}$ across the batch, then layer-wise normalise the input feature.

$μ_{l, i} = \frac{1}{N} \sum_{n = 1}^{N} z_{l, i}, σ_{l, i}^{2} = \frac{1}{N} \sum_{n = 1}^{N} (z_{l, i} - μ_{l, i})^{2}$

${\tilde{z}}_{l, i} = \frac{z_{l, i} - μ_{l, i}}{\sqrt{σ_{l, i}^{2} + ϵ}}$

Rescaling and shifting: scale and shift the normalised feature using the learnable parameters $γ_{l, i}$ and $β_{l, i}$

${\hat{z}}_{l, i} = γ_{l, i} {\tilde{z}}_{l, i} + β_{l, i}$

Q: How big should the batch size be for batch normalisation?

As the formulation shows, batch normalisation relies on computing the mean and variance across a batch of data. If the batch size is too small, the estimated statistics can become noisy and unreliable, which may lead to unstable training and reduced model performance.

Therefore, it is generally recommended to use a moderately large batch size (e.g., 32 or more) to ensure stable and accurate normalisation.

Why rescaling and shifting? Doesn’t that just cancel out the normalisation?

First, rescaling and shifting are not exactly recovering the input before normalisation, because the mean and std are not necessarily the same.

Second, the purpose of introducing these parameters is to allow the model to learn the optimal range for each feature, instead of only $0$ mean and $1$ std. It is the alignment that is important. Different features may follow different distributions, and the model needs flexibility to adapt to these variations.

Third, $γ_{l}, β_{l}$ parameters are typically initialised to $0$ and $1$ , meaning that initially the output remains the same as the normalised input, they are learnable and will be updated during training to better suit the data.

Batch normalisation works as a layer in a neural network and is applied to a batch of data during training. But what happens at test (inference) time, when we often feed in a single data point instead of a batch?

One simple way is skipping the batch normalisation layer, but this makes training and test different. As the batch is only to calculate the mean and std statistics for normalisation, the typical way is to use empirical statistics gathered during the training stage.

For example, PyTorch computes the moving average (accumulated mean and variance) during training:

${\bar{μ}}_{l} (s + 1) = p {\bar{μ}}_{l} (s) + (1 - p) μ_{l} (s), {\bar{μ}}_{l} (1) = 0$

${\bar{σ}}_{l}^{2} (s + 1) = p {\bar{σ}}_{l}^{2} (s) + (1 - p) σ_{l}^{2} (s), {\bar{σ}}_{l}^{2} (1) = 0$

and use these statistics to normalise the data during inference:

${\tilde{z}}_{l, i} = \frac{z_{l, i} - {\bar{μ}}_{l, i}}{\sqrt{{\bar{σ}}_{l, i}^{2} + ϵ}} (test time)$

References

Ioffe, Sergey, and Christian Szegedy. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” In International Conference on Machine Learning, 448–56. pmlr.

Santurkar, Shibani, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. 2018. “How Does Batch Normalization Help Optimization?” Advances in Neural Information Processing Systems 31.