My takeaway series follow a Q&A format to explain AI concepts at three levels:
Anyone with general knowledge can understand them.
For anyone who wants to dive into the code implementation details of the concept.
For anyone who wants to understand the mathematics behind the technique.
Backpropagation is the core procedure of the optimization algorithms for training neural networks. The optimization algorithm repeatedly updates the parameters of the network to minimize the difference between the predicted output and the actual target values. Backpropagation computes gradients for the optimization algorithm to use.
In each training step of a neural network, a forward and a backward pass are performed. The backward pass is the backpropagation process.
Input:
- The loss
between the predicted output and the target value , computed from the forward pass.- The forward pass takes the input data
and passes it through the network to produce the predicted output, as well as the loss.
- The forward pass takes the input data
Output:
- The gradients of the loss with respect to parameters in the network
.
In math, the gradient is a vector that contains all the partial derivatives of a multivariable function with respect to each variable. It points in the direction of the steepest increase of the function. This information is crucial for optimization algorithms to update the parameters effectively.
The loss function in neural network training is a multivariable function of the parameters
Strictly speaking, a gradient is a vector, not a single value. However, we often refer to the single partial derivative for each parameter, e.g.,
The gradients are the core information used to update the parameters of the network by the optimization algorithm. The gradients computed by backpropagation are first-order gradients.
The first-order gradients can be used by first-order optimization algorithms like gradient descent, while the second-order gradients can be used by second-order optimization algorithms like Newton’s method. In deep learning, the optimization algorithms are usually simple first-order algorithms due to the high computational cost of second-order algorithms, that is gradient descent and its variants (e.g., SGD, Adam). Take gradient descent as an example, the parameters are updated as follows:
where
Backpropagation computes the gradients of the loss, where the loss is a composite function of the parameters (composed by the loss function
The chain rule for a multivariable composite function is as follows. It is basically summing the derivatives along all the paths from the dependent variable to the independent variable.

The composition relation between the variables in the function can be illustrated as above, which is called computation graph. For the loss function in neural network training (suppose a feed-forward network for regression problem), the computation graph is as follows:
All
According to chain rule,
We can see from the above example that the most efficient way to compute all the gradients is to start from the loss and compute the gradients layer by layer backwards. The gradient of a layer is computed based on the gradient of the next layer, seemingly propagating back through the network. This is why it is called backpropagation.