AI Concept Takeaway: Recurrent Neural Networks (RNN)

My takeaway series follow a Q&A format to explain AI concepts at three levels:

Conceptual Level

Anyone with general knowledge can understand them.

Implementation Level

For anyone who wants to dive into the code implementation details of the concept.

Mathematical Level

For anyone who wants to understand the mathematics behind the technique.

What is Recurrent Neural Network (RNN)?

Recurrent Neural Network (RNN) is a type of neural network designed for processing sequential data. Unlike conventional neural networks that take single input to single output, RNNs have connections that loop back on themselves (which is recurrent), allowing accumulating information over the sequential data iteratively.

Why do we need RNN for sequential data, instead of feeding them into conventional neural networks?

Because RNNs can naturally handle variable-length sequential data. If we feed sequential data into conventional neural networks, we have to fix the length of the input sequence, either by truncating longer sequences or padding shorter ones. This can lead to loss of information or introduce noise.

What are the inputs and outputs of vanilla RNN?

The RNN processes sequences of data.

Inputs:

A sequence of input data $x_{1}, x_{2}, \dots, x_{T}$ of any length, where $T$ is the length of the sequence. The index $t$ is called the time step in the sequence.

Outputs:

A sequence of outputs $y_{1}, y_{2}, \dots, y_{T}$ of the same length as the input sequence, where each output $y_{t}$ corresponds to the input at time step $t$ .

Please note, a RNN can receive any length of sequence. The length $T$ can vary between different input data fed into the RNN.

What is the architecture of vanilla RNN?

The vanilla RNN architecture is illustrated as follows:

The blue and yellow arrows are the connections between the input and hidden states:

$\begin{matrix} (1) & h_{t} = f (W_{x h} x_{t} + W_{h h} h_{t - 1} + b_{h}) \end{matrix}$

The red arrow is the connection between the hidden state and output:

$\begin{matrix} (2) & y_{t} = g (W_{h y} h_{t} + b_{y}) \end{matrix}$

These weights are shared across all time steps, which is a key feature of RNNs that allows them to generalise across different sequence lengths.

What is the forward pass of RNN?

The forward pass of RNN involves iterating through each time step of the input sequence and updating the hidden state and output at each step, instead of processing the entire sequence at once.

Hidden state initialization: Set the initial hidden state $h_{0}$ to a vector of zeros or random values.
For each time step $t$ from $1$ to $T$ :
- Compute the new hidden state $h_{t}$ using the current input $x_{t}$ and the previous hidden state $h_{t - 1}$ using Equation 1.
- Compute the output $y_{t}$ using the current hidden state $h_{t}$ using Equation 2.

What is the hidden state for?

The hidden state $h_{t}$ serves not only as a hidden layer in the neural network, but also as a memory that captures information about the sequence up to time step $t$ because it iterates itself through the sequence. It allows the RNN to retain context from previous inputs, which is crucial for understanding sequential data where the meaning of an element can depend on its predecessors.

What is the backward pass of RNN?

The loss function of RNN is the average of the loss at each time step:

$L = \frac{1}{T} \sum_{t = 1}^{T} L_{t} (f (x_{t}; θ), y_{t})$

This is neither calculated and backpropagated at once, but step by step through time, which is called Backpropagation Through Time (BPTT):

For each time step $t$ from $1$ to $T$ :
- The forward pass computes the loss $L_{t}$ with respect to the output at each time step $t$ .
- The backward pass computes the gradients $\frac{\partial L_{t}}{\partial θ}$ with respect to the model parameters $θ = {W_{x h}, W_{h h}, W_{h y}}$ (biases are ignored for simplicity) using the chain rule:
  - $\frac{\partial L_{t}}{\partial W_{h y}}$ is first computed from differentiating Equation 2.
  - $\frac{\partial L_{t}}{\partial W_{h h}}$ is then computed from differentiating Equation 1. Because this part of forward pass involves a chain $h_{0}, \dots, h_{t}$ , it will backpropagate through all the hidden states from time step $t$ to $1$ .
  - $\frac{\partial L_{t}}{\partial W_{x h}}$ is also computed from differentiating Equation 1. This is done after backpropagating through all the hidden states from time step $t$ to $1$ .
The total gradient is the sum of the gradients from each time step:

$\frac{\partial L}{\partial θ} = \sum_{t = 1}^{T} \frac{\partial L_{t}}{\partial θ}$

What is RNN used for?

RNNs are commonly used for tasks involving sequential data, such as text, speech, and time series data. Please note when we say sequential data, it usually refers to variable-length sequential data because the fixed-length sequential data can be essentially regarded as the conventional input data.

There are many tasks on various kinds of sequential data. The RNN can generally be used for 3 types of tasks, each having many applications:

Many to many: The input is a sequence, and the output a sequence of equal length.
- POS tagging: Given a sequence of words (a sentence), predict the sentiment (positive, negative, neutral) for each word in the sequence.
- Machine translation: Given a sequence of words (a sentence) in one language, predict the sequence of words in another language. Machine translation is a sequence-to-sequence task where the input and output sequences can have different lengths, so this is not a good practice example of many to many task. Please check autoregressive prediction below or my article about encoder-decoder.
Many to one: The input is a sequence, and the output is a single value. In RNN, we can simply take the last output $y_{t}$ as the single output.
- Sentiment classification: Given a sequence of words (a sentence), predict the sentiment (positive, negative, neutral) for the sequence.
- Speech recognition: Given a sequence of audio features, predict the spoken word or phrase.
Autoregressive prediction: The input is a sequence, and the output is a single value. The output are then appended to the input, and the model is used to predict the next value in the sequence. This can be repeated to generate a sequence of outputs.
- Stock price prediction: Given a sequence of historical stock prices, predict the stock price for the time steps in the future. This is a regression task for sequential data.
- Machine translation can also be designed as an autoregressive prediction task.

What are the limitations of vanilla RNN?

Vanilla RNNs have several limitations, including:

Vanishing / exploding gradients: During BPTT, if the sequence is long, the gradients backpropagated through many time steps can become very small (vanish) or very large (explode), making it difficult to train the model effectively.
Short-range dependencies: The hidden state has limited capacity to store information as it is overwritten at each time step, making it difficult to capture long-range dependencies in the sequence.

These limitations have led to the development of more advanced RNN architectures, such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), which are designed to address these issues.