AI Concept Takeaway: Attention

My takeaway series follow a Q&A format to explain AI concepts at three levels:

Conceptual Level

Anyone with general knowledge can understand them.

Implementation Level

For anyone who wants to dive into the code implementation details of the concept.

Mathematical Level

For anyone who wants to understand the mathematics behind the technique.

What is Attention?

Attention is a type of neural network layer that allows neural networks to focus on specific parts of the input sequence when making predictions.

Who proposed Attention? What is the background?

Attention was first introduced in 2014 (Bahdanau, Cho, and Bengio 2014) for machine translation tasks.

The background was the need to improve the performance of sequence-to-sequence models, particularly for tasks like machine translation. Traditional RNNs struggled with long-range dependencies and often produced poor translations for long sentences. The attention mechanism allowed the model to focus on specific parts of the input sequence when generating each word in the output sequence, significantly improving translation quality.

Later, the self-attention mechanism was proposed in the Transformer architecture (Vaswani et al. 2017) in 2017, which further advanced the use of attention in deep learning. Please refer to AI Concept Takeaway: Transformer for more details.

What are the input and output of the attention mechanism?

We first discuss the attention mechanism in general, then the attention layer in neural networks.

Given a set of vectors, we want to know an other vector should pay attention to which vectors, and how much attention should be paid to each vector. This is what the attention mechanism does.

Input:

Query: a vector $q \in R^{d_{k}}$ that represents the element we want to compute attention for.
Targets: a dictionary of vectors ${k_{1} : v_{1}, \dots, k_{n} : v_{n}}$ :
- Key $k_{i} \in R^{d_{k}}$ represents the identity of each element in the dictionary. They are what the query will be compared against. This can be represented as a matrix $K = (k_{1}; \dots; k_{n}) \in R^{n \times d_{k}}$ . Note that the dimension $d_{k}$ of the key vector $k_{i}$ is the same as that of the query vector $q$ .
- Value $v_{i} \in R^{d_{v}}$ represents the information content of each element in the dictionary. This can be represented as a matrix $V = (v_{1}; \dots; v_{n}) \in R^{n \times d_{v}}$ . Note that the dimension $d_{v}$ of the value vector $v_{i}$ can be different from that of the query and key vectors $q$ and $k_{i}$ .

Output:

Attention Scores: a vector of attention scores $a \in R^{n}$ that indicates how much attention should be paid to each target element. Each score is a 0-1 value, and the sum of all scores is 1.
Weighted Sum Value: a vector $z \in R^{d_{v}}$ that is the weighted sum of the values, where the weights are given by the attention scores.

Illustration of attention mechanism. Source: Ebrahim Pichka’s blog. Please note the illustration is from the Internet and doesn’t match the notation in this article.

There can be multiple query vectors $q_{1}, \dots, q_{m}$ to compute attention at the same time, and the attention mechanism can be applied to each query vector independently. This is similar to the batch processing of data in neural networks. Therefore, the input query can be a matrix $Q = (q_{1}; \dots; q_{m}) \in R^{m \times d_{k}}$ , the output attention scores can be a matrix $A \in R^{m \times n}$ , and the output weighted sum values can be a matrix $Z \in R^{m \times d_{v}}$ .

How are the attention scores computed?

The attention scores are computed by the similarity between the query $q$ and the keys $k_{i}$ . The similarity can be computed in various ways, such as dot product, cosine similarity, or learned similarity functions. The most common method is the dot product $q \cdot k_{i}$ .

The attention scores are each 0-1 value, and the sum of all scores is 1. This can be achieved by applying the softmax function to the similarity scores. But before applying the softmax function, the similarity scores are often scaled by the square root of the dimension of the key vectors $\sqrt{d_{k}}$ to prevent the scores from being too large or too small, which can lead to numerical instability:

$a_{i} = softmax (\frac{q \cdot k_{i}}{\sqrt{d_{k}}}), i = 1, \dots, n$

$z = \sum_{i = 1}^{n} a_{i} v_{i}$

This is known as scaled dot-product attention.

The matrix form of the attention computation is:

$A = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}), Z = A V$

It is often seen as a single attention function: $Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V$

How are the attention scores computed?

$a_{i} = softmax (\frac{q \cdot k_{i}}{\sqrt{d_{k}}}), i = 1, \dots, n$

$z = \sum_{i = 1}^{n} a_{i} v_{i}$

This is known as scaled dot-product attention.

The matrix form of the attention computation is:

$A = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}})$

$Z = A V$

It is often seen as a single attention function: $Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V$

How is the attention mechanism used in neural networks?

In neural networks, the attention mechanism is typically implemented as an attention layer. The input to the attention layer is a set of query, key, and value vectors, which are usually obtained by applying linear transformations to the input data $X \in R^{n \times d_{x}}$ . The output of the attention layer has the same size $n$ but different dimension $d$ , and can be used as input to subsequent layers in the neural network. Please note that $X$ is usually a piece of sequential data, so $n$ is the sequence length, not the batch size.

Attention layer is a parameterised layer. The weights of the linear transformations are learned during training, allowing the model to adaptively learn how to compute attention based on the input data. These are the parameters of the attention layer.

The earliest use of attention in neural networks was in sequence-to-sequence models for machine translation (Bahdanau, Cho, and Bengio 2014). In these years, RNN-based encoder-decoder architectures were widely used for sequence-to-sequence tasks. Here the query vector is (a linear transformation of) the decoder hidden state at the current time step, and the key and value vectors are (linear transformations of) the encoder hidden states at all time steps. The attention output represents a context vector that summarizes the relevant information from the input sequence based on the current decoder state’s attention. The prediction is then made based on this, which is connected to the output layer.

Later, the self-attention mechanism was proposed in the Transformer architecture (Vaswani et al. 2017), which further advanced the use of attention in deep learning. In self-attention, the query, key, and value vectors are all derived from the same input sequence $X$ :

$Q = X W_{Q}, K = X W_{K}, V = X W_{V}$

$W_{Q}, W_{K}, W_{V}$ are learnable weights. The attention scores are now computed between all pairs of elements in the input sequence (where $A$ is a square matrix), allowing the model to capture dependencies between different parts of the sequence. Please refer to AI Concept Takeaway: Transformer for more details.

References

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2014. “Neural Machine Translation by Jointly Learning to Align and Translate.” arXiv Preprint arXiv:1409.0473.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30.