A Mathematical Dive into Bahdanau Attention
Mathematical breakdown of Bahdanau Attention: how neural networks learn to align and translate with context.
Introduction
Before the transformer era revolutionized sequence modelling, the concept of attention was pioneered by Bahdanau et al., enabling neural networks to softly search for relevant parts of the input sequence when predicting the next word. This blog builds upon our previous two posts:
In this part, we explore the mathematical formulation behind Bahdanau’s attention mechanism. If you haven’t read the earlier parts, we recommend checking them out for essential background.
Quick Recap
In Neural Machine Translation (NMT), the RNN encoder-decoder architecture was once a dominant approach. In this architecture:
The encoder processes the source sentence and encodes it into a fixed-length context vector.
The decoder generates the target sentence word-by-word as a function of:
the current input (or the output from the previous time step),
the fixed context vector from the encoder,
and the current hidden state of the decoder.
This continues until the End-of-Sequence (<EOS>) token is generated.
Mathematically, the decoder step can be written as:
The same context vector c is used at every decoding step. This poses a challenge, particularly when translating long sentences, as the single fixed-length vector may not adequately capture all relevant information.
RNN Attention Decoder
To overcome this limitation, Bahdanau et al. introduced an attention mechanism, allowing the decoder to compute a unique, variable-length context vector for each decoding step. This dynamic context vector reflects the relevance of each source word in generating the current output word.
Put simply, the attention mechanism enables the decoder to attend to different parts of the input sentence at each time step. This dynamic approach is known as the RNN Attention Decoder.
Variable Context Vector
Let’s denote:
Then, the context vector for the i-th target word is computed as a weighted sum of all encoder hidden states:
Here, alpha is the attention weight representing how much the i-th output word attends to the j-th input word. These weights reflect the relevance of each source token to the current prediction and are learned dynamically during training.
Computing the Attention Weights
The attention weight alpha are derived by applying a softmax function over a set of alignment scores or energy values
Where, e-ij represents the alignment energy of j-th input token to compute i-th output word. This value is computed for all input words such that, the softmaxed value of alignment energy represents the percentage contribution of j-th input token for predicting i-th output word.
Computing the Alignment Energy
The alignment model is used to compute the alignment energies. The model can be implemented as a simple feedforward neural network with a single hidden layer. One common formulation is:
Where,
Wrapping Up the Attention Computation
To summarize the complete attention mechanism:
For each decoding step i, compute alignment energy between input words and previous hidden state of decoder.
Then, normalize these scores using softmax function to obtain attention weights. These values represents percentage of contribution of input words to predicting next output word.
Compute the context vector as a weighted sum of encoder hidden states, using the attention weights.
This context vector is then used, alongside the decoder’s hidden state, to generate the next word in the output sequence.
Limitations
While attention mechanisms significantly improved performance, RNN-based architectures still suffered from a critical limitation: sequential dependency. Each hidden state computation depends on the previous one, making the model inherently non-parallelizable during training. This results in slower training times and inefficiencies on modern hardware such as GPUs.
This limitation ultimately inspired the development of the Transformer architecture, which relies entirely on self-attention and removes recurrence altogether-enabling full parallelization and scaling to much larger datasets.
In our next post, we’ll introduce the Transformer architecture, the model that redefined sequence modeling with self-attention and parallelization. Stay connected.







