How Neural Networks Learned to Translate - Even Before Attention Was Born

Before Transformers, there was Attention—discover the breakthrough that changed machine translation forever.

Jul 04, 2025

Today’s AI models like ChatGPT or Google Translate handle entire paragraphs effortlessly—across languages, images, and speech. But this wasn't always the case. Before the rise of Transformers and multi-modal models, researchers struggled with how to make machines understand sequences like sentences.

This blog discusses one of the most important stepping stones toward today’s models: Bahdanau et al.'s 2015 paper that introduced attention into neural machine translation.

Early Days of Language Translation

The architecture for translating sentences from one language to another was based on the encoder-decoder architecture. As the name suggests, the encoder RNN (Recurrent Neural Network) encodes the input sentence into some representation. Then, using this representation of the input sentence, the decoder decodes and generates words in translated language.

RNNs can handle input sequences of variable length. This characteristic is really important for NLP tasks because sentences do not always have a fixed number of words or characters.

Translating with RNN Encoder Decoder Architecture

A neural network only understands numbers, so sentences are first converted to their unique vector representation by using methods like Word2Vec. This unique vector represents the meaning of a particular word.

An RNN can be thought of as an assembly line that receives vector representation of word sequentially. The assembly has its own vector called the hidden state. This hidden state stores the context of the sentence.

Whenever an RNN receives an input vector, it updates this hidden state, to update the context of the sentence. The hidden state at the end of the input sentence represents its entire semantics. This part of the architecture is known as an encoder as it encodes variable length sentences into a fixed size vector called the context.

Then comes a decoder that initializes its hidden state with the context of the input sentence. Then, at each decoding step, the next output word is computed based on the current word and current hidden state of the decoder RNN. As it decodes it updates its internal hidden state to generate new output words.

When decoding longer output sentences, due to the update of its hidden state at each step, the model gradually forgets older context. This problem was solved by feeding the context of the input sentence at each decoding step.

However, a single fixed length context vector should encode the entire sentence despite the fact that sentences can have variable length, from sentences like “Hello world” to long paragraphs of stories. This often makes the model less scalable for longer sentences.

Soft-Search : Precursor to Attention

Bahdanau et. al. try to solve this issue by giving the ability to the model to softly search for parts of the source text automatically that are relevant to predict new words. Unlike previous RNN encoder-decoder architecture where only the context of the final word was saved, here the hidden state of each input word is stored. At each decoding step, instead of relying on a single fixed context, the model computes a unique context for the next word based on the relevant hidden state of each input sentence. This is done by using an alignment model.

This gives the model the ability to better learn and adapt to long term dependencies as it can dynamically search and choose what words to align to.

Then, using the unique context at each decoding step, the current hidden state and current input, the model computes the next hidden state and the next word of the output.

Alignment Model in Depth

The alignment model uses the current hidden state of the decoder and hidden state of each input word to compute the associated energy of the next word with all the input words. Then, using a softmax function, the percentage of contribution of each input word to the next output word is computed. The new context is formed by taking the weighted sum of all the input words, where weights are denoted by the alignment contribution factor computed earlier.

Thus, for computing each output word a unique context that stores the information of relevant input words for that output is used.

This intuitive approach achieved state-of-the-art performance at that time for language translation, and this is one of the methods that created the foundations for attention.

How do you think a neural network knows when to stop generating words? (Hint: it involves a special end-of-sentence token!)

For more technical details including the mathematics involved–from computing alignment factor to decoding each word - stay tuned.

Discussion about this post

Ready for more?