Neural Machine Translation by Jointly Learning to Align and Translate
Part II of our mini-series on attention.
Created on March 15|Last edited on May 20
Comment
Introduction
Our previous report was intended to be an appetizer for the vast topic which we were about to cover. We spoke about the journey of NLP from word2vec to attention, as well as tried creating a basic intuition for this ingenious concept. Now that we have got the initial formalities out of our way, its time for us to understand the first known conceptualization of Attention, published as Neural Machine Translation by Jointly Learning to Align and Translate by Bahdanau et al. The original Neural Machine translation architecture was retained, but the adjustments done to it made a world of difference. Instead of statically feeding the output from our Encoder to the decoder, a dynamic relation was derived. Let us understand it in detail.
Paper | Repository
Mathematical Intuition
In neural machine translation we have an encoder and a decoder. The encoder extracts information from an entire source sentence, while the decoder decodes the information and produces the target sentence. The best possible way to derive information from a sequence of words is to work with recurrent neural architectures (RNN, LSTM etc.). Both the encoder and the decoder take help of such architectures for their tasks.
While the information that the encoder extracts, being quite useful, it does not capture some very important aspects of the source sentence. With a static and fixed information extraction we are intuitively under the constraint of using the same meaning of source for each decoded word. We know that while translating from a source sentence to a target sentence there are some selective words in the source that leads to a certain word in the target. With a fixed information we were incapable of doing this.
With attention now in the picture, the model was free to imbibe and discard information from the source at its will. This helped in a better and more generic language understanding.
Encoder
A recurrent neural network takes as input the present input and the previous hidden state to output the present hidden state . This is how at each step, the RNN takes the entire chain of previous inputs into consideration.
For the encoder the authors have suggested a Bidirectional RNN that not only traverses the source sentence in the forward direction but also in the backward direction. This way the RNN will provide with two sets of hidden states, the forward and the backward . The authors suggest that concatenating the two states gives a richer and better representation . The hidden state so formed is also called the annotation of the source sentence.
📌 Note: Due to the tendency of RNNs to better represent recent inputs, the annotation will be focused on the words around . This sequence of annotations is used by the decoder and the alignment model later to compute the context vector.
Decoder
Without attention: After we have the encoder encode the entire source sentence in its last hidden state we feed that to the decoder at every step of translation. The decoder uses a recurrent neural net architecture whose inputs are the present input and the encoder information with which it outputs the probable translated word.
- is the hidden states for the decoder
- the context vector which in turn is
We just have to maximize this conditional probability for a good model for neural machine translation.
With attention: The only difference is that the context vector is dynamic. There is a different context vector responsible for each decoded word.
This is where the game beings. Now that we understand the concept of attention, it will be even more interesting for us to dive into the math and figure out how to achieve the specified behavior.
We have the entire set of annotations from the encoder . We need a mechanism to evaluate the importance of each annotation for a decoded word. We can either formulate this behavior by hard coded equations or let another model figure this out. Turns out, being of the lazy kind we delegated the entire workload to another model. This model turns out to be the attention layer.
The attention layer is a set of Multilayer Perceptron which take a tuple as input. The tuple consists of annotation and the previous decoded hidden state . This provides a set of unnormalized importance map for each annotation. Intuitively this provides the answer to the question "How important is for ?". With the knowledge that neural networks do great with well defined distributions, we provide a SoftMax on the unnormalized importance to have a well defined normally distribution of importance.
The unnormalized importance is called the energy matrix.
The normalized importance is the attention weights.
Now that we have the attention weights in hand we need to pointwise multiply the weights with the annotations such that we obtain the important annotations while discarding the unimportant annotation. Then we sum the weighted annotations so that we obtain a single feature rich vector.
We build the context vector each step of the decoder and attend to specific annotations.
The GIF below👇 shows the entire working of the attention with the encoder and decoder models.

Bahdanau's attention model.
Code
GitHub Repository
In this part, we shall go through the salient parts of the code. For our model, we shall start by understanding the encoder architecture. We choose bidirectional GRUs to be our sequence units. Notice how we have to to initialize two hidden states at the start of a step because of the bidirectional nature of our encoder architecture. There can be several ways we can set our bidirectional hidden state. We choose to concatenate the hidden states to create a double length hidden state output. Intuitively, one can understand that from the initialize_hidden_state function, where we are returning two tf.zeros variables for each hidden step.
class Encoder(tf.keras.Model):def __init__(self,vocab_size,embedding_dim,enc_units,batch_sz):super(Encoder, self).__init__()self.batch_sz = batch_szself.enc_units = enc_unitsself.embedding = L.Embedding(vocab_size, embedding_dim)self.gru = L.GRU(self.enc_units,return_sequences=True,return_state=True,recurrent_initializer='glorot_uniform')self.bidirection = L.Bidirectional(self.gru, merge_mode='concat')def call(self,x,hidden_fd,hidden_bd):x = self.embedding(x)enc_hidden_states, fd_state, bd_state = self.bidirection(x,initial_state = [hidden_fd,hidden_bd])return enc_hidden_states, fd_state, bd_statedef initialize_hidden_state(self):return [tf.zeros((self.batch_sz, self.enc_units)) for i in range(2)]
Now, we shall add the Bahdanau attention module . The attention module takes in two parameters, the decoder hidden state and annotations, the latter being nothing but the output from the encoder. The decoder hidden state refers to the concurrent hidden state of the decoder that will be processed on.
class BahdanauAttention(tf.keras.layers.Layer):def __init__(self, units):super(BahdanauAttention, self).__init__()self.W1 = L.Dense(units)self.W2 = L.Dense(units)self.V = L.Dense(1)def call(self, dec_hidden_state, annotations):dec_hidden_state_time = tf.expand_dims(dec_hidden_state, 1)score = self.V(tf.nn.tanh(self.W1(dec_hidden_state_time) + self.W2(annotations)))# attention_weights shape == (batch_size, max_length, 1)attention_weights = tf.nn.softmax(score, axis=1)# context_vector shape after sum == (batch_size, hidden_size)context_vector = attention_weights * annotationscontext_vector = tf.reduce_sum(context_vector, axis=1)return context_vector, attention_weights
Finally, we shall look into the decoder part of our architecture. Just like in the encoder, we use GRUs to be our sequence units. The GRU layer is followed by a fully connected layer, which then passes through our custom Bahdanau Attention layer to finally give us our output prediction.
class Decoder(tf.keras.Model):def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):super(Decoder, self).__init__()self.batch_sz = batch_szself.dec_units = dec_unitsself.embedding = L.Embedding(vocab_size, embedding_dim)self.gru = L.GRU(self.dec_units,return_sequences=True,return_state=True,recurrent_initializer='glorot_uniform')self.fc = L.Dense(vocab_size)# used for attentionself.attention = BahdanauAttention(self.dec_units)def call(self, x, dec_hidden_state, annotations):context_vector, attention_weights = self.attention(dec_hidden_state, annotations)x = self.embedding(x)x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)output, state = self.gru(x)output = tf.reshape(output, (-1, output.shape[2]))x = self.fc(output)return x, state, attention_weights
For easier training (read blatent abusing of model.fit), we coalesce all our separate modules under one umbrella. We define a train_step and a test_step. The former contains the gradiet_tape function which computes the backpropagation required for changing the trainable weights of our architecture. The test_step is used as an inference step to assess how well our model has learned.
class NMT(tf.keras.Model):def __init__(self, encoder, decoder):super(NMT, self).__init__()self.encoder = encoderself.decoder = decoderdef train_step(self, data):# Every sentence is different# We would not want the memory state to flow from# one sentence to otherenc_hidden_fd, enc_hidden_bd = self.encoder.initialize_hidden_state()inp, targ = dataloss = 0with tf.GradientTape() as tape:annotations, enc_hidden_fd, _ = self.encoder(inp,enc_hidden_fd,enc_hidden_bd)dec_hidden = enc_hidden_fddec_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE,1)# Teacher forcing - feeding the target as the next inputfor t in range(1, targ.shape[1]):# passing enc_output to the decoderpredictions, dec_hidden, att_weights = self.decoder(dec_input,dec_hidden,annotations)loss += self.compiled_loss(targ[:, t], predictions)# using teacher forcingdec_input = tf.expand_dims(targ[:, t], 1)batch_loss = (loss / int(targ.shape[1]))variables = encoder.trainable_variables + decoder.trainable_variablesgradients = tape.gradient(loss, variables)optimizer.apply_gradients(zip(gradients, variables))return {"custom_loss": batch_loss}def test_step(self, data):enc_hidden_fd, enc_hidden_bd = self.encoder.initialize_hidden_state()inp, targ = dataloss = 0enc_output, enc_hidden_fd, _ = self.encoder(inp, enc_hidden_fd, enc_hidden_bd)dec_hidden = enc_hidden_fddec_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE, 1)# Teacher forcing - feeding the target as the next inputfor t in range(1, targ.shape[1]):# passing enc_output to the decoderpredictions, dec_hidden, att_weights = self.decoder(dec_input,dec_hidden,annotations)loss += self.compiled_loss(targ[:, t], predictions)# using teacher forcingdec_input = tf.expand_dims(targ[:, t], 1)batch_loss = (loss / int(targ.shape[1]))return {"custom_loss": batch_loss}
Visualizations
Loss plot
Run set
3
Custom attention weights
Hover on the words to see which word is attended to the most. This custom visualization is a great tool understand how the model does.
Run set
3
The attention heatmaps
Please click on the images for a better view.
Run set
3
Conclusion
Upon closer inspection, it can be seen that there are a few words which are predicted wrong for a given input. Especially when our given input sentences are fairly long. Which leads to a simple conclusion; Even though this predicted architecture took NLP to new strides, there still was a scope of improvement. Since the barebone idea was already settled in by Bahdanau, it was clear that someone else would eventually come up with a way to to improve this with some more tinkering. That someone turned out to be Minh-Thang Luong, who with a slight architectural difference, managed to perfect Neural Machine translation even more. We talk about his contributions in detail in our next report, bidding adieu to Bahdanau's work which started it all.
Add a comment