Neural Machine Translation by Jointly Learning to Align and Translate

Part II of our mini-series on attention.
Aritra Roy Gosthipaty, Devjyoti Chakraborty
Created on March 15|Last edited on May 20
Comment
﻿
IntroductionOur previous report was intended to be an appetizer for the vast topic which we were about to cover. We spoke about the journey of NLP from word2vec to attention, as well as tried creating a basic intuition for this ingenious concept. Now that we have got the initial formalities out of our way, its time for us to understand the first known conceptualization of Attention, published as Neural Machine Translation by Jointly Learning to Align and Translate by Bahdanau et al. The original Neural Machine translation architecture was retained, but the adjustments done to it made a world of difference. Instead of statically feeding the output from our Encoder to the decoder, a dynamic relation was derived. Let us understand it in detail. 
﻿Paper | Repository﻿
Mathematical IntuitionIn neural machine translation we have an encoder and a decoder. The encoder extracts information from an entire source sentence, while the decoder decodes the information and produces the target sentence. The best possible way to derive information from a sequence of words is to work with recurrent neural architectures (RNN, LSTM etc.). Both the encoder and the decoder take help of such architectures for their tasks.
While the information that the encoder extracts, being quite useful, it does not capture some very important aspects of the source sentence. With a static and fixed information extraction we are intuitively under the constraint of using the same meaning of source for each decoded word. We know that while translating from a source sentence to a target sentence there are some selective words in the source that leads to a certain word in the target. With a fixed information we were incapable of doing this.
With attention now in the picture, the model was free to imbibe and discard information from the source at its will. This helped in a better and more generic language understanding.
EncoderA recurrent neural network takes as input the present input xtx_{t}xt​﻿and the previous hidden state ht−1h_{t-1}ht−1​﻿to output the present hidden state hth_{t}ht​﻿. This is how at each step, the RNN takes the entire chain of previous inputs into consideration.
For the encoder the authors have suggested a Bidirectional RNN that not only traverses the source sentence in the forward direction but also in the backward direction. This way the RNN will provide with two sets of hidden states, the forward ht→\overrightarrow{h_{t}}ht​​﻿ and the backward ht←\overleftarrow{h_{t}}ht​​﻿. The authors suggest that concatenating the two states gives a richer and better representation ht=[ht→;ht←]h_{t}=[\overrightarrow{h_{t}};\overleftarrow{h_{t}}]ht​=[ht​​;ht​​]﻿. The hidden state so formed is also called the annotation of the source sentence.
📌 Note: Due to the tendency of RNNs to better represent recent inputs, the annotation hth_{t}ht​﻿ will be focused on the words around xtx_{t}xt​﻿ . This sequence of annotations is used by the decoder and the alignment model later to compute the context vector.
DecoderWithout attention: After we have the encoder encode the entire source sentence in its last hidden state hTxh_{T_{x}}hTx​​﻿we feed that to the decoder at every step of translation. The decoder uses a recurrent neural net architecture whose inputs are the present input and the encoder information with which it outputs the probable translated word.
p(yi∣y<i,c)=RNN(yi−1,si−1,c)p(y_{i}|y_{<i},c)=RNN(y_{i-1},s_{i-1},c)p(yi​∣y<i​,c)=RNN(yi−1​,si−1​,c)﻿
﻿sss﻿ is the hidden states for the decoder
﻿ccc﻿ the context vector which in turn is hTxh_{T_{x}}hTx​​﻿﻿
We just have to maximize this conditional probability for a good model for neural machine translation.
With attention: The only difference is that the context vector ccc﻿ is dynamic. There is a different context vector responsible for each decoded word. 
p(yi∣y<i,ci)=RNN(yi−1,si−1,ci)p(y_{i}|y_{<i},\boxed{c_{i}})=RNN(y_{i-1},s_{i-1},\boxed{c_{i}})p(yi​∣y<i​,ci​​)=RNN(yi−1​,si−1​,ci​​)﻿
This is where the game beings. Now that we understand the concept of attention, it will be even more interesting for us to dive into the math and figure out how to achieve the specified behavior.
We have the entire set of annotations from the encoder h={h1,...,hTx}h=\{h_{1},...,h_{T_{x}}\}h={h1​,...,hTx​​}﻿. We need a mechanism to evaluate the importance of each annotation for a decoded word. We can either formulate this behavior by hard coded equations or let another model figure this out. Turns out, being of the lazy kind we delegated the entire workload to another model. This model turns out to be the attention layer.
The attention layer is a set of Multilayer Perceptron which take a tuple as input. The tuple consists of annotation hth_{t}ht​﻿and the previous decoded hidden state st−1s_{t-1}st−1​﻿. This provides a set of unnormalized importance map for each annotation. Intuitively this provides the answer to the question "How important is hth_{t}ht​﻿ for sts_{t}st​﻿?". With the knowledge that neural networks do great with well defined distributions, we provide a SoftMax on the unnormalized importance to have a well defined normally distribution of importance. 
﻿
{(h1;st−1),(h2;st−1),...,(hTx;st−1)}↓↓↓DENSEDENSEDENSE↓↓↓et={e1,e2,...,eTx}↓↓↓SOFTMAX↓↓↓αt={α1,α2,...,αTx}\{( h_{1} ;s_{t-1}) ,( h_{2} ;s_{t-1}) ,...,( h_{T_{x}} ;s_{t-1})\}\\
\downarrow \downarrow\downarrow\\
\boxed{DENSE}\boxed{DENSE}\boxed{DENSE}\\
\downarrow\downarrow\downarrow\\
e_{t}=\{e_1,e_2,...,e_{T_{x}}\}\\
\downarrow\downarrow\downarrow\\
\boxed{SOFTMAX}\\
\downarrow\downarrow\downarrow\\
\alpha_{t}=\{\alpha_1,\alpha_2,...,\alpha_{T_{x}}\}\\{(h1​;st−1​),(h2​;st−1​),...,(hTx​​;st−1​)}↓↓↓DENSE​DENSE​DENSE​↓↓↓et​={e1​,e2​,...,eTx​​}���↓↓SOFTMAX​↓↓↓αt​={α1​,α2​,...,αTx​​}﻿
The unnormalized importance is called the energy matrix.
eij=dense(si−1,hj)e_{ij}=dense(s_{i-1},h_{j})eij​=dense(si−1​,hj​)﻿
The normalized importance is the attention weights.
αij=exp⁡(eij)∑k=1Txexp⁡(eik)\alpha{ij=\frac{\exp{(e_{ij})}}{\sum_{k=1}^{T_{x}}\exp{(e_{ik})}}}αij=∑k=1Tx​​exp(eik​)exp(eij​)​﻿
Now that we have the attention weights in hand we need to pointwise multiply the weights with the annotations such that we obtain the important annotations while discarding the unimportant annotation. Then we sum the weighted annotations so that we obtain a single feature rich vector.
ci=∑j=1Txαijhjc_{i}=\sum_{j=1}^{T_{x}}\alpha_{ij}h_{j}ci​=∑j=1Tx​​αij​hj​﻿
We build the context vector each step of the decoder and attend to specific annotations.
The GIF below👇 shows the entire working of the attention with the encoder and decoder models.
									Bahdanau's attention model.
Code﻿GitHub Repository﻿In this part, we shall go through the salient parts of the code. For our model, we shall start by understanding the encoder architecture. We choose bidirectional GRUs to be our sequence units. Notice how we have to to initialize two hidden states at the start of a step because of the bidirectional nature of our encoder architecture. There can be several ways we can set our bidirectional hidden state. We choose to concatenate the hidden states to create a double length hidden state output. Intuitively, one can understand that from the initialize_hidden_state function, where we are returning two tf.zeros variables for each hidden step.
class Encoder(tf.keras.Model):
  def __init__(self,
               vocab_size,
               embedding_dim,
               enc_units,
               batch_sz):
    super(Encoder, self).__init__()
    self.batch_sz = batch_sz
    self.enc_units = enc_units
    self.embedding = L.Embedding(vocab_size, embedding_dim)
    self.gru = L.GRU(self.enc_units,
                     return_sequences=True,
                     return_state=True,
                     recurrent_initializer='glorot_uniform')
    self.bidirection = L.Bidirectional(self.gru, merge_mode='concat')
﻿
  def call(self,
           x,
           hidden_fd,
           hidden_bd):
    x = self.embedding(x)
    enc_hidden_states, fd_state, bd_state = self.bidirection(x,
 							      initial_state = [
									     hidden_fd,
							                     hidden_bd
									      ]
							      )
    return enc_hidden_states, fd_state, bd_state
﻿
  def initialize_hidden_state(self):
    return [tf.zeros((self.batch_sz, self.enc_units)) for i in range(2)]
Now, we shall add the Bahdanau attention module . The attention module takes in two parameters, the decoder hidden state and annotations, the latter being nothing but the output from the encoder. The decoder hidden state refers to the concurrent hidden state of the decoder that will be processed on. 
class BahdanauAttention(tf.keras.layers.Layer):
  def __init__(self, units):
    super(BahdanauAttention, self).__init__()
    self.W1 = L.Dense(units)
    self.W2 = L.Dense(units)
    self.V = L.Dense(1)
﻿
  def call(self, dec_hidden_state, annotations):
    dec_hidden_state_time = tf.expand_dims(dec_hidden_state, 1)
﻿
    score = self.V(tf.nn.tanh(
        self.W1(dec_hidden_state_time) + self.W2(annotations)))
﻿
    # attention_weights shape == (batch_size, max_length, 1)
    attention_weights = tf.nn.softmax(score, axis=1)
    
    # context_vector shape after sum == (batch_size, hidden_size)
    context_vector = attention_weights * annotations
    context_vector = tf.reduce_sum(context_vector, axis=1)
﻿
    return context_vector, attention_weights
Finally, we shall look into the decoder part of our architecture. Just  like in the encoder, we use GRUs to be our sequence units. The GRU layer is followed by a fully connected layer, which then passes through our custom Bahdanau Attention layer to finally give us our output prediction.
class Decoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
    super(Decoder, self).__init__()
    self.batch_sz = batch_sz
    self.dec_units = dec_units
    self.embedding = L.Embedding(vocab_size, embedding_dim)
    self.gru = L.GRU(self.dec_units,
                     return_sequences=True,
                     return_state=True,
                     recurrent_initializer='glorot_uniform')
    self.fc = L.Dense(vocab_size)
    # used for attention
    self.attention = BahdanauAttention(self.dec_units)
﻿
  def call(self, x, dec_hidden_state, annotations):
    context_vector, attention_weights = self.attention(dec_hidden_state, annotations)
﻿
    x = self.embedding(x)
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
    output, state = self.gru(x)
    output = tf.reshape(output, (-1, output.shape[2]))
    x = self.fc(output)
    return x, state, attention_weights
For easier training (read blatent abusing of model.fit), we coalesce all our separate modules under one umbrella. We define a train_step and a test_step.  The former contains the gradiet_tape  function which computes the backpropagation required for changing the trainable weights of our architecture. The test_step is used as an inference step to assess how well our model has learned.
class NMT(tf.keras.Model):
    def __init__(self, encoder, decoder):
        super(NMT, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
﻿
    def train_step(self, data):
        # Every sentence is different
        # We would not want the memory state to flow from
        # one sentence to other
        enc_hidden_fd, enc_hidden_bd = self.encoder.initialize_hidden_state()
        inp, targ = data
        loss = 0
        with tf.GradientTape() as tape:
            annotations, enc_hidden_fd, _ = self.encoder(inp,
							 enc_hidden_fd,
						         enc_hidden_bd)
            dec_hidden = enc_hidden_fd
            dec_input = tf.expand_dims(
				       [targ_lang.word_index['<start>']] * BATCH_SIZE,
				       1)
            # Teacher forcing - feeding the target as the next input
            for t in range(1, targ.shape[1]):
                # passing enc_output to the decoder
                predictions, dec_hidden, att_weights = self.decoder(dec_input,
								     dec_hidden,
       								     annotations)
                loss += self.compiled_loss(targ[:, t], predictions)
                # using teacher forcing
                dec_input = tf.expand_dims(targ[:, t], 1)
        batch_loss = (loss / int(targ.shape[1]))
        variables = encoder.trainable_variables + decoder.trainable_variables
        gradients = tape.gradient(loss, variables)
        optimizer.apply_gradients(zip(gradients, variables))
        return {"custom_loss": batch_loss}
    
    def test_step(self, data):
        enc_hidden_fd, enc_hidden_bd = self.encoder.initialize_hidden_state()
        inp, targ = data
        loss = 0
        enc_output, enc_hidden_fd, _ = self.encoder(inp, enc_hidden_fd, enc_hidden_bd)
        dec_hidden = enc_hidden_fd
        dec_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE, 1)
        # Teacher forcing - feeding the target as the next input
        for t in range(1, targ.shape[1]):
            # passing enc_output to the decoder
            predictions, dec_hidden, att_weights = self.decoder(dec_input,
								 dec_hidden,
								 annotations)
            loss += self.compiled_loss(targ[:, t], predictions)
            # using teacher forcing
            dec_input = tf.expand_dims(targ[:, t], 1)
        batch_loss = (loss / int(targ.shape[1]))
        return {"custom_loss": batch_loss}
﻿
Visualizations 
Loss plot﻿
Run set3
﻿
Custom attention weightsHover on the words to see which word is attended to the most. This custom visualization is a great tool understand how the model does.
﻿
Run set3
﻿
The attention heatmapsPlease click on the images for a better view.
﻿
Run set3
﻿
Conclusion Upon closer inspection, it can be seen that there are a few words which are predicted wrong for a given input. Especially when our given input sentences are fairly long. Which leads to a simple conclusion; Even though this predicted architecture took NLP to new strides, there still was a scope of improvement. Since the barebone idea was already settled in by Bahdanau, it was clear that someone else would eventually come up with a way to to improve this with some more tinkering. That someone turned out to be Minh-Thang Luong, who with a slight architectural difference, managed to perfect Neural Machine translation even more. We talk about his contributions in detail in our next report, bidding adieu to Bahdanau's work which started it all.   
﻿
﻿
﻿
Add a comment
Neural Machine Translation by Jointly Learning to Align and Translate

Introduction

﻿Paper | Repository﻿

Mathematical Intuition

Encoder

Decoder

Code

﻿GitHub Repository﻿

Visualizations

Loss plot

Custom attention weights

The attention heatmaps

Conclusion

Paper | Repository

GitHub Repository