Skip to main content

Ablations on NMT with attention.

Part IV of our mini-series on attention.
Created on March 15|Last edited on May 20

Introduction

Our previous reports encompassed a basic understanding of Attention. We first explained the internal clockwork of Bahdanau's concept, then emphasized on how Luong improved on Bahdanau's vision. In this report we discuss on the various ways we can ablate on this two methods. We aim to show what happens when we change certain parameters while training these architectures given the same data and environment.

Bahdanau

Baseline

In the proposed baseline model, we have a Bi-directional RNN as the encoder where we concat the forward and backward annotations. We will be ablating on the following setups:
  • Use a uni-directional RNN in the encoder.
  • Use a bi-directional RNN in the encoder where the forward and backward annotations are sum together.

Uni-direction

In this model architecture, we change just the encoder. Here we use a uni-directional RNN instead of a bi-directional RNN. The code changes are minor as can be seen below.
class Encoder(tf.keras.Model):
def __init__(self,
vocab_size,
embedding_dim,
enc_units,
batch_sz):
super(Encoder, self).__init__()
self.batch_sz = batch_sz
self.enc_units = enc_units
self.embedding = L.Embedding(vocab_size, embedding_dim)
self.gru = L.GRU(self.enc_units,
return_sequences=True,
return_state=True,
recurrent_initializer='glorot_uniform') # Using uni GRU

def call(self,
x,
hidden_fd):
x = self.embedding(x)
output, fd_state = self.gru(x, initial_state = [hidden_fd])
return output, fd_state

def initialize_hidden_state(self):
return tf.zeros((self.batch_sz, self.enc_units))
The loss plot is shown below. We observe that with the uni-directional RNN in place the model seems to overfit quite quickly. This can happen due to the fact that the uni-direction does not provide a rich representation of the annotation. The annotations are shallow and cannot be utillised by the attention layer as such.

Run set
4



Bahdanau - Baseline
3
Bahdanau - UniDirectional
1

While both the translations are not upto the mark, we can notice that Bi-Directioanl model works better than the Uni-Directional model. We also see how the attention weights are better in case of bi-RNN. In case of uni-RNN the attention weights are spread out to all the source words. This talks about the uncertainty of the uni-RNN model. The uni-RNN model is not certain about the parts to attend to and in result spreads out the attention weight to all of the source words. This indeed explains the worse tranlation performance.

Bi-directional with Sum

Here we change the merge mode of the forward and backward annotation for a bi-RNN.
class Encoder(tf.keras.Model):
def __init__(self,
vocab_size,
embedding_dim,
enc_units,
batch_sz):
super(Encoder, self).__init__()
self.batch_sz = batch_sz
self.enc_units = enc_units
self.embedding = L.Embedding(vocab_size, embedding_dim)
self.gru = L.GRU(self.enc_units,
return_sequences=True,
return_state=True,
recurrent_initializer='glorot_uniform')
self.bidirection = L.Bidirectional(self.gru, merge_mode='sum') # Using SUM

def call(self,
x,
hidden_fd,
hidden_bd):
x = self.embedding(x)
output, fd_state, bd_state = self.bidirection(x, initial_state = [hidden_fd, hidden_bd])
return output, fd_state, bd_state

def initialize_hidden_state(self):
return [tf.zeros((self.batch_sz, self.enc_units)) for i in range(2)]
In this model, the only change is how we merge the annotations together for a bi-RNN. Looking at the loss plot we see an immediate upgrade from uni-RNN, but a worse saturation point from the concat merge mode.

Run set
5



Bahdanau - Basline
3
Bahdanau - SUM
1

We see a similar trend here as well. The SUM merge mode spreads out its attention to all the source words depicting its lack on confidence in particular words for the decoder.

Luong

Baseline

In the paper Effective Approaches to Attention-based Neural Machine Translation by Luong et. al. the authors suggest two kinds of attention.
  1. The global attention: This is the baseline model proposed by the authors. The architecture has been widely studied in the previous report. We have also noticed a better loss plot when compared with that of Bahdanau's baseline model architecture.
  2. The local attention: The local attention model is ingenious in its own ways. Before diving deep into the math of the local model let me ask you a question.
If I constrict your vision, would you attend to the most important parts of the scene better?
This is the question that the authors try to uncover with the local attention mechanism. They constrict the attention to a short window in the source sentence. The model decides the window position, but does not have the liberty to assess the entire source sentence while building the attention weights. It turns out that local attention works wonders.
In this method, we assign an index in the source sentence ptp_{t}and a widow size DD. On each decoding step, the model has the freedom to attend to the following window of the source sentence.
(ptD)pt(pt+D)(p_{t}-D) \leftarrow p_{t} \to (p_{t}+D)

Here DD is a hyperparameter chosen which decided how wide the window becomes.
On the basis of the choice of ptp_{t}there are two types of local attention mechanism proposed:
  1. Local-monotonic: Here we assume that the decoded word is aligned to the same indexed source word. Here we choose pt=tp_{t}=t.
  2. Local-predictive: In this process we predict the center index ptp_{t}for each decoding step. This takes help of a Multi Layer Perceptron to predict the best center index for the window. A simple idea, but made all the difference.
The modified equation for our attention weights becomes
at(s)=align(ht,st~)exp((spt)22σ2)a_{t}(s) = align(h_{t}, \tilde{s_{t}})\exp(-\frac{(s-p_{t})^2}{2\sigma^2})

for local-predictive Luong. For Local-monotonic, the value of ss becomes 0. To understand this equation, let's try to visualize this first. The attention weights are calculated as it is, but we force it to match over the designated window indexes with use of the Gaussian distribution term.

Local-monotonic

The point to remember while understanding The Local-monotonic is that our window size will always remain static. The model doesn't learn to calibrate the window size by itself.
class LuongAttention(tf.keras.layers.Layer):
def __init__(self, units):
super(LuongAttention, self).__init__()
self.W1 = L.Dense(units)
self.W2 = L.Dense(units)
self.W3 = L.Dense(units)
self.W4 = L.Dense(units)
self.W5 = L.Dense(units)
self.V = L.Dense(1)
self.V2 = L.Dense(1)

def call(self, dec_hidden_state, annotations, decode_time):
dec_hidden_state_time = tf.expand_dims(dec_hidden_state, 1)

score = self.V(tf.nn.tanh(
self.W1(dec_hidden_state_time) + self.W2(annotations)))
decode_time = tf.cast(decode_time, tf.float32) # pt == t
gaus = tf.math.exp(
tf.math.negative(
tf.math.divide(
tf.math.square(decode_time),
2*tf.math.square(sigma)
)
)
)
attention_weights = tf.nn.softmax(score, axis=1)*gaus
context_vector = attention_weights * annotations
context_vector = tf.reduce_sum(context_vector, axis=1)

mod_hidden = tf.nn.tanh(
self.W3(context_vector) + self.W4(dec_hidden_state)
)

return mod_hidden, attention_weights

Run set
3

Notice how the Local monotonic loss has a much more steeper decline than the Baseline Luong loss. However one can argue that the Baseline loss saturated better.

Luong - Baseline
3
Luong - Monotonic
1

The effect on attention weights can be easily viewed in these custom charts. You can see that the weights formed due to Luong-Monotonic are much less scattered, but concentrated in parts.

Local-predictive

To understand this architecture, let's visualize it first. The sliding window size is a learnable parameter and will change dynamically till the ideal value is found. This way, we are reducing the search span even further
class LuongAttention(tf.keras.layers.Layer):
def __init__(self, units):
super(LuongAttention, self).__init__()
self.W1 = L.Dense(units)
self.W2 = L.Dense(units)
self.W3 = L.Dense(units)
self.W4 = L.Dense(units)
self.W5 = L.Dense(units)
self.V = L.Dense(1)
self.V2 = L.Dense(1)

def call(self, dec_hidden_state, annotations, decode_time):
dec_hidden_state_time = tf.expand_dims(dec_hidden_state, 1)

score = self.V(tf.nn.tanh(
self.W1(dec_hidden_state_time) + self.W2(annotations)))

p_t = max_length_inp*(tf.nn.sigmoid(
self.V2(
tf.nn.tanh(
self.W5(dec_hidden_state_time)
)
)
))
gaus = tf.math.exp(tf.math.negative(
tf.math.divide(
tf.math.square(decode_time-p_t),
2*tf.math.square(sigma)
)
))
attention_weights = tf.nn.softmax(score, axis=1)*gaus

# context_vector shape after sum == (batch_size, hidden_size)
context_vector = attention_weights * annotations
context_vector = tf.reduce_sum(context_vector, axis=1)

mod_hidden = tf.nn.tanh(
self.W3(context_vector) + self.W4(dec_hidden_state)
)

return mod_hidden, attention_weights

Run set
4

Once again, Local-predictive's loss shows a steeper decline than Luong baseline, it saturates faster, albeit at a similar value like the baseline.

Luong - Baseline
2
Luong - Predictive
1

Once again, we see that Luong's predictive architecture has its resultant attention weights concentrated over a few selective input words, in comparison to the Luong baseline model. Thus, we can conclude that reducing our search span has helped the architecture compute attention weights far less chaotically.

Conclusion

Thus, our initial study on Attention concludes with this report. We had started our journey as novices, learning concepts like Word2vec and GloVe etc. Once we reached the all mighty concept of Attention, we had no idea how deep this rabbit hole would become. However, seeing the whole thing through turned out to be one of the better decisions we made, since we got to learn and understand such beautiful ideas. From Bahdanau to Luong, the concept of attention has evolved a lot. At its base level, these ideas become as clear and simple as water, but to have envision them from our mammoth objective is something which deserves the highest level of commendation. Thank you for being a part of this little journey and we hope you have learned some valuable insights to boost your understanding.