An Introduction to Attention

Part I in a series on attention. In this installment we look at its origins, its predecessors, and provide a brief example of what's to come.
Aritra Roy Gosthipaty
When there so many things to do, where do you put your attention?

Preface

When we had written our first report on the topic of Natural Language Processing, it was supposed to be a one time affair. We were on what I like to call a "Computer Vision high," learning to perform cool concepts on deconstructed images and getting visible results. In no world would any NLP concept defer us from our interest in CV.
We were wrong.
As we went through Efficient Estimation of Word Representations in Vector Space by Mikolov et. al., we realised how beautiful the craft is. The craft, that is teaching a man-made construct to a computer along with all its abstract intricacies. Word2Vec gave rise to many subsequent concepts which helped further the original objective.
The absolute breakthrough, in both of our opinions occurred when the concept of attention was made known to the world. In this report, we're going to start with some basic concepts before covering more complex ones as our series continues. Let's dig in!

Introduction

To understand attention mechanism we need to understand the problems that led to its invention. At the time, Neural Machine Translation was being done end to end with an encoder and decoder system. The encoder compressed the source sentence into a fixed size latent representation, which the decoder decoded into the target sentence. That was a good approach but it yielded mediocre results. The problem swapping the fixed size latent representation to something dynamic. In some substantive ways, this was why attention was pursued.
Earlier attempts at translating one language to another involved lots of tedious and cumbersome computations, and ultimately, those methods were unfeasible. We were dealing with the mammoth task of making a computer not only understand two languages, but also how it can correlate the two.
One of the earlier primitive techniques introduced was to translate each word at a time. This failed miserably in its task of capturing the essence of language as a beautifully constructed entity.
Let's understand this with an example:
"I will go home now" is an English sentence, whose French translation will be "je vais rentrer à la maison maintenant."
However, if we translate it word by word, the resultant French sentence will be " je volonté aller domicille à présent."
The point being made here is that the sequence of words in one language has an impact in the formation of the sentence in another language. Things like tenses, parts of speech and several other grammatical concepts come in to the fray. Making a computer understand all of this is both very difficult and a main objective of this work.
Neural Machine Translation introduced a concept of keeping the information retained from complete sentences (and not just singular words) intact, albeit encoded and compressed. The encoded information of sentences is fed to an information decoder which utilizes this information for predicting the output sentence. This was groundbreaking at the time of its conception, since it could correctly translate small sentences from one language to another. This was known as RNN Encoder Decoder architecture.
But a key factor was overlooked: often, some words from the input language sentence held more impact in the formation of the output language sentence than others. When the compressed information from the encoder is being fed to the decoder, we are not explicitly linking the generation of the output sequence of words to any particular input word.
This is where the world of NLP was forever changed, with the introduction of Attention. A simple yet ingenious (yes this word is becoming a theme in our reports) concept which imitated the assigning of weights to specific words (in a manner of speaking) in the input sentence. We are essentially allowing the computer to assess by its own the importance of the input words for the formation of the output words.

Neural Machine Translation (NMT)

Let's first recap how Sequence to Sequence Learning with Neural Networks works.
In this section we do not go too deep into the workings. One can read the in depth review of Sequence to Sequence Learning with Neural Networks and come back to this section.
Source
In neural machine translation, we fit a parameterized model such that the conditional probability maximizes on sentence pairs. Once we have a trained model, we can generate translations by searching for the maximum conditional probability.
The use of neural networks to directly learn the conditional distribution has given promising results. The most appreciated architecture for this task typically consisted of two components, an encoder, and a decoder. On the one hand, the encoder models on the source sentence x and outputs a fixed-length vector which the decoder decodes into the target sentence y. You can check out a previous report of Aritra's where he delves deep into this architecture for neural machine translation.
Despite being a promising approach to neural machine translation, it did have its flaws. The encoder's output viz. (a fixed-length vector) could not hold the meaning quite that well when the sentences grew in size. This led to researchers looking for a better approach to translation.

RNN Encoder-Decoder

Before talking about attention it would be worth our time to recap on the RNN based encoder and decoder framework. An encoder reads the input sentence x=(x_1,...,x_{T_{x}}) into a vector c. Here an RNN is used such that
h_{t}=f(x_t,h_{t-1})\\ c=q(\{h_1,...,h_{T_{x}}\})
where h_t\in\Re^n is the hidden state at time t, and c is a fixed-length vector generated from the sequence of hidden states. f and g are the non-linear functions. In the previous papers authors have used LSTMs and RNNs as f and h_T=q(\{h1,...,h_T\}).
p(y)=\prod^{T}_{t=1} p(y_{t}|\{y_1,...,y_{t-1}\},x)
With an RNN, each of the conditional probabilities is modeled as
p(y_t|\{y_1,...,y_{t-1}\},x)=g(y_{t-1},s_t,c)
where g is a potential multi-layered, nonlinear function that outputs the probability of y_{t} and s_{t}is the hidden state of the RNN.
s_{t}=f_{dec}(y_{t-1},s_{t-1})\\ g(y_{t-1},s_t,c)=\sigma{(W_{y}y_{t-1}+W_ss_t+W_cc)}
where \sigma stands for SoftMax function.

Attention

NMT:
p(y_{t}|\{y_{1},...,y_{t-1}\},x)=g(y_{t-1},s_{t},c)
NMT with attention:
p(y_{t}|\{y_{1},...,y_{t-1}\},x)=g(y_{t-1},s_{t},\boxed{c_{t}})
We have boxed the symbol that makes all the difference.
Attention is essentially the addition of a context vector that can change dynamically with the word that is getting created. Before going into the mathematical complications of this new entity, we want you to understand it intuitively. Once our encoder's output is being fed to our decoder, the decoder will start making predictions of our output sentence one by one. For each output prediction, the model determines which input word holds more importance in predicting a singular output word.
Attention in the Encoder and Decoder setup
The above diagram is a step by step GIF that tells us how the attention works:

Visualizations

The most important part of our series are the visualizations. We understand that visualizations are a great tool for building intuition, hence the series will be filled with images, gifs and custom visualization for the reader to enjoy.
Some of the examples of the visualizations to come in the later part of the series are as follows.

Conclusion

The point of this report was to give our readers a first taste of this elegant concept. Attention not only helped further the world of NLP by leaps and bounds, but also found its way into other domains like CV.
As we read through the different ways of creating and adding Attention in our NLP architectures, the concept slowly unraveled its magic as we reveled in its ingenuity. We plan to address these in our next reports, where we talk about how Attention evolved from Bahdanau's vision to Luong's simplistic but decisive contribution. There is a huge scope of ablative studies when it comes to Attention and we have tried to talk about as many of them as we can. I hope you enjoy this little journey we have planned for you!
The authors:
Name Twitter GitHub
Devjyoti Chakrobarty @Cr0wley_zz @cr0wley-zz
Aritra Roy Gosthipaty @ariG23498 @ariG23498