Sequence to Sequence Learning with Neural Networks

Seq2Seq with tf.keras. Made by Aritra Roy Gosthipaty using Weights & Biases
Aritra Roy Gosthipaty


In the age of attention and transformers, I thought writing a simple report on sequence to sequence modelling thinking it would be a good starting point for a lot of people. In this article I will try declassifying the paper Sequence to Sequence Learning with Neural Networks by Ilya Sutskever et. al. Here in this paper, the authors have presented an end to end learning system that helps in translating one language into another. The idea of a latent space for language is mesmerising in itself. We will dive deep into the intuition of the latent space. We will also look into the objective function that helps in the end to end solution that is proposed.

The article covers the following topics:


The task is to emulate a translator. We will be given a sentence in a given language which we need to translate into another language.

Prior to this paper, this task was viewed as a mapping task, where systems had to map similar contextual phrases in different languages. This way of approaching the problem was too tough for a learning system. The model does not only need to understand the contextual meaning of the phrases but also map different languages to each other.

In this paper, the author suggests an autoencoder system. In an autoencoder setup, there are two distinct models, an encoder and a decoder. An encoder encodes a data distribution into a specific latent space. The decoder tries decoding the input from the encoded data. In this way the model forces in building richer and richer representational latent space.

One might think how does an autoencoder setup fit into our needs here? We cannot just encode English and output English sentences with the motif of building a great latent space. This is where their approach makes a lot of sense. How about encoding English and generate another language? This means they are trying to not only learn the underlying patterns and understanding of a language but also are mapping the same in two languages at the same time. If you find this intriguing what if I told you this concept is proved to work wonders with images and text as the input and output? If you wanted a system to caption an image, you need to encode an image and try decoding that into raw text. This way you have an image caption generator.

Side note: For an interested reader, here is my article on a vanilla image caption generator.


For the task, we need some data. Here we would be fine if we lay our hands on some translations. After searching Kaggle with all my might I did not land on something that I wanted. Then I found this treasure trove of Tab-delimited Bilingual Sentence Pairs. For this task, we will be moving ahead with the English-French pair.

There was a relatively small amount of processing required but I went ahead and hosted a clean txt file as a Kaggle dataset for the reader to try modeling on.

Check out the Kaggle Dataset

The only processing that was required was a regex cleaning so that I have just the tab-separated translations.

# Here we get the fra-eng dataset
! wget

# Here we are unzipping the zip
! unzip -d fra-eng

# Using regex to clean the data
with open('./fra-eng/fra.txt', 'r') as f:
    string =
    pattern = re.compile(r'\s*CC(?:.*?)\n')
    m1_string = re.sub(pattern, r'\n', string)

with open('eng_fra.txt','w') as f:

The hosted dataset is nothing but the eng_fra.txt . This also provides readers to try their hands on any pair that the website essentially provides.

After we have all of our data it is time to convert the data into valid data that the deep learning model can be trained upon.

Here we use the tf.keras.preprocessing.text.Tokenizer to help us with the pre-processing steps. This is a great API that takes away a lot of boilerplate to clean the text data.

top_k = 10000

eng_tokenizer = Tokenizer(num_words=top_k,

After we have the tokenizer we need to apply this tokenizer to the text data so that we have tokens of text that we can train on.

Section 8


Check out the Kaggle Kernel

For the training step, I would like to first define the approach and then walk you through the code.

The training step involves the following:

As we are dealing with text data we would be better off by using a recurrent architecture to model the data. For both the encoder and the decoder we will be building individual GRU models.


Before getting into the encoder I think a quick recap into the recurrent architectures would be great. So a recurrent cell takes two things as its input, the present input and the past state. The recurrent cell in return gives us a state, which is used in the next time step. This mechanism loosely tells us that the recurrent architectures model on the present input as well as all the past input up until now. We can see a recurrent cell below in the given figure.


$ h_t=tanh⁡(x_th_{t−1})h_{t} =\tanh\begin{pmatrix} x_{t}\ h_{t-1} \end{pmatrix}h_t=tanh(x_th_{t−1}) $

The encoder here will be given English words and sentences. At the end of each sentence, the encoder essentially gives us a state $h_{end}$ which is of importance to us. This end hidden state is the latent space representation of the entire sentence. This is the meaning of the entire English sentence.


The decoder is similar to the encoder in this case. It is also a recurrent architecture that is modeled on the french counterpart of the English sentence. The only difference between the encoder and the decoder is the initial hidden state that both receive. For the encoder, the initial hidden state is all zeros or any other way of initialization, whereas the initial state of the decoder is the latent representation of the English sentence. This latent representation is nothing but the last hidden state of the encoder. This means that the decoder does not only get to model on the french statements but also on the meaning of the English statements.

The objective of the decoder is to predict the next word when provided with a present word and the past words. This objective is the negative log-likelihood of the next words. The reason we chose this objective function is that we need to generate words one at a time with the context of the past words.


Here we have used the simple functional API of tf.keras for building our model. The most important part of this architecture is encoder_state. This is the latent representation of the encoded statements.

encoder_input = Input(shape=(None,))
encoder_embedded = Embedding(input_dim=VOCAB_SIZE, output_dim=64)(encoder_input)

# Return states in addition to output
output = GRU(UNITS_RNN, return_sequences=True)(encoder_embedded)
output = GRU(UNITS_RNN, return_sequences=True)(output)
_, state_h = GRU(UNITS_RNN, return_state=True)(output)

encoder_state = [state_h]

decoder_input = Input(shape=(None,))
decoder_embedded = Embedding(input_dim=VOCAB_SIZE, output_dim=64)(decoder_input)

# Pass the state to a new GRU layer, as initial state
decoder_output = GRU(UNITS_RNN, return_sequences=True)(decoder_embedded, initial_state=encoder_state)
decoder_output = GRU(UNITS_RNN, return_sequences=True)(decoder_output, initial_state=encoder_state)
decoder_output = GRU(UNITS_RNN, return_sequences=True)(decoder_output, initial_state=encoder_state)

output = Dense(VOCAB_SIZE)(decoder_output)

model = tf.keras.Model([encoder_input, decoder_input], output)

Section 7


Check out the Kaggle Kernel

In this stage, we will be looking into the code that will generate the french translation of an English sentence.

eng = ["<start> hello how are you <end>"]
eng = np.array(eng_tokenizer.texts_to_sequences(eng))

fre_words = ["<start>"]
fre = np.array(fre_tokenizer.texts_to_sequences(fre_words))
word = ""
while word != "<end>":
    prediction = model(inputs=[eng, fre])
    word = fre_tokenizer.index_word[tf.math.argmax(prediction[0,-1]).numpy()]
    fre_words[0] = fre_words[0]+" "+word
    fre = np.array(fre_tokenizer.texts_to_sequences(fre_words))

Here we are appending the words generated word and trying to predict the next work with all of the past generated words. I would not deny that the inference translations are a bit weird and not up to the mark. This is because it is a very simple model with no attention in it. Furthermore, the discrepancy of the embeddings can also cause a lot of randomness in the model.


We can safely conclude by saying that the encoder and decoder model has proven to be a powerful paradigm in the Deep Learning community. This paper proves to be the gateway to the works of attention. Let me know in the comments if you would want a write-up on attention.

Get in touch with me @ariG23498