Implementing Show and Tell With TensorFlow

In this article, we look at the TensorFlow implementation of Show and Tell, an end-to-end solution for image caption generation outlined in the paper by Vinyals et al.
Aritra Roy Gosthipaty, Devjyoti Chakraborty
Created on January 14|Last edited on December 2
Comment
﻿﻿A few years ago, if someone claimed that we will have virtual assistants who would be able to describe a scenery presented to them correctly, people would have laughed it off. As machine learning slowly ventured into deep learning, opening up endless possibilities, ideas which we could never have dreamt of started seeming possible. 
One of those ideas is depicted in Show and Tell: A Neural Image Caption Generator by Vinyals et al. In the paper, the authors have suggested an end-to-end solution to an image caption generator. Before this paper, all that was proposed for this task involved independent task optimization (vision and natural language) and then hand-engineered stitching of these independent tasks.
This paper takes its inspiration from Neural Machine Translation, where an encoder trains on a sequence in a given language and produces a fixed-length representation for a decoder, that spits a sequence in another language. Stemming from this idea, the authors have used a vision feature extractor as the encoder and a sequence model as the decoder.
﻿Check out the Kaggle Notebook﻿
Table of ContentsReading the PaperTaskDataModelsCodeLoss and ResultsConclusion
﻿
﻿
Reading the PaperIt is quite fascinating to get hold of an academic paper, which upon reading makes you guarantee that you yourself could have come up with the proposed idea. At that very moment, you think of how simple yet powerful an idea can be. Show and Tell: A Neural Image Caption Generator is one such amazing paper that has opened the gates of deep learning research for image caption generator.
The authors have claimed to be highly inspired by machine translations. This led us to break down our article in the form of a road map. Upon following the road-map, the reader should feel the same excitement as we were while reading the paper.
The road-map:
Task: Where we look into the core concepts of the task at hand.
Modules: Where we look into the different architectures used and talk more about why we use them.
Code: How can we build a tutorial without showing you the code?
Loss and Results: What is the objective function used and how does it behave?
﻿
TaskGiven an image, we need the caption describing that image. This is not a mere classification problem, where an artificial agent decides upon which category the image belongs to. This is not a detection task, where an artificial agent draws bounding boxes upon objects that it categorizes. Here we need to decipher the contents of the image and then form a sequence of words that depict the relationship between the image contents.
 The simplest idea that comes to mind is dividing the task into two distinct tasks.
👁️ Computer vision: This part deals with the image provided. It tries extracting the features from the image, building concepts from the hierarchical features, and modeling the data distribution. A simple Convolutional Neural Network would serve fine for this purpose. Upon the input of an image, the CNN kernels would pick up features from the image in a hierarchical fashion. These features would be a compressed representation of the content of the image presented.

﻿
﻿
DeepLearning by Goodfellow et. al.
🗣️ Word generation: In this task, we are provided with an image and also the caption of the image to train on. The caption needs to be modeled upon. The caption is a sequence of words that describes the image. A Natural Language Processor is needed to model the caption data distribution. The model needs to understand the word distribution and also the context of the words. Here we can use a simple recurrent architecture that can model the captions and generate words that are closely sampled from the provided caption data space.

﻿
﻿Source﻿
3. The tricky part here is to stitch the two realms together. We would not only need the Natural Language Processor to generate 	words from the caption data distribution but also want it to take the image under consideration. The feature of the image is an important factor in the image caption generation problem. The caption generator needs to pick up the image features and then with that context, sample the words from the caption space and provide a description of the image. The stitching of the two realms is what makes this task so intriguing.
 
Show and Tell: a Neural Image Caption Generator
﻿
A little insight that I have found highly interesting is the usage of numbers. We humans have come up with a beautiful language of communication called Mathematics. Here we can depict concepts, ideas, and much more with numbers and symbols. Let us concentrate on numbers for the time being. 
A computer vision model extracts valuable features from images which are essentially numbers (the weights and biases of the model). Similarly, language and words can be depicted with numbers too (word embeddings). This is the idea that when harnessed can solve our problem of stitching together the diverse realms of the task. We need to input the numbers from the vision model to the language model in a way that the task of image caption generation succeeds.
DataFor our experiments, we chose to use the Flickr30k dataset, which housed 30,000 images and multiple unique captions corresponding to each image. The data is preprocessed and hosted in Kaggle to ease up our use case. We will be going ahead with the Kaggle dataset of the Flickr30k.
The dataset housed a CSV which had records of images linking them to their respective captions. 
A peek into the dataset is as follows:

﻿
﻿
Before moving ahead, we would like to point to the reader the usage of <start> and <end>. This is particularly important for letting the model know about the start and end of the caption. It does not seem to be necessary while training the data but in the test time we will need to feed the <start> token for the model to generate the first word of the caption, while the model needs to stop generating words after it produces the <end> token.
﻿
Run set1
﻿
ModelsWe have a fair bit of understanding about the task and also the models that we would need to work on. The insight on numbers will be very helpful in this section. Let us start with the architecture of the model proposed by the authors and then dive deep into the working.
The architecture proposed
Here we have two distinct models for the distinct tasks at hand. On the left side is a vision model and on the right is a recurrent model used for the word generation. The idea here is to feed the image features to the recurrent model as it was just another word. The features that are extracted from the image is a collection of numbers (a vector), if we plot the vector in the embedding space of the captions, we will definitely have a representation of a word. This image feature turned into a word is the beauty of the whole process. 
The image features that are plotted in the embedding space might not represent an actual word from the thesaurus, but it is just enough for the recurrent model to learn. This so-called image-word is the initial input to the recurrent model. Upon deep introspection, the reader is bound to find how simple yet effective this idea can be. 
Two pictures with the same features lie close in the embedding space; upon providing these features to the recurrent model, the model will generate captions that are similar for both images. One can also think of this idea of image-word to be important because now we can compute vector algebra on the embedding space, and new concepts can be learned by merely adding two concepts.

Example of image-word
﻿
Following the architecture, we used a CNN as an encoder and a stacked layer of GRU as our decoder. We are using Gated Recurrent Units instead of LSTMs because GRUs are more compute efficient than LSTMs and more effective than the simple RNNs. Keeping in mind that we were dealing with a huge dataset, we chose GRU to boost our pipeline's efficiency in return for losing some computation effectiveness (gradient dissipation problem). 
ShowThis part of our model acts as our information encoder. We use a restnet50 model which was pre-trained on data from Imagenet. Here we take omit the last layer of the model and extract the output from the penultimate layer. On top of the resnet output, we stack a Global Average Pooling and a Dense layer. With the GAP we take an average of the penultimate kernel output and with the Dense layer, we try moulding the image features into the same shape as that of the word embeddings.
TellThis part of our model acts as our information decoder. Now to explain this, we have to look at the roots of an RNN model. For those of you who need a revision on recurrent architectures, Under the Hood of RNN would be a good place to go to. The output of a single cell is determined by its current input and hidden state activation from its previous cell. Now, if our first RNN cell has the encoded image as its input, the hidden state it generates will be carried over to the next cells. This hidden state will also act as a link to the current cell's input and output and this will be repeated for all the cells in the sequence. So to sum up, the effect of the encoded image is essentially passed throughout the sequence of cells so that each word being predicted is done so keeping in mind the image given. 
Depiction of the recurrent formula
Code﻿Check out the Kaggle Notebook﻿Here we will go through the salient parts of our code. 
Text HandlingIn the following code block, we do basic text handling and create our vocabulary from our available set of captions.  We also manually add a 'pad' token so that we can later make all the sentences of the same size for our own benefit. 
train_df = df.iloc[:train_size,:]
val_df = df.iloc[train_size+1:train_size+val_size,:]
test_df = df.iloc[train_size+val_size+1:,:]
﻿
# Choose the top 5000 words from the vocabulary
top_k = 10000
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=top_k,
                                                  oov_token="<unk>",
                                                  filters='!"#$%&()*+.,-/:;=?@[\]^_`{|}~')
﻿
# build the vocabulary
tokenizer.fit_on_texts(train_df['comment'])
﻿
tokenizer.word_index['<pad>'] = 0
tokenizer.index_word[0] = '<pad>'
To check if our basic vocabulary creation is done properly, we create a helper function
# This is a sanity check function
def check_vocab(word):
    i = tokenizer.word_index[word]
    print(f"The index of the word: {i}")
    print(f"Index {i} is word {tokenizer.index_word[i]}")
    
check_vocab("pajama")
We will get an output like this :
﻿
﻿
Moving on, we have to create our training data in accordance to the vocabulary generated by the tokenizer. We manually pad all sentences to have the same length.  We proceed to integrate our data into the tf.data pipeline. 
# Create the tokenized vectors
train_seqs = tokenizer.texts_to_sequences(train_df['comment'])
val_seqs = tokenizer.texts_to_sequences(val_df['comment'])
test_seqs = tokenizer.texts_to_sequences(test_df['comment'])
﻿
# If you do not provide a max_length value, pad_sequences calculates it automatically
train_cap_vector = tf.keras.preprocessing.sequence.pad_sequences(train_seqs, padding='post')
val_cap_vector = tf.keras.preprocessing.sequence.pad_sequences(val_seqs, padding='post')
test_cap_vector = tf.keras.preprocessing.sequence.pad_sequences(test_seqs, padding='post')
﻿
train_cap_ds = tf.data.Dataset.from_tensor_slices(train_cap_vector)
val_cap_ds = tf.data.Dataset.from_tensor_slices(val_cap_vector)
test_cap_ds = tf.data.Dataset.from_tensor_slices(test_cap_vector)
Image HandlingNow we shall create a tf.data pipeline for the images in the flicker30K dataset. We do basic operations like loading the image, decoding it, datatype conversion and resizing. 
@tf.function
def load_img(image_path):
    img = tf.io.read_file(image_path)
    img = tf.image.decode_jpeg(img)
    img = tf.image.convert_image_dtype(img, tf.float32)
    img = tf.image.resize(img, (224, 224))
    return img
﻿
train_img_name = train_df['image_name'].values
val_img_name = val_df['image_name'].values
test_img_name = test_df['image_name'].values
﻿
train_img_ds = tf.data.Dataset.from_tensor_slices(train_img_name).map(load_img)
val_img_ds = tf.data.Dataset.from_tensor_slices(val_img_name).map(load_img)
test_img_ds = tf.data.Dataset.from_tensor_slices(test_img_name).map(load_img)
﻿
Joining the DataOur intention is to merge the two data pipelines created so we can directly feed them together to our networks. We are taking our data in batches since the dataset as a whole is very large. 
# prefecth and batch the dataset
AUTOTUNE = tf.data.experimental.AUTOTUNE
BATCH_SIZE = 512
﻿
train_ds = tf.data.Dataset.zip((train_img_ds, train_cap_ds)).shuffle(42).batch(BATCH_SIZE).prefetch(buffer_size=AUTOTUNE)
val_ds = tf.data.Dataset.zip((val_img_ds, val_cap_ds)).shuffle(42).batch(BATCH_SIZE).prefetch(buffer_size=AUTOTUNE)
test_ds = tf.data.Dataset.zip((test_img_ds, test_cap_ds)).shuffle(42).batch(BATCH_SIZE).prefetch(buffer_size=AUTOTUNE)
Model
ShowAs mentioned earlier, Show refers to the encoder part of the architecture which compresses the image. A ResNet50 model trained on ImageNet acts as the feature extractor, followed by a GAP. Finally, we round up this part of our architecture with a fully connected layer.
class CNN_Encoder(tf.keras.Model):
    
    def __init__(self, embedding_dim):
        super(CNN_Encoder, self).__init__()
        self.embedding_dim = embedding_dim
        
    def build(self, input_shape):
        self.resnet = tf.keras.applications.ResNet50(include_top=False,
                                                     weights='imagenet')
        self.resnet.trainable=False
        self.gap = GlobalAveragePooling2D()
        self.fc = Dense(units=self.embedding_dim,
                        activation='sigmoid')
        
    def call(self, x):
        x = self.resnet(x)
        x = self.gap(x)
        x = self.fc(x)
        return x
TellHousing GRU cells, Tell refers to the decoder which uses information from the encoder to establish a link between the learned captions and the original input. 
class RNN_Decoder(tf.keras.Model):
    def __init__(self, embedding_dim, units, vocab_size):
        super(RNN_Decoder, self).__init__()
        self.units = units
        self.embedding_dim = embedding_dim
        self.vocab_size = vocab_size
        self.embedding = Embedding(input_dim=self.vocab_size,
                                   output_dim=self.embedding_dim)
    
    def build(self, input_shape):
        self.gru1 = GRU(units=self.units,
                       return_sequences=True,
                       return_state=True)
        self.gru2 = GRU(units=self.units,
                       return_sequences=True,
                       return_state=True)
        self.gru3 = GRU(units=self.units,
                       return_sequences=True,
                       return_state=True)
        self.gru4 = GRU(units=self.units,
                       return_sequences=True,
                       return_state=True)
        self.fc1 = Dense(self.units)
        self.fc2 = Dense(self.vocab_size)
﻿
    def call(self, x, initial_zero=False):
        # x, (batch, 512)
        # hidden, (batch, 256)
        if initial_zero:
            initial_state = decoder.reset_state(batch_size=x.shape[0])
            output, state = self.gru1(inputs=x,
                                      initial_state=initial_state)
            output, state = self.gru2(inputs=output,
                                      initial_state=initial_state)
            output, state = self.gru3(inputs=output,
                                      initial_state=initial_state)
            output, state = self.gru4(inputs=output,
                                      initial_state=initial_state)
        else:
            output, state = self.gru1(inputs=x)
            output, state = self.gru2(inputs=output)
            output, state = self.gru3(inputs=output)
            output, state = self.gru4(inputs=output)
        # output, (batch, 256)
        x = self.fc1(output)
        x = self.fc2(x)
        
        return x, state
    
    def embed(self, x):
        return self.embedding(x)
    
    def reset_state(self, batch_size):
        return tf.zeros((batch_size, self.units))
TrainingWe create the class objects and assign our optimizer as Adam. The loss is Sparse Categorical Cross entropy, because here it would be inefficient to use one-hot-encoders are the ground truth. We will also use mask to help mask the  so that we do not let the sequence model learn to overfit on the same.
encoder = CNN_Encoder(EMBEDDIN_DIM)
decoder = RNN_Decoder(embedding_dim=EMBEDDIN_DIM,
                      units=UNITS_RNN,
                      vocab_size=VOCAB_SIZE)
﻿
optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')
﻿
def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)
﻿
    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask
﻿
    return tf.reduce_mean(loss_)
Next, we write our train step function which will calculate the gradients through backpropagation. 
@tf.function
def train_step(img_tensor, target):
    # img_tensor (batch, 224,224,3)
    # target     (batch, 80)
    loss = 0
    with tf.GradientTape() as tape:
        features = tf.expand_dims(encoder(img_tensor),1) # (batch, 1, 128)
        em_words = decoder.embed(target)
        x = tf.concat([features,em_words],axis=1)
        predictions, _ = decoder(x, True)
﻿
        loss = loss_function(target[:,1:], predictions[:,1:-1,:])
﻿
    trainable_variables = encoder.trainable_variables + decoder.trainable_variables
﻿
    gradients = tape.gradient(loss, trainable_variables)
﻿
    optimizer.apply_gradients(zip(gradients, trainable_variables))
﻿
    return loss
﻿
@tf.function
def val_step(img_tensor, target):
    # img_tensor (batch, 224,224,3)
    # target     (batch, 80)
    loss = 0
    features = tf.expand_dims(encoder(img_tensor),1) # (batch, 1, 128)
    em_words = decoder.embed(target)
    x = tf.concat([features,em_words],axis=1)
    predictions, _ = decoder(x, True)
    loss = loss_function(target[:,1:], predictions[:,1:-1,:])
    return loss
Loss and Results﻿Check out the Kaggle Notebook﻿The objective function is the Negative Log-Likelihood of the words generated. To make this a little more intuitive let us go for a feed-forward run into the architecture that is provided. An image when fed into the CNN, provides with the image features. This feature is then encoded into a tensor of the same shape as that of the word embeddings that are provided. 
The image feature is fed into the GRU for the first time step. This cell then produces a softmax of the entire word vocabulary. The objective of our task is to increase the likelihood of the word that closely describes the image. We take the negative log-likelihood so that we can minimize this metric and train our model. In the next time steps, we provide the words of the caption as input and try maximizing the probability of the immediate next word. 
﻿
Run set1
﻿
ConclusionUsing the simple concepts learnt from machine translation, the authors really brought about a brilliant way to generate automatic captions given the image. This paper was the inspiration for several subsequent papers in the automatic caption generation domain, most notably Show, Attend and Tell which, as inferred from its namesake employs the use of attention along with the concepts learnt spoken about in this report.
Talk to the authors:




















NameTwitterGitHub
Devjyoti Chakrobarty@Cr0wley_zz@cr0wley-zz
Aritra Roy Gosthipaty@ariG23498@ariG23498
﻿
﻿
﻿
Name	Twitter	GitHub
Devjyoti Chakrobarty	@Cr0wley_zz	@cr0wley-zz
Aritra Roy Gosthipaty	@ariG23498	@ariG23498
Add a comment
Ayush Chaurasia • 5 years ago
Great report! Now that most modern architectures are adopting attention-based models, are you planning to experiment by replacing LSTMs with transformers for generating captions?
1 reply
Tags: Intermediate, Computer Vision, Language Generation, Keras, Experiment, Panels, Plots
Iterate on AI agents and models faster. Try Weights & Biases today.
Implementing Show and Tell With TensorFlow

Table of Contents

Reading the Paper

Task

Data

Models

Show

Tell

Code

﻿Check out the Kaggle Notebook﻿

Text Handling

Image Handling

Joining the Data

Model

Show

Tell

Training

Loss and Results

﻿Check out the Kaggle Notebook﻿

Conclusion

Check out the Kaggle Notebook

Check out the Kaggle Notebook