Show and Tell

This is the TensorFlow implementation of Show and Tell. Made by Aritra Roy Gosthipaty using Weights & Biases
Aritra Roy Gosthipaty


Check out the Kaggle Notebook

A few years ago, if someone claimed that we will have virtual assistants who would be able to correctly describe a scenery presented to them, people would have laughed it off. As machine learning slowly ventured into deep learning, opening up endless possibilities, ideas which we could never have dreamt off started seeming possible. One of those ideas is depicted in Show and Tell: A Neural Image Caption Generator by Vinyals et al. In the paper, the authors have suggested an end to end solution to an image caption generator. Previous to this paper, all that was proposed for this task involved independent task optimization (vision and natural language) and then hand-engineered stitching of these independent tasks.

This paper takes its inspiration from Neural Machine Translation, where an encoder trains on a sequence in a given language and produces a fixed-length representation for a decoder, that spits a sequence in another language. Stemming from this idea, the authors have used a vision feature extractor as the encoder and a sequence model as the decoder.

📜Reading the paper

It is quite fascinating to get hold of an academic paper, which upon reading makes you guarantee that you yourself could have come up with the proposed idea. At that very moment, you think of how simple yet powerful an idea can be. Show and Tell: A Neural Image Caption Generator is one such amazing paper that has opened the gates of deep learning research for image caption generator.

The authors have claimed to be highly inspired by machine translations. This led us to break down our article in the form of a road-map. Upon following the road-map, the reader should feel the same excitement as we were while reading the paper.

The road-map:


Given an image, we need the caption describing that image. This is not a mere classification problem, where an artificial agent decides upon which category the image belongs to. This is not a detection task, where an artificial agent draws bounding boxes upon objects that it categorizes. Here we need to decipher the contents of the image and then form a sequence of words that depict the relationship between the image contents.

The simplest idea that comes to the mind is dividing the task into two distinct tasks.

  1. 👁️ Computer vision: This part deals with the image provided. It tries extracting the features from the image, building concepts from the hierarchical features, and modeling the data distribution. A simple Convolutional Neural Network would serve fine for this purpose. Upon the input of an image, the CNN kernels would pick up features from the image in a hierarchical fashion. These features would be a compressed representation of the content of the image presented. image.png

-> DeepLearning by Goodfellow et. al. <-

  1. 🗣️ Word generation: In this task, we are provided with an image and also the caption of the image to train on. The caption needs to be modeled upon. The caption is a sequence of words that describes the image. A Natural Language Processor is needed to model the caption data distribution. The model needs to understand the word distribution and also the context of the words. Here we can use a simple recurrent architecture that can model the captions and generate words that are closely sampled from the provided caption data space. image.png

-> Source <-

The tricky part here is to stitch the two realms together. We would not only need the Natural Language Processor to generate words from the caption data distribution but also want it to take the image under consideration. The feature of the image is an important factor in the image caption generation problem. The caption generator needs to pick up the image features and then with that context, sample the words from the caption space and provide a description of the image. The stitching of the two realms is what makes this task so intriguing. image.png

-> Show and Tell: a Neural Image Caption Generator <-

A little insight that I have found highly interesting is the usage of numbers. We humans have come up with a beautiful language of communication called Mathematics. Here we can depict concepts, ideas, and much more with numbers and symbols. Let us concentrate on numbers for the time being. A computer vision model extracts valuable features from images which are essentially numbers (the weights and biases of the model). Similarly, language and words can be depicted with numbers too (word embeddings). This is the idea that when harnessed can solve our problem of stitching together the diverse realms of the task. We need to input the numbers from the vision model to the language model in a way that the task of image caption generation succeeds.


For our experiments, we chose to use the Flickr30k dataset, which housed 30,000 images and multiple unique captions corresponding to each image. The data is preprocessed and hosted in Kaggle to ease up our use case. We will be going ahead with the Kaggle dataset of the Flickr30k.

The dataset housed a CSV which had records of images linking them to their respective captions.

A peek into the dataset is as follows: image.png caption.png

Before moving ahead, we would like to point to the reader the usage of <start> and <end>. This is particularly important for letting the model know about the start and end of the caption. It does not seem to be necessary while training the data but in the test time we will need to feed the <start> token for the model to generate the first word of the caption, while the model needs to stop generating words after it produces the <end> token.

Section 10


We have a fair bit of understanding about the task and also the models that we would need to work on. The insight on numbers will be very helpful in this section. Let us start with the architecture of the model proposed by the authors and then dive deep into the working. image.png

-> The architecture proposed <-

Here we have two distinct models for the distinct tasks at hand. On the left side is a vision model and on the right is a recurrent model used for the word generation. The idea here is to feed the image features to the recurrent model as it was just another word. The features that are extracted from the image is a collection of numbers (a vector), if we plot the vector in the embedding space of the captions, we will definitely have a representation of a word. This image feature turned into a word is the beauty of the whole process. The image features that are plotted in the embedding space might not represent an actual word from the thesaurus, but it is just enough for the recurrent model to learn on. This so-called image-word is the initial input to the recurrent model. Upon deep introspection, the reader is bound to find how simple yet effective this idea can be. Two pictures with the same features lie close in the embedding space, upon providing these features to the recurrent model, the model will generate captions that are similar for both the images. One can also think of this idea of image-word to be important because now we can compute vector algebra on the embedding space, new concepts can be learned by merely adding two concepts. image_caption.png

-> Example of image-word <-

Following the architecture, we used a CNN as an encoder and a stacked layer of GRU as our decoder. We are using Gated Recurrent Units instead of LSTMs because GRUs are more compute efficient than LSTMs and more effective than the simple RNNs. Keeping in mind that we were dealing with a huge dataset, we chose GRU to boost our pipeline's efficiency in return for losing some computation effectiveness (gradient dissipation problem).


This part of our model acts as our information encoder. We use a restnet50 model which was pre-trained on data from Imagenet. Here we take omit the last layer of the model and extract the output from the penultimate layer. On top of the resnet output, we stack a Global Average Pooling and a Dense layer. With the GAP we take an average of the penultimate kernel output and with the Dense layer we try molding the image features into the same shape as that of the word embeddings.


This part of our model acts as our information decoder. Now to explain this, we have to look at the roots of an RNN model. For those of you who need a revision on recurrent architectures, Under the Hood of RNN would be a good place to go to. The output of a single cell $y_{t}$ is determined by its current input $x_{t}$ and hidden state activation from its previous cell $h_{t-1}$. Now, if our first RNN cell has the encoded image as its input, the hidden state it generates will be carried over to the next cells. This hidden state will also act as a link to the current cell's input and output and this will be repeated for all the cells in the sequence. So to sum up, the effect of the encoded image is essentially passed throughout the sequence of cells so that each word being predicted is done so keeping in mind the image given. image.png

-> Depiction of the recurrent formula <-


Check out the Kaggle Notebook

Here we will go through the salient parts of our code.

Text Handling

In the following code block, we do basic text handling and create our vocabulary from our available set of captions. We also manually add a 'pad' token so that we can later make all the sentences of the same size for our own benefit.

train_df = df.iloc[:train_size,:]
val_df = df.iloc[train_size+1:train_size+val_size,:]
test_df = df.iloc[train_size+val_size+1:,:]

# Choose the top 5000 words from the vocabulary
top_k = 10000
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=top_k,

# build the vocabulary

tokenizer.word_index['<pad>'] = 0
tokenizer.index_word[0] = '<pad>'

To check if our basic vocabulary creation is done properly, we create a helper function

# This is a sanity check function
def check_vocab(word):
    i = tokenizer.word_index[word]
    print(f"The index of the word: {i}")
    print(f"Index {i} is word {tokenizer.index_word[i]}")

We will get an output like this :


Moving on, we have to create our training data in accordance to the vocabulary generated by the tokenizer. We manually pad all sentences to have the same length. We proceed to integrate our data into the pipeline.

# Create the tokenized vectors
train_seqs = tokenizer.texts_to_sequences(train_df['comment'])
val_seqs = tokenizer.texts_to_sequences(val_df['comment'])
test_seqs = tokenizer.texts_to_sequences(test_df['comment'])

# If you do not provide a max_length value, pad_sequences calculates it automatically
train_cap_vector = tf.keras.preprocessing.sequence.pad_sequences(train_seqs, padding='post')
val_cap_vector = tf.keras.preprocessing.sequence.pad_sequences(val_seqs, padding='post')
test_cap_vector = tf.keras.preprocessing.sequence.pad_sequences(test_seqs, padding='post')

train_cap_ds =
val_cap_ds =
test_cap_ds =

Image Handling

Now we shall create a pipeline for the images in the flicker30K dataset. We do basic operations like loading the image, decoding it, datatype conversion and resizing.

def load_img(image_path):
    img =
    img = tf.image.decode_jpeg(img)
    img = tf.image.convert_image_dtype(img, tf.float32)
    img = tf.image.resize(img, (224, 224))
    return img

train_img_name = train_df['image_name'].values
val_img_name = val_df['image_name'].values
test_img_name = test_df['image_name'].values

train_img_ds =
val_img_ds =
test_img_ds =

Joining the Data

Our intention is to merge the two data pipelines created so we can directly feed them together to our networks. We are taking our data in batches since the dataset as a whole is very large.

# prefecth and batch the dataset

train_ds =, train_cap_ds)).shuffle(42).batch(BATCH_SIZE).prefetch(buffer_size=AUTOTUNE)
val_ds =, val_cap_ds)).shuffle(42).batch(BATCH_SIZE).prefetch(buffer_size=AUTOTUNE)
test_ds =, test_cap_ds)).shuffle(42).batch(BATCH_SIZE).prefetch(buffer_size=AUTOTUNE)



As mentioned earlier, Show refers to the encoder part of the architecture which compresses the image. A ResNet50 model trained on ImageNet acts as the feature extractor, followed by a GAP. Finally, we round up this part of our architecture with a fully connected layer.

class CNN_Encoder(tf.keras.Model):
    def __init__(self, embedding_dim):
        super(CNN_Encoder, self).__init__()
        self.embedding_dim = embedding_dim
    def build(self, input_shape):
        self.resnet = tf.keras.applications.ResNet50(include_top=False,
        self.resnet.trainable=False = GlobalAveragePooling2D()
        self.fc = Dense(units=self.embedding_dim,
    def call(self, x):
        x = self.resnet(x)
        x =
        x = self.fc(x)
        return x


Housing GRU cells, Tell refers to the decoder which uses information from the encoder to establish a link between the learned captions and the original input.

class RNN_Decoder(tf.keras.Model):
    def __init__(self, embedding_dim, units, vocab_size):
        super(RNN_Decoder, self).__init__()
        self.units = units
        self.embedding_dim = embedding_dim
        self.vocab_size = vocab_size
        self.embedding = Embedding(input_dim=self.vocab_size,
    def build(self, input_shape):
        self.gru1 = GRU(units=self.units,
        self.gru2 = GRU(units=self.units,
        self.gru3 = GRU(units=self.units,
        self.gru4 = GRU(units=self.units,
        self.fc1 = Dense(self.units)
        self.fc2 = Dense(self.vocab_size)

    def call(self, x, initial_zero=False):
        # x, (batch, 512)
        # hidden, (batch, 256)
        if initial_zero:
            initial_state = decoder.reset_state(batch_size=x.shape[0])
            output, state = self.gru1(inputs=x,
            output, state = self.gru2(inputs=output,
            output, state = self.gru3(inputs=output,
            output, state = self.gru4(inputs=output,
            output, state = self.gru1(inputs=x)
            output, state = self.gru2(inputs=output)
            output, state = self.gru3(inputs=output)
            output, state = self.gru4(inputs=output)
        # output, (batch, 256)
        x = self.fc1(output)
        x = self.fc2(x)
        return x, state
    def embed(self, x):
        return self.embedding(x)
    def reset_state(self, batch_size):
        return tf.zeros((batch_size, self.units))


We create the class objects and assign our optimizer as Adam. The loss is Sparse Categorical Cross entropy, because here it would be inefficient to use one-hot-encoders are the ground truth. We will also use mask to help mask the <pad> so that we do not let the sequence model learn to overfit on the same.

encoder = CNN_Encoder(EMBEDDIN_DIM)
decoder = RNN_Decoder(embedding_dim=EMBEDDIN_DIM,

optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')

def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)

    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask

    return tf.reduce_mean(loss_)

Next, we write our train step function which will calculate the gradients through backpropagation.

def train_step(img_tensor, target):
    # img_tensor (batch, 224,224,3)
    # target     (batch, 80)
    loss = 0
    with tf.GradientTape() as tape:
        features = tf.expand_dims(encoder(img_tensor),1) # (batch, 1, 128)
        em_words = decoder.embed(target)
        x = tf.concat([features,em_words],axis=1)
        predictions, _ = decoder(x, True)

        loss = loss_function(target[:,1:], predictions[:,1:-1,:])

    trainable_variables = encoder.trainable_variables + decoder.trainable_variables

    gradients = tape.gradient(loss, trainable_variables)

    optimizer.apply_gradients(zip(gradients, trainable_variables))

    return loss

def val_step(img_tensor, target):
    # img_tensor (batch, 224,224,3)
    # target     (batch, 80)
    loss = 0
    features = tf.expand_dims(encoder(img_tensor),1) # (batch, 1, 128)
    em_words = decoder.embed(target)
    x = tf.concat([features,em_words],axis=1)
    predictions, _ = decoder(x, True)
    loss = loss_function(target[:,1:], predictions[:,1:-1,:])
    return loss

📉Loss and Results

Check out the Kaggle Notebook

The objective function is the Negative Log-Likelihood of the words generated. To make this a little more intuitive let us go for a feed-forward run into the architecture that is provided. An image when fed into the CNN, provides with the image features. This feature is then encoded into a tensor of the same shape as that of the word embeddings that are provided. The image feature is fed into the GRU for the first time step. This cell then produces a softmax of the entire word vocabulary. The objective of our task is to increase the likelihood of the word that closely describes the image. We take the negative log-likelihood so that we can minimize this metric and train our model. In the next time steps, we provide the words of the caption as input and try maximizing the probability of the immediate next word.

Section 9


Using the simple concepts learnt from machine translation, the authors really brought about a brilliant way to generate automatic captions given the image. This paper was the inspiration for several subsequent papers in the automatic caption generation domain, most notably Show, Attend and Tell which, as inferred from its namesake employs the use of attention along with the concepts learnt spoken about in this report.

Talk to the authors:

Name Twitter GitHub
Devjyoti Chakrobarty @Cr0wley_zz @cr0wley-zz
Aritra Roy Gosthipaty @ariG23498 @ariG23498