Generate Meaningful Captions for Images With Attention Models
In this article, we follow a practical tutorial and give a brief overview of the milestones in the field of image captioning with attention models.
Created on July 22|Last edited on October 28
Comment

Image captioning is the task of generating a description for a given image.
Caption generation involves two tasks.
- Understanding the content of the image.
- Turning this understanding into a meaningful sentence describing the image.
Hence, it requires techniques from both computer vision and natural language processing. Image captioning has many use cases that include generating captions for Google image search and live video surveillance as well as helping visually impaired people to get information about their surroundings. Watch this wonderful video by Microsoft here.
Let us dig deeper into the different techniques to perform image captioning.
Table of Contents
History and Evolution of Image Captioning1. Every Picture Tells a Story: Generating Sentences from Images2. From Captions to Visual Concepts and BackShow and Tell: A Neural Image Caption GeneratorShow, Attend and Tell: Attention Mechanism and Image CaptioningEncoder Block: Convolution Feature MapsDecoder Block: LSTM/GRU as a Generator NetworkSoft Attention: Neural Machine Translation by Jointly Learning to Align and TranslateModel Prediction and Attention Visualization
History and Evolution of Image Captioning
1. Every Picture Tells a Story: Generating Sentences from Images


Ali Farhadi et al.'s 'Every Picture Tells a Story: Generating Sentences from Images' projects the image and text into a triplet representation space of <object, action, scene>. They call this space 'the meaning space'. Once the images and texts are projected into the meaning space, the results are compared. Similar meanings result in a high score.
The problem has been formulated using a Markov Random Field (MRF). Each of the nodes represents an object, action, and scene. The node potentials are computed by the linear combination of scores from several detectors and classifiers. Frequencies estimate edge potentials. Learning involves setting the weights on the node and edge potentials. Given the potentials, the goal is to find the best triplets.
A Markov Random Field (MRF) is a probability distribution P(x) over variables in an undirected graphical network (G), where the represents the nodes of the graph G. The joint probability distribution P( ) is given by
where C denotes the set of cliques of the graph (G), and each factor is a non-negative function over the variables in a clique, and Z is the normalizing factor.
In the represented MRF, there is a node for
- Objects which can take a value from a possible set of 23 nouns
- Actions with 16 different values
- Scenes that can select each of 29 different values
Now, the mapping from images to meaning space is reduced to learning to predict triplet for images by maximizing the joint distribution over the graph G of Object, Action, and Scene, i.e., which will be factorized as
.
Inferencing involves a greedy method that maximizes the joint distribution based on the unary and binary potentials that are learned by maximizing.
The projection of textual sentences into the latent, meaning space is done using dependency parsing.
2. From Captions to Visual Concepts and Back

In 2015, 'From Captions to Visual Concepts and Back' presented a multi-modal perspective of solving the captioning problem using multiple instance learning. It was state of the art at that time on the Microsoft COCO benchmark.
The main concept revolves around three primary steps :

Detecting the words in the images by having a fixed corpus of words (V) in the training set and using multiple instance learning.
Each word , a set of positive and negative bags of bounding boxes is given to the model where a bag {b_{i}} corresponds to a positive image if the word is present inside the caption for that image.
The final probability that a bag contains the word is given by
,
where is the probability that a given image region in the image
contains the word w. The probability is estimated using a CNN model with sigmoid loss function.
Once the possible word distribution is known for every image, the next part is generating the most likely sentence. Maximum entropy language model in a generative framework is used for it.
Show and Tell: A Neural Image Caption Generator

'Show and Tell: A Neural Image Caption Generator' proved to be path-breaking in the field of image captioning. Inspired by the success of sequence-to-sequence learning in machine translation, the authors used an encoder-decoder framework to create a generative learning scenario.

Encoders & decoders in sequence-to-sequence learning
- The encoder network processes each word in the input sequence and the compiled information is put into context vector
- The context vector C is passed to the decoder network
- The decoder is a generative network which maintains the hidden state that it passes from one-time step to the next and generates the translated word (in a translation scenario) given the previous words.
Ilya Sutskever et al.'s research 'Sequence to Sequence Learning with Neural Networks' created an end-end training setup in an encoder-decoder framework to solve the Machine Translation problem from a sequence generation perspective. The central part of the research involved training a deep LSTM on many sentence pairs by maximizing the log probability of a correct translation given the source sentence . The objective is defined as:
In 'Show and Tell: A Neural Image Caption Generator' the authors encode the image using a deep convolution neural network and pass the encoder to the decoder network , primarily a recurrent neural model, which generated the sequence, same as above. The primary objective function is to maximize the likelihood of of the target sentence given the input image.
is the objective function where are the parameters of our model, is an image, and its correct transcription. To be more detailed of the entire process, if represents the image and is true sentence describing the image , then the primary equations are :
, where ---> word embeddings
Here they represent each word as a one-hot vector of dimension equal to the size of the dictionary and the final loss is given by :

To summarize, the convolutional neural network (CNN) takes an input image and generates the feature representation, which is then fed to the decoder LSTM for generating the output sentence with very high accuracy. It has given the state of the art results at that time, but there were particular concerns \with the same.

Problems: One of the significant problems in a standard seq2seq model is its inability to accurately process long input sequences since only the last hidden state of the encoder CNN is used as the context vector for the decoder. While in seq2seq models, the entire representation of the input image is projected into a latent space , which has dense information of the entire image.
However, while generating the word at the time point, the entire sequence is probably not needed. Instead, a sub-representation of the entire image space can provide much more relevance for the generation process, i.e., a representation of a portion of the input image is more relevant than the representation of the entire image input for generating the word .
This means that for generating the word boy in the sequence A boy playing the guitar, the LSTM model should focus more on the image area () where the boy is present rather than the entire image ().
'Neural Machine Translation by Jointly Learning to Align and Translate' and 'Effective Approaches to Attention-based Neural Machine Translation' introduce the concept of attention mechanism in a sequence to sequence learning scenario. This concept opened up a new path of attending tot relevant portions in the input sequence/image and generating many informative and relevant encoder representations for the generative models at each time point.
Show, Attend and Tell: Attention Mechanism and Image Captioning

Kelvin Xu et al.'s paper 'Show, Attend and Tell: Neural Image Caption Generation with Visual Attention' proposes an attention-based encoder-decoder framework which gives importance to relevant portions of the input image in the encoder network for generating each word in the decoder network.

The goal of the image captioning task is given an input image , it generates the caption which is encoded as a sequence of words,
where is the size of the vocabulary and is the maximum sequence length
Encoder Block: Convolution Feature Maps
To efficiently encode the input image and represent it in a latent space, features from a convolution neural model is used.
Unlike previous research works that majorly use the flattened fully connected representation of CNN, in this paper, the researchers have used the features from the lower convolution layers to retain the correspondence between the features and the 2D image.
This also allows the decoder network to selectively focus on certain parts of an image by selecting a subset of all the feature vectors.
The output from the lower layers of the CNN is primarily of the form , where is the size of the feature maps, and represents the number of convolutional filters.
Now, we reshape the feature tensors to and for the sake of simplicity, let us assume that , then the extracted feature map's dimension can be written as
Now, this can be thought of as the convolutional feature extractor produces vectors each of which is of dimension and also represents a portion of the image .
, where
where is the final output from the encoder network
Code for the Encoder
class CNN_Encoder(tf.keras.Model):# Since you have already extracted the features and dumped it using pickle# This encoder passes those features through a Fully connected layerdef __init__(self, embedding_dim):super(CNN_Encoder, self).__init__()## inception v3 modelself.image_model = tf.keras.applications.InceptionV3(include_top=False, weights='imagenet')self.image_features_extract_model = tf.keras.Model(self.image_model.input, self.image_model.layers[-1].output)# shape after fc == (batch_size, 64, embedding_dim)self.fc = tf.keras.layers.Dense(embedding_dim)def call(self, x):x = self.image_features_extract_model(x) #shape batch_sz * 8 * 8 * 2048x = tf.reshape(x, (x.shape[0], -1, x.shape[3])) #shape batch_sz * 64 * 2048x = self.fc(x) ##shape batch_sz * 64 * embedding_dimx = tf.nn.relu(x) ##shape batch_sz * 64 * embedding_dim with relureturn x
Decoder Block: LSTM/GRU as a Generator Network
The Decoder LSTM network, in this case, is used as a generative network that generated each word at a time conditioned on the previous hidden state, previously generated words, and the encoder feature vector/tensor.
= .
where, , , , represents the input, forget, memory, output and hidden state of the LSTM, respectively.
represents the attention weighted contextual encoder vector
is the embedding matrix which projects the input initialized vectors of the word from to
and denote the embedding and LSTM dimensionality, respectively.
and be the logistic sigmoid activation and element-wise multiplication, respectively.
In the decoding stage, the one-hot representation of the input words, which is a -dimensional vector, is being projected by the embedding matrix in the dimensional space. The resulting vector is in , which is concatenated with the attention weighted encoder vector and the LSTM hidden state and is fed as an input to the LSTM network for generation of the word from the vocabulary of words. matrix projects the dimensional vector to , followed by the standard sigmoid and activations of the LSTM network.
The conditional probability of generating the sequence gets modified, as shown below:
Code for the Decoder
class RNN_Decoder(tf.keras.Model):def __init__(self, embedding_dim, units, vocab_size):super(RNN_Decoder, self).__init__()self.units = unitsself.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)self.gru = tf.keras.layers.GRU(self.units,return_sequences=True,return_state=True,recurrent_initializer='glorot_uniform')self.fc1 = tf.keras.layers.Dense(self.units)self.fc2 = tf.keras.layers.Dense(vocab_size)self.attention = BahdanauAttention(self.units)def call(self, x, features, hidden):# defining attention as a separate modelcontext_vector, attention_weights = self.attention(features, hidden)# x shape after passing through embedding == (batch_size, 1, embedding_dim)x = self.embedding(x)# x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)# passing the concatenated vector to the GRUoutput, state = self.gru(x)# shape == (batch_size, max_length, hidden_size)x = self.fc1(output)# x shape == (batch_size * max_length, hidden_size)x = tf.reshape(x, (-1, x.shape[2]))# output shape == (batch_size * max_length, vocab)x = self.fc2(x)return x, state, attention_weightsdef reset_state(self, batch_size):return tf.zeros((batch_size, self.units))
Soft Attention: Neural Machine Translation by Jointly Learning to Align and Translate
Bahdanau's paper on the attention mechanism gave a new direction in the field of sequence-to-sequence learning problems by introducing the concept of attention weighted encoder representation.
The most common approach followed previously was to give equal importance to every sequence in the encoder, which might not be relevant as for generating a fixed word as discussed above, and Bahdanau's approach in the paper solves the same problem by giving relevant weightage to the relevant sequence.
As shown above, the conditional probability of generating the word is given by
The most exciting aspect to note is that the probability for each word is conditioned on a distinct encoder vector and not a generic representation for all the words.
The attention scores for generating a word by the learning the additive association between the encoder vectors of , where and the hidden state vector of the previous state . However, since both the vectors are in different dimensions, so they are projected by respective matrices to the common space . The matrix
is used for projecting to KaTeX parse error: Double superscript at position 4: R^p'̲ and for projecting to KaTeX parse error: Double superscript at position 4: R^p'̲
, and , where and represents the projected encoder and LSTM representations respectively.
The score = and the scores are scaled by applying the softmax to understand the relative importance of the encoder representation () such that the scores sum up to 1.
=
So, now it is clear that the scores are different for generating different words that are achieved by giving importance to the relevant portion os the encoder representations, i.e., giving weightage to a particular image section for each word.
Finally, the context encoder vector for generating the word is given by the weighted combinations of the attention scores.
=
Code for Attention Mechanism
class BahdanauAttention(tf.keras.Model):def __init__(self, units):super(BahdanauAttention, self).__init__()self.W1 = tf.keras.layers.Dense(units)self.W2 = tf.keras.layers.Dense(units)self.V = tf.keras.layers.Dense(1)def call(self, features, hidden):# features(CNN_encoder output) shape == (batch_size, 64, embedding_dim)# hidden shape == (batch_size, hidden_size)# hidden_with_time_axis shape == (batch_size, 1, hidden_size)hidden_with_time_axis = tf.expand_dims(hidden, 1)# score shape == (batch_size, 64, hidden_size)score = tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis))# attention_weights shape == (batch_size, 64, 1)# you get 1 at the last axis because you are applying score to self.Vattention_weights = tf.nn.softmax(self.V(score), axis=1)# context_vector shape after sum == (batch_size, hidden_size)context_vector = attention_weights * featurescontext_vector = tf.reduce_sum(context_vector, axis=1)return context_vector, attention_weights
Model Prediction and Attention Visualization
In the below images one can see how the how models gaze is shifting across the image when generating different words in the caption.
Run set
10
We encourage you to experiment with image captioning on your own.
You can find the full code here. →
Add a comment
The "gaze" of the network in the "Model Prediction and Attention Visualization" section is not very intuitive. When it generates the word "surfboard" the corresponding image looks to have attention everywhere except for the subject.
I checked the code for the run and I seem to be correct in my assumption that the white sections correspond to higher attention.
Would this imply that the attention could be further improved? Or is it that attention doesn't work as intuitively as I think?
1 reply
Tags: Intermediate, Computer Vision, NLP, Text Generation, Keras, Research, CNN, Github, Panels, Plots, Classification
Iterate on AI agents and models faster. Try Weights & Biases today.