Skip to main content

Generate Meaningful Captions for Images With Attention Models

In this article, we follow a practical tutorial and give a brief overview of the milestones in the field of image captioning with attention models.
Created on July 22|Last edited on October 28


Image captioning is the task of generating a description for a given image. Caption generation involves two tasks.
  1. Understanding the content of the image.
  2. Turning this understanding into a meaningful sentence describing the image.
Hence, it requires techniques from both computer vision and natural language processing. Image captioning has many use cases that include generating captions for Google image search and live video surveillance as well as helping visually impaired people to get information about their surroundings. Watch this wonderful video by Microsoft here.
Let us dig deeper into the different techniques to perform image captioning.

Table of Contents



History and Evolution of Image Captioning

1. Every Picture Tells a Story: Generating Sentences from Images







Ali Farhadi et al.'s 'Every Picture Tells a Story: Generating Sentences from Images' projects the image and text into a triplet representation space of <object, action, scene>. They call this space 'the meaning space'. Once the images and texts are projected into the meaning space, the results are compared. Similar meanings result in a high score.
The problem has been formulated using a Markov Random Field (MRF). Each of the nodes represents an object, action, and scene. The node potentials are computed by the linear combination of scores from several detectors and classifiers. Frequencies estimate edge potentials. Learning involves setting the weights on the node and edge potentials. Given the potentials, the goal is to find the best triplets.
A Markov Random Field (MRF) is a probability distribution P(x) over variables x1,x2,x3...xnx_{1},x_{2},x_{3}...x_{n} in an undirected graphical network (G), where the x1,x2,x3...xnx_{1},x_{2},x_{3}...x_{n} represents the nodes of the graph G. The joint probability distribution P(x1,x2,x3...xnx_{1},x_{2},x_{3}...x_{n} ) is given by
P(x1,x2,x3...xn)=1ZcCϕc.xcP(x_{1},x_{2},x_{3}...x_{n}) = \frac{1}{Z} \prod_{c\in C} ϕ_{c}.x_{c}
Z=x1,x2...xncCϕc.xcZ = \sum_{x_{1},x_{2}...x_{n}} \prod_{c\in C} ϕ_{c}.x_{c}
where C denotes the set of cliques of the graph (G), and each factor ϕcϕ_{c} is a non-negative function over the variables in a clique, and Z is the normalizing factor.
In the represented MRF, there is a node for
  • Objects which can take a value from a possible set of 23 nouns
  • Actions with 16 different values
  • Scenes that can select each of 29 different values
Now, the mapping from images to meaning space is reduced to learning to predict triplet for images by maximizing the joint distribution over the graph G of Object, Action, and Scene, i.e., maxP(A,B,C)max P(A, B, C) which will be factorized as
P(O,A,S)=ϕ(O,A).ϕ(A,S).ϕ(O,S)P(O,A, S) = ϕ(O,A) . ϕ(A, S).ϕ(O, S). Inferencing involves a greedy method that maximizes the joint distribution based on the unary and binary potentials that are learned by maximizing. The projection of textual sentences into the latent, meaning space is done using dependency parsing.

2. From Captions to Visual Concepts and Back




In 2015, 'From Captions to Visual Concepts and Back' presented a multi-modal perspective of solving the captioning problem using multiple instance learning. It was state of the art at that time on the Microsoft COCO benchmark.
The main concept revolves around three primary steps :


Detecting the words in the images by having a fixed corpus of words (V) in the training set and using multiple instance learning.
Each word wVw \in V, a set of positive and negative bags of bounding boxes is given to the model where a bag {b_{i}} corresponds to a positive image if the word ww is present inside the caption for that image.
The final probability that a bag bi{b_{i}} contains the word ww is given by
(1jbi(1pijw))(1 - \prod_{j\in b_{i}}(1 - p_{ij}^w)) ,
where pijp_{ij} is the probability that a given image region jj in the image i{i} contains the word w. The probability is estimated using a CNN model with sigmoid loss function.
Once the possible word distribution is known for every image, the next part is generating the most likely sentence. Maximum entropy language model in a generative framework is used for it.
L=s=1Sl=1#(s)logPr(wlswl1s...w1s,Vl1s)L = \mathop{\sum_{s=1}^{S}\sum_{l=1}^{\#(s)}}log Pr(w_{l}^s| w_{l-1}^s ...w_{1}^s, V_{l-1}^s)

Show and Tell: A Neural Image Caption Generator




'Show and Tell: A Neural Image Caption Generator' proved to be path-breaking in the field of image captioning. Inspired by the success of sequence-to-sequence learning in machine translation, the authors used an encoder-decoder framework to create a generative learning scenario.


Encoders & decoders in sequence-to-sequence learning

  • The encoder network EE processes each word ww in the input sequence SiS_{i} and the compiled information is put into context vector CC  

  • The context vector C is passed to the decoder network DD
  • The decoder DD is a generative network which maintains the hidden state that it passes from one-time step to the next and generates the translated word (in a translation scenario) wtw_t given the previous words.
Ilya Sutskever et al.'s research 'Sequence to Sequence Learning with Neural Networks' created an end-end training setup in an encoder-decoder framework to solve the Machine Translation problem from a sequence generation perspective. The central part of the research involved training a deep LSTM on many sentence pairs by maximizing the log probability of a correct translation TT given the source sentence SS. The objective is defined as:
(T,S)Slogp(TS)\sum_{(T,S) \in S} log p(T|S)
T=argmaxTp(TS)T' = arg \max_{T} p(T|S)
In 'Show and Tell: A Neural Image Caption Generator' the authors encode the image using a deep convolution neural network and pass the encoder EE to the decoder network DD, primarily a recurrent neural model, which generated the sequence, same as above. The primary objective function is to maximize the likelihood of (L)(L) of the target sentence given the input image.
T=argmaxT(T,S)Slogp(TS)T' = arg \max_{T} \sum_{(T,S) \in S} log p(T|S) is the objective function where θθ are the parameters of our model, II is an image, and SS its correct transcription. To be more detailed of the entire process, if II represents the image and S=(S0,S0,S0...Sn)S = (S_{0}, S_{0}, S_{0} ... S_{n}) is true sentence describing the image , then the primary equations are :
xt1=CNN(I)x_{t-1} = CNN(I)
xt=We.St,t0,1...N1x_{t} = W_{e}.S_{t} , t \in {0,1... N-1} , where WeW_{e} ---> word embeddings
pt+1=LSTM(xt),t0,1...N1p_{t+1} = LSTM(x_{t}) , t\in {0,1... N-1}
Here they represent each word as a one-hot vector StS_{t} of dimension equal to the size of the dictionary and the final loss is given by : L=t=1Nlogpt.StL = -\sum_{t=1}^{N}log p_{t} .S_{t}



To summarize, the convolutional neural network (CNN) takes an input image and generates the feature representation, which is then fed to the decoder LSTM for generating the output sentence with very high accuracy. It has given the state of the art results at that time, but there were particular concerns \with the same.


Problems: One of the significant problems in a standard seq2seq model is its inability to accurately process long input sequences since only the last hidden state of the encoder CNN is used as the context vector for the decoder. While in seq2seq models, the entire representation of the input image II is projected into a latent space CNN(I)CNN(I), which has dense information of the entire image.
However, while generating the word wtw_{t} at the ttht^{th} time point, the entire CNN(I)CNN(I) sequence is probably not needed. Instead, a sub-representation of the entire image space can provide much more relevance for the generation process, i.e., a representation FRdF' \in R^d of a portion of the input image III' \in I is more relevant than the representation FRdF \in R^d of the entire image input II for generating the word wtw_{t}.
This means that for generating the word boy in the sequence A boy playing the guitar, the LSTM model should focus more on the image area (II') where the boy is present rather than the entire image (II).
'Neural Machine Translation by Jointly Learning to Align and Translate' and 'Effective Approaches to Attention-based Neural Machine Translation' introduce the concept of attention mechanism in a sequence to sequence learning scenario. This concept opened up a new path of attending tot relevant portions in the input sequence/image and generating many informative and relevant encoder representations for the generative models at each time point.

Show, Attend and Tell: Attention Mechanism and Image Captioning




Kelvin Xu et al.'s paper 'Show, Attend and Tell: Neural Image Caption Generation with Visual Attention' proposes an attention-based encoder-decoder framework which gives importance to relevant portions of the input image in the encoder network for generating each word in the decoder network.



The goal of the image captioning task is given an input image II, it generates the caption yy which is encoded as a sequence of 1ofK1-of-K words,
y={y1,y2....yC},yiRKy = \{y_{1},y_{2}....y_{C}\} , y_{i} \in R^K
where KK is the size of the vocabulary and CC is the maximum sequence length

Encoder Block: Convolution Feature Maps

To efficiently encode the input image and represent it in a latent space, features from a convolution neural model is used.
Unlike previous research works that majorly use the flattened fully connected representation of CNN, in this paper, the researchers have used the features from the lower convolution layers to retain the correspondence between the features and the 2D image.
This also allows the decoder network to selectively focus on certain parts of an image by selecting a subset of all the feature vectors.
The output from the lower layers of the CNN is primarily of the form kkDk*k*D, where kkk*k is the size of the feature maps, and DD represents the number of convolutional filters.
Now, we reshape the feature tensors to k2Dk^2 * D and for the sake of simplicity, let us assume that k2=Lk^2 = L, then the extracted feature map's dimension can be written as LDL*D
Now, this can be thought of as the convolutional feature extractor produces LL vectors each of which is of dimension DD and also represents a portion of the image II.
a={a1,a2,a3,......aL}a = \{a_{1}, a_{2} , a_{3} , ...... a_{L}\} , where aiRDa_{i} \in R^D where aa is the final output from the encoder network

Code for the Encoder

class CNN_Encoder(tf.keras.Model):
# Since you have already extracted the features and dumped it using pickle
# This encoder passes those features through a Fully connected layer
def __init__(self, embedding_dim):
super(CNN_Encoder, self).__init__()

## inception v3 model
self.image_model = tf.keras.applications.InceptionV3(include_top=False, weights='imagenet')
self.image_features_extract_model = tf.keras.Model(self.image_model.input, self.image_model.layers[-1].output)
# shape after fc == (batch_size, 64, embedding_dim)
self.fc = tf.keras.layers.Dense(embedding_dim)

def call(self, x):
x = self.image_features_extract_model(x) #shape batch_sz * 8 * 8 * 2048
x = tf.reshape(x, (x.shape[0], -1, x.shape[3])) #shape batch_sz * 64 * 2048
x = self.fc(x) ##shape batch_sz * 64 * embedding_dim
x = tf.nn.relu(x) ##shape batch_sz * 64 * embedding_dim with relu
return x

Decoder Block: LSTM/GRU as a Generator Network

The Decoder LSTM network, in this case, is used as a generative network that generated each word at a time conditioned on the previous hidden state, previously generated words, and the encoder feature vector/tensor.
(itftotgt)\left( \begin{array}{c} i_{t} \\ f_{t} \\o_{t}\\g_{t} \end{array} \right) = (σσσtanh)\left( \begin{array}{c} \sigma \\ \sigma\\ \sigma\\tanh\end{array} \right) . TD+m+n,nT_{D+m+n},_{n} (Eyt1ht1zt1)\left( \begin{array}{c} Ey_{t-1} \\ h_{t-1}\\ z'_{t-1}\end{array} \right)
where, iti_{t} , ftf_{t}, oto_{t} , gtg_{t} represents the input, forget, memory, output and hidden state of the LSTM, respectively.
zRDz' \in R^D represents the attention weighted contextual encoder vector
ct=ftct1+itgtc_{t} = f_{t}\bigodot c_{t-1} + i_{t}\bigodot g_{t}
ht=ottanh(ct)h_{t} = o_{t}\bigodot tanh(c_{t})
Ts,t:Rs>RtT_{s,t} : R^{s} -> R^{t}
ERmKE \in R^{m*K} is the embedding matrix which projects the input initialized vectors of the word from RKR^K to RmR^m
mm and nn denote the embedding and LSTM dimensionality, respectively.
σ\sigma and \bigodot be the logistic sigmoid activation and element-wise multiplication, respectively.
In the decoding stage, the one-hot representation of the input words, which is a KK-dimensional vector, is being projected by the embedding matrix in the mm dimensional space. The resulting vector Eyt1Ey_{t-1} is in RmR^m, which is concatenated with the attention weighted encoder vector zRDz' \in R^D and the LSTM hidden state ht1Rnh_{t-1} \in R^n and is fed as an input to the LSTM network for generation of the word wtw_{t} from the vocabulary of KK words. TT matrix projects the D+m+nD+m+n dimensional vector to RnR^n, followed by the standard sigmoid and tanh\tanh activations of the LSTM network.
The conditional probability of generating the ttht^{th} sequence gets modified, as shown below:
P(ytyt1,yt2.....y1,x)>P(ytEyt1,ht1,zt1)P(y_{t}|y_{t-1},y_{t-2}.....y_{1} , x) --> P(y_{t}|Ey_{t-1},h_{t-1},z'_{t-1})

Code for the Decoder

class RNN_Decoder(tf.keras.Model):

def __init__(self, embedding_dim, units, vocab_size):
super(RNN_Decoder, self).__init__()
self.units = units

self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
self.gru = tf.keras.layers.GRU(self.units,
return_sequences=True,
return_state=True,
recurrent_initializer='glorot_uniform')
self.fc1 = tf.keras.layers.Dense(self.units)
self.fc2 = tf.keras.layers.Dense(vocab_size)

self.attention = BahdanauAttention(self.units)

def call(self, x, features, hidden):
# defining attention as a separate model
context_vector, attention_weights = self.attention(features, hidden)

# x shape after passing through embedding == (batch_size, 1, embedding_dim)
x = self.embedding(x)

# x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

# passing the concatenated vector to the GRU
output, state = self.gru(x)

# shape == (batch_size, max_length, hidden_size)
x = self.fc1(output)

# x shape == (batch_size * max_length, hidden_size)
x = tf.reshape(x, (-1, x.shape[2]))

# output shape == (batch_size * max_length, vocab)
x = self.fc2(x)

return x, state, attention_weights

def reset_state(self, batch_size):
return tf.zeros((batch_size, self.units))

Soft Attention: Neural Machine Translation by Jointly Learning to Align and Translate

Bahdanau's paper on the attention mechanism gave a new direction in the field of sequence-to-sequence learning problems by introducing the concept of attention weighted encoder representation.
The most common approach followed previously was to give equal importance to every sequence in the encoder, which might not be relevant as for generating a fixed word as discussed above, and Bahdanau's approach in the paper solves the same problem by giving relevant weightage to the relevant sequence.
As shown above, the conditional probability of generating the ttht^{th} word is given by
P(ytEyt1,ht1,zt1)P(y_{t}|Ey_{t-1},h_{t-1},z'_{t-1})
The most exciting aspect to note is that the probability for each word is conditioned on a distinct encoder vector zt1z'_{t-1} and not a generic representation for all the words.
The attention scores for generating a word wtw_{t} by the learning the additive association between the encoder vectors of aa, where aiRDa_{i} \in R^D and the hidden state vector of the previous state ht1Rn{h_{t-1} \in R^n}. However, since both the vectors are in different dimensions, so they are projected by respective matrices to the common space pp'. The matrix W1RDpW1 \in R^{D*p'} is used for projecting aia_{i} to KaTeX parse error: Double superscript at position 4: R^p'̲ and W2RnpW2 \in R^{n*p'} for projecting ht1h_{t-1} to KaTeX parse error: Double superscript at position 4: R^p'̲
a=a×W1a' = a \times W1 , and ht1=ht1×W2h'_{t-1} = h_{t-1} \times W2 , where aa' and ht1h'_{t-1} represents the projected encoder and LSTM representations respectively.
The score si(i,whereiL)s_{i} (\forall i, where i \in L) = tanh(ht1+ai)\tanh(h'_{t-1} + a'_{i}) and the scores are scaled by applying the softmax to understand the relative importance of the encoder representation (aia'_{i}) such that the scores sum up to 1.
sis'_{i} = exp(si)(j)Lexp(si)\dfrac{exp(s_{i})}{\sum_{(j) \in L}\exp(s_{i})}
So, now it is clear that the scores are different for generating different words that are achieved by giving importance to the relevant portion os the encoder representations, i.e., giving weightage to a particular image section for each word.
Finally, the context encoder vector zt1z'_{t-1} for generating the word wtw_{t} is given by the weighted combinations of the attention scores.
zt1z'_{t-1} = (i)Lai×si\sum_{(i) \in L} a_{i} \times s'_{i}

Code for Attention Mechanism

class BahdanauAttention(tf.keras.Model):
def __init__(self, units):
super(BahdanauAttention, self).__init__()
self.W1 = tf.keras.layers.Dense(units)
self.W2 = tf.keras.layers.Dense(units)
self.V = tf.keras.layers.Dense(1)

def call(self, features, hidden):
# features(CNN_encoder output) shape == (batch_size, 64, embedding_dim)

# hidden shape == (batch_size, hidden_size)
# hidden_with_time_axis shape == (batch_size, 1, hidden_size)
hidden_with_time_axis = tf.expand_dims(hidden, 1)

# score shape == (batch_size, 64, hidden_size)
score = tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis))

# attention_weights shape == (batch_size, 64, 1)
# you get 1 at the last axis because you are applying score to self.V
attention_weights = tf.nn.softmax(self.V(score), axis=1)

# context_vector shape after sum == (batch_size, hidden_size)
context_vector = attention_weights * features
context_vector = tf.reduce_sum(context_vector, axis=1)

return context_vector, attention_weights


Model Prediction and Attention Visualization

In the below images one can see how the how models gaze is shifting across the image when generating different words in the caption.

Run set
10


We encourage you to experiment with image captioning on your own.

You can find the full code here. →


Luke Floden
Luke Floden •  
The "gaze" of the network in the "Model Prediction and Attention Visualization" section is not very intuitive. When it generates the word "surfboard" the corresponding image looks to have attention everywhere except for the subject. I checked the code for the run and I seem to be correct in my assumption that the white sections correspond to higher attention. Would this imply that the attention could be further improved? Or is it that attention doesn't work as intuitively as I think?
1 reply
Iterate on AI agents and models faster. Try Weights & Biases today.