## Introduction

Image captioning is the task of generating a description for a given image. Caption generation involves two tasks.

1. Understanding the content of the image.
2. Turning this understanding into a meaningful sentence describing the image.

Hence, it requires techniques from both computer vision and natural language processing. Image captioning has many use cases that include generating captions for Google image search and live video surveillance as well as helping visually impaired people to get information about their surroundings. Watch this wonderful video by Microsoft here.

### Full code →

Let us dig deeper into the different techniques to perform image captioning.

## History and Evolution of Image Captioning

### 1. Every Picture Tells a Story: Generating Sentences from Images

Ali Farhadi et al.'s 'Every Picture Tells a Story: Generating Sentences from Images' **projects the image and text into a triplet representation space of <object, action, scene>. They call this space 'the meaning space'. ** Once the images and texts are projected into the meaning space, the results are compared. Similar meanings result in a high score.

The problem has been formulated using a Markov Random Field (MRF). Each of the nodes represents an object, action, and scene. The node potentials are computed by the linear combination of scores from several detectors and classifiers. Frequencies estimate edge potentials. Learning involves setting the weights on the node and edge potentials. Given the potentials, the goal is to find the best triplets.

A Markov Random Field (MRF) is a probability distribution P(x) over variables $x_{1},x_{2},x_{3}...x_{n}$ in an undirected graphical network (G), where the $x_{1},x_{2},x_{3}...x_{n}$ represents the nodes of the graph G. The joint probability distribution P($x_{1},x_{2},x_{3}...x_{n}$ ) is given by

$P(x_{1},x_{2},x_{3}...x_{n}) = \frac{1}{Z} \prod_{c\in C} ϕ_{c}.x_{c}$

$Z = \sum_{x_{1},x_{2}...x_{n}} \prod_{c\in C} ϕ_{c}.x_{c}$

where C denotes the set of cliques of the graph (G), and each factor $ϕ_{c}$ is a non-negative function over the variables in a clique, and Z is the normalizing factor.

In the represented MRF, there is a node for

• Objects which can take a value from a possible set of 23 nouns
• Actions with 16 different values
• Scenes that can select each of 29 different values

Now, the mapping from images to meaning space is reduced to learning to predict triplet for images by maximizing the joint distribution over the graph G of Object, Action, and Scene, i.e., $max P(A, B, C)$ which will be factorized as

$P(O,A, S) = ϕ(O,A) . ϕ(A, S).ϕ(O, S)$. Inferencing involves a greedy method that maximizes the joint distribution based on the unary and binary potentials that are learned by maximizing. The projection of textual sentences into the latent, meaning space is done using dependency parsing.

### 2. From Captions to Visual Concepts and Back

In 2015, 'From Captions to Visual Concepts and Back' presented a multi-modal perspective of solving the captioning problem using multiple instance learning. It was state of the art at that time on the Microsoft COCO benchmark.

The main concept revolves around three primary steps :

Detecting the words in the images by having a fixed corpus of words (V) in the training set and using multiple instance learning.

Each word $w \in V$, a set of positive and negative bags of bounding boxes is given to the model where a bag {b_{i}} corresponds to a positive image if the word $w$ is present inside the caption for that image.

The final probability that a bag ${b_{i}}$ contains the word $w$ is given by

$(1 - \prod_{j\in b_{i}}(1 - p_{ij}^w))$ ,

where $p_{ij}$ is the probability that a given image region $j$ in the image ${i}$ contains the word w. The probability is estimated using a CNN model with sigmoid loss function.

Once the possible word distribution is known for every image, the next part is generating the most likely sentence. Maximum entropy language model in a generative framework is used for it.

$L = \mathop{\sum_{s=1}^{S}\sum_{l=1}^{#(s)}}log Pr(w_{l}^s| w_{l-1}^s ...w_{1}^s, V_{l-1}^s)$

## Show and Tell: A Neural Image Caption Generator

'Show and Tell: A Neural Image Caption Generator' proved to be path-breaking in the field of image captioning. Inspired by the success of sequence-to-sequence learning in machine translation, the authors used an encoder-decoder framework to create a generative learning scenario.

Encoders & decoders in sequence-to-sequence learning

• The encoder network $E$ processes each word $w$ in the input sequence $S_{i}$ and the compiled information is put into context vector $C$
• The context vector C is passed to the decoder network $D$
• The decoder $D$ is a generative network which maintains the hidden state that it passes from one-time step to the next and generates the translated word (in a translation scenario) $w_t$ given the previous words.

Ilya Sutskever et al.'s research 'Sequence to Sequence Learning with Neural Networks' created an end-end training setup in an encoder-decoder framework to solve the Machine Translation problem from a sequence generation perspective. The central part of the research involved training a deep LSTM on many sentence pairs by maximizing the log probability of a correct translation $T$ given the source sentence $S$. The objective is defined as:

$\sum_{(T,S) \in S} log p(T|S)$

$T' = arg \max_{T} p(T|S)$

In 'Show and Tell: A Neural Image Caption Generator' the authors encode the image using a deep convolution neural network and pass the encoder $E$ to the decoder network $D$, primarily a recurrent neural model, which generated the sequence, same as above. The primary objective function is to maximize the likelihood of $(L)$ of the target sentence given the input image.

$T' = arg \max_{T} \sum_{(T,S) \in S} log p(T|S)$ is the objective function where $θ$ are the parameters of our model, $I$ is an image, and $S$ its correct transcription. To be more detailed of the entire process, if $I$ represents the image and $S = (S_{0}, S_{0}, S_{0} ... S_{n})$ is true sentence describing the image , then the primary equations are :

$x_{t-1} = CNN(I)$

$x_{t} = W_{e}.S_{t} , t \in {0,1... N-1}$ , where $W_{e}$ ---> word embeddings

$p_{t+1} = LSTM(x_{t}) , t\in {0,1... N-1}$

Here they represent each word as a one-hot vector $S_{t}$ of dimension equal to the size of the dictionary and the final loss is given by : $L = -\sum_{t=1}^{N}log p_{t} .S_{t}$

To summarize, CNN takes an input image and generates the feature representation, which is then fed to the decoder LSTM for generating the output sentence with very high accuracy. It has given the state of the art results at that time, but there were particular concerns \with the same.

Problems: One of the significant problems in a standard seq2seq model is its inability to accurately process long input sequences since only the last hidden state of the encoder CNN is used as the context vector for the decoder. While in seq2seq models, the entire representation of the input image $I$ is projected into a latent space $CNN(I)$, which has dense information of the entire image.

However, while generating the word $w_{t}$ at the $t^{th}$ time point, the entire $CNN(I)$ sequence is probably not needed. Instead, a sub-representation of the entire image space can provide much more relevance for the generation process, i.e., a representation $F' \in R^d$ of a portion of the input image $I' \in I$ is more relevant than the representation $F \in R^d$ of the entire image input $I$ for generating the word $w_{t}$.

This means that for generating the word boy in the sequence * A boy playing the guitar*, the LSTM model should focus more on the image area ($I'$) where the boy is present rather than the entire image ($I$).

'Neural Machine Translation by Jointly Learning to Align and Translate' and 'Effective Approaches to Attention-based Neural Machine Translation' introduce the concept of attention mechanism in a sequence to sequence learning scenario. This concept opened up a new path of attending tot relevant portions in the input sequence/image and generating many informative and relevant encoder representations for the generative models at each time point.

## Show, Attend and Tell: Attention Mechanism and Image Captioning

Kelvin Xu et al.'s paper 'Show, Attend and Tell: Neural Image Caption Generation with Visual Attention' proposes an attention-based encoder-decoder framework which gives importance to relevant portions of the input image in the encoder network for generating each word in the decoder network.

The goal of the image captioning task is given an input image $I$, it generates the caption $y$ which is encoded as a sequence of $1-of-K$ words,

$y = {y_{1},y_{2}....y_{C}} , y_{i} \in R^K$

where $K$ is the size of the vocabulary and $C$ is the maximum sequence length

### Encoder Block: Convolution Feature Maps

To efficiently encode the input image and represent it in a latent space, features from a convolution neural model is used.

Unlike previous research works that majorly use the flattened fully connected representation of CNN, in this paper, the researchers have used the features from the lower convolution layers to retain the correspondence between the features and the 2D image.

This also allows the decoder network to selectively focus on certain parts of an image by selecting a subset of all the feature vectors.

The output from the lower layers of the CNN is primarily of the form $kkD$, where $k*k$ is the size of the feature maps, and $D$ represents the number of convolutional filters.

Now, we reshape the feature tensors to $k^2 * D$ and for the sake of simplicity, let us assume that $k^2 = L$, then the extracted feature map's dimension can be written as $L*D$

Now, this can be thought of as the convolutional feature extractor produces $L$ vectors each of which is of dimension $D$ and also represents a portion of the image $I$.

$a = {a_{1}, a_{2} , a_{3} , ...... a_{L}}$ , where $a_{i} \in R^D$ where $a$ is the final output from the encoder network

### Full code →

#### Code for the Encoder

class CNN_Encoder(tf.keras.Model):

# Since you have already extracted the features and dumped it using pickle
# This encoder passes those features through a Fully connected layer
def __init__(self, embedding_dim):
super(CNN_Encoder, self).__init__()

## inception v3 model
self.image_model = tf.keras.applications.InceptionV3(include_top=False, weights='imagenet')
self.image_features_extract_model = tf.keras.Model(self.image_model.input, self.image_model.layers[-1].output)

# shape after fc == (batch_size, 64, embedding_dim)
self.fc = tf.keras.layers.Dense(embedding_dim)

def call(self, x):

x = self.image_features_extract_model(x) #shape batch_sz * 8 * 8 * 2048
x = tf.reshape(x, (x.shape[0], -1, x.shape[3]))  #shape batch_sz * 64 * 2048
x = self.fc(x)     ##shape batch_sz * 64 * embedding_dim
x = tf.nn.relu(x)  ##shape batch_sz * 64 * embedding_dim with relu
return x


### Decoder Block: LSTM/GRU as a Generator Network

The Decoder LSTM network, in this case, is used as a generative network that generated each word at a time conditioned on the previous hidden state, previously generated words, and the encoder feature vector/tensor.

$\left( \begin{array}{c} i_{t} \ f_{t} \o_{t}\g_{t} \end{array} \right)$ = $\left( \begin{array}{c} \sigma \ \sigma\ \sigma\tanh\end{array} \right)$ . $T_{D+m+n},{n}$ $\left( \begin{array}{c} Ey{t-1} \ h_{t-1}\ z'_{t-1}\end{array} \right)$

where, $i_{t}$ , $f_{t}$, $o_{t}$ , $g_{t}$ represents the input, forget, memory, output and hidden state of the LSTM, respectively.

$z' \in R^D$ represents the attention weighted contextual encoder vector

$c_{t} = f_{t}\bigodot c_{t-1} + i_{t}\bigodot g_{t}$

$h_{t} = o_{t}\bigodot tanh(c_{t})$

$T_{s,t} : R^{s} -> R^{t}$

$E \in R^{m*K}$ is the embedding matrix which projects the input initialized vectors of the word from $R^K$ to $R^m$

$m$ and $n$ denote the embedding and LSTM dimensionality, respectively.

$\sigma$ and $\bigodot$ be the logistic sigmoid activation and element-wise multiplication, respectively.

In the decoding stage, the one-hot representation of the input words, which is a $K$-dimensional vector, is being projected by the embedding matrix in the $m$ dimensional space. The resulting vector $Ey_{t-1}$ is in $R^m$, which is concatenated with the attention weighted encoder vector $z' \in R^D$ and the LSTM hidden state $h_{t-1} \in R^n$ and is fed as an input to the LSTM network for generation of the word $w_{t}$ from the vocabulary of $K$ words. $T$ matrix projects the $D+m+n$ dimensional vector to $R^n$, followed by the standard sigmoid and $\tanh$ activations of the LSTM network.

The conditional probability of generating the $t^{th}$ sequence gets modified, as shown below:

$P(y_{t}|y_{t-1},y_{t-2}.....y_{1} , x) --> P(y_{t}|Ey_{t-1},h_{t-1},z'_{t-1})$

### Full code →

#### Code for the Decoder

class RNN_Decoder(tf.keras.Model):

def __init__(self, embedding_dim, units, vocab_size):
super(RNN_Decoder, self).__init__()
self.units = units

self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
self.gru = tf.keras.layers.GRU(self.units,
return_sequences=True,
return_state=True,
recurrent_initializer='glorot_uniform')
self.fc1 = tf.keras.layers.Dense(self.units)
self.fc2 = tf.keras.layers.Dense(vocab_size)

self.attention = BahdanauAttention(self.units)

def call(self, x, features, hidden):
# defining attention as a separate model
context_vector, attention_weights = self.attention(features, hidden)

# x shape after passing through embedding == (batch_size, 1, embedding_dim)
x = self.embedding(x)

# x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

# passing the concatenated vector to the GRU
output, state = self.gru(x)

# shape == (batch_size, max_length, hidden_size)
x = self.fc1(output)

# x shape == (batch_size * max_length, hidden_size)
x = tf.reshape(x, (-1, x.shape[2]))

# output shape == (batch_size * max_length, vocab)
x = self.fc2(x)

return x, state, attention_weights

def reset_state(self, batch_size):
return tf.zeros((batch_size, self.units))


### Soft Attention: Neural Machine Translation by Jointly Learning to Align and Translate

Bahdanau's paper on the attention mechanism gave a new direction in the field of sequence-to-sequence learning problems by introducing the concept of attention weighted encoder representation.

The most common approach followed previously was to give equal importance to every sequence in the encoder, which might not be relevant as for generating a fixed word as discussed above, and Bahdanau's approach in the paper solves the same problem by giving relevant weightage to the relevant sequence.

As shown above, the conditional probability of generating the $t^{th}$ word is given by

$P(y_{t}|Ey_{t-1},h_{t-1},z'_{t-1})$

**The most exciting aspect to note is that the probability for each word is conditioned on a distinct encoder vector $z'_{t-1}$ and not a generic representation for all the words. **

The attention scores for generating a word $w_{t}$ by the learning the additive association between the encoder vectors of $a$, where $a_{i} \in R^D$ and the hidden state vector of the previous state ${h_{t-1} \in R^n}$. However, since both the vectors are in different dimensions, so they are projected by respective matrices to the common space $p'$. The matrix $W1 \in R^{Dp'}$ is used for projecting $a_{i}$ to $R^p'$ and $W2 \in R^{np'}$ for projecting $h_{t-1}$ to $R^p'$

$a' = a \times W1$ , and $h'{t-1} = h{t-1} \times W2$ , where $a'$ and $h'_{t-1}$ represents the projected encoder and LSTM representations respectively.

The score $s_{i} (\forall i, where i \in L)$ = $\tanh(h'{t-1} + a'{i})$ and the scores are scaled by applying the softmax to understand the relative importance of the encoder representation ($a'_{i}$) such that the scores sum up to 1.

$s'{i}$ = $\dfrac{exp(s{i})}{\sum_{(j) \in L}\exp(s_{i})}$

So, now it is clear that the scores are different for generating different words that are achieved by giving importance to the relevant portion os the encoder representations, i.e., giving weightage to a particular image section for each word.

Finally, the context encoder vector $z'{t-1}$ for generating the word $w{t}$ is given by the weighted combinations of the attention scores.

$z'{t-1}$ = $\sum{(i) \in L} a_{i} \times s'_{i}$

### Full code →

#### Code for Attention Mechanism

class BahdanauAttention(tf.keras.Model):
def __init__(self, units):
super(BahdanauAttention, self).__init__()
self.W1 = tf.keras.layers.Dense(units)
self.W2 = tf.keras.layers.Dense(units)
self.V = tf.keras.layers.Dense(1)

def call(self, features, hidden):
# features(CNN_encoder output) shape == (batch_size, 64, embedding_dim)

# hidden shape == (batch_size, hidden_size)
# hidden_with_time_axis shape == (batch_size, 1, hidden_size)
hidden_with_time_axis = tf.expand_dims(hidden, 1)

# score shape == (batch_size, 64, hidden_size)
score = tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis))

# attention_weights shape == (batch_size, 64, 1)
# you get 1 at the last axis because you are applying score to self.V
attention_weights = tf.nn.softmax(self.V(score), axis=1)

# context_vector shape after sum == (batch_size, hidden_size)
context_vector = attention_weights * features
context_vector = tf.reduce_sum(context_vector, axis=1)

return context_vector, attention_weights



## Model Prediction and Attention Visualization

In the below images one can see how the how models gaze is shifting across the image when generating different words in the caption.

## Section 4

We encourage you to experiment with image captioning on your own.

You can find the full code here. →