Skip to main content

Model

Description of Model Used
Created on March 2|Last edited on March 7

Overview

Image caption generation is the problem of generating a descriptive sentence of an image. The fact that humans (e.g you) can do this with remarkable ease makes this a very interesting/challenging problem for AI, combining aspects of computer vision (in particular scene understanding) and natural language processing.
In this work, we introduced an "attention" based framework into the problem of image caption generation. Much in the same way human vision fixates when you perceive the visual world, the model learns to "attend" to selective regions while generating a description. Furthermore, in this work we explore and compare two variants of this model: a deterministic version trainable using standard backpropagation techniques and a stochastic variant trainable by maximizing a variational lower bound.

How does it work?

The model brings together convolutional neural networks, recurrent neural networks and work in modeling attention mechanisms.
Above: From a high level, the model uses a convolutional neural network as a feature extractor, then uses a recurrent neural network with attention to generate the sentence.
If you are not familiar with these things, you can think of the convolutional network as an function encoding the image ('encoding' = f(image)), the attention mechanism as grabbing a portion of the image ('context' = g(encoding)), and the recurrent network a word generator that receives a context at every point in time ('word' = l(context)).

The Model in Action

A woman standing in a crowd holding a rainbow umbrella.
A man holding a tennis racket in his hand.

Description

The basis of the model is an encoder-decoder architecture where a Convolutional Neural Network pre-trained on Image Net dataset is used as Encoder to produce a fixed vector representation of the image. Then using a LSTM as Decoder it generates each word of the caption one at a time.
Architecture used by Vinyals et al. in their paper Show and Tell
The model uses sum of the negative log likelihood of the correct word at each step as the loss function.
This model uses the entire representation of the image to condition the generation of each word and hence cannot focus on using different parts of the image for generating different words.

Same architecture as before but with attention layer added
The model looks at the “relevant” part of these images to generate the underlined words.
The architecture is similar to that of the classical model but with a new layer of attention. The attention looks at the feature map generated by CNN and decides what is relevant for the LSTM decoder at each step.


At a particular timestep, the decoder GRU considers the hidden state from the previous time step, context vector generated by attention mechanism and previous step output. It then combines them to update the hidden state.