Skip to main content

Instructions

Explanation of how the project was completed
Created on March 2|Last edited on March 7

Context

Given an image like the example below, your goal is to generate a caption such as "a surfer riding on a wave".
To accomplish this, you'll use an attention-based model, which enables us to see what parts of the image the model focuses on as it generates a caption.

In this example, you will train a model on a relatively small amount of data—the first 30,000 captions for about 20,000 images (because there are multiple captions per image in the dataset).

Instructions

  • Download the MS-COCO dataset
  • Cache a subset of images using InceptionV3
    • Resize the image to 299px by 299px
    • Preprocess the images using the preprocess_input method to normalize the image so that it contains pixels in the range of -1 to 1, which matches the format of the images used to train InceptionV3
    • Create a tf.keras model where the output layer is the last convolutional layer in the InceptionV3 architecture. The shape of the output of this layer is 8x8x2048. You use the last convolutional layer because you are using attention in this example. You don't perform this initialization during training because it could become a bottleneck.
    • Extract the features from the lower convolutional layer of InceptionV3 giving us a vector of shape (8, 8, 2048).
    • Squash that to a shape of (64, 2048).
    • Save to .npy file
  • Preprocess Captions
    • Transform the text captions into integer sequences using the TextVectorization layer
      • Use adapt to iterate over all captions, split the captions into words, and compute a vocabulary of the top 5,000 words (to save memory).
      • Tokenize all captions by mapping each word to its index in the vocabulary. All output sequences will be padded to length 50.
      • Create word-to-index and index-to-word mappings to display results.
  • Split into Train - Test
  • Train an encoder-decoder model
    • The decoder is identical to the one in the example for Neural Machine Translation with Attention
    • The model architecture is inspired by the Show, Attend and Tell paper.
    • This vector is then passed through the CNN Encoder (which consists of a single Fully connected layer).
    • The RNN (here GRU) attends over the image to predict the next word.
    • You extract the features stored in the respective .npy files and then pass those features through the encoder.
    • The encoder output, hidden state (initialized to 0), and the decoder input (which is the start token) is passed to the decoder.
    • The decoder returns the predictions and the decoder's hidden state.
    • The decoder hidden state is then passed back into the model and the predictions are used to calculate the loss.
    • Use teacher forcing to decide the next input to the decoder.
      • Teacher forcing is the technique where the target word is passed as the next input to the decoder.
    • The final step is to calculate the gradients and apply them to the optimizer and backpropagate.
  • Generates captions on new images using the trained model
    • The evaluate function is similar to the training loop, except you don't use teacher forcing here. The input to the decoder at each time step is its previous predictions along with the hidden state and the encoder output.
    • Stop predicting when the model predicts the end token.
    • And store the attention weights for every time step.