Skip to main content

Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering (In Layspeak)

On May 2, 2022 Google Research submitted the Answer-Me paper to arXiv. Let's take a look at this new effort to advance multi-modal questions answering.
Created on May 30|Last edited on June 1
A very interesting paper titled "Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering" was submitted to arXiv on May 2, 2022 by Google Research. The team of AJ Piergiovanni, Wei Li, Weicheng Kuo, Mohammad Saffar, Fred Bertsch and Anelia Angelova developed the Answer-Me framework to advance the methods used in questions answering in images.
Specifically, the framework "... unifies a variety of question answering tasks, such as, visual question answering, visual entailment, visual reasoning."
Before we dive in, let's take a look at what we'll be covering:


Let's first look at the problem, before digging into their solution.

The Problem With Current Models

Traditional models rely on pre-training on large datasets and then fine-tuning for individual tasks.
In the paper they describe the problem in the introduction with:
"... the intent of the input text is not captured by many pretrained models, as they are trained on datasets for image captioning or with weak image-language correlations [51,25], which might not learn the right interactions between image and language to capture the goal of the question. For example, pretraining with image captioning does not train a text encoder, so the model will not be able to understand questions. Instead, we need a model which is able to take advantage of the language-image training and learn the corresponding interactions which reflect the intent of the question or task."
In short, much of the pretraining may not assist with the required task.
A couple additional goals:
  • They wanted the model to be able to generalize on other tasks with natural questions and answers not seen previously. Basically, they wanted the system to work in a zero-shot scenario on unseen tasks.
  • They wanted the model to be relatively easy to train and support.

Why Do We Need Answer-Me?

It's not hard to think of why they might take on this task. Empowering a model to understand images a way that facilitates a deeper understanding of what's in them permeates much of what's being worked on presently. Let's just consider the following questions:
  • How valuable might it be for advertisers on the Google Ads system to be able to advertise their products after a user asks, "Where can I buy brown boots like that?" while looking at a picture?
  • How important might it be for an autonomous vehicle to be able to answer to itself the question, "How many humans are on the road?" or "Is that a child or a dog?"
  • Or just casual questions one might ask such as the examples given in the paper, "What is the child holding?" or the potentially far more difficult, "Why is the person smiling?"
Advancements in this area will have immediate applications for online shopping and all manner of mobile uses, especially as augmented reality adoption increases (think "what building am I looking at?" or "how much is that tv" when your phone is simply pointed towards the object).

Additionally, many of the techniques and results from this framework can be combined with other systems to improve them further.
Empowering autonomous cars to better understand their environment makes a good example, as does helping game designers create characters that respond more realistically after digesting the world based on how people react.

How The Answer-Me Model Works

Before we launch into the technical bit (though I'll be keeping it light), let's take a look at how the authors describe some of the mechanics:
"Each of the image-language question-answering tasks contain a specific intent in the question, for example, counting the number of objects, answering a visual entailment question, or reasoning about an image. Answer-Me combines the tasks in joint training using a mixture, enabling the model to acquire new capabilities, and generalize to other classes of questions without losing accuracy on tasks."
In short, in training the model is given multiple tasks based on the same inputs and combines those tasks which results in higher flexibility in acquiring new capability and dealing with new tasks.
Additionally, Answer-Me was trained on multiple datasets with different question intents to increase generalization and avoid overfitting. The datasets were VQA/VQA2.0 (Visual Question Answering), NLVR2 (Natural Language for Visual Reasoning), SNLI-VE and GQA.

Under The Answer-Me Hood

The Answer-Me model consists of an image encoder (ResNet) and text encoder (T5) which are easily and independently scalable. Their experiments are, "based on a ResNet-50 and T5-base model, and [they] scale it 3x by using ResNet101 and T5-large."
The features are provided to the fusion layer. They image and language features obviously need to be combined. This is done with concatination. They write:
"We use the ResNet feature map to get H∗W features and concatenate with the L (number of tokens) text features. We apply a relative position bias to both image and text separately. We then apply the Transformer on the concatenated features, combining both sources. While existing works have proposed similar fusion methods, the pretraining method and generalization abilities without forgetting are new, further only using raw images instead of region proposals."
Here's what that looks like in their Figure 2:
Fig. 2. Model overview. The model processes an image and text, fuses them together,
and uses transformer layers to generate the output text. The pretraining task mix,
Fig. 2. Model overview. The model processes an image and text, fuses them together, and uses transformer layers to generate the output text. The pretraining task mix, shown in different colors at the bottom right allows all parts of the model to be well-trained, and is better suited for the subsequent multi-task training.

How The Answer-Me Model Is Trained

One of the unique characteristics of this model, is that:
  • The image encoder, text encoder, image-text fusion module and text decoder are all trained together.
  • They use a mix of captioning, caption completion and matching tasks to exercises various pathways during pretraining, improving it's performance in question-answering and descriptive tasks.
  • Taken together, the tasks allowed the encoder and decoder to see all, part of, or none of the caption, and experimentally they found this pretraining method results in a stronger model than any individual pretraining task.
They designed four tasks for training. They were:
  1. Image Captioning (IC). Here, the input text is ‘caption the image’ and the target text is the caption. This task mostly trains the text decoder and fusion layers.
  2. Caption Completion (CMP). Here the input is 10-40% of the caption text and the target text is the remaining caption. This encourages training of the entire model.
  3. Text MLM. Here the input is the caption with 25% of the words masked out, the target text is the missing words. This trains the entire model.
  4. Image Text Matching (ITM). Here the input is either the image caption or a random caption and the target text is ‘true’ or ‘false’ depending on if the caption matches the image or not. This primarily trains the encoder and fusion layers.
The advantage to this, is that the encoder and decoder are trained on all, part or none of the caption across tasks, preparing it to address a large variety of image-text tasks.

The Results

So how does the model perform?

As one might expect, the most tasks the model is trained on, the better it performs on a new task. One of the hallmarks of this model is that scaling training in this way is not onerous.
The 3x in the last line references 3x the number of parameters, which also improved performance.
The data that illustrates what I believe is the highlight of the work, is Table 3 from the paper:

What we can see is that when the model is fine-tuned on a single task it does well on that task and fails at others, however when it is trained on all tasks it performs well on all (though not as well as had it specialized).
One might think this is obvious, however many model suffer when trained on multiple tasks and suffer "catastrophic forgetting". Basically, they lose track of how to do the first task they were trained on, when they are taught a new one. The authors point out something I personally found humorous. "... additional finetuning on the first task actually makes the performance on the second one even worse than zero-shot." Standard models are better with no training on a task, then with training on it, and then consistent training on a new one.
And how does it stack up against the state-of-the-art?

The authors describe the results here well (though the bold makes it pretty obvious). They wrote:
"Experiments comparing to SOTA: specialized models (top section), and multitask models, including ours (middle large section). Answer-Me largely outperforms other multi-task models despite working in the open-vocabulary generative setting. For reference we include pre-trained and fine-tuned models which are further advantaged by fine-tuning to each individual dataset (bottom section). As seen, Answer-Me still outperforms these on GQA and VizWiz and it even outperforms the Large version of SimVLM model and it is close to its SimVLM-Huge on SNLI-VE. The best results among the pre-trained fine-tuned models are marked in italics."

Conclusion

What we need to remember is that while some models do outperform Answer-Me, it's far more versatile and easily adapted to new tasks. Something that the models that beat it are not. This makes it highly suitable for environments where the model may be directed to perform a new task.
Additionally, this is a new method and one that is sure to be advanced.
At present the authors have created a model that offers the following controbutions (drawn from the paper):
  • A multi-task task-aware general framework which can serve multiple diverse Visual Question Answering tasks and is able to generalize to many tasks. It uses a simple, easily scalable and extensible architecture and considers the VQA tasks in the more-challenging open-vocabulary generative setting.
  • A pretraining method to train the entire encoder-decoder vision-language model simultaneously using only noisy captioning data that results in a strong pretrained model.
  • Answer-Me is able to perform well on zero-shot (novel task) generalization settings, is also robust to overfitting and forgetting, and performs well on a new ‘detection’ task which requires semantic object recognition.
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.