DeepMind Flamingo: A Visual Language Model for Few-Shot Learning

In this article, we'll explore Flamingo — an open-ended single visual language model (VLM) for multimodal machine learning research developed by DeepMind.
Atharva Ingle
Created on November 16|Last edited on December 8
Comment
Flamingo is a new visual language model (VLM) capable of multimodal tasks like captioning, visual dialogue, classification, and visual question answering. As you can see, it works rather well: 
﻿
Single Image Samples
Step
Single Image samples: Gray boxes are user input and the pink boxes are Flamingo output.1
﻿
This article will walk you through this new research, its architecture and training data, and finally take it for a test run to see how it works.
Table of ContentsTable of ContentsAn Introduction to FlamingoWhat is Flamingo? Challenges of Multimodal Generative ModelingKey Ideas of FlamingoThe Flamingo ModelTraining DataFlamingo Training DetailsTask Adaptation With Few-Shot In-Context LearningFlamingo ModelsQualitative ResultsConclusionReferences
﻿
An Introduction to FlamingoOur experience of the world is multimodal — we see objects, hear sounds, feel the texture, smell odors, and taste flavors. For AI to make progress in understanding the world around us, it needs to be able to interpret such multimodal signals together. 
One key aspect of intelligence is the ability to quickly learn how to perform a new task given a short instruction. However, in computer vision, the most used paradigm still consists of pre-training the model on a large volume of data and then fine-tuning it on the task of interest. Fine-tuning requires thousands of annotated data points to be effective. Furthermore, it also requires per-task hyperparameter tuning and is resource-intensive, even after the pretraining process. 
Recently, multimodal vision-language models trained with a contrastive objective (CLIP) have enabled the model to adapt to zero-shot tasks without the need for fine-tuning. But there's a caveat. As these models are trained to provide a similarity score between text and an image, they can tackle limited tasks such as classification, where a finite set of outcomes are provided beforehand. There is no way that these models can generate language (because of the way they are trained) making them less suitable for open-ended tasks such as captioning or visual question-answering.
It is recommended that you should have prior knowledge of CLIP for understanding the Flamingo architecture better. To learn more about CLIP, take a look at the report below from Jonathan Whitaker.﻿
💡
A Deep Dive Into OpenCLIP from OpenAI
This article explores an open re-implementation of OpenAI's CLIP model for zero-shot classification and covers its additional uses and its potential issues.
﻿
The tables and figures are taken from the original Flamingo paper with a few minor changes to the image captions and descriptions.
💡
What is Flamingo? ﻿Flamingo is a family of visual language models (VLMs) which was introduced in the paper Flamingo: A Visual Language Model for Few-Shot Learning by DeepMind. They set a new state-of-the-art in few-shot learning on a wide range of open-ended vision and language tasks. 
Flamingo surpasses the fine-tuned state of the art in 6 cases (out of 16) the authors considered. It takes inspiration from large-scale generative language models (LMs) which are good few-shot learners — as demonstrated by GPT-3. A few examples are provided to the LMs as a prompt, along with a query input, and the model generates a continuation to produce a predicted output for the task on the query. 
However, they only work with text data. They can be extended further for visual tasks such as image classification, captioning or question-answering by visual input conditioning. Specifically, the model must be able to input a multimodal prompt (image + text). The authors of this paper describe Flamingo as follows:
Flamingo is a visually-conditioned autoregressive text generation model able to ingest a sequence of text tokens interleaved with images and/or videos, and produce text as output.Flamingo models fuse (combine) large language models with powerful visual embeddings–each separately pre-trained and frozen– by adding novel architecture components in between.
Before diving deep into the research, let's look at a few examples of the kind of tasks the Flamingo model can perform.
💡
﻿
﻿
﻿
﻿
﻿
As we've seen qualitative results above, let's also see the quantitative results now before moving forward.
💡
﻿
﻿
﻿
This article was written as a Weights & Biases Report which is a project management and collaboration tool for machine learning projects. Reports let you organize and embed visualizations, describe your findings, share updates with collaborators, and more. To know more about reports, check out Collaborative Reports.
💡
Let's first examine the difficulties of multimodal generative modeling and how the authors overcame them using Flamingo before moving on to the model architecture.
Challenges of Multimodal Generative ModelingHere, let's look at some challenges and how Flamingo proposes we solve them: 
Unifying Strong Single-Modal ModelsThe challenges: 
Training large language models is extremely computationally expensive.
The authors saved the computational resources by starting with a pretrained language model.
However, a text-only model has no built-in way to incorporate data from other modalities.
The authors wanted to enable this while retaining the knowledge of the original language model.
Proposed approach: 
Interleave cross-attention layers with regular language-only self-attention layers that are kept frozen during training. The authors also introduce a specific gating mechanism to minimize the effect of these newly added layers at initialization which greatly improved stability and final performance.
Supporting Both Images and VideosThe challenges: 
Images and videos (of even modest resolution) are high dimensional.
Flattening them to 1D sequences (as used in unimodal text generation) is costly as the computation scales quadratically with the sequence length.
The authors also wanted to treat images and videos in a unified manner which is not straightforward.
Proposed approach: Use a Perceiver-based architecture that can produce a small fixed number of visual tokens (around a hundred) per image/video, given a large varying number of visual input features (up to several thousand).
Heterogeneous Training DataThe challenges: 
Large models require huge datasets.
Paired image/caption datasets used in CLIP and ALIGN may not be general enough to reach GPT-3 style few-shot learning.
Large internet-based text-only dataset exists, but not for multimodal data.
One approach is to scrape webpages with interleaved images and text. Despite the generality of the data, the images and text are often weakly related.
Proposed approach: combine the interleaved dataset (which the authors created) with standard paired image/text and video/text datasets where the visual and language are typically more strongly related.
Let's run down the primary and innovative ideas of Flamingo as outlined by the authors in the paper before delving into each component in more detail.
Key Ideas of FlamingoFlamingo by DeepMind can perform various multimodal tasks (such as captioning, visual dialogue, classification or visual question answering) from only a few input/output examples. This is enabled by the following key ideas:
A novel architecture for accepting arbitrarily interleaved visual and text data as input and generating output text in an open-ended manner.
Architectural innovations and training strategies that effectively leverage large pre-trained vision-only and language-only models, saving tons of compute and preserving the benefits of these initial models while efficiently fusing the modalities. Specifically, the authors used Chinchilla, a 70B state-of-the-art LM (which is frozen in Flamingo) and trained Flamingo, an 80B parameter VLM.
Efficient ways to adapt to visual inputs of varying sizes, making Flamingo applicable to images and videos.
The authors also quantitatively evaluate Flamingo on various tasks for few-shot learning. They reserved a large set of held-out benchmarks which have not been used for validation of any design decisions of hyperparameters of the approach which allowed them to estimate unbiased few-shot performance.
Flamingo sets a new state-of-the-art in few-shot learning on a wide array of 16 multimodal language and image/video understanding tasks. In 6 out of these 16 tasks, Flamingo also outperforms the fine-tuned state-of-the-art, despite using only 32 task-specific examples which are around 1,000 times less task-specific training data than the current state-of-the-art.
The Flamingo Model
Approach﻿
﻿
Flamingo accepts text interleaved with images/videos and outputs free-form text. It can handle both open-ended tasks such as visual question-answering or captioning and close-ended tasks such as classification.
The first goal of the authors is to leverage pre-trained language models without spending computing on training them from scratch. Specifically, they used a model called Chinchillah which was introduced recently by DeepMind. This allowed the Flamingo model to have strong generative language abilities and access to a large amount of knowledge stored in LM weights.
On the vision side, the authors pretrain a vision encoder with a contrastive text-image approach similar to CLIP. The role of this model is to extract rich semantic spatial features from the given images/videos.
The second goal was to bridge these two models harmoniously. For that, the authors freeze the weights of these models and link them via two learnable architectures.
The Perceiver Resampler receives spatiotemporal features from the Vision Encoder (obtained from a variable number of images or videos) and outputs a fixed-size set of visual tokens.
The visual tokens are then used to condition the frozen LM using freshly initialised cross-attention layers that are interleaved (or inserted) between the pre-trained LM layers. These layers offer LM a way to incorporate visual information for the next-token prediction task.
﻿
﻿
The model is trained by maximizing the likelihood of Equation (1) on a diverse mixture of datasets.
Let's now deep dive into each component of the Flamingo Model.
💡
Vision Encoder: From Pixels to FeaturesThe authors used an F6 Normalizer-Free ResNet (NFNet) as it gives an excellent trade-off between performance and efficiency given the hardware.
The vision encoder is pretrained as a dual encoder using contrastive loss employed by CLIP.
﻿BERT is used for text encoder which is discarded after pretraining.
Contrastive similarities are computed as dot product of the mean pooling of the image encoder output, and the mean pooled output of BERT model.
In contrast to CLIP, global average pooling is used to produce the visual embedding (rather than the global attention pooling) for simplicity.
288 x 288 pixels image resolution is used and the joint embedding space size is 1376.
The final output is a 2D spatial grid of features XfX_fXf​﻿, which is further flattened to 1D as shown in Figure 4.
For video inputs, the frames are sampled at 1 FPS and are encoded independently to obtain a sequence of TTT﻿ feature maps XfX_fXf​﻿ which are then concatenated.
Vision Encoder Details﻿
﻿
﻿
Perceiver Resampler: From Varying-Size Large Feature Maps to Few Visual Tokens﻿
﻿
The Perceiver Resampler is based on DeepMind's paper Perceiver: General Perception with Iterative Attention.﻿
The Flamingo model takes as input a variable number of image or video features from the vision encoder and outputs a fixed number of visual tokens.
The visual inputs are re-sampled to a fixed and small number (64 in practice) of outputs to significantly reduce the computational complexity of vision-text cross-attention.
The visual features which are fed to Perceiver Resampler are obtained by first adding a learnt temporal position (t=0,t=1,t=2t=0, t=1, t=2t=0,t=1,t=2﻿ ) to each spatial grid of features corresponding to a given frame of the video as seen in Figure 4.
The authors only used temporal encodings and no spatial grid position encodings as the latter didn't brought improvements.
These visual features are then flattened as XfX_fXf​﻿.
Similar to the original Perceiver, the model learns a predefined number of latent input queries.
These latent queries are fed to a transformer stack and cross-attend to the flattened visual features XfX_fXf​﻿.
The keys and values computed from the learnt latents are concatenated to the keys and values obtained from XfX_fXf​﻿.
Interleaving New GATED XATTN-DENSE Layers Within a Frozen Pretrained LM.﻿
﻿
Text generation is performed by a Transformer decoder, conditioned on the visual representations XXX﻿ produced by the Perceiver Resampler.
The authors used a 70B parameter Chinchilla model for the largest Flamingo model as the language model.
The pretrained blocks are frozen during the training of Flamingo to preserve the information and text generation abilities in the text-only language model.
In order to condition the LM on the visual inputs, the authors inserted gated cross-attention dense (GATED XATTN-DENSE illustrated in Figure 5) blocks in between the original self-attention layers. Note that the original self-attention layers are frozen during the training of Flamingo while the newly inserted cross-attention layers are trained from scratch.
﻿LayerNorm is applied to all attention inputs and the feed-forward layers (GPT-2 style).
The authors also added tanh getting mechanisms to preserve the original language model behaviour at initialisation and not catastrophically change the learned features by the LM.
The outputs of the newly added cross-attention layers are multiplied by tanh(α\alphaα﻿), where α\alphaα﻿ is a layer-specific learnable scalar initialized at 0.
During training, the model smoothly transitions from a fully trained text-only model to a visual language model, thanks to the gating mechanism.
Per-Image/Video Attention Masking﻿
﻿
The training data also consists of interleaved sequences scraped and processed from webpages.
Each interleaved example consists of: a sequence of text yyy﻿, a sequence of images xxx﻿, and the sequence of positions of the images in the text.
Based on the visual data positions, the authors define a function ϕ:[1,L]↦[0,N]\phi:[1, L] \mapsto[0, N]ϕ:[1,L]↦[0,N]﻿ that assigns to each text position the index of the last image or video appearing before this position (or 0 if no visual data appears before the position). 
The function 𝜙 defines which visual inputs are considered usable to predict token in Equation - (1): the set of preceding tokens y<ℓ≜(y1,…,yℓ−1)y_{<\ell} \triangleq\left(y_1, \ldots, y_{\ell-1}\right)y<ℓ​≜(y1​,…,yℓ−1​)﻿, and the set of preceding images or videos x≤ℓ≜{xi∣i≤ϕ(ℓ)}x_{\leq \ell} \triangleq\left\{x_i \mid i \leq \phi(\ell)\right\}x≤ℓ​≜{xi​∣i≤ϕ(ℓ)}﻿.
Multi-image attention is implemented with the gated xattn-dense layers with causal masking over tokens from the Perceiver Resampler.
By default, each token is only allowed to attend to the visual tokens of the image that appeared immediately before it (this restriction improved performance).
Although direct attention is over a single image, there is still a causal dependency on previous images (due to causal self-attention in the text decoder).
Furthermore, experiments show that the model can train on 5 images, but generalize up to 32.
This marks the end of the model architecture. Let's move on to the training data, training details and the qualitative results of the model.
💡
Training Data﻿
﻿
﻿
﻿
﻿
Flamingo Training Details﻿
﻿
﻿
﻿
﻿
Task Adaptation With Few-Shot In-Context LearningThe authors evaluate the ability of the model to rapidly adapt to new tasks using in-context learning, popularised by GPT-3.
The model is given a set of support-examples in the form of (image, text) or (video, text) (where the image or video is the input visual and the text is the expected response) and a single visual query for which the model needs to make a prediction.
Given a few support-examples, the authors build a multimodal prompt by concatenating the support examples followed by the visual query as illustrated in the Figure 8.
﻿
﻿
﻿
﻿
﻿
Flamingo ModelsHere's the parameter count of each component of the model.
﻿
Run set1
﻿
Qualitative ResultsThe panels below demonstrate how Flamingo performs in a wide range of tasks using the examples the authors presented in the paper. To view more samples of the same type, drag the blue slider (Step) towards the right.
💡
Here comes the most exciting part of this report. We'll see how the model performs across wide variety of tasks. Note, all the results are taken from the paper as the model is not open-sourced.
Results below shows the simplest form of interaction where a single image is provided followed by a text prompt either in the form of a question or start of a caption. Even though the model is not trained in Q&A form, the pretrained language model's capabilities allows this adaptation.
﻿
﻿
Single Image samples: Gray boxes are user input and the pink boxes are Flamingo output.1
﻿
﻿
Since the Flamingo model can accept inputs in the form of arbitrary sequences of visuals and language, the authors tested its abilities to hold an extended dialogue with interleaved images and text. What's interesting to see is that, even after several rounds of interaction Flamingo can still successfully attend to the image and reply to questions. As you can see in some of the examples below, Flamingo also demonstrates solid OCR abilities, robustness to distribution shifts and complex reasoning.
﻿
﻿
Dialogue samples. Gray boxes are user input and the pink boxes are Flamingo output.1
﻿
Flamingo can also integrate information from multiple frames (e.g. videos scanning through a scene or text) and respond to requests involving temporal understanding as seen in the results below.
﻿
Video Samples. These are all of the frames the model sees.1
﻿
ConclusionFlamingo is a "general-purpose" family of models that can be applied to images and videos to understand tasks with minimal task-specific training data. Given only a few examples, a single Flamingo model can achieve state-of-the-art on a wide array of tasks, often competitive with approaches requiring fine-tuning on orders of magnitude more examples. Although there are some weaknesses of the model like worse performance on classification tasks than contrastive models, direct inheritance of all the biases, toxicity and weaknesses of the Language Model and sometimes hallucination and un-grounded guesses in open-ended visual question answering tasks (more details in Section 6 of the paper). 
However, this model provides a promising future for multimodal systems that can do a wide array of tasks without explicitly training for it. It is also quite interesting to see how other modalities such as audio can be integrated into these systems in the future.
References﻿DeepMind blog on Flamingo﻿
﻿Flamingo official paper﻿
﻿A digest of the Flamingo model by Samuel Albanie﻿
﻿Unofficial implementation of the key ideas of Flamingo﻿
﻿
Check out other reports on Fully Connected covering a wide range of topics.
💡
﻿
Add a comment
Tags: Articles, TMP, Computer Vision, Advanced, Experiment, GenAI, NLP
Iterate on AI agents and models faster. Try Weights & Biases today.