BLIP-2: A new Visual Language Model by Salesforce

In this article, we'll explore BLIP-2, a new Vision Language Model by Salesforce which beats the previous state-of-the-art Flamingo on various multimodal tasks.
Atharva Ingle
Created on February 24|Last edited on July 15
Comment
BLIP-2 is a new visual language model (VLM) capable of multimodal tasks like captioning, visual dialogue ﻿, and visual question answering. As you can see, it works rather well: 
﻿
runs.summary["image_captioning"]
 - 2 of 8
merlion fountain at night with marina bay skyline in the background
Image
Generated caption
1
2
﻿
This article will walk you through this new research, and its architecture, and finally, take it for a test run to see how it works on various tasks.
Note: You can try out the model now via Gradio demo hosted on HuggingFace Spaces here.
💡
Here's what we'll be covering: 
Table Of Contents (click to expand)What Is BLIP-2?Overcoming Vision-Language Pretraining Obstacles with BLIP-2The Key Ideas of BLIP-2BLIP-2 detailsTaking BLIP-2 for a Ride 🚀ResultsLimitations of BLIP-2ConclusionReferences
﻿
﻿
Let's dive in! 
What Is BLIP-2?﻿BLIP-2 is a novel pre-training strategy that leverages off-the-shelf frozen, pre-trained image encoders and language models, bridging the gap between modalities with a lightweight Querying Transformer. This was developed in response to the explosion in the cost of vision-and-language pre-training.
 BLIP-2 achieves state-of-the-art performance on a variety of vision-language tasks, surpassing existing methods with significantly fewer trainable parameters. Notably, the model outperforms Flamingo80B by 8.7% on zero-shot VQAv2, with 54 times fewer trainable parameters. The paper also showcases the emerging capabilities of BLIP-2 in zero-shot image-to-text generation, where the model can follow natural language instructions. In this report, we will dive deeper into BLIP-2 and explore its potential applications in the field of vision and language research.
﻿
To gain a better understanding of the Vision-Language paradigm and Flamingo, I highly recommend reading the report I recently authored on the topic. This comprehensive report delves deep into the intricacies of Flamingo and provides valuable insights on the Vision-Language paradigm.
💡
DeepMind Flamingo: A Visual Language Model for Few-Shot Learning
In this article, we'll explore Flamingo — an open-ended single visual language model (VLM) for multimodal machine learning research developed by DeepMind.  
﻿
Overcoming Vision-Language Pretraining Obstacles with BLIP-2Vision-language pre-training (VLP) has advanced quickly in recent years, with larger models continuously improving on various downstream tasks. However, these state-of-the-art models require high computational costs during pre-training. To address this, the authors propose a more compute-efficient VLP method by leveraging off-the-shelf pre-trained vision and language models that remain frozen during pre-training. The pre-trained vision models offer a high-quality visual representation of the image, and the pre-trained large language models (LLMs) offer strong language generation and zero-shot transfer abilities. 
However, using pre-trained unimodal models for VLP requires aligning visual and textual features, which can be challenging. This is particularly true for frozen unimodal models (like LLMs), which have not seen any images during their pre-training. To address this problem, the authors have proposed a lightweight module called Querying Transformer (Q-Former) that effectively enhances the vision-language model.
Specifically, Q-Former is a lightweight transformer that uses learnable query vectors to extract visual features from the frozen image encoder. It acts as an information bottleneck between the frozen image encoder and the frozen LLM, where it feeds the most useful visual feature for the LLM to output the desired text. In the first pre-training stage, the researchers perform vision-language representation learning that enforces the Q-Former to learn the visual representation most relevant to the text. In the second pre-training stage, they perform vision-to-language generative learning by connecting the output of the Q-Former to a frozen LLM and train the Q-Former such that its output visual representation can be interpreted by the LLM. We'll be looking these things in detail in further sections.
The Key Ideas of BLIP-2
Figure-1: Overview of BLIP-2's framework. [Source: Figure-1 from the paper]
The key ideas of BLIP-2 can be summarized as:
BLIP-2 is a powerful approach that effectively combines frozen pre-trained image models and language models to achieve outstanding performance on various vision-language tasks, including visual question answering, image captioning, and image-text retrieval. To bridge the modality gap, BLIP-2 employs a Q-Former model that is pre-trained in two stages: first for representation learning, and then for generative learning.
Thanks to its use of large language models (LLMs) such as OPT and FlanT5, BLIP-2 can even perform zero-shot image-to-text generation following natural language instructions, enabling new capabilities such as visual knowledge reasoning and visual conversation.
Moreover, BLIP-2 stands out for its computing efficiency, thanks to its use of frozen unimodal models and a lightweight Q-Former. In fact, it outperforms existing state-of-the-art methods such as Flamingo by 8.7% on zero-shot VQAv2 while using only 54 times fewer trainable parameters.
BLIP-2 detailsIn this section, we will explore the model architecture and pretraining strategies employed in BLIP-2, a powerful VLP (Vision-Language Pre-training) framework. BLIP-2 stands for Bootstrapping Language-Image Pre-training with frozen unimodal models.
Q-Former
Figure-2: Model architecture of Q-Former and BLIP-2’s first-stage vision-language representation learning objectives. [Source: Figure-2 (Left) from the paper]
Q-Former is proposed as a trainable module to connect a frozen image encoder and a frozen LLM.
It extracts a fixed number of output features from the image encoder, regardless of input image resolution.
It consists of two transformer submodules that share the same self-attention layers, first one being an image transformer that interacts with the frozen image encoder for visual feature extraction and the second one a text transformer that can function as both a text encoder and a text decoder.
The authors create a set number of learnable query embeddings for the image transformer. The queries interact with each other through self-attention layers, and with frozen image features through cross-attention layers (inserted every other transformer block). Additionally, queries can interact with the text through the same self-attention layers. 
Depending on the pre-training task, the authors use different self-attention masks to control the interaction between queries and text. 
The Q-Former is initialized with the pre-trained weights of bert-base, while the cross-attention layers are randomly initialized. 
Q-Former boasts an architecture with only 188M trainable parameters. This is a significantly less number of parameters than complete finetuning approaches used in the past, which required far more parameters.
The authors used 32 queries, each with a dimension of 768 (the same as the hidden dimension of the Q-Former). 
﻿ZZZ﻿ is used to denote the output query representation in Figure-2, which is much smaller in size (32 x 768) than the frozen image features (e.g., 257x1024 for ViT-L/14). 
This bottleneck architecture, together with the pre-training objectives, forces the queries to extract the visual information which is most relevant to the text.
Note: Bootstraping refers to using existing resources and capabilities (in this case pretrained vision and language models).
💡
Bootstrap Vision-Language Representation Learning from a Frozen Image EncoderIn the representation learning stage, Q-Former is used in conjunction with a frozen image encoder to perform pretraining using image-text pairs. The main aim of Q-Former is to learn to extract visual representation that is most informative of the text. Furthermore, the authors jointly optimize three pre-training objectives. Note that during all of three pretraining objectives, the input format and model parameters are shared. The main goal behind using three pretraining strategies is to use a different masking strategy between queries and text to control their interaction, as shown in Figure-3.
Figure-3: The self-attention masking strategy for each objective to control query-text interaction. [Source: Figure-2 (right) from the paper]
Image-Text Contrastive Learning (ITC)This is a typical contrastive learning pipeline where the model learns to align image and text representations such that their mutual information is maximized. To accomplish this, the authors contrast the similarity of a positive image-text pair against that of negative pairs. The output query representation ZZZ﻿ from the image transformer is aligned with the text representation ttt﻿ from the text transformer. A unimodal self-attention mask is used to prevent queries and text from seeing each other (to prevent information leak).
Image-grounded Text Generation (ITG)The ITG loss is utilized to train the Q-Former model to generate text based on given input images. However, the Q-Former architecture lacks a direct communication pathway between the frozen image encoder and text tokens. To generate text, the necessary information must first be extracted by the queries and then relayed to the text tokens through self-attention layers. Consequently, queries are forced to extract visual features that encompass all the text's information. To regulate query-text interaction, the authors employ a multimodal causal self-attention mask. While queries can interact with one another, they are unable to attend to text tokens. On the other hand, each text token can attend to all queries and its previous text tokens.
Image-text matchingThe objective of Image-Text Matching (ITM) is to establish a detailed correlation between the image and text representations. The ITM task involves a binary classification wherein the model has to predict whether an image-text pair is positive (matched) or negative (unmatched). To achieve this, the authors utilize a bi-directional self-attention mask that allows all queries and texts to attend to each other, leading to output query embeddings (ZZZ﻿) that capture multimodal information. These embeddings are then fed into a two-class linear classifier to obtain a logit for each. The output matching score is obtained by averaging the logits across all queries.
Bootstrap Vision-to-Language Generative Learning from a Frozen LLMThe generative pre-training phase involves leveraging the generative language capability of a frozen LLM by connecting it to Q-Former (with a frozen image encoder). As shown in Figure 4, a fully-connected (FC) layer is used to linearly project the output query embeddings ZZZ﻿ to match the dimension of the text embedding of the LLM. These projected embeddings serve as soft visual prompts that enable the LLM to incorporate visual information extracted by Q-Former. As Q-Former is already pre-trained to extract language-informative visual representation, it functions as an information bottleneck that supplies relevant information to the LLM while filtering out irrelevant visual data. This helps to alleviate the catastrophic forgetting problem by reducing the burden of the LLM to learn vision-language alignment.
BLIP-2’s second-stage vision-to-language generative pre-training, which bootstraps from frozen large language models (LLMs). [Source: Figure-3 from the paper]
The authors experimented with two types of LLMs: decoder-based LLMs and encoder-decoder-based LLMs. For decoder-based LLMs, they pre-train using the language modeling loss, where the frozen LLM is tasked with generating text conditioned on the visual representation from Q-Former. On the other hand, for encoder-decoder-based LLMs, they pre-train using the prefix language modeling loss. In this approach, a text is split into two parts, where the prefix text is concatenated with the visual representation as input to the LLM's encoder. The suffix text is then used as the generation target for the LLM's decoder.
Now that we've seen how BLIP-2 works, it's time to actually use it and see how it performs!
Taking BLIP-2 for a Ride 🚀The best thing about BLIP-2 is that it's completely open-source. It was originally released under SalesForce's LAVIS library. You can see the project page of BLIP-2 here. The model was recently ported to HuggingFace and can be used as a general HuggingFace model. This report will focus on using the HuggingFace implementation to execute a variety of tasks with BLIP-2.
InstallFirst, we need to install HuggingFace (🤗). The model is brand new and not released yet into the stable version at the time of writing this report, so we will install it from source.
pip install git+https://github.com/huggingface/transformers
Load ImageNext, we load an image. You can load any arbitrary image you want.
import requests
from PIL import Image
﻿
url = 'https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png' 
image = Image.open(requests.get(url, stream=True).raw).convert('RGB')   
display(image.resize((596, 437)))
﻿
Load Model and ProcessorNext, we load the model and processor. Here we load a BLIP-2 checkpoint that leverages the pre-trained OPT model by Meta AI, which is a 2.7 billion parameter LLM. Furthermore, we load the model in float16 to save memory. You can choose from any checkpoint you want from here. However, it is important to keep in mind that the size of the checkpoint you select will directly impact the memory requirements for running your inference. Therefore, it is wise to carefully consider the appropriate checkpoint for your needs.
import torch
from transformers import AutoProcessor, Blip2ForConditionalGeneration
device = "cuda" if torch.cuda.is_available() else "cpu"
﻿
MODEL_ID = "Salesforce/blip2-opt-2.7b"
processor = AutoProcessor.from_pretrained(MODEL_ID)
# by default `from_pretrained` loads the weights in float32
# we load in float16 instead to save memory
model = Blip2ForConditionalGeneration.from_pretrained(MODEL_ID, torch_dtype=torch.float16)
model.to(device) 
﻿
We can use the above-loaded model in various ways as described below👇
Image CaptioningIf any text prompt isn't provided, the model will start generating text from BOS token. In other words, it will generate a caption for the image.
inputs = processor(image, return_tensors="pt").to(device, torch.float16)
﻿
﻿
generated_ids = model.generate(**inputs, max_new_tokens=20)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
print(generated_text)
Output: singapore merlion fountain
Prompted image captioningWe can also provide a short prompt, and the model will continue it with contextually relevant text that describes the content of the image in greater detail.
prompt = "this is a picture of"
﻿
inputs = processor(image, text=prompt, return_tensors="pt").to(device, torch.float16)
﻿
generated_ids = model.generate(**inputs, max_new_tokens=20)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
print(generated_text)
Output: a statue in front of a water fountain
Visual-Question Answering (VQA)We can even ask questions based on the image, and the model will answer it for you.
prompt = "Question: which city is this? Answer:"
﻿
inputs = processor(image, text=prompt, return_tensors="pt").to(device, torch.float16)
﻿
generated_ids = model.generate(**inputs, max_new_tokens=10)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
print(generated_text)
Output: singapore
Chat-based promptingThis is one of the coolest applications of BLIP-2. We can literally create ChatGPT-like interface by concatenating each generated response to the ongoing conversation. First, we provide the model with a prompt such as "Which city is this?" and it generates an answer such as "Singapore." 
We then concatenate this response to the conversation and ask a follow-up question such as "Why is Singapore an important city?" We can also concatenate this follow-up question and feed it back into the model. However, it's important to keep in mind that the context length for models like OPT and T5 (which are being used in BLIP-2) is limited to 512 tokens, so the conversation context cannot be too long.
Here's a code example doing exactly that.
context = [
    ("which city is this?", "singapore"),
    ("why?", "it has a statue of a merlion"),
]
question = "where is the name merlion coming from?"
template = "Question: {} Answer: {}."
﻿
prompt = " ".join([template.format(context[i][0], context[i][1]) for i in range(len(context))]) + " Question: " + question + " Answer:"
﻿
print(prompt)
Output: Question: which city is this? Answer: singapore. Question: why? Answer: it has a statue of a merlion. Question: where is the name merlion coming from? Answer:
inputs = processor(image, text=prompt, return_tensors="pt").to(device, torch.float16)
﻿
generated_ids = model.generate(**inputs, max_new_tokens=10)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
print(generated_text)
Output: merlion is a mythical creature from malays
﻿
For using the chat-like interface you can use the Gradio demo hosted on HuggingFace spaces. This demo uses Salesforce/blip2-flan-t5-xxl checkpoint which is their best and the largest checkpoint.
💡
ResultsThe following results are generated from Salesforce/blip2-opt-2.7b checkpoint from HuggingFace model hub and using a A5000 GPU from jarvislabs.ai. You can reproduce the following results from this GitHub Repository.
💡
﻿
Image Captioning1
﻿
﻿
Notes:
1st picture: The model successfully identified that it was nighttime and that the Marina Bay area was visible in the distant background.
4th picture: BLIP-2 accurately detected the reflection of the Taj Mahal in the water, showcasing its advanced image recognition capabilities.
5th picture: Impressively, the model was able to recognize a scene from a movie, demonstrating its proficiency in identifying complex visual stimuli.
6th picture: The model displayed its ability to recognize famous personalities, showcasing its impressive capacity to identify specific individuals within images.
Prompted image captioning﻿
Prompted Image Captioning1
﻿
﻿
Based on the above examples, it is evident that the model can be prompted to provide specific information relevant to the picture at hand.
Visual-Question Answering﻿
Visual Question Answering1
﻿
Notes:
1st picture: The model can even tell what is the food in the picture and its ingredients.
3rd picture: The model displays its ability to identify and describe unusual photographs.
4th picture: The model impresses with its capability to provide insightful information about historical places without explicitly mentioning their names.
5th picture: The model can comprehend high-action images.
6th picture: The model can identify a city based on its famous buildings, statues, or landmarks.
7th picture: The model correctly identifies the car as a Tesla and identifies it as an electric vehicle.
Limitations of BLIP-2Although recent LLMs can achieve in-context learning given few-shot examples, experiments with BLIP-2 did not demonstrate an improved VQA performance when providing the LLM with in-context VQA examples. The authors of the paper attribute the lack of in-context learning capability of BLIP-2 to the pretraining dataset used. This is also observed in the Flamingo paper, which uses a close-sourced interleaved image and text dataset with multiple image-text pairs per sequence. The authors aim to create a similar dataset in future work. BLIP-2's image-to-text generation may produce unsatisfactory results due to various reasons, including inaccurate knowledge from the LLM, activating the incorrect reasoning path, or not having up-to-date information about new image content. Additionally, the authors mention the risks associated with LLMs, such as outputting offensive language, propagating social bias, or leaking private information, and suggest remediation approaches such as using instructions to guide the model's generation or training on a filtered dataset with harmful content removed.
ConclusionBLIP-2 is an innovative and resource-efficient approach to vision-language pre-training that utilizes frozen pretrained image encoders and LLMs. With minimal trainable parameters during pre-training, BLIP-2 delivers outstanding results on a range of vision-language tasks. Additionally, BLIP-2 showcases promising capabilities in generating image-to-text translations with zero-shot instruction. BLIP-2 is a crucial advancement toward creating a multimodal conversational AI agent.
References﻿BLIP-2 official paper﻿
﻿SalesForce LAVIS Library﻿
﻿Notebook by Niels Rogge for code examples﻿
﻿HuggingFace documentation on BLIP-2﻿
﻿
Add a comment