Skip to main content

MiniGPT-4

An open source alternative to GPT-4
Created on April 19|Last edited on April 19
GPT-4 was released as closed source, but this hasn’t prevented researchers from taking a stab at recreating some of the functionality - in particular, the multi-modal capabilities that GPT-4 possesses.
Work in extending transformers' reasoning ability to the vision modality could have a massive impact in areas like robotics and self-driving cars, so it’s exciting to see more interest in multi-modal models. Researchers from King Abdullah University of Science and Technology recently introduced some new work called MiniGPT-4, which is capable of generating detailed descriptions of images and even creating websites from handwritten text instructions, similar to GPT-4.

Architecture

The model essentially aligns a language model (Vicuna) with a visual encoder (ViT-G/14 from EVA-CLIP and Q-Former) using a single projection layer. The architecture can be seen below:


The linear projection layer in MiniGPT-4 plays a crucial role in aligning the visual information with the language model, and interestingly, this process of aligning visual and language information has parallels with how the human visual system processes information, taking in massive amounts of data, and then encoding it down to a much smaller format. The model was trained in two stages.
In the first stage, it was trained on a large dataset of image-text pairs to acquire knowledge about how language and vision are related. In the second stage, the model was fine-tuned on a smaller but higher-quality image-text dataset to enhance its generation reliability and usability.
It was able to produce more natural and reliable responses, such as writing stories and poems inspired by given images, providing solutions to problems shown in images, and teaching users how to cook based on food photos.

Open Source Progress

It is exciting to see more focus on multimodal models in the research community, and MiniGPT-4 is a significant step towards achieving greater advancements in this field. It is also encouraging to see other researchers catching up with OpenAI's models.
The competition in this space will drive progress and help ensure that the future of AI is safer and more beneficial for everyone. With MiniGPT-4 as an example of what is possible, the potential for new developments in multimodal models is endless.

Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.