TinyGPT-V: A New Lightweight Multimodal LLM

A new hyper-efficient multi-modal LLM
Created on January 8|Last edited on January 8
Comment
Developed by researchers from Anhui Polytechnic University, Nanyang Technological University, and Lehigh University, TinyGPT-V is designed to bridge the gap between high performance and computational accessibility in the realm of multimodal learning, which combines language and visual elements.
TinyGPT-V stands out for its ability to be trained on a 24G GPU and deployed on an 8G GPU or CPU, making it significantly more accessible for various applications. It is built on the Phi-2 language model, known for its efficiency and performance, and integrates a pre-trained ViT for the vision module. This model has a total of 2.8 billion parameters and undergoes a unique quantization process, making it suitable for local deployment and inference tasks on various devices.
Architecture TinyGPT-V's architecture combines a visual encoder, linear projection layers, and a large language model backbone. The visual encoder uses EVA of ViT, remaining inactive during training. The model operates at 224x224 resolution for early stages and 448x448 for the final stage. The initial linear projection layer, the Q-Former from BLIP-2, embeds visual features into the language model and reduces training parameters. A second linear projection layer, initialized with a Gaussian distribution, bridges the gap between the Q-Former output and the language model's embedding layer. The language model backbone is the Phi-2 model, a 2.7 billion-parameter model with advanced reasoning and language comprehension. TinyGPT-V utilizes normalization techniques and LoRA (Low-Rank Adaptation) to stabilize training and prevent computational issues like NaN values. It also incorporates Query-Key Normalization for efficiency in low-resource settings. This architecture enables efficient multimodal understanding and processing in TinyGPT-V, leveraging its reduced parameter size effectively.
﻿
Training The training process of TinyGPT-V is carefully orchestrated through four stages, each with specific learning rate strategies and loss profiles. This ensures the model's robust performance across a range of visual question-answering and referring expression comprehension tasks. 
TinyGPT-V's training comprises four stages. In the first stage, it learns vision-language understanding using image-text pairs from Conceptual Caption, SBU, and LAION datasets, covering 5 million pairs. The second stage involves re-employing these datasets to train the LoRA module, equipping the model to process image inputs. The third stage fine-tunes the model with image-text pairings from MiniGPT4 or LLaVA for human-like response generation. Finally, the fourth stage focuses on multi-task learning, enhancing chatbot capabilities using multi-modal instruction datasets like LLaVA, Flickr30k, a mixing multi-task dataset, and Unnatural Instruction, improving conversation, image captioning, and object parsing skills.
Results TinyGPT-V, with 2.8 billion parameters, showcases competitive performance across multiple visual question answering benchmarks, rivaling models with nearly 13 billion parameters. It leads in the VSR zero-shot task with a 53.2% score and shows robust performance in the IconVQ challenge at 43.3%. In GQA, TinyGPT-V scores 33.6%, slightly lagging behind InstructBLIP's 49.5%. In the VizWiz task, it achieves 24.8%, and in the Hateful Memes dataset, it scores 53.2%, closely matching InstructBLIP's top score. TinyGPT-V's efficiency and capacity to compete with larger models are highlighted by these results.
﻿
￼
In conclusion, TinyGPT-V represents a significant advancement in creating efficient, cost-effective, and potent MLLMs. It opens up new possibilities for practical real-world applications, particularly in scenarios where computational resources are limited. The researchers have made their code and training weights publicly available, contributing to the broader AI and machine learning community.
﻿
The Paper: https://arxiv.org/abs/2312.16862﻿
﻿
﻿
﻿
﻿
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.