Skip to main content

Belebele & Qwen-VL

Belebele, a 122-language parallel evaluation benchmark for NLU & Qwen-VL a vision-language model fine-tuned on instructions!
Created on September 2|Last edited on September 2

Belebele

Meta AI released a 122-language spanning dataset for Natural Language Understanding (NLU), coined Belebele, meaning "big, fat, large, great" in Bambara. Besides being the most comprehensive multilingual dataset, the authors state that Belebele, being a carefully crafted dataset of questions and multiple-choice answers, can help researchers benchmark their work even on low resource languages. Additionally, the dataset is designed to distinguish biased models from generalizable ones. Of the common NLP tasks like summarization, natural language inference (NLI), the paper's authors chose multiple choice questions (MCQ) because they were the most objective and easily evaluated as a correct answer is not subject to interpretation and evaluation across multiple languages would be fairer.

In a nutshell, the above diagram summarizes how the Belebele corpus was created. What should you take away from this diagram? For one, each box has many arrows outgoing and Belebele stems from FLORES-200. Some of the passages were handled differently. Expert annotators created candidate questions which were passed through both manual quality checks and automatic quality checks. Quality-checked passing questions were translated and proofread again before becoming part of the Belebele corpus.

Qwen-VL



Researchers at Alibaba Group have released Qwen-VL, a Large Vision Language Model (LVLM). Qwen-VL is backed by Qwen-7B and a 1.9B parameter ViT with Openclip ViT-bigG weights. The parameter counts are listed below.

They also introduced a VL Adapter, which is essentially a single cross-attention layer with learnable query vectors and image features from the vision encoder as the key. Position embedding is included. Why use a VL Adapter? The authors argue that the VL adapter is to help with long image sequences. By compressing a long image feature sequence to a fixed length of 256, the downstream LLM is more efficient in processing this input.

Training is broken into 3 stages: pretraining, multi-task training, and supervised fine-tuning. In stage 1, they freeze the LM and train only the ViT and adapter, optimizing for cross entropy of the text tokens outputted. Stage 2, they train on higher quality image data and the LM is trainable. The image resolution jumps from 224x224 to 448x448. The final stage, they perform instruction fine-tuning.

References

Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.