Microsoft Presents: Language Is Not All You Need

Microsoft utilizes images as a secondary input to LLM's
Created on March 1|Last edited on March 1
Comment
Microsoft has just introduced a new multimodal large language model (MLLM) called Kosmos-1, which is a transformer that can learn using additional modalities besides text, like images. The researchers studied the effects of utilizing image data to train large language models, and found that including images as extra context during the training process allowed the model to learn more common sense knowledge beyond text descriptions, as well as allowing for more potential use in other applications like robotics. In addition, the authors were able to evaluate their model on non-verbal reasoning tests, like the Raven IQ test. 
﻿
How it Works ﻿
From a high level, the authors encoded images directly into a vector embedding, and fed the vectors into the transformer model, and used tokens like <image> to indicate the beginning of image data in the sequence. The authors chose the Magento architecture as a backbone model due to its robust performance over multiple modalities, and used xPos as a positional encoding, which generalizes well to different sequence lengths. The training task for the model was next-token prediction, where the model’s goal is to predict the next token in the sequence (excluding images). Overall, the model contains 1.7 billion parameters. 
﻿
The Results ﻿
The model was able to obtain 4% increase in IQ ability on the Raven IQ test in a zero-shot learning setting. The addition of the image modality also led to an improvement of 5.8 points on the task of chain-of-thought prompting. The authors were able to apply the model in the task of zero shot image classification, where prompts were used in addition to images as inputs, and achieved an improvement in performance over models like CLIP. The model was evaluated on the task of zero shot image classification with descriptions and without descriptions, and the performance difference was dramatically different between the two tests, which showed that model was effectively able to utilize both image and language modalities to make predictions. 
﻿
Conclusion Overall the inclusion of images into the training process of large language models allows for performance gains, and also is a seemingly viable path towards integrating these systems with robotics and other sytems that have strong reliance on visual inputs. It should be interesting to see how LLM’s affect the development of Autonomous systems, which currently struggle in situations where human reasoning is required, and this could work could be a first step towards solving this issue. 
﻿
The paper ﻿
﻿https://arxiv.org/pdf/2302.14045.pdf﻿
﻿
Add a comment