Skip to main content

Microsoft Introduces Visual ChatGPT

Microsoft demonstrates how to extend ChatGPT into multiple modalities
Created on March 10|Last edited on March 10

Text is Not All You Need

Rumors have been floating around about GPT-4 being multimodal. It remains to be seen whether that is the case, however, Microsoft has just introduced some extremely interesting work that extends existing the existing ChatGPT model to function in the context of multi-modal inputs, such as images.
In their paper “Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models”, Microsoft was able to extend ChatGPT’s functionality without retraining the underlying language model with new inputs, and built a system that functions on the basis of delegating other vision models to carry out tasks. In this system, ChatGPT ultimately serves as a guide in using the additional models to carry out tasks.
ChatGPT relies on what the authors call a “prompt manager.” This prompt manager handles mush of the visual data that is used in the system.
From a high level, this prompt manager recursively uses ChatGPT to make decisions on how to convert complex queries into instructions that the prompt manager can use to accomplish visual tasks. The prompt manager also has access to several Visual Foundation Models (VFM’s) and utilizes these models to accomplish the tasks specified by the user. Overall, the Prompt Manager can be thought of as an automated prompt engineer, that also is capable of using Visual Foundation Models to accomplish queries related to images.

Example

The paper gives an excellent example of how this would work for a realistic query:
“a user uploads an image of a yellow flower and enters a complex language instruction “please generate a red flower conditioned on the predicted depth of this image and then make it like a cartoon, step by step”. With the help of Prompt Manager, Visual ChatGPT starts a chain of execution of related Visual Foundation Models. In this case, it first applies the depth estimation model to detect the depth information, then utilizes the depth-to-image model to generate a figure of a red flower with the depth information, and finally leverages the style transfer VFM based on the Stable Diffusion model to change the style of this image into a cartoon [1].

What's Next

What makes this work so impressive is the potential scalability of the work to not only more language models or VFM’s, but also for more modalities, such as audio.
Being able to reuse large models has huge upside for innovation, as the amount of time and capital required to develop new system is dramatically lowered. In addition, this work should serve as a proof-of-concept for many new products that may rely on visual inputs, or requiring advanced prompt engineering!
/The paper:
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.