Skip to main content

PaLM-E: An Embodied Visual-Language Model

PaLM but now with the ability to reason and understand images!
Created on March 7|Last edited on March 8

TL; DR


PaLM-E is PaLM but Embodied! That is to say, PaLM-E is a Large Language Model (LLM) that is trained on multimodal data by interweaving fixed prompt tokens with embeddings from images and observations. These embeddings are generated by the encoders they selected and trained.
The authors describe it as a high-level policy that manages low-level policies that allow for decision-making. Their argument is that it's possible to train a visual-language model that has embodied reasoning (decision-making, in a general sense).

What is PaLM?

PaLM (or Pathways Language Model) is a Large Language Model (LLM) made by Google. It's 540 billion parameters! That's a little over 3 times as large as ChatGPT (175 billion parameters). This insanely large model is strong at lots of tasks like question answering, language understanding, translation, summarization, simple reasoning, arithmetic, and more.

What is PaLM-E?

PaLM-E expands upon PaLM by granting it multimodal, embodied reasoning.
In other words, they found ways to encode observations from embodied reasoning tasks into embeddings and images into embeddings. These other modes of embedding vectors are paired with the usual text embeddings.
By having access to embeddings describing information from different modes, the model is able to understand an image, but also generate text about it and instruct a robot to act.
There are a few nuances in how they encode embeddings from other modes of data:
  • a simple MLP to map states/observations to embeddings
  • a 22 billion parameter ViT to generate image embeddings
    • to inject the idea that an image is a collection of object instances (not a static grid) —object-centric representation, they decompose ViT image representations/features with ground truth masks of objects in the image
  • an Object Scene Representation Transformer (OSRT) is in charge of generating 3D-centric neural scene representations of which an MLP maps it to the embeddings fed into PaLM-E
  • Entity Referral is done by passing in special tokens to denote an object in an image like <obj_1>
You can notice that most of these modalities have some sort of encoder or projector that encodes information about that type of data into an embedding vector which is then combined with text embedding vectors and fed into PaLM-E.
More can be found in their paper down below.

References

Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.