Google Combines RL and LLM's to Train Robots
Using visual language models as a form of supervision, Google Researchers have been able to train agents to achieve tasks within a simulation environment.
Created on July 28|Last edited on July 28
Comment
Leveraging recent advances in large language models, Google Researchers have adopted a sophisticated method that merges the powers of Visual-Language Models (VLMs) and Large Language Models (LLMs). This integration allows robots to leverage the existing reasoning abilities of LLM's for interacting with their environments and ultimately learning faster.
The Foundation
The vision driving this work is to utilize Foundation Models. These models are pre-trained on massive datasets, to provide a solid knowledge base for an agent that learns through reinforcement learning (RL).
The Architecture
The structure of this advanced model predominantly splits into three interconnected components:
LLM (FLAN-T5): FLAN-T5 takes on the role of deconstructing high-level task instructions into a series of more manageable sub-goals. It creates a sequence of text-based sub-goals, which the robotic agent will use to accomplish the task.

VLM (CLIP): Following FLAN-T5, we have CLIP, a vital intermediary that bridges the gap between visual inputs and language. It offers a mechanism to translate the visual data the robot collects into a digestible textual format. Incorporating an image encoder and a text encoder, it produces a 128-dimensional embedding vector optimized to have high cosine similarity. This process facilitates the creation of textual descriptions of the robot's environment based on visual inputs. In essence, CLIP acts as a reward function, determining if the current state of the environment, captured through a third-person view, aligns with the task at hand (generated by the T5 LLM) - for instance, stacking a blue block on a red one.

Language-Conditioned Policy Network: The last piece of the puzzle is the Language-Conditioned Policy Network. This component processes the sub-goals devised by FLAN-T5 and the current state of the environment and converts them into actions for the robot to perform. It effectively "grounds" the language instructions into tangible, executable actions, ensuring that the model's commands get effectively translated into the physical world. It uses the VLM as supervision, gaining feedback for successful actions that accomplish the subgoals generated by the T5 LLM.

How does it learn?
Complementing these elements, the authors have introduced a 'Collect & Infer' Learning Paradigm to manage the agent's learning mechanism. During the 'Collect' phase, the agent interacts with the environment and gathers data. Once a reward is encountered, the episode's data is stored in an experience buffer. Then, during the 'Infer' phase, the agent's policy network is trained on this buffered experience. This approach empowers the agent to learn from its environment and update its action policies dynamically.
Results
The researchers evaluated their method on a range of robotic stacking tasks and compared it against a baseline RL agent that learns only through environmental rewards. The results demonstrated the efficiency of the new method, with the learning curves showing that the language-centric agents were considerably more successful, especially on more complex tasks. Importantly, the number of steps the agent needed to successfully complete tasks grew slower than the sparseness of the task, indicating that the method could handle even more challenging tasks.

In sum, by employing the blend of language and vision models, researchers have taken a substantial step towards enabling robots to handle increasingly complex tasks more efficiently. This new language-centric model serves to highlight the promising potential for further development in the realm of robotic learning.
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.