Robotics Transformer: Google's Open-Source Architecture For Robotics Generalization
Google and Everyday Robots have come out with a new framework for robotics AI generalization, involved new fast-inference transformer architecture.
Created on December 15|Last edited on December 15
Comment
Google researchers, in collaboration with Everyday Robots, have been working towards improving the state of AI robotics in recent time. Their latest project, called Robotics Transformer (RT-1), takes inspiration from the generalization capabilities present in large-scale models which are trained on massive quantities of data.
How Robotics Transformer works
To reach the goal of a robotics AI model that can generalize it's knowledge to new tasks, the team developed two things: A new large and varied dataset of robotics tasks and a new model architecture which can take advantage of it.
The dataset
Collected over the span of 17 months, the dataset contains over 130k episodes and covers over 700 tasks. The content of the dataset consists of human-provided demonstrations, controlling robots which feature a seven-joint arm, two-finger gripper, and a mobile base.
Each data piece includes a video from the robot's perspective, annotated with a short text description of what is taking place and the corresponding instructions a robot would be fed to accomplish the task. Data was collected using an office kitchen set, featuring many tasks such as moving objects and opening drawers.

The model
RT-1 takes in a set of images alongside natural language text instructions. This input data is immediately fed into the first section of the model (16 million parameters) which tokenizes the data into a small set of combined text-visual tokens. The second section of the model (19 million parameters), the core transformer network of RT-1, determines final action.
Tokenizing the input data with a modified FiLM-EfficientNet:
RT-1 starts off with tokenizing the input data. The tokenization process involves a modified ImageNet pretrained EfficientNet convolutional neural net. By itself, an EfficientNet will modify an input image into a differently shaped feature map, but for RT-1 it's been modified in a few different ways.
First, a number of FiLM layers have been inserted into the EfficientNet which will allow conditioning based on the text instruction input. These FiLM layers simply allow for the text instruction (which will have been embedded by a Universal Sentence Encoder) to influence the convolution.
Each of the six input images are sent through the modified EfficientNet individually, all conditioned on the same text instruction, and in the end each produce 81 textual-visual tokens. These are further compressed into 8 tokes each, using a TokenLearner model, and then concatenated for a final total of 48 tokens, which are sent to the next section.
Determining action with a decoder transformer network:
The second and final part of RT-1 is the transformer network that gives it it's name. To make sure RT-1 can run inference quickly and remain responsive to human users, the transformer network has to stay fairly small, which is why it only takes in 48 token as input.
Other than that, it's a fairly typical decoder-only transformer network. On the output end, it will decide what action the robot takes, including parameters for arm control, robot positioning, and mode switching.

RT-1's performance and generalization ability
RT-1 was compared against three other models: Gato, BC-Z, and BC-Z XL (larger BC-Z with parameters matching RT-1). Tests for seen tasks, unseen tasks, and robustness, and long-horizon tasks (ie. SayCan).
For the first three tests, RT-1 shows clear improvements across the board.

For long-horizon tests, or multi-step long-form situations (in the format of SayCan), there were two scenarios: kitchen 1 and kitchen 2. Kitchen 1 was based on the training scenes present in the dataset mentioned before, whereas kitchen 2 was new. In kitchen 1, RT-1 showed performance gains consistent with the other tests. In kitchen 2, RT-1 maintains performance where the other models spectacularly fail in the unseen environment.

RT-1's performance on long-horizon tasks in unfamiliar environments is a testament to it's generalization capabilities.
To prove RT-1's generalization ability even further, after being trained on the core dataset, it was trained further with a new dataset of bin-picking tasks featuring an entirely different robot (a Kuka robot, whereas RT-1 is an Everyday Robots robot). When trained with the new data, RT-1 greatly improves bin-picking performance with negligible general performance loss. When RT-1 is trained purely on the Kuka dataset alone however, it completely fails.

A similar situation follow with mixing sim training data in with real training data; performance involving object only seen in sim (and then tested on in real life) shows immense improvements in related skill with negligible general performance loss.
Find out more
The RT-1 models and code, including pre-trained checkpoints, are all open-source and available in this GitHub repo. Be sure to read the research paper for all the detail on RT-1, and read Google's blog post for some easier-to-digest content. Head the RT-1's web page for lots of examples and other links, including a video overview.
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.