DeepMind Partners with Research Labs to Develop New Robotics Dataset
DeepMind makes a huge advancement for robotics research!
Created on October 6|Last edited on October 6
Comment
DeepMind's orchestration of a collaboration across 33 academic labs to create the Open X-Embodiment Repository represents a milestone in overcoming one of the most pressing challenges in robotics research: the difficulty in generalizing robotic skills across varied tasks, environments, and hardware setups. This issue was akin to the hypothetical scenario where each new image recognition problem would require a dataset collected from a specific camera, thereby significantly hampering progress. Before the release of this repository, the robotics community faced a similar constraint, where every new task or robotic configuration often required a new, specialized dataset for pre-training models. However, by pooling resources and knowledge from diverse research labs, DeepMind's initiative has given birth to a standardized yet flexible dataset
A Collaborative Milestone
Traditionally, training robotic models has been an arduous, isolated process, with each lab gathering its own data for specific robots and tasks. The Open X-Embodiment dataset aims to remedy this by aggregating data from 22 different robot types, across more than 20 institutions. This comprehensive collection includes over 1 million episodes, featuring more than 500 skills and 150,000 tasks, making it one of the most extensive robotics datasets to date.

Samples from the dataset
Cross-Training and Skill Transfer
The Open X-Embodiment Repository introduces a transformative approach to robotics research, particularly through its use of "coarse alignment" of action and observation spaces across various robotic systems. Typically, robotics research has been stymied by the fragmented nature of datasets and models, each tuned to specific robot configurations and sensory setups. Coarse alignment overcomes this bottleneck by providing a unified framework where actions and observations are mapped to a standardized 7-dimensional action vector. This means that models trained on this dataset can control different kinds of robots—ranging from single arms to complex quadrupeds—by interpreting this standardized action vector differently depending on the specific robot being used. This eliminates the need for retraining models from scratch for each new robotic system or task, thereby saving significant time and computational resources. It also fosters a collaborative research environment, as researchers can now more easily share, compare, and build upon each other's work, accelerating the pace of innovation in the field.
Versatile Robotics Transformers
The study also considers two model architectures for robotic control: RT-1 and RT-2. RT-1 is a 35M parameter network based on a Transformer architecture designed for robotic control tasks. It processes a sequence of 15 images and natural language instructions, combining visual and language representations via FiLM layers before generating tokenized actions using a decoder-only Transformer. On the other hand, RT-2 is part of a large family of vision-language-action models (VLAs) that are trained on Internet-scale vision and language data in addition to robotic control data. It converts tokenized actions into text tokens, allowing it to be fine-tuned for robotic control by leveraging pre-trained vision-language models. While both models take in visual inputs and natural language instructions to output actions, RT-2's approach allows for broader generalization capabilities due to its Internet-scale training data.
Results
A total of 3600 evaluation trials were carried out across six different robots to assess performance.
In-Distribution Performance: The study assessed the model's performance on both small-scale and large-scale datasets. In small-scale dataset domains like Kitchen Manipulation and NYU Door Opening, the RT-1-X model outperformed the Original Method in 4 out of 5 datasets, signifying that co-training on the X-embodiment data substantially improves performance in scenarios where data is limited. For large-scale datasets, the RT-1-X model did not outperform the RT-1 baseline, indicating underfitting. However, the RT-2-X model outperformed both the Original Method and RT-1, suggesting that high-capacity architectures can benefit from X-embodiment training even in data-rich domains.
Generalization to Out-of-Distribution Settings: Experiments using the RT-2-X model showed that it performed roughly on par with the RT-2 in manipulating unseen objects in unseen environments. This is attributed to the already good generalization properties of the RT-2 model.
Impact of Design Decisions: Ablation studies were conducted to understand the influence of various factors like model size and history length on performance. One key insight was that the RT-2-X model, when trained with data from other robotic platforms, performed about 3 times better than the RT-2 model in emergent skills evaluation. This indicates that co-training with different platforms can significantly extend a robot's skill set, even if it already has abundant data available.
Overall, the experiments demonstrate the benefits of training a single model using data from multiple robotic embodiments. Not only does this result in higher performance, but it also fosters new capabilities that individual robots didn't initially possess. For instance, RT-2-X could successfully execute tasks involving spatial relationships between objects, a skill it was not initially trained for.
What's Next?
In summary, the Open X-Embodiment dataset and the new RT-X models are significant steps toward developing robots that are not just specialists but also generalists, adaptable across a range of tasks and embodiments.
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.