Skip to main content

Teaching A Robot Dog New Tricks (Safely)

Google AI has revealed a new framework for reinforcement learning to help robots learn tasks in a safe manner, without a fear of expensive equipment damage or a requirement for constant human intervention.
Created on May 6|Last edited on May 7
A new paper has been published by Yang et al. describing a method they've come up with for safely training robots in real life environments. These Google researchers observe that there are many faults when it comes to training robots with reinforcement learning to complete tasks, and sought out to find a way we could train them with better results more consistently.
The goal set out in the paper is to have a four legged robot learn three seperate locomotion activities without needing constant human supervision and without critical failures like accidentally flipping itself upside down.
A post about the paper has already been released on the Google AI blog, viewable here: https://ai.googleblog.com/2022/05/learning-locomotion-skills-safely-in.html
The full paper is available here: https://arxiv.org/abs/2203.02638

The difficulties with reinforcement learning in robotics

The crux of the issue with training robots with reinforcement learning in real life is the failure state. In a computer simulation, we can easily reset the environment to it's initial state when the model fails, but in a real life training scenario we can't just warp time and space like that. Additionally, depending on what our training setup and goals are, a robot failing its task could damage it.
Failure is something fundamental to reinforcement learning (and machine learning in general). A model learns through exploration of it's environment, and when a model is brand new it will inevitably fail over and over again. This is not acceptable for fragile and expensive robots.
Doing early training in simulations is one common proposal to bypass the bulk of early training where failure is abundant, however this has it's own limitations in the transfer process from sim to real life. Physics is too complicated to be perfectly simulated, and real life equipment can introduce noise in measurements.

Using a safety policy to train robots safely

The paper features a four legged robot being taught a number of different locomotion tasks, including balancing on two legs.
The researchers' goal was to have the robot train on these tasks without the need for human intervention, whether it be to reset the robot to the start position or save it from having accidentally flipped itself upside down.
By using two policies for training, one being an exploration policy to further it's ability to perform the tasks, and the other being a safety policy for keeping from damaging itself, the robot is able to learn while avoiding critical failure. Additionally, a sequence to return the robot to it's initial state was programmed in so that it may repeat the training process over and over again by itself.
The robot is first set to the exploration policy to learn, at the same time internal programs keep track of the robot's balance through measurement devices on-board. An internal algorithm runs to predict future safeness levels, and if it deems that the robot is headed toward an unrecoverable state, the safety policy is immediately switched on. The safety policy has the sole intent to keep the robot upright, safe from damage. From there the robot can walk back to the initial position and try again.

The safety policy acts as training wheels for the exploration policy, making sure that no matter what the chaos of the exploration policy produces, the safety policy is there to keep the robot from damaging itself.
Here you can see what training looks like without the safety policy:

And here it is with the safety policy:

You can clearly see the safely policy activate in the second gif. The internal algorithms for predicting future safety decide whatever's about to happen is bad, and the robot goes from bouncing on two legs straight into a wide recovery stance. From there, the robot goes into its process of returning to the initial state, where it can repeat the process.
This loop of initial -> exploration -> safety -> reset allows the robot to learn without human observation. With the protection of the safety policy and the utility of the reset program, the robot can be left alone to learn for as long as it wants.

Find out more

Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.