Robotic Telekinesis in the Wild

In this article, we teach a robotic hand imitator by watching humans on Youtube to enable any operator in the wild with only a single uncalibrated color camera.
Soumik Rakshit
Created on July 4|Last edited on January 19
Comment
Building robots that mimic human behavior has been a central component of robotics research for decades. This particular paradigm, known as teleoperation, has historically been used to enable robots to perform tasks that were unsafe or impossible for humans to perform, such as handling nuclear materials or deactivating explosives (or building an Iron Man Suite 😉). 
More recently, teleoperation has been used to enable the robotic automation of tasks that are easy for humans to demonstrate but difficult to program. In industrial robotics, for example, teleoperation can be used to demonstrate a single trajectory (e.g., picking a box from a conveyor belt) that the robot overfits, repeating verbatim for months or years thereafter. Teleoperation can alternatively be used as a means to collect a large dataset of demonstrations, which can then be used to learn a policy that generalizes to new tasks in unseen environments.
However, if we want to build a real-life Iron Man suit or Atom from Real Steel, we have to study the problem of teleoperation for dexterous robotic manipulation. There are many existing promising techniques, such as:
﻿Kinesthetic Control﻿
Virtual Reality devices like the Meta Quest 2, Vive, and PlayStation VR﻿
Haptic gloves such as Haptx﻿
Motion Capture or MoCap﻿
All of these existing techniques, in spite of their impressive results, suffer some shortcoming that has precluded their ubiquitous adoption. These setups typically involve expensive hardware and specialized engineering, expert operators, or an apparatus that impedes the natural fluid motion of the demonstrator’s hand.
Therefore, the question that we ask ourselves today is:
Is is possible to build a low-cost teleportation system that would enable dexterous robotic manipulation for any untrained operator in the wild with only a single uncalibrated color camera?This is the question the authors of the paper Robotic Telekinesis: Learning a Robotic Hand Imitator by Watching Humans on Youtube attempt to answer. The authors have built a system called Robotic Telekinesis that enables humans to control a robot's hand and arm simply by demonstrating motions with their own hand. The robot observes the human operator via a single RGB camera and imitates their actions in real time. In other words, you could look into a monocular camera of your phone or tablet and control a robot without relying on bulky motion capture or multi-camera rigs for accurate 3D estimation in the wild!
﻿
Demonstration of Robotic Telekinesis in the Wild1
﻿
This article was written as a Weights & Biases Report which is a project management and collaboration tool for machine learning projects. Reports let you organize and embed visualizations, describe your findings, share updates with collaborators, and more. To know more about reports, check out Collaborative Reports.
💡
﻿
Table of ContentsThe Chicken-and-Egg ProblemRobotic Telekinesis: In DetailHand Teleoperation: Human Hand to Robot Hand PoseDataset of YouTube Videos of Human InteractionArm Teleoperation: Human Body to Robot Arm PosesSimilar Reports
﻿
The Chicken-and-Egg ProblemAs we have discussed earlier, Robotic Telekinesis is a teleoperation system capable of dexterous manipulation of motions of a robotic arm in the wild by an untrained operator using nothing but a single uncalibrated color camera. However, controlling a 3D robotic hand from 2D camera input is a severely under-constrained problem. The authors resolve such ambiguities by extracting prior experience from passive data on the web to learn deep neural network-based imitator.  
However, there is a bigger underlying problem that is posed by developing such a system:
In order to train a teleoperation system that can work in the wild, we need a rich and diverse dataset of paired human-robot pose correspondences, but to collect this kind of data, we need an in-the-wild teleoperation system.Even though this seems like a near-impossible problem, the authors tackle this lack of paired data by leveraging a massive unlabeled corpus of internet human videos at training time. These videos capture many different people from many different viewpoints doing many different tasks in many different environments.
The authors propose a method that uses this data–in combination with the latest advancements in 3D human pose estimation and novel retargeting algorithms–to train a system that estimates human hand-body motion and retargets the human actions into robot hand-arm actions. During training, our method only uses passive data readily available online and does not require any active fine-tuning on our robot in our lab setup. By design, this ensures that our system works out of the box for any operator in any environmental setting. We would discuss the proposed method in the next section.
﻿
Robotic Telekinesis: In DetailThe proposed Robotic Telekinesis system consists of a xArm6 robot arm, a 16-DoF Allegro robot hand, and a single RGB camera that captures a stream of images of the human operator. The camera can be placed anywhere as long as the operator is within the camera’s field of view. The robot should be visible to the operator (either in real life or through a video conferencing screen).
The problem of remote teleoperation from a single camera is severely under-constrained for two reasons:
The input images are in 2D while the robot has to be controlled in 3D; mapping from 2D to 3D is an ill-defined problem.
The ambiguity is caused by the difference in morphology between humans and robots.
To address both of these problems, the authors use deep neural networks to learn priors from passively-collected internet-scale human datasets in order to enable powerful human pose estimation and human-to-robot transfer.
﻿
This figure demonstrates a graphical description of the proposed visual teleoperation pipeline. A color camera captures an image of the operator. To command the robot hand, a crop of the operator’s hand is passed to a hand pose estimator, and the hand retargeting network maps the estimated human hand pose to a robot hand pose. To command the robot arm, a crop of the operator’s body is passed to a body pose estimator and cross-body correspondences are used to determine the desired pose of the robot’s end-effector from the estimated body pose. Commands are sent to both the robot hand and arm.1
﻿
Hand Teleoperation: Human Hand to Robot Hand PoseThe problem of retargeting 2D human images to robot hand control commands is broken into two sub-problems. The first is to estimate the 3D pose of the human hand from a 2D image, and the second is to map the extracted 3D human hand parameters to robot joint control commands.
3D Human Hand Pose Estimation from 2D ImagesThe first step in hand retargeting is to detect the operator’s hand in a 2D image and infer its 3D pose. While the problem of inferring a 3D hand pose from a 2D image is inherently under-constrained, the authors leverage priors from offline data and learn to accurately estimate physically plausible hand poses. This is easily solvable using several paired 2D/3D datasets and several methods which use these datasets to train high-quality 3D human pose estimators that operate on 2D images. Let's go over the steps involved in solving this sub-problem:
A crop around the operator’s hand is first computed based on a bounding box computed using an off-the-shelf detector derived from OpenPose. The resulting image crop goes to a pose estimator from FrankMocap to obtain the shape of the hand and pose parameters of a 3D MANO model of the right hand of the operator.
One key point to note is the human hand pose estimation module works for any human operator, with any uncalibrated camera in any environment. This is achieved by utilizing pre-trained state-of-theart neural network detectors and pose estimators, which indirectly enables the authors to leverage the millions of images these models were trained on. These images depict human hands in many poses against many backgrounds, and as a result, our system can be used out of the box for anyone.
💡
3D Human Hand-to-Robot Hand ControlThe next step is to retarget the estimated 3D human hand pose to a vector of 16 Allegro joint angles that place the robot’s hand in an analogous hand pose. This has three challenges:
Under-constrained: The Allegro hand and the human hand have very different embodiments and differ greatly in shape, size, and joint structure. This means that there could be multiple robot poses that can correspond to a certain human pose and vice-versa.
Robustness: The proposed teleoperation solution must work for any human operator trying to perform any kind of task in any environment. Notably, this means we cannot bias solutions toward any particular type of motion or hand type.
Efficiency: The proposed solution needs to be real-time at inference without any lag. This means that the robot must be able to follow the human by at least 15Hz to solve tasks successfully via teleoperation.
A natural way to address these three challenges would be to train a model on a diverse dataset of paired human-robot hand pose examples. However, as mentioned before, due to the chicken-and-egg nature of this solution, it's not practical. The authors get around this issue by training a deep human-to-robot hand Retargeter network in a way that uses just the human data itself and does need any supervision for the target robot pose. The key idea is to formulate the human-to-robot mapping problem using a feasibility objective rather than a regression objective because the latter relies on ground-truth target robot poses, which are not available. Instead, an energy optimization procedure is defined that provides loose constraints on the robot hand poses the Retargeter network should output for a given hand pose.
﻿
This is a demonstration of Human-to-robot Translations. The inputs and outputs of our hand retargeting network. Each of the pairs depicts a human hand pose and the retargeted Allegro hand pose.1
﻿
Dataset of YouTube Videos of Human InteractionThe authors leverage a massive internet-scale dataset of human hand images and videos. The following are the steps used for creating the dataset:
About 20 million images are gathered from the Epic Kitchens Dataset, which captures ego-centric videos of humans performing daily household tasks, and the 100 Days of Hands Dataset, which is a collection of YouTube videos depicting a wide variety of human hand activities.
The hand poses estimator from FrankMocap is used to estimate human hand poses for each image frame in these videos.
This massive noisy dataset of estimated human hand poses is augmented with the small and clean FreiHand Dataset, which contains ground-truth human hand poses for a diverse collection of realistic hand configurations.
Arm Teleoperation: Human Body to Robot Arm PosesA hand that can flex its fingers but does not have the mobility of an arm will not be able to solve many useful tasks. Therefore, the second branch of our retargeting pipeline focuses on computing the correct pose for the robot arm from images of the human operator.
Since we desire a system that operates from a single color camera, there are two main problems that arise:
Without a depth sensor or camera intrinsics, it cannot be accurately estimated how far from the camera the human’s wrist is.
Without camera extrinsics, there is no known transformation between the camera, robot, and human.
To circumvent these issues, at each time step, the authors estimate the relative transformation between the human wrist and an anchor point on the human body. The authors define the human torso as the origin of an anchor coordinate frame and choose a suitable point to serve as the robot’s torso. The authors suggest that the relative transformation between the human’s right-hand wrist and torso should be the same as the relative transformation between the robot’s wrist link and the robot’s torso.
Concretely, at each timestep, upon capturing an image of the operator, a crop of the operator’s body is first computed using a bounding box detector derived from OpenPose, then it is passed to the body pose estimator from FrankMocap. The human body is modeled using the parametric SMPL-X model, and the body poses estimator predicts the 3D positions of the joints on the human kinematic chain. By traversing the kinematic chain from the torso joint to the right-hand wrist joint, the authors compute the relative position and orientation between the human’s right-hand wrist and torso.
The authors then use an Inverse Kinematics Solver to compute arm joint angles that place the robot’s end-effector at the correct relative transformation relative to the coordinate frame of the robot's torso. In order to handle minor errors in human body pose estimation and ensure smooth motion, the outliers are rejected, and a low-pass filter is applied to the stream of estimated wrist poses. Finally, the smoothed target joint angles are sent to the xArm controller.
﻿
Demonstrations of the Performance of the Proposed Robotic Telekinesis System1
﻿
﻿
Similar Reports
Learning Robust Perceptive Locomotion for Quadrupedal Robots in the Wild
Or, teaching four-legged robots to walk in the real world
Training Reproducible Robots with W&B
How I'm using W&B in my Robot Training Workflow.
Offline Reinforcement Learning with Collaborative Datasets
highlights: data visualization; easy collaboration; reproducible research 
Peter Welinder — Deep Reinforcement Learning and Robotics
Peter Welinder, Robotics lead at OpenAI talks about his love of robotics, the early days of reinforcement learning, and the evolution of the robot hand.
﻿
﻿
Add a comment
Tags: Articles, TMP, Robotics, Computer Vision, Video, Experiment, Intermediate
Iterate on AI agents and models faster. Try Weights & Biases today.