AMA with Alex Paino
Alex answers questions about his projects in OpenAI.
Created on April 21|Last edited on April 21
Comment
Questions
Charles:
Great work and really excellent talk, Alex! To what extent do you need to ensure that the total amount of force/torque imposed by the robot on objects is limited? Or is this something that emerges "for free" from training? Do you think the policy would generalize to a Rubik's cube made of a fragile material, like glass?
Alex:
In training, we randomize the maximum amount of force each actuator is able to exert, which in turn limits the force exerted on objects being manipulated. We don't explicitly limit the force on objects, though. In the real world, we have a limit on actuator force which we've tuned a bit to be reasonable (e.g. reduce breakages). Since we're not explicitly limiting force exerted on objects, and since we never model fragile materials like glass in training, my guess is that it might not generalize to a fragile cube made of glass (i.e. there's a decent chance it would be broken). It'd be fun to try, though!
Sayak Paul:
Hi Alex. Thank you so much for joining in. I am Sayak, currently working as a Deep Learning Associate at PyImageSearch. 1. I am deeply interested in studying the field of self-supervised learning and the promise it brings. To what extent do you think it would be able to solve typical problems like catastrophic forgetting, manifold collapse, etc. and be able to learn very useful representations in general? 2. Most of the SoTA self-supervised learning algorithms rely on a pre-text task even though there’s existence of PIRL (Pre-text Task Invariant Representation). But we have seen simpler methods like SimCLR have gotten past the performance and pushed SoTA. So, I am interested in learning how one should chase the pre-text task in self-supervised learning and could there better loss functions to evaluate their efficiency other than the NTXEnt loss?
Alex:
Hi Sayak, I'm not an expert in self-supervised learning so I don't think I have much to add here (it'd just be thoughts based on reading papers, not real experience).
Lukas Biewald
Hi Alex! One question I had (if I’m reading the report right) - it seemed like the end to end training ultimately got a slightly higher reward than the behavioral cloning trained model. Is this right? Do you think that the end-to-end trained policy is ultimately a little more effective than the one with the explicit vision model?
Alex
Hi Lukas, it is correct that the end to end training got to higher mean reward at the end of training, However, the evaluator performance (i.e. performance on a difficult, fixed "hold out" env) is roughly perfect for both experiments at the end of training, and this is the main metric we care about. We didn't compare these specific models in real rollouts, though, so it's hard to say for sure which is "better" – I do think there's a possibility that an RL trained policy would end up with higher performance than a cloned one, since there's the possibility that the behavior of an optimal end-to-end policy is different than the behavior of an optimal state-based policy.
Jack Morris
Hi Alex Paino. Just out of curiosity, how long did it take to get the hand to actually seem like it was "working"? I figured, if it were me, there would be a long period before I knew that this problem was actually learnable. It must have been a crazy moment when you realized that it might actually work.
Alex:
Hi Jack, I joined in March of last year, at which point we already had the hand working for the bock reorientation task, so I'll answer from that perspective. When I joined we were still pretty far away from solving Rubik's cube – IIRC we couldn't effectively rotate faces of a real cube much at all. It wasn't until later last spring where everything kind of came together at once, and we relatively quickly went from being barely able to manipulate the cube to being able to solve it. So in total, that was about 9 months into the Rubik's cube project (since the previous release) before we really knew we could solve it in the real world. Solving it in simulation was a different story, however; we actually had that done way before I joined, and I believe it didn't take more than a few weeks. So from that perspective, we never really had much doubt that it was possible to learn this in a simulator.
Kyle Goyette:
Hi Alex! You mention in the report that you believe that learning separate embeddings for the policy and value networks for end-to-end policies is faster because the value network pushes the embedded output from the vision sub-model to zero. Could you explain how the model does this? Does it learn to disregard the embedding because it has a negative impact on performance?
Alex:
Hi Kyle, the hypothesis we have for why the value network pushes the output of the vision model to ~0 is that the value function does not need any info from it, since it is still given "privileged" state-based information during training (we can do this since we don't need to use the value function in real-world rollouts).
Ayush Thakur:
Hi Alex I was going through the report. It's really amazing. Can you clarify this statement: > If yes to 2 and 3, how much more "expensive" is RL compared to BC? What do you mean by "expensive"? Also how do expensiveness of a method determine teams desire to test it.
Alex:
Hi Ayush, here "expensive" is in terms of compute or actual $ cost of training (we're pretty aware of the price of CPU and GPU resources for our team). I'd say in general we try to make our experiments as low-cost as possible, but that we also aren't afraid to run expensive experiments if we think they're promising.
Ayush Thakur:
Ah thanks. Just to know how do you determine before hand that an experiment can be promising. Experiments tend to fail. What is the process that your team go through to determine before hand about promissing experiment. Thanks in adv.
Alex:
We don't really have a formal process. I think generally, though, the way we proceed is by first prototyping any new idea on a problem which requires less compute prior to then trying it in a bigger setting. It's obviously not perfect, since there are ideas which only work at large scale – sometimes we have to take a bit of a risk to try out such ideas.
Hamel Husain:
Hi Alex Paino what is your opinion on GPT-3 for program synthesis? Do you plan on scaling up GPT-3 further to another order of magnitude or greater number of parameters? Are there any concerns about scaling up models to such a large size that their compute requirements might become too onerous for industrial applications, or is that a complimentary path of research somehow?
Alex:
Hi Hamel, I'm sorry but since I did not work on GPT-3 I'm not going to comment on it.
BorisDayma:
Hi Alex, 1/ What is the minimum computer resources required at the moment to do RL in robotics? 2/ Do you foresee generic pre-trained RL networks that will be able to obtain great results after fine-tuning on smaller machines (like we have in computer vision or NLP with transformers)?
Alex:
Hi Boris, 1. I think this is a very tough question. It certainly depends a lot on what problem you're trying to solve – e.g. the Rubik's cube release required a lot more compute than the previous block reorientation release, for us. It's also pretty dependent on the approach taken; I think sim2real approaches in general require more compute than approaches relying entirely on real world data. Sorry, I don't think I have a concrete answer here.
2. I think it's really interesting to think about whether there could be a single useful pretraining task for RL+Robotics, as there is in vision and NLP. One obvious difference is that in RL, the inputs and outputs to the model vary tremendously across problems in a way that you don't see for NLP/vision. The inputs can probably be "frozen" (e.g. you could standardize on image inputs), but the outputs are much more difficult since they commonly map directly to the controls for a specific robot. I think it will probably be tough to pretrain a single RL policy which could generalize to different robots.
Max Green:
What do you currently see as the barriers to productionizing RL in the real world? What do the failure modes tend to look like?
Alex:
Hi Matt, I think the biggest issue for widespread deployment of RL in the physical world is safety – i.e., preventing the RL policy from doing something very unexpected and harmful. This is less of an issue in warehouse/manufacturing, since there is little/no interaction between the robots and humans (this is partially why this is the (maybe) the first place we've seen RL deployed in real world robots, e.g. Covariant). But ultimately we want robots that interact more closely with humans, so we'll need new solutions here.
Elena Khachatryan:
Hi Alex, thank you for joining us today. I’m wondering how you guys settled on solving the rubik’s cube as the benchmark problem.
Alex:
Hi Elena, Honestly, this decision was made before I joined so I don't have all the details. My understanding is that we chose it because it was the "hardest single-hand manipulation problem we could think of".
Lavanya:
Alex, thanks for coming! My first question is – you mentioned that your team used domain randomization to help you transfer your models to the real world. But intuitively, how does training the hand with different gravitational constants and friction coefficients help it do better in the real world where these values are constant?
Alex:
Hi Lavanya, the general motivation for domain randomization is that we can't model the real world perfectly, so we instead model a large number of possible real worlds (through varying the simulation parameters). I'm not sure that randomizing gravity is too useful, since it is very well known and does not vary at all in our lab. However, things like friction coefficients are very hard to measure accurately, and can vary quite a bit if we decide to slightly change the objects being manipulated.
Lavanya:
do you have intuition around why a larger batch size of 1024 was the most computationally efficient? have you found larger batch sizes to perform generally better across all your experiments?
Alex:
I'm not entirely sure. It's possible 1024 is closer to the true "critical batch size" for the behavioral cloning problem, but I didn't study this in too much detail. In general larger batches tend to help in terms of training time, but may not always result in greater sample efficiency.
Lavanya:
Lastly, if the hobbyist robotics enthusiasts in our community want to work on applying deep learning to robotics (maybe on your team), what would be your advice for them? Is your team hiring?
Alex:
My team is not actively hiring at the moment, but these things can change quickly so I recommend keeping us in mind in the future.Regarding working on deep learning and robotics in general, I think it depends a bit on which aspect you're interested in -- i.e., hardware, software stack, RL, computer vision. I recommend picking one to start with; I think each individually can be studied by a hobbyist, although I recognize tying them all together into a single system is quite a challenge for a single person.
Eliza Szczechla:
Hello Piero, are there any plans to integrate reinforcement learning algorithms into Ludwig?
Alex:
Hi Eliza, sorry for the late response. I think RL+NLP is an interesting direction but I unfortunately don't have much to add here (I'm not sure I'm even up on the latest research here).
Add a comment