Offline Reinforcement Learning with Collaborative Datasets
highlights: data visualization; easy collaboration; reproducible research
Created on June 11|Last edited on April 27
Comment
I. Introduction
Reinforcement Learning (RL) is one of the most promising research domains in machine learning (ML) and has shown exciting progress towards creating artificial intelligence (AI) systems: recent advances in RL span various tasks that typically require sequential decision making, such as Atari games, robotic control, and StarCraft. However, there's much yet to be explored for RL methods to achieve the next level of capability and/or transferring its success from toy environments to more complex real world scenarios. To envision a more powerful setup and avoid some of the fundamental limitations in the current RL problem formulation, the field has recently gathered attention to a new direction, Offline Reinforcement Learning. In this blog, we'll introduce this setup and its motivation, and briefly cover the key challenges that prevent immediate transfer from previous RL success into this new setting. Along the way, we'll use Weights & Biases as a powerful tool to log, visualize, and analyze the Offline RL workflow, and showcase how the usage of W&B toolkit greatly accelerates the research collaboration process to facilitate creation of cleaner, more re-producible Offline RL methods.

An illustration of how prior reinforcement learning (RL) differs from the offline RL setup: the offline agent is fed only data samples, but does not interact with the environment as part of its learning process. figure source: https://bair.berkeley.edu/blog/2020/12/07/offline/
II. Background & Motivation
How is reinforcement learning different from supervised learning?
Offline Reinforcement Learning alleviates this need for online data collection.
First, let's look at how a standard online RL agent is trained.
We plot the learning curves logged from training a Walker-v3 agent with Soft Actor Critic, a state-of-the-art RL algorithm, and how it's performance (measured by episode reward) improvs over the course of training time:
Teacher Agent Training
1
Let's see how the same algorithm applies without online interaction:
i.e. training a SAC "student" purely on the data, i.e. the replay buffer filled with tuples , saved from the previous teacher run. Here in the student reward plot, we see a clear drop in performance when compared to the teacher's, and a large standard error when the same student training setup is repeated for 5 random seeds. To get a peak of what's going on under the hood, we can also plot the learned Q-value throughout the student training: a clear over-estimation happens very early within the first few epochs, and if we consider teacher's Q values as the "ground truth", this over-estimation still happens until the end of student training (at Epoch 250, the student is estimating ~400 v.s. the teacher's ~300).
Student Agent Training
6
Teacher Online Training
1
III. How to Offline RL: Challenges and Datasets
The performance gap and Q-value overestimation we see above describe the key challenges we currently face in offline reinforcement learning. Intuitively, whenever an online RL algorithm is leaning towards some suboptimal policies, the agent acts on it, collects low-rewarded data, and learns to shift away from those behaviors. This need for "self-correcting" phase means the standard RL algorithms rely heavily on online data collection, and, as we've seen, would fail quickly in the offline setting. To mitigate this, some potential solutions have been explored in recent work. For a more detailed overview of both the theoretical background and recent advances in this topic, see Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems.
As you start a research project in this setting, the first and foremost thing is to collect an offline dataset, i.e. the replay buffer saved from an online 'teacher'. To be able to log, version, and do analysis and visualizations later, W&B Artifact provides a handy workflow which we show using the same example as above: we save the replay buffer from the SAC Walker agent as an artifact, along with any relevant logging metics saved from the training run. This allows you to download the same dataset into any other work spaces (e.g. a different machine, another research collaborator's local directory, a Google CoLab notebook, etc. ). Then, we can load back the saved data, and use W&B plotting and Tables to create more visualizations.
Visualize saved offline dataset with Tables
Each row below records one epoch's statistics from the teacher's training run. The 'Video' column is created by loading the saved replay buffer artifact, group all data tuples according to the original episodes, and render a subset of them into videos. Then, the rows are divided according to 'Epoch <= 100': thus grouped roughly according to "early" v.s. "late" stages in training: where we see the early epochs show videos of the walker agent failing easily, and the later epochs with the desired walking behaviors emerging.
data = wandb.Artifact(args.env_algo, type='teacher')artifact = run.use_artifact('mandizhao/spinup/{}_sac_alpha0-2_fix_alpha:latest'.format(env_name), type='teacher')artifact_dir = artifact.download()buffer_dir = glob(join(artifact_dir, seed, 'buffers/*'))[-1]
Visualize Offline RL Artifact
77
Another Teacher Agent: SAC Hopper
77

Early-stage teacher agent: it struggles to show any useful behaviors

Late-stage teacher agent: the walking behavior is emerging and we see a very non-random policy being learned.
IV. Data Curriculum for Offline RL
With the offline dataset is cleaned up, analyzed, visualized, and ready to use, we can start exploring some mediating solutions to the over-estimation issue we see in the example student training above. One simplistic yet surprisingly-effective method is called Data Curriculum: instead of providing the entire teacher replay buffer to the student, it creates a curriculum of data during the student training by sequentially making a subset of teacher data visible. At each epoch of student training, add only the teacher's data collected up util that same epoch: using time-step as an indicator, the curriculum is effectively re-creating the data "experience" during the teacher's training run.
Train a student with the same SAC algorithm but a curriculum of data:
below we compare two methods, each averaged across multiple random seeds. When data curriculum is added, we see a clear performance gain over without-curriculum. The Q-value estimates also look greatly tamed, and very close to the teacher's experience.
Time Data Curriculum
5
No Curriculum
5
Reference
- S. Lange, T. Gabel, and M. Riedmiller. Batch Reinforcement Learning.RL. Adaptation, Learning, and Optimization, 2012
- S. Fujimoto, D. Meger, and D. Precup. Off-Policy Deep Reinforcement Learning without Exploration. In319International Conference on Machine Learning (ICML), 2019
- A. Kumar, J. Fu, G. Tucker, and S. Levine. Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. InNeural Information Processing Systems (NeurIPS), 2019
- A. Kumar, A. Zhou, G. Tucker, and S. Levine. Conservative Q-Learning for Offline Reinforcement329Learning. InNeural Information Processing Systems (NeurIPS), 2020
- S. Levine, A. Kumar, G. Tucker, and J. Fu. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems.arXiv preprint arXiv:2005.01643, 2020
- Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum Learning. InInternational Conference on Machine Learning (ICML), 2009
- T. Matiisen, A. Oliver, T. Cohen, and J. Schulman. Teacher-Student Curriculum Learning.arXiv preprint arXiv:1707.00183, 2017
- S. Sukhbaatar, Z. Lin, I. Kostrikov, G. Synnaeve, A. Szlam, and R. Fergus. Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play. InInternational Conference on Learning Representations291(ICLR), 2018
- D. Seita, C. Tang, R. Rao, D. Chan, M. Zhao, and J. Canny. ZPD Teaching Strategies for Deep Reinforcement Learning from Demonstrations.arXiv preprint arXiv:1910.12154, 2019.
- Y. Zhang, P. Abbeel, and L. Pinto. Automatic Curriculum Learning through Value Disagreement. In Neural Information Processing Systems (NeurIPS), 2020.
- T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft Actor-Critic: Off-Policy Maximum Entropy Deep314Reinforcement Learning with a Stochastic Actor. InInternational Conference on Machine Learning315(ICML), 2018
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.