Skip to main content

Measuring Safety in Reinforcement Learning

This article introduces the SafeLife Benchmark Launch with Partnership on AI and explains its importance for the future development of ethical AI.
Created on October 7|Last edited on November 8
TL;DR our benchmark with Partnership on AI measures safety in reinforcement learning; join us as a human or an AI.
This project is a collaboration with Carroll Wainwright and Peter Eckersley.

Introducing the SafeLife Benchmark

We’re excited to announce our latest benchmark in collaboration with Partnership on AI: explore safety in reinforcement learning via a powerful and adjustable game-playing environment. SafeLife v1.2 procedurally generates a series of puzzle levels with three possible tasks (build, destroy, navigate) and Conway’s Game of Life dynamics—fun and challenging for both human and machine players.
A fixed set of 100 benchmark puzzles quantifies the tradeoff between agent performance (solution speed and accuracy) and induced side effects, or interference with the environment’s shifting patterns. You can read more in the PAI announcement.
Example task: remove only the red cells and find the exit


The Challenge

The goal of this project is to measure and improve safety in reinforcement learning (RL): quantify the negative side effects of RL agents in the SafeLife environment (an artificial but expressive and highly general game) and figure out how to train RL agents to reliably minimize such side effects.
Side effect reduction is a core unsolved problem in AI safety. Since we teach RL agents to optimize an explicit objective function, anything not covered by that function could be implicitly less good (or at least, permitted). A classic example: a cleaning robot maximizing efficiency might smash an expensive vase if it blocks the shortest path across a room.
We might then encode the vase, and possibly all objects in the room, as protected and not to be disturbed, but what about the cat or a new guest? We’d love for an agent to accomplish the task “while respecting common sense” or “while minimizing the influence on the environment”, but the former is notoriously difficult to encode and the latter might prevent the agent from doing anything at all.
How can we train an agent to avoid negative side effects in general, without enumerating them, especially when we can’t possibly know all of them in advance? In this benchmark, we approximate the amount of agent-caused change in the dynamic puzzle environment as the side effects and train on a variety of task configurations, exploring the tradeoffs between safety and performance across different approaches.
By tracking experiments in a public Weights & Biases benchmark, we aim to encourage research and collaboration on this important problem. Through the automated logging of code, config, and result metrics to W&B, we can compare training strategies and explore new approaches more easily —and verify how much information about side effects an RL agent receives. As a bonus, we can track and visualize the same metrics for artificial and human players, and we hope to inspire more transfer learning across these teams.

Sample training runs
12


Long-Term Impact

SafeLife is a general environment for benchmarking safety in reinforcement learning, offering many directions for further research:
  • explore hyperparameters and improve the baseline model (reducing side effects is far from a solved problem!)
  • implement different RL algorithms and training strategies
  • quantify and penalize side effects robustly across different tasks (e.g. via attainable utility preservation and relative reachability)
  • precisely tune the difficulty, safety constraints, and dynamics of the game-playing environment
  • incorporate supervision from human players
  • build more powerful visualizations for model analysis
  • generalize across tasks and distributional shift
  • allow for safe exploration, extend to multi-agent systems, and more.
While SafeLife itself is a game, it can deepen and validate our theoretical understanding of safety in reinforcement learning agents across critical applications from robots to healthcare to financial markets to personalized recommendation systems. Conversely, if we can’t reliably teach a goal-driven robot on a small 2D grid to avoid destroying its environment, how can we hope to do the same in the vast dynamic complexity of the real world?
At Weights & Biases, we build tools to support the ethical and effective development of machine learning models. Benchmarks enable us to foster and accelerate open collaboration on meaningful research problems. We are actively seeking more benchmark projects, especially ones relevant to climate change and long-term value alignment. If you have an application, dataset, or baseline model you’d like to contribute or recommend, please reach out to stacey@wandb.com.

Additional resources


Iterate on AI agents and models faster. Try Weights & Biases today.