Meltingpot initial trials
work in progress - detailed report comming soon
Created on August 24|Last edited on January 26
Comment
Intro
Multi-Agent Reinforcement Learning (MARL) has recently beocme an active area of research on giving agents the social capabilites humans have. In sucession to the research work on Sequential Social Dilemmas (SSDs), DeepMind has release the Melting Pot testing suite. It aims at testing a broad range of social interactions such as cooperation, competition to assess how agents can generalize under these atari-like game theory scenarios.
The gist of the project is to:
- Find better mertics to assess the performance of these methods under the MARL setting (Part 2)
- Understand / Replicate MARL methods running on these SSDs environments (Part 3)
1 Vanilla method: PPO on Melting Pot

What Melting Pot is about - screenshot from the talk on 'Social-Cognitive Capacities, Representations, and Motivations' given by one of its author Joel Leibo
Melting Pot can be thought of as a tesing ground for MARL algorithms, just as what ImageNet challenge did for allowing breakthroughts to happen in Computer Vision. Many SSD-like games can be tested such as Clean Up, Harvest, Overcook etc to test performance of a population of agents under different game theory sceanarios like stag-hunt, prisoners-dilemma in an iterated fashion. Mixed motive games are interesting as it captures a lot of the problems in real world where it is not purely cooperative or purely competitve. E.g. Problems in tackling climate change where there is a limited resource pool, failure to sustainably share the resources can lead to outcomes like 'tragedy of commons'.
Running the tried and true Proximal Policy Optimization method (Schulman et al., 2017) to establish a baseline and get a sense of how algorithms should be implemented in Melting Pot. PPO implementation is adapted from CleanRL, credit to this project for their clear documentation & code samples highlighting all the nuances in implementation details.
I adopted the PettingZoo interface by using the wrapper provided in Melting Pot, so as standardize the interaction with the multi-agents environment, making it similar to implementing RL methods in OpenAI's Gym where it is in single agent setting.
Naive measurement of performance together with a video of the gameplay in Commons_Harvest_Open substrate:
Run: commons_harvest_open__ppo__1__1661323746
1
For more details on the substrates, meaning the different sceanarios in environment & agent population mix, please refer to the project paper appendix and their Github docs: https://github.com/deepmind/meltingpot/blob/main/docs/substrate_scenario_details.md
Part 2: Social Outcome Metrics
I implemented the 4 Social Outcome Metrics (Perolat et al., 2017) as a better way to track the state of MARL mixed-motive games instead of simply using the averaged return recived by all agents.
The 4 metrics are:
- Unitarian (U) aka Efficiency - sum total of all rewards obtained by all agents
- Equality (E) - using Gini coefficient
- Sustainability (S) - average time at which the rewards are collected
- Peace (P) - average number of untagged agent steps

Formula of the 4 metrics from the paper.
[Implementation Detail]: All metrics can be implemented directly with the exception of Peace which requires the tag (aka Zap) action that are not exposed in the environment interface, but are treated as internal world observation.
After some digging, the only Substrate in Melting Pot that tracks the 'World Observations' is the 'Allelopathic Harvest' game among all the available substrates at the moment, which is exactly the game that the CNM paper is experimented on.
Additional enhancment is made to my implementation of RecordMultiagentEpisodeStatistics wrapper to expose those World Observations to each individual agent as originally each agent is designed to only receive observation of its own but not information about the whole state (e.g. public sanctions)
Part 3: Implementing Classifier Norm Model (CNM)
Evluating Classifier Norm Model (CNM) from "A learning agent that acquires social norms from public sanctions in decentralized multi-agent settings" paper (Vinitsky et al., 2021) on Melting Pot substrates.
Why have I picked this paper?
Having read 'Social influence as intrinsic motivation for multi-agent deep reinforcement learning' (Jaques et al., 2019), I have been following the research works of the co-authors on the problem of Sequential Social Dilemmas on how agents can interact with each other instead of acting on their own.
Papers such as 'Inequity aversion improves cooperation in intertemporal social dilemmas' (Hughes et al., 2018) and 'Joint Attention for Multi-Agent Coordination and Social Learning' (Lee et al., 2021) have provided unique insights on encouraging cooperation among agents to achieve higher reward and provided game theory based analysis to better understand the effectiveness of the method in mixed-motive games.
In particular the paper 'Open Problems in Cooperative AI' (Dafoe et al., 2021) and the talk on 'Social-Cognitive Capacities, Representations, and Motivations' pointed out that social norms could be an effective mechanism for promoting cooperation. This is an interesting idea as norms are one of the complex social behaviours that emerge in society, which is external to the capacity of each individual agent (e.g. theory of mind, communication etc), which enhancing agent's capacity is typically what the general research direction is focused on. If norms can be established, it can guide new agents joining an environment to the desired behaviour of the whole group, (akin to some form of cultural transmission?). Following this line of thought led me to discovering the CNM paper.
What is the paper about?
"A learning agent that acquires social norms from public sanctions in decentralized multi-agent settings" paper looked at using public sanctions as a means to construct the learning dynamics for social norms to emerge. With a population of agents learning the pattern of sanctioning it can prevent miscoordination and free-riding which are common in SSDs. Using public sacntion to construct social norms is possible as often this is the only public information availble to all agents. We should not expect agents to share their internal policies & rewards and use that as a means to construct MARL algorithms as some papers do as often those information are private. The paper captures the idea much better and I encourage readers to check it out.
Experiment setup / general implementation details
- In addition to the Gym wrapper RecordEpisodeStatistics, I have implemented the RecordMultiagentEpisodeStatistics wrapper for calculating the 4 social outcome metrics
- In addition to using PettingZoo, the library / interface also comes with a very useful utilties tool SuperSuit to add additional preprocessing to the environements. Additional wrapper / preprocessing steps include:
- frame stacking of 2 frames useful for CNM classifier as it needs the T-1 frame to output a sanction prediction and learn by comparing it to the T frame of whether sanction actually happened
- converting the multiagent gameplay to Vectored environment
- as an added complexity to train multi-agent system, it is better to separate out agents in to its own parallel env for RL algorithm to train on. E.g. a 16 player game will split into 16 vectored envs each representing the observation / action / reward of one agent.
Ablation Studies
Vanilla PPO
implementation details - things that are different from the authors implementation:
In this project I'm not using A3C + contrastive loss but opted for PPO as
- too much complication without showing its added benefits in the paper
- not the core idea of the paper
Gameplay of PPO on Alleopathic Harvest
PPO
2
PPO + LSTM
The implementation here are also adapted from the CleanRL project, for more information on implementation details of the variations on algorithm / mode of training, recommend to have a read on The 37 Implementation Details of Proximal Policy Optimization
Here we can see significant improvements in terms of the social outcome metrics.
Note that changes should be made for the Equality metric as according to the paper it uses Gini coefficient which is not applicable to envs with negative rewards.
Run set
5
PPO + LSTM + CNM

Illustration of the Classifier Norm Model from the paper where Classifier is a separate Neural Network from the Policy network.
Run set
8
Add a comment