Ripple
Rolv's Imitating Proactive Policy Learning from Experts
Created on January 26|Last edited on March 10
Comment
Ripple is a Rocket League bot being trained with a combination of reinforcement learning and imitation learning.
The purpose of this project is to train Ripple to play Rocket League in two stages by first training it to play in a human-like way through imitation learning, then training it to be the best player it can be through reinforcement learning.
Imitation
My goal with the imitation learning stage was to train Ripple to take human-like actions in the game before moving on to reinforcement learning. There is an online database containing replays of human games at ballchasing.com that I wanted to leverage for this, but unfortunately, Rocket League replays don't include the actions players took at each frame. To overcome this problem I was able to use my previous project which was inspired by this OpenAI paper where I trained an Inverse Dynamics Model (IDM) to predict the actions that led from one frame to the next when given physics data on either side of the action.
I was able to use the IDM to predict player actions in around 37 000 hours of SSL/pro gameplay - in the 1v1, 2v2, and 3v3 ranked playlists, as well as the RLCS - which I then used to train the initial stage of Ripple to play Rocket League. Below you can see graphs of the loss and accuracy of the bot during this stage of training. Different models have been used, but v1.0 used the bcm-6M-normalized_dot model as the starting point, and v1.1 started from the slightly smaller (and thus faster) bcm-1M-low_decay3.
Reinforcement
After the imitation learning stage, I transferred Ripple to a distributed reinforcement learning algorithm using PPO where it is learning to play the game by playing against itself and its prior versions. To ensure Ripple doesn't entirely lose the human-like strategy it gained from the imitation learning stage, I included an extra term in the PPO loss function that penalizes KL divergence from the version of Ripple that came from the imitation learning stage, as was also done for AlphaStar and VPT. Doing this encourages Ripple to continue taking human-like actions while it improves at the game. Additionally, the coefficient of the KL term in the loss is decayed, meaning it has less of an effect over time. These graphs show the KL divergence between Ripple and its starting point, and the coefficient over time. In v1.1 I also changed to a periodic coefficient during later stages, to hopefully get the best of both exploration and exploitation.
The most important way to measure Ripple's skill while it trains is the TrueSkill system, which is a common way to measure the skill of players in video games (MMR in Rocket League uses a similar system). Different iterations of the agent are pitted against each other in a normal Rocket League match, and wins/losses are recorded to calculate their ratings. This rating is also used for matching the latest version with previous iterations during training.
Below you can see a plot of Ripple v1.1's TrueSkill during training.
Here are some game stats throughout training, measured for Ripple v1.1 and v1.0, and Tecko (the failed successor to Necto and Nexto) as an example of a bot starting from scratch.
Details
When it comes to making Rocket League bots, the details can make a big difference. I will describe some of the details and their reasoning here.
Action delay
A common complaint for people playing against the best bots is their unnatural playstyle and reaction time. Nexto is the prototypical example of this. It has a dribble-heavy playstyle with crazy flick power and accuracy, partially made possible by its ability to quickly correct any deviations in the ball on its hood. In defense, it tends to sit in net and wait for the opponent to flick, since it learned that is an effective strategy while training against itself. For humans this is unrealistic, we simply can't react fast enough to get to where the flick is going to go. As both attackers and defenders, we also don't have perfect information about the location, velocity, etc. of other players like bots do; the ball obstructs our view and our depth perception is not perfect. To even the playing field somewhat, Ripple has a built-in delay of about 200ms - a typical human reaction time - by queuing actions for execution. This should make dribbling slightly less overpowered as fake challenges and challenges will work better. It also means fakes and other hard-to-predict plays should be more viable. Of course, Ripple may also learn to utilize these tactics while playing against itself, making for an overall experience more akin to playing against humans.
Starting states
When training a bot, one very important thing is to expose it to many diverse scenarios, and particularly those where there is a lot to learn. If the bot always starts in a dribbling position, it might become really good at dribbling. If it always starts with an open net, it might become really consistent at scoring really fast. Ripple uses states from SSL replays, and weights the probabilities using a special combination of :
- Aerial potential - combining ball height/velocity and player heights/velocities/boost amounts to estimate air time.
- Goal volatility - estimated by using a next goal predictor, generously supplied by twobackfromtheend from his situational player value experiment. States are weighted by how much the prediction changes in the near future. This should mean actual outplays and saves are weighted highly, while the ball rolling into an open net is not weighted highly.
Truncation
NOTE: Not in effect in latest versions of Ripple, as it didn't seem to have as much of a positive effect as I wanted. I still think it could be a good idea to prioritize obtaining more experience near replay states.
While good starting states are undoubtedly useful, if you let the bot play it will quickly diverge from the starting states, reducing their effectiveness. Gravity especially is a large hurdle, unless a bot is trained very specifically for staying in the air it will tend to fall down, and the rest of the episode takes place in a grounded state. To allow Ripple to learn from the starting states, episodes are quite heavily time-limited, meaning the episode stops after a certain amount of time. We then use the critic, which is a neural network that predicts future rewards, to replace the actual future rewards.
Observation
The way the bot perceives the world is quite important. While simply plugging the raw physics data into a network can work, using feature engineering can speed up the process significantly by giving the bot similar observations for similar situations and letting it generalize better. For Ripple, it receives its own raw physics and car data so it knows where it is on the field and what it can do. The ball and other cars, however, have their physics data transformed to their relative position from Ripple's point of view. In addition, since the field is symmetric across the goal-goal axis, Ripple's position is normalized so it always thinks it's on the left side of the field, though the observation also includes an indicator for left/right to resolve situations like kickoffs where it may be unclear who should go (left goes is the typical rule of thumb for humans). The aforementioned action queue is also included, so it knows what it's about to do in the next few frames. Ball predictions are also included by simulation in RocketSim, showing slices of the ball's trajectory without any car interference (every 1s, 5s into the future) to make reads easier and take that load off the neural network.
Architecture
Ripple uses a custom attention architecture, similar to PerceiverIO. This allows it to support any number of entities (meaning balls/players/boosts), and can selectively use information from each of the entities by querying based on its own current state iteratively. Embeddings for the action queue and ball slices are produced using small FF neworks, while the embedding for the boosts is produced via max pooling. After producing an embedding for the state, instead of a standard classifier, it uses another network to generate embeddings for the available actions. We then dot product together the available action embeddings and the state embedding to produce the action logits.
Actions
In theory, Rocket League has an infinitely large action space. There are 5 continuous inputs (throttle, steer, pitch, yaw and roll) and 3 discrete inputs (jump, boost and handbrake). However, we know that using keyboard and mouse is not an obstacle for making it to a super high level mechanically, as evidenced by multiple pros and freestylers. KBM discretizes the continuous inputs into 3 options (-1, 0, 1), giving options. This is much better, but with some clever optimizations like realizing yaw can be equal to steer, no reversing while boosting, etc. we can bring that number down to just 90 unique actions. Specifics can be found here. Ripple uses an expanded set with more bins for a total of 276 actions.
Rewards
Rewards are perhaps the most important part of making an RL bot. While I won't go into specifics, I can give an outline of the rewards for Ripple. The first thing I realized is that due to the starting point not being totally incompetent, the rewards can be much more sparse than is typical. Other bots often need a lot of hand-holding to even start approaching the ball. Thus, the starting point for Ripple is touching the ball. Touches are rewarded in three ways. The first is a base reward, to promote dribbling the ball (touching the ball every step would maximize this reward). The second is a reward for the change in velocity of the ball, encouraging hard hits and other large changes in direction/speed. Third, it is rewarded for touches high in the air, with some extra logic to promote more advanced aerial plays like flip resets, air dribbles, and double taps. Demos give rewards to the attackers and punishments to the victims. It is rewarded for having boost in the tank continuously, weighted so that more boost is better, but the importance decreases the more it has. Lastly, goals give a very high reward, with small bonuses for goal speed and punishments for distance from the ball at the time the goal happens. In addition, there is a system of reward distribution similar to the team spirit system in OpenAI Five. Rewards are distributed between teammates, and a positive reward for one team gives an equivalent negative reward for the other team, making it zero-sum.
Add a comment
Very impressive! Looking forward to the progress.
Reply