Skip to main content

Scaling Out Motion Prediction for Autonomous Vehicles with L5Kit, Ray, and W&B

In this tutorial, we'll show you how we easily organized and instrumented a prediction model for autonomous vehicle motion with W&B and scaled it out with Ray.
Created on August 25|Last edited on September 30

Introduction

We take the true complexity of driving for granted. When we drive, we're making dozens of small decisions every minute, from signaling turns to checking our rearview mirrors to breaking so we're a little further from the car in front of us.
Self-driving models need to do the exact same things and that means far more than one model and far more than one engineer. As the old adage says: "It takes a village." But in this case, it takes a "village" of interdependent data-adjacent practitioners all working together to make models that:
  1. frequently iterate to ensure we have an effective model that performs well (no crashes)
  2. are tracked and traced for strong auditability, making the process of "debugging" models feasible
  3. can be utilized by other downstream teams with the best-available context for them to understand and use the model we trained
With the use of Weights & Biases alongside Ray, we'll learn to scale our experimentation process to capture all related personas, traces, and results for our AV organization using a real dataset released from Lyft Level 5 (Lyft's self-driving arm).

Table of Contents



Background Links:

First Things First: What is Motion Prediction? 🤔

Motion prediction is the machine learning task concerned with knowing how cars, cyclists, and pedestrians move in an autonomous vehicle's environment. This is different than object detection, where the [models are used to identify objects rather than track their movement.
Note also that the process of decision-making for our vehicle, called planning, is different but related to our prediction task.
Three main components of AV stack: Perception (What is around the car?), Prediction (What will happen next?), Planning (What should the car do?).
Example situation: for the self-driving car to perform an unprotected left turn, it needs to know whether the oncoming vehicle will turn right or go straight and interfere with the AV’s left turn.

Motion Prediction 🤝 Lyft's Open L5 Dataset


Snapshot of the Level 5 Prediction Dataset, which contains 1,000 hours of driving collected by our AV fleet in Palo Alto, CA.
The prediction dataset registers the world around the car at different timestamps. Each timestamp includes:
  1. A frame is a record of the vehicle itself. It contains its location and direction, as well as a list of all the agents and traffic lights detected around it at that moment.
  2. An agent is a movable entity in the world. Agents are labeled with a class (car, pedestrian, etc.) and their position information. Agents also have unique IDs, which are tracked between consecutive frames.
A common choice when working with AV data is to use Bird’s-Eye View (BEV) rasterization for the system’s input, which consists of top-down views of a scene. This simplifies building your models because the coordinate spaces of the input and output are the same.
The data preprocessing technique called rasterization is a process of creating images from other objects, such as the sensor data in this case
💡

Examples of driving scenes from the dataset capturing position of other agents around the AV.

Input Data

For Semantic and Satellite View:
🍀 (Green) = Autonomous Vehicle (Ego)
🐷 (Pink) = Path of AV (Trajectory)
🔵 (Blue) = Other Agents (cars, bikes, etc.)
For Bokeh View:
🐙 (Red) = Ego
🔵 (Blue) = Other Agents
Click the icons in the Table below to get the best view of the various animations of our vehicle driving around. We provided not only GIFs 🎥 but an interactive tool 🤳🏽 to scrub through a driving scene via Bokeh.
💡

scene_idx
semantic_view
satellite_view
1
2
3
4

1 of 10

How can we use deep learning to solve motion prediction?

First, we can create a baseline by adapting a standard CNN architecture (e.g. ResNet50) to our needs. We'll use ResNet50 as it is a common backbone for many computer vision tasks. And while we can leave the central part of the network as is, we’ll need to change its input and output layers to match your setting.
To do this, we'll match the number of channels in the first convolutional layer to the one in the BEV. A three-channel convolutional layer may not be enough to rasterize different semantic information in different layers. In our case, we must account for both the image raster and the movements of the car, which will involve more than 3 channels.
Next, let's make sure the number of outputs matches our future prediction horizon multiplied by each time-step element (XY displacements are used in the example below). For a horizon of 50 steps, we’ll need a total of 100 neurons in the last layer of our network.
Essentially, we input the image raster (3 channels) alongside the XY movements of the AV into the model. The model outputs the predicted XY movements for the AV. We can adjust how many previous steps we input alongside the image resulting in a total input of: 3+(2*number_of_previous_steps), where the 3 comes from the number of channels of the input raster and the 2 coming from X and Y position. Similarly the output can provide how many steps into the future we want to predict for: 2*number_of_future_steps
💡


The L5Kit Model in PyTorch



import torch # 🔦
from torch import nn
from torchvision.models.resnet import resnet50

# 🔦
def build_model(config: Dict) -> torch.nn.Module:
# load pre-trained Conv2D model
model = resnet50(pretrained=True)

# change input channels number to match the rasterizer's output
num_history_channels = (config["model_params"]["history_num_frames"] + 1) * 2
num_in_channels = 3 + num_history_channels
model.conv1 = nn.Conv2d(
num_in_channels,
model.conv1.out_channels,
kernel_size=model.conv1.kernel_size,
stride=model.conv1.stride,
padding=model.conv1.padding,
bias=False,
)

# change output size to (X, Y) * number of future states
num_targets = 2 * config["model_params"]["future_num_frames"]
model.fc = nn.Linear(in_features=2048, out_features=num_targets)

return model

Training an AV for Motion Prediction 🚖

Our training process will incorporate 5 steps to match what one may expect from a normal ML pipeline.

  1. Ingest and transform data
  2. Perform EDA on the data which will allow us to understand the different driving scenes in a human-readable form
  3. Build and run multiple motion prediction models on the data
  4. Evaluate the trained candidate models on a validation set of data
  5. Based on the provided best evaluation metric, retrieve and promote the model to the W&B Model Registry
In fact, below you can see details about the current best model in our Model Registry, ready for review:
Click through the different tabs! You can see every detail about the model with linking to all relevant experimentation steps 🤩
💡

prediction-model
Direct lineage view
Artifact - model
checkpoint_TorchTrainer_2289b1f2:v6
Run
TorchTrainer_2289b1f2
Run - evaluate
evaluate-latest-models
You can explore the main operational dashboard for deeper analysis!


Maximizing Hardware Utilization with Ray



Modern workloads like deep learning and hyperparameter tuning are compute-intensive and require distributed or parallel execution. Ray makes it effortless to parallelize single machine code — go from a single CPU to multi-core, multi-GPU, or multi-node with minimal code changes. In other words, Ray helps us effortlessly scale our most complex workloads
Adding in Ray is as simple as:
import torch # 🔦
import multiprocessing # 🔎
from ray import train
from ray.air import session, Checkpoint # 💾
from ray.train.torch import TorchTrainer # 🔦💨⚡️
from ray.air.config import ScalingConfig

# 🔦+🔎 Get all your useful hardware information
USE_GPU = torch.cuda.is_available()
NUM_GPUS = torch.cuda.device_count()
NUM_CPUS = multiprocessing.cpu_count()

# 🧮 Do some quick math to efficiently spread the workload to each GPU
if USE_GPU:
num_actors = NUM_GPUS
num_data_workers = NUM_CPUS // num_actors
else:
num_data_workers = 4 if NUM_CPUS>=4 else NUM_CPUS
ideal_num_actors = NUM_CPUS // num_data_workers
num_actors = ideal_num_actors if ideal_num_actors else 1

# 🔦 Define the details of a training run
def train_model(config):
train_dataloader = load_data(config) # 🫵🏽 You implement this
# ⚡️ Prepares Data for DDP
train_dataloader = train.torch.prepare_data_loader(train_dataloader)

model, criterion, optimizer = build_model(config) # 🔦+🫵🏽 and this...
# ⚡️ Prepares Model for DDP
model = train.torch.prepare_model(model)

steps_before_checkpointing = config.get("steps_before_checkpointing", 100)
max_epochs = config.get("max_epochs", 100)
for epoch in range(max_epochs):
model.train()
torch.set_grad_enabled(True)
metrics, model_outputs = train_model_epoch(train_dataloader,
model, criterion,
optimizer) # 🫵🏽 AND this!
# 🪄🐝 Report your training metrics to your logger of choice
# 💾 This includes the model checkpoints serialized by Ray
if (step%steps_before_checkpointing==0) or (step==max_epochs-1):
session.report(
metrics=metrics,
checkpoint=Checkpoint.from_dict(dict(epoch=epoch, model=model)))
else:
session.report(
metrics=metrics
)

# 🔦💨⚡️ Ray deals with the rest of the hardware management for you SUPAFAST
trainer = TorchTrainer(
train_loop_per_worker=train_model,
scaling_config=ScalingConfig(num_workers=num_actors, use_gpu=USE_GPU),
)

These above lines of code will:
  • 🔦+🔎 Check the number of CPUs and GPUs available on your system, assuming all are for training
  • 🧮 Calculate the most efficient split of GPUs and CPUs to distributively train
  • 🔦 Create a training script that will be able to start a full model training run
    • ⚡️ The training run will also ensure to use Ray to properly apply DDP to our DataLoaders and Model
    • 🪄🐝 We can tell the Ray session to report metrics and our model checkpoints to our favorite ML system of record (spoiler alert: Weights & Biases)
  • 🔦💨⚡️ Create a ray.train TorchTrainer with minimal additional code and reap the parallelized speed benefits
The best part? This workflow is really reusable (and easily so!).
Want MLOps, early stopping, and hyperparameter optimization? Here we go:
from ray import tune
from ray.tune.tuner import Tuner # 📻♻️
from ray.air.callbacks.wandb import WandbLoggerCallback # 🪄🐝
from ray.tune.stopper import ExperimentPlateauStopper # 🛑
from ray.tune.search.optuna import OptunaSearch # 🕵🏽

project_name = "your_cool_project_name"
n_search_attempts = 25

# 🕵🏽 Define your hyperparameter optimization search strategy
# You can define for which config values we want to search and how
config["max_epochs"] = tune.quniform(1000, 5000, 250)
optuna_search = OptunaSearch()

# 📻♻️ Define details of the ray.tune job by providing:
tuner = Tuner(
trainer, # 🔦💨⚡️ The trainer from above...
tune_config=tune.TuneConfig(
metric="your_important_metric",
mode="min",
search_alg=optuna_search, # 🕵🏽 How we want to search.
num_samples=n_search_attempts,
),
param_space={
"train_loop_config": config # 🔦💨⚡️ ...with the config we want to pass to the trainer!
},
run_config=RunConfig(
# 🛑 How to stop the experiment based on a provided metric.
stop=ExperimentPlateauStopper("another_important_metric"),
# 🪄🐝 Your ML System of Record (Tracked Experiments & Checkpoint Artifacts).
callbacks=[WandbLoggerCallback(project=f"{project_name}-trials",
save_checkpoints=True),]))
# 🏃🏽‍♂️ Run the tune job with
# - Distributed Training
# - Hyperparameter Search
# - Experiment Stopping
# - MLOps
analysis = tuner.fit()
  • 🕵🏽 We can utilize any of the built-in search strategies for hyperparameter optimization as defined by ray.tune
    • This assumes we prepare the values we want to search over in the proper format within the config passed to ray.tune (and more properly ray.train TorchTrainer)
  • 📻♻️ Create a ray.tune Tuner that will be provided details about how it should tune the experiments and under what circumstances we're monitoring each of these runs
    • 🔦💨⚡️ The TorchTrainer will be provided the config provided to the Tuner, with the hyperparameters automatically adjusted by ray.tune and automatic management of useful run configurations like callbacks
    • 🕵🏽 The hyperparameter search agent, optuna with default sampler TPESampler : Tree-structured Parzen Estimator
      • On each trial, for each parameter, TPE fits one Gaussian Mixture Model (GMM) l(x) to the set of parameter values associated with the best objective values, and another GMM g(x) to the remaining parameter values. It chooses the parameter value x that maximizes the ratio l(x)/g(x).
    • 🛑 The early stopping criteria, ExperimentPlateauStopper stops the entire experiment when the metric has plateaued for more than the given amount of iterations specified in the patience parameter.
    • 🪄🐝 The WandbLoggerCallback provided by ray.air is a simple one-line addition needed to quickly add experiment tracking and model artifacts all organized in one ML system of record
  • 🏃🏽‍♂️ Running 100's of optimized experiments on optimized hardware then becomes as simple as tuner.fit()!
  • 🫵🏽 <Broken Record>: The best part? This workflow is really reusable. Like really easily.

Training Outputs!

Training


Run set
56

Key insights
  • In the parallel coordinate plot above we see that a majority of our models centralize around a low avg_loss. This is what we want! We use the hyperparameter optimization tool Optuna which iteratively chooses the best choice of hyperparameters to try.
  • Lower learning rates ( <0.0009) with a higher number of training steps yield lower average loss.
  • Higher batch sizes in combination with lower learning rates yield lower average loss.
  • There is very little correlation between the type of raster (semantic vs satellite) and the average loss.

Evaluation

🦋 (Cyan) = Predicted Motion Path of AV from the trained model
If 🐷 (Pink) = Path of AV (Trajectory) is not equivalent to 🦋 (Cyan) = Predicted Motion Path of AV from the trained model then our model is performing poorly.
Looking at the animations presented in our Table, we can directly see how our model would have directed our AV car in a real-world setting on a map. Our model tends to crash a lot... but at least with Ray and W&B the code to train a better model won't 🫰🏽

semantic_predictions
satellite_predictions
semantic_animation
satellite_animation
1
2
3
4
run_url
run_name

Conclusion

Looking at the results of our trained model, I would say that this model probably wouldn't hit production due to some of our hilariously blatant errors 🤷🏽‍♂️. With W&B not only were we easily able to find these errors, but we have a running record of every single step that may have led to this error, with insight from every member of our team.

Iterate on AI agents and models faster. Try Weights & Biases today.
File<(table)>
List<Maybe<html-file>>
artifact
File<(table)>