Skip to main content

The Waymo Open Dataset

The Waymo Open Dataset is a perception and motion planning video dataset for self-driving cars. It’s composed the perception and motion planning datasets.
Created on September 13|Last edited on September 30

What Is the Waymo Open Dataset?

The Waymo Open Dataset is a perception and motion planning video dataset for self-driving cars that comprises high-resolution sensor data. It is composed of two separate dataset — the perception dataset and the motion planning dataset.
The perception dataset consists of 1150 scenes that each span 20 seconds, consisting of well synchronized and calibrated high-quality LiDAR and camera data captured across a range of urban and suburban geographies. This includes high-resolution sensor data and labels for 2,030 segments, key points labels, 2D-to-3D association labels, 3D semantic segmentation labels, and 2D video panoptic segmentation.
The motion planning dataset contains object trajectories and corresponding 3D maps for 103,354 segments. With over 100,000 scenes, each 20 seconds long at 10 Hz, the dataset contains more than 570 hours of unique data over 1,750 km of roadways and captures interesting relationships between vehicles, pedestrians, and cyclists across different geographic conditions.

What We're Covering About the Waymo Open Dataset



General Info About the Waymo Open Dataset

Dataset Structure

The dataset contains two separate datasets for Motion Prediction and Perception.
Motion Prediction: The motion dataset is provided as sharded TFRecord format files containing protocol buffer data. The dataset is composed of 103,354 segments each containing 20 seconds of object tracks at 10Hz and map data for the area covered by the segment. These segments are further broken into 9-second windows (1 second of history and 8 seconds of future data) with 5-second overlap.
Each record in the dataset is contains the following fields:
scenario_id - A unique string identifier for this scenario.
timestamps_seconds - Repeated field containing timestamps for each step in the Scenario starting at zero.
tracks - Repeated field containing tracks for each object.
id - A unique numeric ID for each object.
object_type - The type of object for this track (vehicle, pedestrian, or cyclist).
states - Repeated field containing the state of the object for each time step containing its 3D position, velocity, heading, dimensions, and a valid flag.
dynamic_map_states - Repeated field containing traffic signal states across time steps such that dynamic_map_states[i] occurs at timestamps_seconds[i]
lane_states - Repeated field containing the set of traffic signal states and the IDs of lanes they control (indexes into the map_features field) for a given time step.
map_features - Repeated field containing the set of map data for the scenario. This includes lane centers, lane boundaries, road boundaries, crosswalks, speed bumps, and stop signs. Map features are defined as 3D polylines or polygons. See the map proto definitions for full details.
sdc_track_index - The track index of the autonomous vehicle in the scene.
objects_of_interest - Repeated field containing indices into the tracks field of objects determined to have behavior that may be useful for research training.
tracks_to_predict - Repeated field containing a set of indices into the tracks field indicating which objects must be predicted. This field is provided in the training and validation sets only. These are selected to include interesting behavior and a balance of object types.
current_time_index - The index into timestamps_seconds for the current time. All steps before this index are history data and all steps after this index are future data. Predictions are to be made at the current time.

More details related to the motion prediction dataset can be found on their website
Perception: The perception dataset contains images, videos, and lidar data with annotations for 3D Bounding Boxes, 2D Bounding Boxes, Key Points, 2D-to-3D Correspondence, 3D Semantic Segmentation, and 2D Video Panoptic Segmentation. The following objects have 3D labels: vehicles, pedestrians, cyclists, and signs. 

Supported Tasks of the Waymo Open Dataset

Here are tasks supported by the Waymo Open Dataset:

Motion Prediction

In Motion Prediction, we are given the 1-second history of all agents' on a corresponding map and are asked to predict the position of up to 8 agents for 8 seconds into the future. The predictions are evaluated using soft Maximum Average Precision (mAP).
All metrics are computed by first bucketing all objects into object type. The metrics are then computed per type. The metrics for each object type (Minimum Average Displacement Error, Minimum Final Displacement Error, Miss Rate, Overlap rate, and mAP) are all computed at 3, 5, and 8-second timestamps. Further Details related to the task can be found here.

Interaction Prediction

Interaction Prediction is the task of forecasting the interactions among agents. Interaction Prediction is crucial for forecasting and anticipating the behavior of surrounding agents and understanding the context of a given scene.
In the Waymo Open Dataset, we are given 1-second tracks and asked to predict the joint future positions of pairs of interacting agents for 8 seconds into the future. 

Occupancy and Flow Prediction

Occupancy and Flow Prediction is a novel and effective representation for motion prediction. They consist of future occupancy grid maps warped by backward motion flow, constructing a spatial-temporal grid set accompanied by the corresponding flow.
Prediction of occupancy flow fields captures rich distributions of traffic participants’ future motion with uncertainties, maintaining the track-and-traceability for every participant through the predicted flow. We are again given the one-second history of a number of agents in a scene and tasked with predicting future occupancies and flow (motion) of only vehicles over 8 seconds into the future. All predictions are dense grids in bird's-eye view (BEV).
Th Waymo Open Dataset contains data for the following 3 connected sub-tasks:
  • Predict future occupancy of all vehicles that are present at the current timestep t, for 8 seconds into the future.
  • Predict future occupancy of all vehicles that are not present at the current timestep t, for 8 seconds into the future.
  • Predict future flow of all vehicles, observed or occluded in the current timestep t, for 8 seconds into the future.
Find more details on this here.

3D Semantic Segmentation

3D semantic segmentation is a fundamental machine learning task for a range of applications, including Autonomous Vehicles. It aims to delineate different, homogeneous objects in a scene using LiDAR point clouds or other 3D sensor data.
In the Waymo Open Dataset we are given one or more LiDAR range images and the associated camera images and tasked with producing a semantic class label for each LiDAR point.
The following 23 classes are provided in the annotations:
1: Car
2: Truck
3: Bus
4: Motorcyclist
5: Bicyclist
6: Pedestrian Sign
7: Traffic Light
8: Pole
9: Construction Cone
10: Bicycle
11: Motorcycle
12: Building
13: Vegetation
14: Tree Trunk
15: Curb
16: Road
17: Lane Marker
18: Walkable
19: Sidewalk
20: Other Ground
21: Other Vehicle
22: Undefined


3D Object Detection

This is a variant of the 3D Object Detection task in cases where there is no LiDAR data available. We are given one or more images of a scene from multiple cameras and tasked with predicting 3D bounding boxes for objects in the scene.
The bounding boxes are available for calibrated camera images for each frame of a scene. We are free to use all previous frames when predicting the bounding boxes for a given frame.

Real-Time 3D Detection

Real-Time 3D Detection is a variant of Object Detection which focuses on lowering the latency of models in Autonomous Vehicles. We can make use of LiDAR data and camera images to predict 3D bounding boxes for objects in a given scene.
However, to qualify for the real-time prediction challenge a model must predict under 70 ms/frame on a Nvidia Tesla V100 GPU.

3D Tracking

3D Tracking is a multi-object tracking task. In this task, we are given temporal sequences of LiDAR and camera data and are asked to produce 3D bounding boxes for objects and their relationships across the frames of a particular scene.
Specific annotations are provided for vehicles, pedestrians, cyclists, and other objects are labeled as all_ns. Multi-object Tracking Accuracy and Precision are a few metrics that can be used to evaluate models on this task.

2D Detection

2D Detection is perhaps one of the most canonical machine learning task in Autonomous Driving. In this task, we are given a set of 2D camera images and are tasked with predicting the bounding boxes for different objects in the scene.
Annotations are available for the following object types:
1: Vehicle
2: Pedestrian
3: Cyclist
4: Sign
You can evaluate our models on this dataset by submitting them to their leaderboard. They evaluate the models on Average Precision. 

2D Tracking

In 2D Tracking, we are given a temporal sequence of camera images and tasked with predicting 2D bounding boxes for objects in a scene. Again, the annotations are available for the Vehicles, Pedestrians, Cyclist, and Sign classes as in the example above.
The leaderboard for the challenge can be found here.



Iterate on AI agents and models faster. Try Weights & Biases today.