Video to 3D: Depth Perception for Self-Driving Cars
Unsupervised learning of depth perception from dashboard cameras.
Created on March 20|Last edited on October 3
Comment
Overview
Predict relative distance and motion from unlabeled video
This report explores Tinghui Zhou et al's work on Unsupervised Learning of Depth and Ego-Motion from Video from CVPR 2017. We show how to train and test the provided models, visualize and analyze the results, and explore new variants with the help of Weights & Biases.
Depth perception from a single photo

Model
This is an unsupervised framework for learning depth perception from a single camera feed, e.g. of unstructured car dashboard driving videos. Two networks are trained in parallel: one to predict depth from a single frame, and one to predict the current camera view from several frames (e.g. with a sequence length of three frames, predict the contents of the current frame from the preceding frame and the following frame).
During training, this view synthesis serves as the supervisory signal. At testing time, the networks can be decoupled so the model can predict depth from a single image. Here is an illustration of the process:

Run set
35
Visualizing depth predictions: More training sharpens details
I compare three snapshots of the model, all sharing the baseline architecture and hyperparameters, trained on KITTI for varying durations:
- 120K iterations
- 190K iterations (baseline) — this is restored from a checkpoint provided in the repo and not trained
- 230K iterations
As we train the baseline model and increase the number of iterations, the depth map increases in detail. For a given validation frame in the leftmost column, you can see three models' predictions. All of these models have the same baseline architecture and are saved at different points during training.
The 1904K (baseline) model is downloaded directly from the original repo.
120K iterations loosely captures the foreground and background. The baseline does pretty well but misses some details. At 230K iterations in the rightmost column, you can see more crisp details, down to individual tree trunks and poles.
The model can handle complex lighting conditions (a mix of bright light and dark shadows) and even silhouette individual cars parked in a row. You can scroll in both panels independently to compare across a row.
Same architecture, increasing training iterations
3
Useful visualization types
Tensorboard is fully integrated with W&B
If you right-click on a model name to open a training run in a new window, you can click on the TensorFlow icon in the left sidebar to load all the Tensorboard logs and plots for that run. These will effortlessly persist in the cloud alongside your experiments when you pass sync_tensorboard=True to W&B. Note that if the Tensorboard view hasn't been loaded in a while, you may need to hit the refresh arrow in the top right corner to load all the data.
Visual evidence for explainability, view synthesis, and projection error
The original repository logs several image types to facilitate model explainability and evaluation:
- the source and target images for pose estimation/view synthesis
- the projected image and the error in the projection for view synthesis
- the explainability mask and the image disparity
These are logged for each scale (how close we zoom into the training image) and automatically stored in W&B. Some examples are below. You can see these for a given training run by opening it in a new window, and opening the "MEDIA" tab. Here is the relevant workspace for an early run reproducing the baseline model.


Training loss comparisons
Adjusting relative importance of loss components
The charts below show the total loss and the three component losses in more detail.
The original model from the paper is shown in red as "baseline" (with a burgundy "baseline 2" showing a second run with identical settings to give a sense of the stability/variance of the results). The total training loss for this model is a weighted sum of
- the view synthesis objective (pixel loss, top right)
- depth smoothness loss (smooth_loss, bottom left)
- explainability regularization loss (exp_loss, bottom right)
The smoothness and explainability losses are weighted via "explain_reg_weight" and "smooth_weight" constants. I explored a few model variants by changing these relative weights, along with learning rate and the number of depth scales (number of zoom levels at which the model processes frames for pose estimation).
Total loss seems to decrease in the following experiments:
- increasing the weight of the explainability regularization loss (set to zero in the baseline model)
- increasing the weight of the smoothness loss
- increasing the learning rate
Notes
- Charts are zoomed into the most relevant regions. For example, the bottom left "smooth_loss" zooms in on the first 400 steps and shows that setting smooth_weight around 0.5 results in less stable initial training with a spike around step 100.
- The view synthesis objective chart shows all the pixel_losses smoothed, sorted, and added to each other instead of plotting them on top of each other. This makes the relative ordering and size of the pixel_loss more clear across different experiments. The increased spikiness is an artifact of this plot style--each successive loss plot is at least as noisy as the layer below it.
- The final value for total loss sometimes increases slightly as the model trains past 100K to 300K iterations, likely because of the inherent noisiness of the loss. Perhaps early stopping or interleaving validation stages with visual confirmation would help.
- Noticed the number of trainable parameters in the model didn't change when I tried to change the number of depth scales from 4 to 3 or 5—turns out this hyperparameter is fixed to 4 deeper in the code
Model variants
8
More examples: Follow the cars as they enter and exit the view
Same architecture, increasing training iterations
3
Visualizing depth: Edge cases
The model performs remarkably well in complicated lighting conditions.
Overall, training for longer seems to improve the model and add to the level of detail, although some false positives worsen (such as perceiving certain textures/shapes as dark/far away "holes"). Scroll through the two panels side-by-side to see more examples.
Interesting edge cases
- first row: glare from car perceived as a nearby vertical object
- first row, third row: asphalt texture and potholes perceived as "holes" or very far away patches of background instead of foreground
- sixth row: patch of bright sun perceived as "hole"
- third to last row: occluding bus is basically invisible
- last row: the strange shape of car, perhaps inferred from long shadow?
Same architecture, increasing training iterations
2
Future directions
Run set
0
Related Reading
The Semantic KITTI Dataset
Semantic-Kitti is a large semantic segmentation and scene understanding dataset developed for LiDAR-based autonomous driving. But what it is and what is it for?
The Many Datasets of Autonomous Driving
Below we'll explore the datasets used to train autonomous driving systems to perform the various tasks required of them.
Semantic Segmentation: The View from the Driver's Seat
This article explores semantic segmentation for scene parsing on Berkeley Deep Drive 100K (BDD100K) including how to distinguish people from vehicles.
A System of Record for Autonomous Driving Machine Learning Models
A look at the most useful Weights & Biases features for autonomous vehicle companies
Add a comment
Model Depth Predictions
interesting results!Reply
Iterate on AI agents and models faster. Try Weights & Biases today.