Overview

Predict relative distance and motion from unlabeled video

This report explores Tinghui Zhou et al's work on Unsupervised Learning of Depth and Ego-Motion from Video from CVPR 2017. We show how to train and test the provided models, visualize and analyze the results, and explore new variants with the help of Weights & Biases.

Depth perception from a single photo

prediction sample

Model

This is an unsupervised framework for learning depth perception from a single camera feed, e.g. of unstructured car dashboard driving videos. Two networks are trained in parallel: one to predict depth from a single frame, and one to predict the current camera view from several frames (e.g. with a sequence length of three frames, predict the contents of the current frame from the preceding frame and the following frame). During training this view synthesis serves as the supervisory signal. At testing time, the networks can be decoupled so the model can predict depth from a single image. Here is an illustration of the process:

model structure

Overview

Predict relative distance and motion from unlabeled video

This report explores Tinghui Zhou et al's work on Unsupervised Learning of Depth and Ego-Motion from Video from CVPR 2017. We show how to train and test the provided models, visualize and analyze the results, and explore new variants with the help of Weights & Biases.

Depth perception from a single photo

prediction sample

Model

This is an unsupervised framework for learning depth perception from a single camera feed, e.g. of unstructured car dashboard driving videos. Two networks are trained in parallel: one to predict depth from a single frame, and one to predict the current camera view from several frames (e.g. with a sequence length of three frames, predict the contents of the current frame from the preceding frame and the following frame). During training this view synthesis serves as the supervisory signal. At testing time, the networks can be decoupled so the model can predict depth from a single image. Here is an illustration of the process:

model structure

Relevant Datasets

Resources

Visualizing depth predictions: More training sharpens details

I compare three snapshots of the model, all sharing the baseline architecture and hyperparameters, trained on KITTI for varying durations:

As we train the baseline model and increase the number of iterations, the depth map increases in detail. For a given validation frame in the leftmost column, you can see three models' predictions. All of these models have the same baseline architecture and are saved at different points during training. The 1904K (baseline) model is downloaded directly from the original repo.

120K iterations loosely captures the foreground and background. The baseline does pretty well but misses some details. At 230K iterations in the rightmost column, you can see more crisp details, down to individual tree trunks and poles. The model can handle complex lighting conditions (mix of bright light and dark shadows) and even silhouette individual cars parked in a row. You can scroll in both panels independently to compare across a row.

Visualizing depth predictions: More training sharpens details

I compare three snapshots of the model, all sharing the baseline architecture and hyperparameters, trained on KITTI for varying durations:

As we train the baseline model and increase the number of iterations, the depth map increases in detail. For a given validation frame in the leftmost column, you can see three models' predictions. All of these models have the same baseline architecture and are saved at different points during training. The 1904K (baseline) model is downloaded directly from the original repo.

120K iterations loosely captures the foreground and background. The baseline does pretty well but misses some details. At 230K iterations in the rightmost column, you can see more crisp details, down to individual tree trunks and poles. The model can handle complex lighting conditions (mix of bright light and dark shadows) and even silhouette individual cars parked in a row. You can scroll in both panels independently to compare across a row.

Useful visualization types

Tensorboard is fully integrated with W&B

If you right-click on a model name to open a training run in a new window, you can click on the TensorFlow icon in the left sidebar to load all the Tensorboard logs and plots for that run. These will effortlessly persist in the cloud alongside your experiments when you pass sync_tensorboard=True to W&B. Note that if the Tensorboard view hasn't been loaded in a while, you may need to hit the refresh arrow in the top right corner to load all the data.

Visual evidence for explainability, view synthesis, and projection error

The original repository logs several image types to facilitate model explainability and evaluation:

These are logged for each scale (how close we zoom into the training image) and automatically stored in W&B. Some examples are below. You can see these for a given training run by opening it in a new window, and opening the "MEDIA" tab. [Here is the relevant workspace] (https://app.wandb.ai/stacey/sfmlearner/runs/34pwfi1i?workspace=user-stacey) for an early run reproducing the baseline model.

different viz types part 2

Training loss comparisons

Adjusting relative importance of loss components

The charts below show the total loss and the three component losses in more detail. The original model from the paper is shown in red as "baseline" (with a burgundy "baseline 2" showing a second run with identical settings to give a sense of the stability/variance of the results). The total training loss for this model is a weighted sum of

The smoothness and explainability losses are weighted via "explain_reg_weight" and "smooth_weight" constants. I explored a few model variants by changing these relative weights, along with learning rate and the number of depth scales (number of zoom levels at which the model processes frames for pose estimation).

Total loss seems to decrease in the following experiments:

Notes

Training loss comparisons

Adjusting relative importance of loss components

The charts below show the total loss and the three component losses in more detail. The original model from the paper is shown in red as "baseline" (with a burgundy "baseline 2" showing a second run with identical settings to give a sense of the stability/variance of the results). The total training loss for this model is a weighted sum of

The smoothness and explainability losses are weighted via "explain_reg_weight" and "smooth_weight" constants. I explored a few model variants by changing these relative weights, along with learning rate and the number of depth scales (number of zoom levels at which the model processes frames for pose estimation).

Total loss seems to decrease in the following experiments:

Notes

More examples: Follow the cars as they enter and exit the view

More examples: Follow the cars as they enter and exit the view

Visualizing depth: Edge cases

The model performs remarkably well in complicated lighting conditions. Overall, training for longer seems to improve the model and add to the level of detail, although some false positives worsen (such as perceiving certain textures/shapes as dark/far away "holes"). Scroll through the two panels side-by-side to see more examples.

Interesting edge cases

Visualizing depth: Edge cases

The model performs remarkably well in complicated lighting conditions. Overall, training for longer seems to improve the model and add to the level of detail, although some false positives worsen (such as perceiving certain textures/shapes as dark/far away "holes"). Scroll through the two panels side-by-side to see more examples.

Interesting edge cases

Future directions

Future directions

The SfMLearner repository provides convenient scripts for training, evaluation, and further exploration.

Potential next steps

bimodal

Automatically track and visualize histograms of every parameter and gradient. This is an example of a bimodal "rx" distribution.

Baseline (190K): weak detection of biker and van

missing biker

50K more iterations: black holes

230K

Weak positive for distant car (L: 190K baseline, R: 230K iterations)

weak car

Failure to detect distant cars and nearby poles (230K iterations)

cars far away