Two Shots to Green Screen: Collage With Deep Learning

In this article, we take a look at how to train a deep net to extract foreground and background in natural images and videos.
Stacey Svetlichnaya
Created on May 14|Last edited on October 14
Comment
﻿
﻿
Take Two Photos or Videos, Substitute Any Background You LikeIn Background Matting: The World is Your Green Screen, Sengupta et al train a machine learning model to extract figures in the foreground of photos and videos and collage them onto new backgrounds. Traditional methods for this kind of "background matting" require a green screen or a handmade trimap to build the matte, a per-pixel annotation of foreground color and alpha (or image depth). 
This new model requires two versions of the source photo or video: one with the person/subject in the foreground, and one without, showing just the background. Below I show some examples of how this works and how Weights & Biases can help analyze results and compare different models on this task.
﻿Run predictions on your own images with Google Colab →﻿
﻿Read the full paper, appearing in CVPR 2020 →﻿
﻿Dive into the code →﻿
Pretrained Model Inference on Sample Fixed-Camera Videos﻿
﻿
Example videos1
﻿
I loaded a saved model, ran it on the sample videos (fixed camera), and logged to Weights & Biases with wandb.Video(). The pretrained model is very impressive and doesn't get confused by the other humans or similar colors in the new substitute background for the video. Below I do the same for photos, logging every stage of the process, which can be very useful for debugging.
Model Inference on Sample Fixed-Camera Photos﻿
﻿
Example images1
﻿
﻿
Subtask Overview: Train and Test on Photos and VideosThe existing code for this project enables several tasks:
training models for photo matting
testing new photos on a trained model
training models for video matting
testing new videos on a trained model
All of these could be interesting to explore in Weights & Biases. The core photo-matting model requires the Adobe synthetic-composite dataset of 45.5K train and 1K test images extracted from a simple background and composited onto a new one, with accompanying alpha masks. This dataset is not immediately available for download, though there is a contact email. Fortunately, the authors provide a download link for the saved model, plus a notebook for you to run the model on your own images. 
The video-matting model a self-supervised generative adversarial network using frames extracted from real unlabeled videos (with a new dataset provided by the authors). This training finetunes from a previously trained network, i.e. the photo-matting model trained on the static Adobe images. 
With each image,  the video-matting model also takes in an automatically-generated soft semantic segmentation map as an initial estimate of the foreground (these masks are precomputed prior to training). Below I finetune a few different versions of the video-matting GAN to see the effect of various hyperparameters and explore the performance on sample photos.
Casual Test: The Importance of Preprocessing and Fixed vs Hand-Held Models
﻿
The top row in both panels isn't a perfect matte, but it looks very reasonable aside from slight noise around hair and shadows (and the light wood paneling at the bottom of the image being parsed as foreground). After many rounds of testing, I assumed the indoor setting and bright lighting conditions fell outside the range of reasonable inputs for the model until I confirmed the following: 
fixed camera model gives far superior results: the fixed-camera model substantially outperforms the hand-held camera model as it doesn't need to consider pixel drift—compare the first and second rows in both panels. The second row parses a lot more background as foreground.
preprocessing is context-dependent: in my initial tests, the background image was rendered into something unrecognizable (see the third row in the panel to see it). Eventually, I tried overwriting the file with the original background after the mask generation step.
photo quality decreased by preprocessing?: the "original photo" shown in the first column of these media panels is much lower quality (grainier) than the starting photo file, perhaps because of the preprocessing. 
﻿
Real photo examples0
﻿
﻿
Visualizing Training﻿
Example training runs8
﻿
﻿
Training Video ModelsThe video-matting GAN is well-tuned, showing fast-dropping loss curves that are well-balanced between generator and discriminator. The experiments below train on a random subsample of the full 13,000+ dataset of video frames for a few epochs to explore the effect of different hyperparameters. You can zoom into subregions on the x-axis in the charts below for more detail, clicking on the left endpoint and dragging right to select a subregion of the x-axis.
Observationsgenerator loss and discriminator loss match closely: as you can see in the first chart below, the generator loss and discriminator loss match closely and converge quickly for each model color. The first runs show more slightly more noise because the data wasn't randomized initially. The left end of the loss curves flattens as I double the amount of training data or training epochs. After initial proofs of concept, I lowered the epoch count and training data size to focus on edge cases and faster experimentation. 
real loss initially larger on discriminator than fake loss but tends to converge within 500 steps
highest generator loss on foreground, then composite, then alpha: This relationship stays fairly consistent across training and may be explained by the inherent dimensionality/complexity of these inputs and how much it is possible to learn from each(foreground has only full color in foreground and implicit depth, composite has full color and no depth, alpha is a binary depth mask). When the loss bumps up, it generally does so across all three loss components.
﻿
Finetuning from Adobe Synthetic Model4
﻿
﻿
Video Model HyperparametersI tried changing some of the hyperparameters of the GAN to see how much they would affect convergence. The baseline is shown in black below, training on the full dataset, and you can select one or both tabs to compare different model variants. I'm zooming in on the very start of training because it converges very quickly.
Tuning the Adversarial Relationshipnumber of Generator updates per Discriminator update (select the "G updates per D updates" tab below): the starting code set 5 generator updates per 1 discriminator update. Setting this parameter lower increases the generator loss, as expected. Setting it to 2 leads to the closest balance between fake and real generator loss. 
 weight decay (select the "Weight decay" tab below): the starting code set this to 2, such that the relative weight of the discriminator loss was halved every 2 epochs, putting less emphasis on the pseudo-supervision. If this weight decay is lowered (less than 2), the generator loss dominates the discriminator loss. Setting it to 3 inverts the order of the losses, such that the discriminator loss dominates.
﻿
﻿
G updates per D updates5
 
Weight decay4
﻿
﻿
Add a comment
Tags: Advanced, Computer Vision, Video, GenAI, Experiment, Research, Panels, Plots
Iterate on AI agents and models faster. Try Weights & Biases today.