In Background Matting: The World is Your Green Screen, Sengupta et al train a machine learning model to extract figures in the foreground of photos and videos and collage them onto new backgrounds. Traditional methods for this kind of "background matting" require a green screen or a handmade trimap to build the matte, a per-pixel annotation of foreground color and alpha (or image depth). This new model requires two versions of the source photo or video: one with the person/subject in the foreground, and one without, showing just the background. Below I show some examples of how this works and how wandb can help analyze results and compare different models on this task.
The existing code for this project enables several tasks:
All of these could be interesting to explore in Weights & Biases. The core photo-matting model requires the Adobe synthetic-composite dataset of 45.5K train and 1K test images extracted from a simple background and composited onto a new one, with accompanying alpha masks. This dataset is not immediately available for download, though there is a contact email. Fortunately, the authors provide a download link for the saved model, plus a notebook for you to run the model on your own images.
The video-matting model a self-supervised generative adversarial network using frames extracted from real unlabeled videos (with a new dataset provided by the authors). This training finetunes from a previously trained network, i.e. the photo-matting model trained on the static Adobe images. With each image, the video-matting model also takes in an automatically-generated soft semantic segmentation map as an initial estimate of foreground (these masks are precomputed prior to training). Below I finetune a few different versions of the video-matting GAN to see the effect of various hyperparameters and explore the performance on sample photos.
The video-matting GAN is well-tuned, showing fast-dropping loss curves that are well-balanced between generator and discriminator. The experiments below train on a random subsample of the full 13,000+ dataset of video frames for a few epochs to explore the effect of different hyperparameters. You can zoom into subregions on the x-axis in the charts below for more detail, clicking on the left endpoint and dragging right to select a subregion of the x-axis.
I tried changing some of the hyperparameters of the GAN to see how much they would affect convergence. The baseline is shown in black below, training on the full dataset, and you can select one or both tabs to compare different model variants. I'm zooming in on the very start of training because it converges very quickly.