Omnimatte: How to Detect Objects and Their Effects

Automatically capture the shadows and reflections of objects in videos. Made by Scott Condron using Weights & Biases
Scott Condron

Introduction

Is it possible to automatically detect and remove an object from a video? On first thought, you'd imagine Image Segmentation models might be able to create a pixel-wise mask of the object, to be later used to fill in with some estimation of the background. Although this is a good start, you would likely be left with effects of that object in the scene like reflection, shadow, etc. which you would have to remove by some other means.
Rather than using traditional segmentation masks of just the object itself, one idea is to try predict a mask of the object and its effects by using correlations between the object and its environment. This is exactly the concern of the paper we'll be going through in this report, in which the authors try to detect the effects of each of a given set of objects in a video.

What's an Omnimatte?

In image processing, a matte defines the different foreground or background areas of an image. This paper introduces the concept of an Omnimatte, a matte for each object and its effects in the scene. It may be easier to understand with a few examples:
The input video with rough input mask, followed by the alpha channel of the Omnimatte, the full RGB video + alpha channel of the Omnimatte, finally the predicted background with the Omnimatte removed.

Use Cases

It would be a shame if a model that can easily enable flashy video effects didn't provide some examples upon release of their paper. Thankfully for us, the authors for this paper didn't disappoint. Here are a few examples of the model in action.
They show how you can use the Omnimatte to desaturate all other objects in the scene to add a "color pop" to your object. They also show background replacement, where the original shadow remains. Interestingly, as the video below shows below, the shadow surface should be consistent otherwise the shadow will look strange. Finally, they create a stroboscope effect showing multiple versions of the same object across time.

Video: CVPR Presentation on Omnimatte

Method

In this section, we'll go over the inputs, outputs, and training procedure that produce these amazing results.
The basic idea is that the model is trained per video to reconstruct the original scene by using a composition of layers for each foreground object, and one layer for the background. There are also many other clever ideas which the authors use to improve their results, many of which are described below.

Input

The input the model proposes is the input video along with one or more rough segmentation masks. The output of the model is an omnimatte for the object given in each input segmentation mask.
Other inputs are used to help the model in various ways. To give information about frame-to-frame motion, the authors use RAFT to pre-compute optical flow and feed this into the model. They also provide information about the camera movement. To do this, they compute camera movement homographies from the video and fix this during training. Finally, they represent the background as a static "canvas" which is sampled from to produce the background on screen.

Output

The output of the model is the predicted opacity mattes along with the color information for each layer. The model also outputs a predicted optical flow which is used as additional information to capture motion during training.

Training

Interestingly, rather than relying on human labels, the model is trained using self-supervised training, a method of training neural networks which relies only on the input data. The paper we're discussing today shares a lot of authors with the paper Layered Neural Rendering for Retiming People in Video and it builds upon that work.
For each video, the model is trained to reconstruct the input as a composition of each layer. That is, the model predicts the RGBA information for each given segmentation mask along with a background, and during training the model is encouraged to accurately reconstruct the input frame as a composition of each layer.
This reconstruction loss is calculated as the L1 loss (mean absolute error) between each input frame and the output composited frame.
\mathbf{E}_\text{rgb-recon} = \frac{1}{T}\sum_t \|I_t - Comp(\mathcal{L}_t, o_t)\|_1 where \mathcal{L}_t = \{\alpha^i_t, C_t^i\}_{i=1}^N are the predicted layers for frame t, and o_t is the compositing order.
Below, you can see the target images beside the reconstructed images. The earlier "steps" in the visualization are from earlier in training and so the reconstruction is worse. Use the slider to see how the reconstruction improves during training.
With just this loss, it's possible that the system would output the entire reconstruction on a single layer. To overcome this, the authors add a regularization loss to encourage the each layer to be spatially sparse.
\mathbf{E_{\text{reg}}} = \frac{1}{T}\frac{1}{N}\sum_t \sum_i \gamma \left \| \alpha_t^i \right \|_1 + \Phi_0(\alpha_t^i) where \Phi_0(x) = 2\cdot \mathtt{Sigmoid}(5x) - 1 smoothly penalizes non-zero values of the output opacity map, and \gamma controls the relative weight between the terms.
Another loss is used to "bootstrap" the alpha masks to match the input masks at the beginning of training. This is removed after its value reaches a given threshold.
\mathbf{E}_\text{mask} = \frac{1}{T}\frac{1}{N}\sum_t \sum_i \left \| d_t^i \odot (M_t^i - \alpha_t^i) \right \|_2 where d_t^i = 1 - \mathtt{dilate}(M_t^i) + M_t^i is a boundary erosion mask to turn off the loss near the mask boundary, and \odot is element-wise product.
As mentioned above, there's an additional "flow loss" that is used to add information about motion to the training process. This is the same L1 loss used for the image reconstruction, but it's calculated on the input and output flow for each frame for each layer.
\mathbf{E}_\text{flow-recon} = \frac{1}{T}\sum_t W_t \cdot \|F_t - \textit{Comp}(\mathcal{F}_t, o_t)\|_1 where \mathcal{F}_t = \{\hat{F}_i^t\} is the set of predicted flow layers, F_t is the original, pre-computed flow, and W_t is a spatial weighting map that lowers the impact of pixels with inaccurate flow. W_t is computed based on standard left-right flow consistency error and photometric warping error (see paper and supplementary materials for more details).
Finally, to prevent inconsistencies from from frame-to-frame, the authors encourage temporal consistency by applying an alpha warping loss.
\mathbf{E}_\text{alpha-warp} = \frac{1}{T}\frac{1}{N}\sum_t \sum_i \|\alpha_t^i - \alpha_{wt}^i\|_1 where \alpha_{wt}^i = \textit{Warp}(\alpha_{t+1}^i, \mathcal{F}_t^i) is the alpha for layer i at time t+1 warped to time t using the predicted flow.
The total loss is:
\mathbf{E}_\text{rgb-recon} + \lambda_\text{r} \mathbf{E}_\text{reg} + \lambda_\text{m} \mathbf{E}_\text{mask} + \mathbf{E}_\text{flow-recon} + \lambda_\text{w} \mathbf{E}_\text{alpha-warp}
where \lambda_\text{r}, \lambda_\text{m} and \lambda_\text{w} are weighting coefficients.
Here's the code to the backwards pass which shows all the loss functions being combined.

Conclusion

In this post, we've looked at how the authors of Omnimatte: Associating Objects and Their Effects in Video used self-supervision to train a model to capture the effects of objects in videos.
The paper goes into greater detail about the motivations for each loss, where their approach fails and shows ablation studies on many aspects of the final implementation. They also compare their results to other papers which attempt similar tasks. This paper is a great example of the extra resources that can be released alongside it to help the reader understand the impact and usefulness of the findings. I highly recommending checking out the original paper along with the additional resources they provide (linked below).
Lastly, if you enjoy this type of research, you'll likely enjoy this research by one of the same research groups, Visual Geometry Group - University of Oxford: Self-supervised Video Object Segmentation by Motion Grouping. Thanks for reading!
Resources:WebsiteCodePaperSupplementary Material

Read Next:

Report Gallery