An Intro to Retiming Instances in a Video
This article explores a method discussed in Layered Neural Rendering for Retiming People in Video. Using this, we can retime, manipulate, and edit motions, and more.
Created on November 24|Last edited on November 28
Comment
The ability to manipulate the timing of people's movements in a video can help video editors create exciting effects and even change the perception of an event. Video manipulation is widely used in films to alter time by speeding up or slowing down people's motions. How often have we seen villains around our favorite action movie star freeze in time in an action-packed scene?
This report will explore a deep learning-based method that can take in any natural video with multiple people moving. The output is a realistic re-rendering of the video, where the timing of people's movements is modified.
Paper | Project Website
Table of Contents
Introduction to Retiming Motion in Video
Why Is This a Hard Problem?
Retiming the motion of people in the video is hard. As mentioned, retiming is widely used in films but has been studied so far, mostly in the context of character animation. In character animation, the challenge is to retime the motion of a set of joints. However, spatiotemporal correlations exist between these joints. Usually, ground truth 3D models for such characters exist as well.
Imagine manipulating the timing of a person in a natural video. It is not just about moving the joints but the entire body. On top of that, to generate photorealistic high-quality retiming effects, the various elements in the scene like the shadow of the person, reflections, splashing of water, etc., are correlated with the motion of the person and need to be retimed correctly.
A video is not confined to one person. The interaction of subjects needs to be addressed. Small errors such as minor misalignment between frames can show up as visual artifacts.
How Is the Proposed Method Promising?
- The method is optimized on a per-video basis to decompose every frame into a set of layers (RGBA color images, more on this later). The optimization process ensures that each RGBA layer over time is associated with specific people in the video or a group of people. This grouping can be predefined, and thus the optimization is bound to this constraint. One group per layer ensures robust control for retiming effects.
- Using the rough parameterization of people obtained using existing tools, the model automatically learns to group people with their correlated elements in the scene like shadows, reflection, etc.
- Since a layer exists for each group, the retiming effects can be produced by simple operations on layers like removing, copying, or interpolating specific layers without any additional training or processing. This is because the video's original frames can be reconstructed by using the standard back-to-front composting with the estimated frames.
- The method is applicable to ordinary natural video. Many applications can benefit from this method.
Check out the video linked by the authors of this paper. 👇
Overview of the Proposed Method
The proposed method is based on a novel deep neural network that learns a layered decomposition of the input video. Thus, the model disentangles people's motions in different layers along with the correlated elements like shadows, reflection, etc.
The proposed model is trained on a per-video basis in a self-supervised manner. The big-picture task is to reconstruct the original video by predicting the layers. This is clever if you think about it. Let us see how?
- This representation is passed to a neural renderer. The input to this neural renderer includes only the people(we will see how?) and a static background. The task of this renderer is to generate layers that reconstruct the full input video. Isn't it brilliant? With these estimated frames, one can use any simple editing technique and achieve retiming. The added benefit is the learned layers capture the correlated elements as well.
-
Let us get into the nitty-gritty of the proposed method.
Problem Formulation
Given an input video , the goal is to decompose each frame into a set of RGBA (color channels+opacity) layers. This is represented by:
Here, is a color image, and is an opacity map. Each layer is associated with the person/group in a video. Also, is the background layer. Using the estimated layers and a back-to-front ordering of the layers, each video frame can be rendered using the standard "over" operator. This operation is denoted by:
Here, is the ordering of the estimated layers. Equality in the above equation exists if the rendered frame is exactly the same as the input frame.
The authors have cleverly used self-supervised learning to decompose each frame into sets of layers. We will see how in the training section.
Layered Neural Renderer
A deep neural network-based architecture dubbed layered neural renderer is used to decompose a frame into a set of layers.
A real-world video can be decomposed into numerous possible ways. A single layer that contains the entire frame can perfectly decompose the video. This is not useful, thus, the authors constrained the neural renderer to steer the solution towards the desired person-specific decomposition.
The input to the renderer is constructed as follows:
- Person Representation: Each person in the video is parameterized with a single human texture atlas and a per-frame UV-coordinate map , which maps each pixel in the human region in frame to the texture atlas. To represent person at time , the deep texture map is sampled using obtaining .
- Background Representation: The background is represented with a single texture map for the entire video. This is used to learn the necessary colors. Sampling from the background is performed according to a UV map . The background's UV map is placed behind each person's UV map to provide background context for the renderer.
The neural renderer predicts the layers in separate feed-forward passes. The input-output of this renderer is:
- Input: The background's UV map is placed behind each person's UV map to provide background context. Thus, the input for layer at time is the sampled deep texture map , which consists of person sampled texture placed over the sampled background texture.
- Output: The output of the renderer is . Here is the time-varying color image and is the opacity map.
The renderer aims to reconstruct the original frames from the predicted layers(output).
Training
The model is trained per-video to find the optimal parameters, . The authors have used three loss functions:
- One obvious loss function is the loss between the input frame and the rendered frame since the task is formulated as a reconstruction problem. Formally,
-
- This alone is not sufficient to make the optimization converge from random initialization. Authors cleverly encouraged the learned alpha maps to match the people segments that are associated with layer . This is used just to bootstrap the model and is turned off as the optimization progresses.
-
- Here, is a trimap derived from the UV maps , and is a distance function.
- The authors have also used a regularization loss to the opacities to encourage them to be spatially sparse. Formally,
-
- Here, slowly penalizes non-zero values of the alpha map.
The total loss is given by,
Conclusion
This article is written to give you the gist of the proposed method. The paper is full of exciting details, and I highly encourage you to give it a try. With the learned neural renderer, the input video can be decomposed into layers. With the predicted layers in hand, various retiming and editing effects can be produced via simple operations on the layers.
The neural renderer represents the person and all space-time visual effects correlated with them, including the movement of the individual’s clothing and even challenging semi-transparent effects such as shadows and reflections. This is an interesting area of study, and the authors have used clever bits in the paper.
I hope you get a sense of what the authors are trying to achieve through this report. Leave your thoughts in the comments down below.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.