An Intro to Retiming Instances in a Video

This article explores a method discussed in Layered Neural Rendering for Retiming People in Video. Using this, we can retime, manipulate, and edit motions, and more.
Ayush Thakur
Created on November 24|Last edited on November 28
Comment
The ability to manipulate the timing of people's movements in a video can help video editors create exciting effects and even change the perception of an event. Video manipulation is widely used in films to alter time by speeding up or slowing down people's motions. How often have we seen villains around our favorite action movie star freeze in time in an action-packed scene? 
This report will explore a deep learning-based method that can take in any natural video with multiple people moving. The output is a realistic re-rendering of the video, where the timing of people's movements is modified. 
﻿Paper | Project Website﻿﻿
Table of Contents﻿﻿Introduction to Retiming Motion in VideoOverview of the Proposed MethodConclusion
﻿
Introduction to Retiming Motion in Video
Why Is This a Hard Problem?Retiming the motion of people in the video is hard. As mentioned, retiming is widely used in films but has been studied so far, mostly in the context of character animation. In character animation, the challenge is to retime the motion of a set of joints. However, spatiotemporal correlations exist between these joints. Usually, ground truth 3D models for such characters exist as well.
Imagine manipulating the timing of a person in a natural video. It is not just about moving the joints but the entire body. On top of that, to generate photorealistic high-quality retiming effects, the various elements in the scene like the shadow of the person, reflections, splashing of water, etc., are correlated with the motion of the person and need to be retimed correctly. 
A video is not confined to one person. The interaction of subjects needs to be addressed. Small errors such as minor misalignment between frames can show up as visual artifacts. 
How Is the Proposed Method Promising?The method is optimized on a per-video basis to decompose every frame into a set of layers (RGBA color images, more on this later). The optimization process ensures that each RGBA layer over time is associated with specific people in the video or a group of people. This grouping can be predefined, and thus the optimization is bound to this constraint. One group per layer ensures robust control for retiming effects.
Using the rough parameterization of people obtained using existing tools, the model automatically learns to group people with their correlated elements in the scene like shadows, reflection, etc.
Since a layer exists for each group, the retiming effects can be produced by simple operations on layers like removing, copying, or interpolating specific layers without any additional training or processing. This is because the video's original frames can be reconstructed by using the standard back-to-front composting with the estimated frames. 
The method is applicable to ordinary natural video. Many applications can benefit from this method. 
Check out the video linked by the authors of this paper. 👇
﻿
Overview of the Proposed MethodThe proposed method is based on a novel deep neural network that learns a layered decomposition of the input video. Thus, the model disentangles people's motions in different layers along with the correlated elements like shadows, reflection, etc. 
The proposed model is trained on a per-video basis in a self-supervised manner. The big-picture task is to reconstruct the original video by predicting the layers. This is clever if you think about it. Let us see how?
First, the authors have used off-the-shelf methods like AlphaPose and DensePose in combination with their techniques to represent each person in each frame. 
This representation is passed to a neural renderer. The input to this neural renderer includes only the people(we will see how?) and a static background. The task of this renderer is to generate layers that reconstruct the full input video. Isn't it brilliant? With these estimated frames, one can use any simple editing technique and achieve retiming. The added benefit is the learned layers capture the correlated elements as well. 
﻿
﻿
Figure 1: Summary of the proposed method. Source﻿
Let us get into the nitty-gritty of the proposed method. 
Problem FormulationGiven an input video VVV﻿, the goal is to decompose each frame Tt∈VT_t \in VTt​∈V﻿ into a set of RGBA (color channels+opacity) layers. This is represented by:
﻿At={Lti}i=1N={Cti,αti}i=1NA_t = \{L_t^i\}_{i=1}^N = \{ C_t^i, \alpha_t^i \}_{i=1}^NAt​={Lti​}i=1N​={Cti​,αti​}i=1N​﻿﻿
Here, CtiC_t^iCti​﻿ is a color image, and αti\alpha_t^iαti​﻿ is an opacity map. Each ithi^{th}ith﻿ layer is associated with the ithi^{th}ith﻿ person/group in a video. Also, Lt0L_t^0Lt0​﻿ is the background layer. Using the estimated layers and a back-to-front ordering of the layers, each video frame can be rendered using the standard "over" operator. This operation is denoted by:
﻿I^t=Comp(At,ot)\hat I_t =  Comp(A_t, o_t)I^t​=Comp(At​,ot​)﻿﻿
Here, oto_tot​﻿ is the ordering of the estimated layers. Equality in the above equation exists if the rendered frame is exactly the same as the input frame.
The authors have cleverly used self-supervised learning to decompose each frame into sets of layers. We will see how in the training section.
Layered Neural RendererA deep neural network-based architecture dubbed layered neural renderer is used to decompose a frame into a set of layers. 
A real-world video can be decomposed into numerous possible ways. A single layer that contains the entire frame can perfectly decompose the video. This is not useful, thus, the authors constrained the neural renderer to steer the solution towards the desired person-specific decomposition. 
The input to the renderer is constructed as follows:
Person Representation: Each person in the video is parameterized with a single human texture atlas TiT^iTi﻿ and a per-frame UV-coordinate map UVtiUV_t^iUVti​﻿, which maps each pixel in the human region in frame ItI_tIt​﻿ to the texture atlas. To represent person iii﻿ at time ttt﻿, the deep texture map TiT^iTi﻿ is sampled using UVtiUV_t^iUVti​﻿ obtaining TtiT_t^iTti​﻿.
Background Representation: The background is represented with a single texture map T0T^0T0﻿ for the entire video. This is used to learn the necessary colors. Sampling from the background is performed according to a UV map UVt0UV_t^0UVt0​﻿. The background's UV map is placed behind each person's UV map to provide background context for the renderer. 
The neural renderer predicts the layers in separate feed-forward passes. The input-output of this renderer is:
Input: The background's UV map is placed behind each person's UV map to provide background context. Thus, the input for layer iii﻿ at time ttt﻿ is the sampled deep texture map TtiT_t^iTti​﻿, which consists of person i′si'si′s﻿ sampled texture placed over the sampled background texture.
Output: The output of the renderer is Lti={Cti,αti}L_t^i = \{C_t^i, \alpha_t^i \}Lti​={Cti​,αti​}﻿. Here CtiC_t^iCti​﻿ is the time-varying color image and αti\alpha_t^iαti​﻿ is the opacity map. 
The renderer aims to reconstruct the original frames from the predicted layers(output). 
TrainingThe model is trained per-video to find the optimal parameters, θθθ﻿. The authors have used three loss functions:
One obvious loss function is the L1L_1L1​﻿ loss between the input frame and the rendered frame since the task is formulated as a reconstruction problem. Formally, 
  Erecon=1K∑t∣∣It−Comp(At,ot)∣∣E_{recon} = \frac{1}{K} \sum_{t}||I_t - Comp(A_t, o_t)||Erecon​=K1​∑t​∣∣It​−Comp(At​,ot​)∣∣﻿﻿
This alone is not sufficient to make the optimization converge from random initialization. Authors cleverly encouraged the learned alpha maps αti\alpha_t^iαti​﻿ to match the people segments that are associated with layer iii﻿. This is used just to bootstrap the model and is turned off as the optimization progresses. 
  Emask=1K1N∑t∑iD(mti,αti)E_{mask} = \frac{1}{K} \frac{1}{N} \sum_t \sum_i D(m_t^i, \alpha_t^i)Emask​=K1​N1​∑t​∑i​D(mti​,αti​)﻿﻿
  Here, mtim_t^imti​﻿ is a trimap derived from the UV maps UVtiUV_t^iUVti​﻿, and DDD﻿ is a distance function.
The authors have also used a regularization loss to the opacities αti\alpha_t^iαti​﻿ to encourage them to be spatially sparse. Formally,
  Ereg=1K1N∑t∑iγ∣∣αti∣∣1+Φ0(αti)E_{reg} = \frac{1}{K} \frac{1}{N} \sum_t \sum_i γ||\alpha_t^i||_1 + Φ_0(\alpha_t^i)Ereg​=K1​N1​∑t​∑i​γ∣∣αti​∣∣1​+Φ0​(αti​)﻿﻿
  Here, Φ0(x)=2.Sigmoid(5x)−1Φ_0(x) = 2. Sigmoid(5x) -1Φ0​(x)=2.Sigmoid(5x)−1﻿ slowly penalizes non-zero values of the alpha map.
The total loss is given by,
﻿Etotal=Erecon+γmEmask+βEregE_{total} = E_{recon} + γ_mE_{mask} + βE_{reg}Etotal​=Erecon​+γm​Emask​+βEreg​﻿﻿
ConclusionThis article is written to give you the gist of the proposed method. The paper is full of exciting details, and I highly encourage you to give it a try. With the learned neural renderer, the input video can be decomposed into layers. With the predicted layers in hand, various retiming and editing effects can be produced via simple operations on the layers. 
The neural renderer represents the person and all space-time visual effects correlated with them, including the movement of the individual’s clothing and even challenging semi-transparent effects such as shadows and reflections. This is an interesting area of study, and the authors have used clever bits in the paper. 
I hope you get a sense of what the authors are trying to achieve through this report. Leave your thoughts in the comments down below. 
﻿
﻿
Add a comment
Tags: Intermediate, Video, GenAI, Research, Github, TMP
Iterate on AI agents and models faster. Try Weights & Biases today.
An Intro to Retiming Instances in a Video

﻿Paper | Project Website﻿

Table of Contents﻿﻿

Introduction to Retiming Motion in Video

Why Is This a Hard Problem?

How Is the Proposed Method Promising?

Overview of the Proposed Method

Problem Formulation

Layered Neural Renderer

Training

Conclusion

Paper | Project Website

Table of Contents