Retiming Instances in a Video

This report explores an interesting method discussed in Layered Neural Rendering for Retiming People in Video. Using this method, one can retime people in a video, manipulate and edit the timing of different motions, freeze people, and even erase them.
Ayush Thakur

The ability to manipulate the timing of people's movements in a video can help video editors create exciting effects and even change the perception of an event. Video manipulation is widely used in films to alter time by speeding up or slowing down people's motions. How often have we seen villains around our favorite action movie star freeze in time in an action-packed scene?

This report will explore a deep learning-based method that can take in any natural video with multiple people moving. The output is a realistic re-rendering of the video, where the timing of people's movements is modified.

Paper | Project Website


Why is this a hard problem?

Retiming the motion of people in the video is hard. As mentioned, retiming is widely used in films but has been studied so far, mostly in the context of character animation. In character animation, the challenge is to retime the motion of a set of joints. However, spatiotemporal correlations exist between these joints. Usually, ground truth 3D models for such characters exist as well.

Imagine manipulating the timing of a person in a natural video. It is not just about moving the joints but the entire body. On top of that, to generate photorealistic high-quality retiming effects, the various elements in the scene like the shadow of the person, reflections, splashing of water, etc., are correlated with the motion of the person need to be retimed correctly.

A video is not confined to one person. The interaction of subjects needs to be addressed. Small errors such as minor misalignment between frames can show up as visual artifacts.

How is the proposed method promising?

Check out the video linked by the authors of this paper. :point_down:


Overview of the Proposed Method

The proposed method is based on a novel deep neural network that learns a layered decomposition of the input video. Thus, the model disentangles people's motions in different layers along with the correlated elements like shadows, reflection, etc.

The proposed model is trained on a per-video basis in a self-supervised manner. The big picture task is to reconstruct the original video by predicting the layers. This is clever if you think about it. Let us see how?


-> Figure 1: Summary of the proposed method. Source <-

Let us get into the nitty-gritty of the proposed method.

Problem Formulation

Given an input video $V$, the goal is to decompose each frame $T_t \in V$ into a set of RGBA(color channels+opacity) layers. This is represented by:

$A_t = {L_t^i}{i=1}^N = { C_t^i, \alpha_t^i }{i=1}^N$

Here, $C_t^i$ is a color image, and $\alpha_t^i$ is an opacity map. Each $i^{th}$ layer is associated with the $i^{th}$ person/group in a video. Also, $L_t^0$ is the background layer. Using the estimated layers and a back-to-front ordering of the layers, each video frame can be rendered using the standard "over" operator. This operation is denoted by:

$\hat I_t = Comp(A_t, o_t)$

Here, $o_t$ is the ordering of the estimated layers. Equality in the above equation exists if the rendered frame is exactly the same as the input frame.

The authors have cleverly used self-supervised learning to decompose each frame into sets of layers. We will see how in the training section.

Layered Neural Renderer

A deep neural network-based architecture dubbed layered neural renderer is used to decompose a frame into a set of layers.

A real-world video can be decomposed into numerous possible ways. A single layer that contains the entire frame can perfectly decompose the video. This is not useful, thus, the authors constrained the neural renderer to steer the solution towards the desired person-specific decomposition.

The input to the renderer is constructed as follows:

The neural renderer predicts the layers in separate feed-forward passes. The input-output of this renderer is:

The renderer aims to reconstruct the original frames from the predicted layers(output).


The model is trained per-video to find the optimal parameters, $θ$. The authors have used three loss functions:

The total loss is given by,

$E_{total} = E_{recon} + γ_mE_{mask} + βE_{reg}$


This report is written to give you the gist of the proposed method. The paper is full of exciting details, and I highly encourage you to give it a try. With the learned neural renderer, the input video can be decomposed into layers. With the predicted layers in hand, various retiming and editing effects can be produced via simple operations on the layers.

The neural renderer represents the person and all space-time visual effects correlated with them, including the movement of the individual’s clothing and even challenging semi-transparent effects such as shadows and reflections. This is an interesting area of study, and the authors have used clever bits in the paper.

I hope you get a sense of what the authors are trying to achieve through this report. Leave your thoughts in the comments down below.