# Retiming Instances in a Video

This report explores an interesting method discussed in Layered Neural Rendering for Retiming People in Video. Using this method, one can retime people in a video, manipulate and edit the timing of different motions, freeze people, and even erase them.
Ayush Thakur

The ability to manipulate the timing of people's movements in a video can help video editors create exciting effects and even change the perception of an event. Video manipulation is widely used in films to alter time by speeding up or slowing down people's motions. How often have we seen villains around our favorite action movie star freeze in time in an action-packed scene?

This report will explore a deep learning-based method that can take in any natural video with multiple people moving. The output is a realistic re-rendering of the video, where the timing of people's movements is modified.

## Introduction

#### Why is this a hard problem?

Retiming the motion of people in the video is hard. As mentioned, retiming is widely used in films but has been studied so far, mostly in the context of character animation. In character animation, the challenge is to retime the motion of a set of joints. However, spatiotemporal correlations exist between these joints. Usually, ground truth 3D models for such characters exist as well.

Imagine manipulating the timing of a person in a natural video. It is not just about moving the joints but the entire body. On top of that, to generate photorealistic high-quality retiming effects, the various elements in the scene like the shadow of the person, reflections, splashing of water, etc., are correlated with the motion of the person need to be retimed correctly.

A video is not confined to one person. The interaction of subjects needs to be addressed. Small errors such as minor misalignment between frames can show up as visual artifacts.

#### How is the proposed method promising?

• The method is optimized on a per-video basis to decompose every frame into a set of layers(RGBA color images, more on this later). The optimization process ensures that each RGBA layer over time is associated with specific people in the video or a group of people. This grouping can be predefined, and thus the optimization is bound to this constraint. One group per layer ensures robust control for retiming effects.

• Using the rough parameterization of people obtained using existing tools, the model automatically learns to group people with their correlated elements in the scene like shadows, reflection, etc.

• Since a layer exists for each group, the retiming effects can be produced by simple operations on layers like removing, copying, or interpolating specific layers without any additional training or processing. This is because the video's original frames can be reconstructed by using the standard back-to-front composting with the estimated frames.

• The method is applicable to ordinary natural video. Many applications can benefit from this method.

Check out the video linked by the authors of this paper. :point_down:

## Overview of the Proposed Method

The proposed method is based on a novel deep neural network that learns a layered decomposition of the input video. Thus, the model disentangles people's motions in different layers along with the correlated elements like shadows, reflection, etc.

The proposed model is trained on a per-video basis in a self-supervised manner. The big picture task is to reconstruct the original video by predicting the layers. This is clever if you think about it. Let us see how?

• First, the authors have used off-the-shelf methods like AlphaPose and DensePose in combination with their techniques to represent each person at each frame.

• This representation is passed to a neural renderer. The input to this neural renderer includes only the people(we will see how?) and a static background. The task of this renderer is to generate layers that reconstruct the full input video. Isn't it brilliant? With these estimated frames, one can use any simple editing technique and achieve retiming. The added benefit is the learned layers capture the correlated elements as well.

-> Figure 1: Summary of the proposed method. Source <-

Let us get into the nitty-gritty of the proposed method.

### Problem Formulation

Given an input video $V$, the goal is to decompose each frame $T_t \in V$ into a set of RGBA(color channels+opacity) layers. This is represented by:

$A_t = {L_t^i}{i=1}^N = { C_t^i, \alpha_t^i }{i=1}^N$

Here, $C_t^i$ is a color image, and $\alpha_t^i$ is an opacity map. Each $i^{th}$ layer is associated with the $i^{th}$ person/group in a video. Also, $L_t^0$ is the background layer. Using the estimated layers and a back-to-front ordering of the layers, each video frame can be rendered using the standard "over" operator. This operation is denoted by:

$\hat I_t = Comp(A_t, o_t)$

Here, $o_t$ is the ordering of the estimated layers. Equality in the above equation exists if the rendered frame is exactly the same as the input frame.

The authors have cleverly used self-supervised learning to decompose each frame into sets of layers. We will see how in the training section.

### Layered Neural Renderer

A deep neural network-based architecture dubbed layered neural renderer is used to decompose a frame into a set of layers.

A real-world video can be decomposed into numerous possible ways. A single layer that contains the entire frame can perfectly decompose the video. This is not useful, thus, the authors constrained the neural renderer to steer the solution towards the desired person-specific decomposition.

The input to the renderer is constructed as follows:

• Person Representation: Each person in the video is parameterized with a single human texture atlas $T^i$ and a per-frame UV-coordinate map $UV_t^i$, which maps each pixel in the human region in frame $I_t$ to the texture atlas. To represent person $i$ at time $t$, the deep texture map $T^i$ is sampled using $UV_t^i$ obtaining $T_t^i$.

• Background Representation: The background is represented with a single texture map $T^0$ for the entire video. This is used to learn the necessary colors. Sampling from the background is performed according to a UV map $UV_t^0$. The background's UV map is placed behind each person's UV map to provide background context for the renderer.

The neural renderer predicts the layers in separate feed-forward passes. The input-output of this renderer is:

• Input: The background's UV map is placed behind each person's UV map to provide background context. Thus, the input for layer $i$ at time $t$ is the sampled deep texture map $T_t^i$, which consists of person $i's$ sampled texture placed over the sampled background texture.

• Output: The output of the renderer is $L_t^i = {C_t^i, \alpha_t^i }$. Here $C_t^i$ is the time-varying color image and $\alpha_t^i$ is the opacity map.

The renderer aims to reconstruct the original frames from the predicted layers(output).

### Training

The model is trained per-video to find the optimal parameters, $θ$. The authors have used three loss functions:

• One obvious loss function is the $L_1$ loss between the input frame and the rendered frame since the task is formulated as a reconstruction problem. Formally,

$E_{recon} = \frac{1}{K} \sum_{t}||I_t - Comp(A_t, o_t)||$

• This alone is not sufficient to make the optimization converge from random initialization. Authors cleverly encouraged the learned alpha maps $\alpha_t^i$ to match the people segments that are associated with layer $i$. This is used just to bootstrap the model and is turned off as the optimization progresses.

$E_{mask} = \frac{1}{K} \frac{1}{N} \sum_t \sum_i D(m_t^i, \alpha_t^i)$

Here, $m_t^i$ is a trimap derived from the UV maps $UV_t^i$, and $D$ is a distance function.

• The authors have also used a regularization loss to the opacities $\alpha_t^i$ to encourage them to be spatially sparse. Formally,

$E_{reg} = \frac{1}{K} \frac{1}{N} \sum_t \sum_i γ||\alpha_t^i||_1 + Φ_0(\alpha_t^i)$

Here, $Φ_0(x) = 2. Sigmoid(5x) -1$ slowly penalizes non-zero values of the alpha map.

The total loss is given by,

$E_{total} = E_{recon} + γ_mE_{mask} + βE_{reg}$

## Conclusion

This report is written to give you the gist of the proposed method. The paper is full of exciting details, and I highly encourage you to give it a try. With the learned neural renderer, the input video can be decomposed into layers. With the predicted layers in hand, various retiming and editing effects can be produced via simple operations on the layers.

The neural renderer represents the person and all space-time visual effects correlated with them, including the movement of the individual’s clothing and even challenging semi-transparent effects such as shadows and reflections. This is an interesting area of study, and the authors have used clever bits in the paper.

I hope you get a sense of what the authors are trying to achieve through this report. Leave your thoughts in the comments down below.