Automating Animations with the Help of Robust In-Betweening

Using an adversarial neural network to automate animation . Made by Ayush Thakur using Weights & Biases
Ayush Thakur

Introduction

There’s a process in traditional animation called in-betweening. Essentially, it works like this: a principal animator draws the so-called key frames of a sequence–say, a frame of a man planting his feet, then one of him bending his knees, then one of him jumping, etc.–and then different artists fill in the frames between those key moments.
These in-between frames are crucial for creating fluid, cohesive character movements and they’re generally far more numerous than the key frames themselves. They can also be quite a tedious, time-consuming process.
Here’s a quick example of a man wagging his finger, with the key frames being 1 and 6 and the in-between frames being the remainder:
The paper we’re going to cover today proposes a method to automate these frames.
"Robust Motion In-betweening" (2020) by Félix G. Harvey, Mike Yurick, Derek Nowrouzezahrai, and Christopher Pal uses high-quality Motion Capture (MOCAP) data and adversarial recurrent neural networks to generate quality transitions between key frames. In other words, given a frame of a man crouching and another of him jumping, can we animate the in-between frames automatically?
We’ll briefly cover some of the most interesting ideas from this paper but absolutely encourage you to check out the full work if this post piques your interest. We’re linking both below and starting with a great overview video they’ve provided as well:

Paper | Blog Post

Overview of the Proposed Method

Dataset and Representation

First, let’s talk about the dataset. Here, the authors used a subset of the Human3.6M dataset as well as LaFAN1, a novel, high-quality motion dataset ideal for motion prediction.
Then, they used a humanoid skeleton model that has j=28 points when using the Human3.6M dataset and j=22 in the case of the LaFAN1 dataset. The authors have represented humanoid motion by:

Top Level View of the Architecture

The system takes up to 10 seed frames as past context and a target keyframe as its inputs. It produces the desired number of transition frames to link the past context and the target. In other words, the past context might be several frames of a man crawling while our target keyframe is one of him standing–the model's goal here would be to animate the in-between motions of that man rising to his feet.
When generating multiple transitions from several keyframes, the model is simply applied in series, using its last generated frames as a past context for the next sequence. To continue the example above, if we wanted the now-standing man to start running, the standing frame would be past context for the target of a running frame.

Architecture - Transition Generator

The architecture is based on Recurrent Transition Networks (RTN). As seen in Figure 3, the generator has three different encoders:
The encoders are all fully-connected Feed-Forward Networks (FFN) with a hidden layer of 512 units and an output layer of 256 units. All layers use Piecewise Linear Activation, which performs slightly better than Rectified Linear Units (ReLU). The resulting embeddings are h_t^{state}, h_t^{offset}and h^{target}.
In the original RTN architecture, the resulting embeddings for each of those inputs get passed directly to a Long-Short-Term-Memory (LSTM) recurrent layer responsible for modeling the motion's temporal dynamics. However, to make the architecture robust to variable lengths of in-between and enforce diversity in the generated transitions, the authors have proposed two modifications. We'll get into those right after this diagram.

Time-to-arrival embeddings(Z_{tta})

Simply adding conditioning information about the target keyframe is insufficient since the recurrent layer must be aware of the number of frames left until the target must be reached. This is essential to produce a smooth transition without teleportation or stalling.
The time-to-embedding latent modifier is borrowed from the Transformer networks. The authors have used the same mathematical formulation as positional encodings but based on a time-to-arrival basis instead of token position. This time-to-arrival represents the number of frames left to generate before reaching the target keyframe. Functionally, this pushes the input embeddings in different regions of the manifold, depending on the time-to-arrival.
Time-to-arrival embeddings provide continuous codes that will shift input representations in the latent space smoothly and uniquely for each transition step due to the phase and frequency shifts of the sinusoidal waves on each dimension.

Scheduled target noise(Z_{target})

To improve robustness to keyframe modifications and enable sampling capabilities for our network, the authors have employed a second kind of additive latent modifier called scheduled target noise. It is applied to the target and offset embeddings only. The z_{target} is scaled by a scalar λ_{target} that linearly decreases during the transition and reaches zero five frames before the target.
This has the effect of distorting the perceived target and offset early in the transition while the embeddings gradually become noise-free as the motion is carried out. Since the target and offset embeddings are shifted, this noise directly impacts the generated transition. It allows animators to easily control the level of stochasticity of the model by specifying the value of σtarget before launching the generation.

Losses

The authors have used multiple loss functions as complimentary soft constraints:

Training

Curriculum learning strategy with respect to the transition lengths is used to accelerate training. Each training starts with p_{min} = \tilde{P}_{max} = 5, where P_{min} and \tilde{P}_{max} are the minimal and current maximal transition lengths. During training, \tilde{P}_{max} is increased until it reaches the true maximum transition length P_{max}.
The discriminators are implemented as 1D temporal convolutions, with strides of 1, without padding, and with receptive fields of 1 in the last 2 layers, yielding parallel feed-forward networks for each motion window in the sequence.

Conclusion

This is a high-level look at this paper that hopefully makes the research a bit more accessible for our readers. That said, in certain key places, we've quoted directly from the paper to avoid any confusion.
The authors' original research is well worth your time, especially if you're interested in the implementation details, and the video they've created to show some of the results is both really informative and contains some pretty wonderful sequences of the model attempting to dance.