Skip to main content

Automating Animations with the Help of Robust In-Betweening

Using an adversarial neural network to automate animation
Created on January 30|Last edited on March 16

Introduction

There’s a process in traditional animation called in-betweening. Essentially, it works like this: a principal animator draws the so-called key frames of a sequence–say, a frame of a man planting his feet, then one of him bending his knees, then one of him jumping, etc.–and then different artists fill in the frames between those key moments.
These in-between frames are crucial for creating fluid, cohesive character movements and they’re generally far more numerous than the key frames themselves. They can also be quite a tedious, time-consuming process.
Here’s a quick example of a man wagging his finger, with the key frames being 1 and 6 and the in-between frames being the remainder:
Figure 1: Stereotypical image of a keyframe sketch.
The paper we’re going to cover today proposes a method to automate these frames.
"Robust Motion In-betweening" (2020) by Félix G. Harvey, Mike Yurick, Derek Nowrouzezahrai, and Christopher Pal uses high-quality Motion Capture (MOCAP) data and adversarial recurrent neural networks to generate quality transitions between key frames. In other words, given a frame of a man crouching and another of him jumping, can we animate the in-between frames automatically?
We’ll briefly cover some of the most interesting ideas from this paper but absolutely encourage you to check out the full work if this post piques your interest. We’re linking both below and starting with a great overview video they’ve provided as well:

Paper | Blog Post




Overview of the Proposed Method

Dataset and Representation

First, let’s talk about the dataset. Here, the authors used a subset of the Human3.6M dataset as well as LaFAN1, a novel, high-quality motion dataset ideal for motion prediction.
Then, they used a humanoid skeleton model that has j=28j=28 points when using the Human3.6M dataset and j=22j=22 in the case of the LaFAN1 dataset. The authors have represented humanoid motion by:
  • using a local quaternion vector qtq_t of j4j * 4 dimensions (a quaternion is a four-element vector used to encode any rotation in a 3D coordinate system).
  • a 3-dimensional global root velocity vector r˙t\dot{r}_t at each time step.
  • extracting from the data contact information based on toes and feet velocities as a binary vector ctc_t of 4-dimensions (used with LaFAN1).

Top Level View of the Architecture

The system takes up to 10 seed frames as past context and a target keyframe as its inputs. It produces the desired number of transition frames to link the past context and the target. In other words, the past context might be several frames of a man crawling while our target keyframe is one of him standing–the model's goal here would be to animate the in-between motions of that man rising to his feet.
When generating multiple transitions from several keyframes, the model is simply applied in series, using its last generated frames as a past context for the next sequence. To continue the example above, if we wanted the now-standing man to start running, the standing frame would be past context for the target of a running frame.

Figure 2: Visual overview of the proposed architecture. (Source)


Architecture - Transition Generator

The architecture is based on Recurrent Transition Networks (RTN). As seen in Figure 3, the generator has three different encoders:
  • State encoder - The state encoder's input is the current character pose, expressed as a concatenation of the root velocity(r˙t\dot{r}_t), joint-local quaternions(qtq_t), and feet-contact binary values(ctc_t).
  • Offset encoder - The offset encoder's input is the current offset from the target keyframe to the current pose, expressed as a concatenation of linear differences between root positions and orientations and between joint-local quaternions.
    • The offset vectors otro_t^r and o𝑡qo_𝑡^q contain respectively the global root position’s offset and local-quaternions’ offsets from the target keyframe at time 𝑡. The quaternion offset is expressed using simpler element-wise linear differences, which simplifies learning.
  • Target encoder - The target encoder takes as input the target pose, expressed as a concatenation of root orientation and joint-local quaternions.
The encoders are all fully-connected Feed-Forward Networks (FFN) with a hidden layer of 512 units and an output layer of 256 units. All layers use Piecewise Linear Activation, which performs slightly better than Rectified Linear Units (ReLU). The resulting embeddings are htstateh_t^{state}, htoffseth_t^{offset}and htargeth^{target}.
In the original RTN architecture, the resulting embeddings for each of those inputs get passed directly to a Long-Short-Term-Memory (LSTM) recurrent layer responsible for modeling the motion's temporal dynamics. However, to make the architecture robust to variable lengths of in-between and enforce diversity in the generated transitions, the authors have proposed two modifications. We'll get into those right after this diagram.

Figure 3: (Source)


Time-to-arrival embeddings(ZttaZ_{tta})

Simply adding conditioning information about the target keyframe is insufficient since the recurrent layer must be aware of the number of frames left until the target must be reached. This is essential to produce a smooth transition without teleportation or stalling.
The time-to-embedding latent modifier is borrowed from the Transformer networks. The authors have used the same mathematical formulation as positional encodings but based on a time-to-arrival basis instead of token position. This time-to-arrival represents the number of frames left to generate before reaching the target keyframe. Functionally, this pushes the input embeddings in different regions of the manifold, depending on the time-to-arrival.
Time-to-arrival embeddings provide continuous codes that will shift input representations in the latent space smoothly and uniquely for each transition step due to the phase and frequency shifts of the sinusoidal waves on each dimension.

Scheduled target noise(ZtargetZ_{target})

To improve robustness to keyframe modifications and enable sampling capabilities for our network, the authors have employed a second kind of additive latent modifier called scheduled target noise. It is applied to the target and offset embeddings only. The ztargetz_{target} is scaled by a scalar λtargetλ_{target} that linearly decreases during the transition and reaches zero five frames before the target.
This has the effect of distorting the perceived target and offset early in the transition while the embeddings gradually become noise-free as the motion is carried out. Since the target and offset embeddings are shifted, this noise directly impacts the generated transition. It allows animators to easily control the level of stochasticity of the model by specifying the value of σtarget before launching the generation.

Losses

The authors have used multiple loss functions as complimentary soft constraints:
  • Reconstruction Losses: The reconstruction losses for a predicted sequence 𝑋~\tilde{𝑋} given its ground-truth 𝑋𝑋 are computed with the L1 norm. The authors introduce multiple reconstruction losses. An angular Quaternion Loss is computed on the root and joint-local quaternions, a Position Loss computed on the global position of each joint retrieved through Forward Kinematics(FK), and a Foot Contact Loss based on contacts prediction.
  • Adversarial Loss: The Adversarial Loss is obtained by training two additional feed-forward discriminator networks (or Critics) that are trained to differentiate real motion segments from generated ones. The critics C1C_1 and C2C_2 are simple feed-forward architecture with three fully connected layers. C1C_1 is a long-term critic that looks at sliding windows of 10 consecutive frames of motion, and C2C_2 is the short-term critic and looks at windows of instant motion over two frames.
  • The critics are trained with the Least-Square GAN formulation, and their scores for all segments are averaged to get the final loss.

Training

Curriculum learning strategy with respect to the transition lengths is used to accelerate training. Each training starts with pmin=P~max=5p_{min} = \tilde{P}_{max} = 5, where PminP_{min} and P~max\tilde{P}_{max} are the minimal and current maximal transition lengths. During training, P~max\tilde{P}_{max} is increased until it reaches the true maximum transition length PmaxP_{max}.
The discriminators are implemented as 1D temporal convolutions, with strides of 1, without padding, and with receptive fields of 1 in the last 2 layers, yielding parallel feed-forward networks for each motion window in the sequence.

Conclusion

This is a high-level look at this paper that hopefully makes the research a bit more accessible for our readers. That said, in certain key places, we've quoted directly from the paper to avoid any confusion.
The authors' original research is well worth your time, especially if you're interested in the implementation details, and the video they've created to show some of the results is both really informative and contains some pretty wonderful sequences of the model attempting to dance.


Run set
1

Iterate on AI agents and models faster. Try Weights & Biases today.