Using an adversarial neural network to automate animation . Made by Ayush Thakur using Weights & Biases

There’s a process in traditional animation called in-betweening. Essentially, it works like this: a principal animator draws the so-called key frames of a sequence–say, a frame of a man planting his feet, then one of him bending his knees, then one of him jumping, etc.–and then different artists fill in the frames between those key moments.

These in-between frames are crucial for creating fluid, cohesive character movements and they’re generally far more numerous than the key frames themselves. They can also be quite a tedious, time-consuming process.

Here’s a quick example of a man wagging his finger, with the key frames being 1 and 6 and the in-between frames being the remainder:

The paper we’re going to cover today proposes a method to automate these frames.

"Robust Motion In-betweening" (2020) by Félix G. Harvey, Mike Yurick, Derek Nowrouzezahrai, and Christopher Pal uses high-quality Motion Capture (MOCAP) data and adversarial recurrent neural networks to generate quality transitions between key frames. In other words, given a frame of a man crouching and another of him jumping, can we animate the in-between frames automatically?

We’ll briefly cover some of the most interesting ideas from this paper but absolutely encourage you to check out the full work if this post piques your interest. We’re linking both below and starting with a great overview video they’ve provided as well:

First, let’s talk about the dataset. Here, the authors used a subset of the Human3.6M dataset as well as LaFAN1, a novel, high-quality motion dataset ideal for motion prediction.

Then, they used a humanoid skeleton model that has j=28 points when using the Human3.6M dataset and j=22 in the case of the LaFAN1 dataset. The authors have represented humanoid motion by:

- using a local quaternion vector q_t of j * 4 dimensions (a quaternion is a four-element vector used to encode any rotation in a 3D coordinate system).
- a 3-dimensional global root velocity vector \dot{r}_t at each time step.
- extracting from the data contact information based on toes and feet velocities as a binary vector c_t of 4-dimensions (used with LaFAN1).

The system takes up to 10 seed frames as past context and a target keyframe as its inputs. It produces the desired number of transition frames to link the past context and the target. In other words, the past context might be several frames of a man crawling while our target keyframe is one of him standing–the model's goal here would be to animate the in-between motions of that man rising to his feet.

When generating multiple transitions from several keyframes, the model is simply applied in series, using its last generated frames as a past context for the next sequence. To continue the example above, if we wanted the now-standing man to start running, the standing frame would be past context for the target of a running frame.

The architecture is based on Recurrent Transition Networks (RTN). As seen in Figure 3, the generator has three different encoders:

- State encoder - The state encoder's input is the current character pose, expressed as a concatenation of the root velocity(\dot{r}_t), joint-local quaternions(q_t), and feet-contact binary values(c_t).
- Offset encoder - The offset encoder's input is the current offset from the target keyframe to the current pose, expressed as a concatenation of linear differences between root positions and orientations and between joint-local quaternions.
- The offset vectors o_t^r and o_𝑡^q contain respectively the global root position’s offset and local-quaternions’ offsets from the target keyframe at time 𝑡. The quaternion offset is expressed using simpler element-wise linear differences, which simplifies learning.

- Target encoder - The target encoder takes as input the target pose, expressed as a concatenation of root orientation and joint-local quaternions.

The encoders are all fully-connected Feed-Forward Networks (FFN) with a hidden layer of 512 units and an output layer of 256 units. All layers use Piecewise Linear Activation, which performs slightly better than Rectified Linear Units (ReLU). The resulting embeddings are h_t^{state}, h_t^{offset}and h^{target}.

In the original RTN architecture, the resulting embeddings for each of those inputs get passed directly to a Long-Short-Term-Memory (LSTM) recurrent layer responsible for modeling the motion's temporal dynamics. However, to make the architecture robust to variable lengths of in-between and enforce diversity in the generated transitions, the authors have proposed two modifications. We'll get into those right after this diagram.

Simply adding conditioning information about the target keyframe is insufficient since the recurrent layer must be aware of the number of frames left until the target must be reached. This is essential to produce a smooth transition without teleportation or stalling.

The time-to-embedding latent modifier is borrowed from the Transformer networks. The authors have used the same mathematical formulation as positional encodings but based on a time-to-arrival basis instead of token position. This time-to-arrival represents the number of frames left to generate before reaching the target keyframe. Functionally, this pushes the input embeddings in different regions of the manifold, depending on the time-to-arrival.

Time-to-arrival embeddings provide continuous codes that will shift input representations in the latent space smoothly and uniquely for each transition step due to the phase and frequency shifts of the sinusoidal waves on each dimension.

To improve robustness to keyframe modifications and enable sampling capabilities for our network, the authors have employed a second kind of additive latent modifier called scheduled target noise. It is applied to the target and offset embeddings only. The z_{target} is scaled by a scalar λ_{target} that linearly decreases during the transition and reaches zero five frames before the target.

This has the effect of distorting the perceived target and offset early in the transition while the embeddings gradually become noise-free as the motion is carried out. Since the target and offset embeddings are shifted, this noise directly impacts the generated transition. It allows animators to easily control the level of stochasticity of the model by specifying the value of σtarget before launching the generation.

The authors have used multiple loss functions as complimentary soft constraints:

- Reconstruction Losses: The reconstruction losses for a predicted sequence \tilde{𝑋} given its ground-truth 𝑋 are computed with the L1 norm. The authors introduce multiple reconstruction losses. An angular Quaternion Loss is computed on the root and joint-local quaternions, a Position Loss computed on the global position of each joint retrieved through Forward Kinematics(FK), and a Foot Contact Loss based on contacts prediction.
- Adversarial Loss: The Adversarial Loss is obtained by training two additional feed-forward discriminator networks (or Critics) that are trained to differentiate real motion segments from generated ones. The critics C_1 and C_2 are simple feed-forward architecture with three fully connected layers. C_1 is a long-term critic that looks at sliding windows of 10 consecutive frames of motion, and C_2 is the short-term critic and looks at windows of instant motion over two frames.
- The critics are trained with the Least-Square GAN formulation, and their scores for all segments are averaged to get the final loss.

Curriculum learning strategy with respect to the transition lengths is used to accelerate training. Each training starts with p_{min} = \tilde{P}_{max} = 5, where P_{min} and \tilde{P}_{max} are the minimal and current maximal transition lengths. During training, \tilde{P}_{max} is increased until it reaches the true maximum transition length P_{max}.

The discriminators are implemented as 1D temporal convolutions, with strides of 1, without padding, and with receptive fields of 1 in the last 2 layers, yielding parallel feed-forward networks for each motion window in the sequence.

This is a high-level look at this paper that hopefully makes the research a bit more accessible for our readers. That said, in certain key places, we've quoted directly from the paper to avoid any confusion.

The authors' original research is well worth your time, especially if you're interested in the implementation details, and the video they've created to show some of the results is both really informative and contains some pretty wonderful sequences of the model attempting to dance.