First Order Motion Model for Image Animation
This article gives a summary of the NeurIPS 2019 paper by Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci and Nicu Sebe.
Created on November 4|Last edited on November 14
Comment
In this article, we take a look at the task of animating an object depicted in a source image (S) based on the motion of a similar object in a driving video (D).

Training Pipeline
A large collection of video sequences containing objects of the same object category was used. The model is trained to reconstruct the training videos by combining a single frame and a learned latent representation of the motion in the video. The model learns to encode motion as a combination of motion-specific key point displacements and local affine transformations.
The Approach
- The framework is composed of two main modules: the motion estimation module and the image generation module.
- The purpose of the motion estimation module is to predict a dense motion field from a frame D ∈ R of the driving video D to the source frame S ∈ R.
- The dense motion field is later used to align the feature maps computed from S with the object pose in D.
- It is assumed that there exists an abstract reference frame R. Two transformations: from R to S (TS←R) and from R to D (TD←R) are estimated independently. The reference frame is an abstract concept that cancels out in derivations. This choice allows independent processing of D and S.

The 2-step motion estimation process
- A dense motion network combines the local approximations to obtain the resulting dense motion field . In addition, this network outputs an occlusion mask that indicates which image parts of D can be reconstructed by warping of the source image and which parts should be inpainted, i.e, inferred from the context.
- Finally, the generation module renders an image of the source object moving as provided in the driving video. A generator network G warps the source image according to and inpaints the image parts that are occluded in the source image.
The Results


Website | Paper
The Reading Group
We host reading groups in our slack community of more than 3,000 ML engineers with authors of interesting deep learning papers, like this one. Join the conversation in the #ml-papers channel.
Join us →
Add a comment
Hello guys, I work on a documentary about my grandfather and I don't have any videos about him, He passed away a long time ago. So I intend to enliven him through this technology from photos do you have any good types to do it in an easy way and the most realistic way its possible? Thank you.
Reply
test
Reply
i wanna use this for funny yt vids bu this s hard
Reply
This seems like an interesting paper. Thank you for the summary Lavanya.
I am wondering if the Keypoint Detector is a pose estimation model?
By reading this summary, I think there is a pose estimation model, image inpainting model and a generative model in play. I might be wrong. It's going to be an interesting read. :)
Reply
Iterate on AI agents and models faster. Try Weights & Biases today.