First Order Motion Model for Image Animation

This article gives a summary of the NeurIPS 2019 paper by Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci and Nicu Sebe.

Lavanya Shukla

Created on November 4|Last edited on November 14

Comment

In this article, we take a look at the task of animating an object depicted in a source image (S) based on the motion of a similar object in a driving video (D).  
﻿
Training PipelineA large collection of video sequences containing objects of the same object category was used. The model is trained to reconstruct the training videos by combining a single frame and a learned latent representation of the motion in the video.  The model learns to encode motion as a combination of motion-specific key point displacements and local affine transformations. 
The ApproachThe framework is composed of two main modules: the motion estimation module and the image generation module.
The purpose of the motion estimation module is to predict a dense motion field from a frame D ∈ R of the driving video D to the source frame S ∈ R.
The dense motion field is later used to align the feature maps computed from S with the object pose in D.
It is assumed that there exists an abstract reference frame R. Two transformations: from R to S (TS←R) and from R to D (TD←R)  are estimated independently. The reference frame is an abstract concept that cancels out in derivations. This choice allows independent processing of D and S.
﻿
﻿
The 2-step motion estimation processA dense motion network combines the local approximations to obtain the
resulting dense motion field  TˆS←DTˆS←DTˆS←D﻿. In addition, this network
outputs an occlusion mask  OˆS←DOˆS←DOˆS←D﻿ that indicates which image parts of D can be reconstructed by warping of the source image and which parts should be inpainted, i.e, inferred from the context.
Finally, the generation module renders an image of the source object moving as provided in the
driving video. A generator network G warps the source image according to TˆS←DTˆS←DTˆS←D﻿ and inpaints the image parts that are occluded in the source image.
The Results﻿
﻿
﻿
﻿
﻿Website | Paper ﻿
The Reading GroupWe host reading groups in our slack community of more than 3,000 ML engineers with authors of interesting deep learning papers, like this one. Join the conversation in the #ml-papers channel.
﻿Join us →﻿﻿
﻿

Add a comment

Yo Zo • 5 years ago

Hello guys, I work on a documentary about my grandfather and I don't have any videos about him, He passed away a long time ago. So I intend to enliven him through this technology from photos do you have any good types to do it in an easy way and the most realistic way its possible? Thank you.

Veronica Jung Yeon Kim • 5 years ago

test

loggy • 5 years ago

i wanna use this for funny yt vids bu this s hard

Ayush Thakur • 5 years ago

This seems like an interesting paper. Thank you for the summary Lavanya. I am wondering if the Keypoint Detector is a pose estimation model? By reading this summary, I think there is a pose estimation model, image inpainting model and a generative model in play. I might be wrong. It's going to be an interesting read. :)

Tags: Intermediate, Computer Vision, GenAI, Research, Panels

Iterate on AI agents and models faster. Try Weights & Biases today.

First Order Motion Model for Image Animation

Training Pipeline

The Approach

The 2-step motion estimation process

The Results

﻿Website | Paper ﻿

The Reading Group

﻿Join us →﻿

Website | Paper

Join us →