Talking-Face Generation

Comparison and evalutation of 3 state-of-the-art methods for talking faces generation

Created on May 19|Last edited on June 1

Comment

﻿
﻿1. Neural Head Reenactment with Latent Pose Descriptors (CVPR)﻿Burkov et al., 2020, June
Architecture of Neural Head Reenactment with Latent Pose Descriptors
To generate a talking face from a video or an image with neural head, the first step is to learn to reconstruct pose and identity by taking 9 random frames from a video (8 for identity and 1 for pose).
﻿
﻿
Fixed_images_train_visual
confused-morning-32
autumn-planet-31
soft-salad-30
breezy-night-29
valiant-dust-28
clear-snowflake-27
tough-durian-26
stellar-shadow-25
peach-smoke-24
expert-wave-23
Step
We first need to fine-tune the meta-model to the face we want to reenact. The model is pre-trained on VoxCeleb2.
We can see on the left the generator and the segmentation getting better over time.
Pictures on the left are interactive, we can tweet the "step" parameter by clicking on the gear to see it over time
Run set10
﻿
Once trained, we have a .pth of the model fine-tuned for the new identity. We can now use another video to drive (puppeteering) the identity.
Output example
Video puppeteering, on the left the driving video and on the right the generated video
﻿2. Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation (CVPR 2021)﻿Zhou et al., 2021, April
﻿
Architecture of PC-AVS Framework
With this architecture we don't fine-tune for a specific identity. We give as input one video or one image for the identity, one video for the pose source and one audio. 
Output example
Output of PC-AVS, with respectively from left to right, the idendity still image, the generated video, the pose source and the audio source
﻿3. First Order Motion Model for Image Animation﻿﻿(NeuRIPS 2019)﻿Siarohin et al., 2020, October
﻿
First Order Model pipeline
For training, we employ a large collection of video sequences containing objects of the same object category. Our model is trained to reconstruct the training videos by combining a single frame and a learned latent representation of the motion in the video. Observing frame pairs (source and driving), each extracted from the same video, it learns to encode motion as a combination of motion-specific keypoint displacements and local affine transformations. At test time we apply our model to pairs composed of the source image and of each frame of the driving video and perform image animation of the source object.
3. Comparison on same inputsGenerated outputs are on the right and poses are on the left.
Extreme pose with identity and pose from VoxCeleb﻿
Original identity
NeuralHead
PC-AVS
First Order Model
Extreme pose with identity not from VoxCeleb
Original identity
NeuralHead
PC-AVS
First Order Model
Driven by it-self
Original identity
NeuralHead
PC-AVS
First Order Model
Both videos not from VoxCeleb
Original identity
NeuralHead
PC-AVS
First Order Model
Qualitative evaluationWith NeuralHead we first fine-tune the model for our identity, so NeuralHead get way better results on inputs that are not from VoxCeleb, especially with extreme pose.
However with inputs from VoxCeleb, both achieve good results. However facial movements like eyes and eyebrows are still more accurate with NeuralHead.  
NeuralHead also seems to depend heavily on the quality of segmentation.
Other generated examples
NeuralHead
﻿
﻿
PC-AVS
﻿
﻿
Video CreditsI used 2 videos that are not from VoxCeleb for evaluation. Both of them contains one speaker front-facing the camera, with good lighting and moderate head movements.
﻿Conseils de carrière (So Good They Can't Ignore You) - L'atelier d'Armand / YouTube﻿
﻿Rosé Talks On The Ground, “R”, Blackpink & Hank - Zach Sang Show / YouTube﻿﻿﻿
﻿