One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing

In this report, we will look at the latest work published in CVPR 21 in the domain of one-shot talking-head synthesis.
Created on March 16|Last edited on April 19
Comment
For reasons we won't belabor, video conferencing has gained a tremendous user base in the last year. But despite its rise, it's not accessible to many because of the high network bandwidth required to carry both video and speech in real-time. Deep learning techniques (especially GANs) can deliver high-quality video via image compression at lower bit rates. 
But before we dig into this research on a really interesting application of deep learning called talking head synthesis, we recommend checking out the brief video below. It'll help anchor the research as we dig a bit deeper into "One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing" (Wang et al., 2020).
﻿Project Page | Paper | Online Demo﻿﻿
﻿
Overview of the Proposed MethodLet's first level-set on the notations that we'll be using and get a little clarity on the goal of this research. From the paper:
Let sss﻿ be an image of a person, referred to as the source image. Let {d1,d2,...,dN}\{d_1, d_2,...,d_N\}{d1​,d2​,...,dN​}﻿ be a talking-head video, called the driving video, where did_idi​﻿’s are the individual frames, and NNN﻿ is the total number of frames. Our goal is to generate an output video {y1,y2,...,yN}\{y_1, y_2,...,y_N\}{y1​,y2​,...,yN​}﻿, where the identity in yiy_iyi​﻿’s is inherited from sss﻿ and the motions are derived from did_idi​﻿’s.Depending on the sss﻿ (i.e. the image of the person), the goal can be either one of the two broader deep learning tasks:
If the person in the source image (sss﻿) is the same as in the driving video (did_idi​﻿), then it's a video reconstruction task. The generated output video (yiy_iyi​﻿) still takes the identity information from sss﻿ and motion information from did_idi​﻿. 
If the person in sss﻿ is not the same as in did_idi​﻿, then it's a motion transfer task. 
To inherit the features from the source image and control the novel synthesis of the talking head, the authors devised an unsupervised approach for learning a set of 3D keypoints and their decomposition.
The proposed method can be divided into three major steps:
Source image feature extraction
Driving video feature extraction
Video generation
The beauty of the proposed solution is the joint training of all the architectures, in all three stages. We will look into the training details in a moment, but let's first quickly look at the architectural design for source image feature extraction.
Source Image Feature Extraction
Figure: Different features extracted from the source image. (Source)
Four separate neural network architectures (well, three actually) are used to extract identity-specific information. Digging in a bit:
3D Appearance Feature Extraction (FFF﻿): Using a neural network FFF﻿, the source image sss﻿ is mapped to a 3D appearance feature volume fsf_sfs​﻿. The network FFF﻿ consists of multiple downsampling blocks followed by a number of 3D residual blocks to compute the 3D feature volume fsf_sfs​﻿.
Figure: Architectural design of FFF﻿. (Source)
﻿KKK﻿3D Canonical Keypoint Extraction (LLL﻿): Using a canonical 3D keypoint detection network LLL﻿, a set of KK K﻿ canonical 3D keypoints xc,k∈R3x_{c, k} \in \R^3xc,k​∈R3﻿ and their Jacobians Jc,k∈R3×3J_{c, k} \in \R^{3 \times 3}Jc,k​∈R3×3﻿ are extracted from sss﻿.
The Jacobians represent how a local patch around the keypoint can be transformed into a patch in another image via an affine transformation.The authors have used a U-Net style encoder-decoder to extract canonical keypoints. 
Figure: Architectural design of LLL﻿. (Source)
Our canonical keypoints are formulated to be independent of the pose and expression change. They should only contain a person’s geometry signature, such as the shapes of face, nose, and eyes.
Head Pose (HHH﻿) and Expression Extraction (△\triangle△﻿): A pose estimation network HHH﻿ is used to estimate the head pose of the person in sss﻿. It is parameterized by a rotation matrix Rs∈R3×3R_s \in \R^{3 \times 3}Rs​∈R3×3﻿ and a translation vector ts∈R3t_s \in \R^3ts​∈R3﻿. The rotation matrix RsR_sRs​﻿ in practice is composed of three matrices - yaw, pitch
Expression deformation estimation network △\triangle△﻿ is used to estimate deformation of keypoints from the neutral expression. Thus there are KKK﻿ 3D deformations δs,k\delta_{s, k}δs,k​﻿. 
Note that the authors have used a common backbone with shared weights for both HHH﻿ and △\triangle△﻿. This is evident from the proposed architecture for the same.
Figure: Architectural design of HHH﻿ and △\triangle△﻿. (Source)
Note: The same architecture is used to extract motion-related information from the driving video.
Using the information from all 3 architectures, the authors have proposed a transformation TTT﻿ to obtain the final 3D keypoints xs,kx_{s, k}xs,k​﻿ and their Jacobians Js,kJ_{s, k}Js,k​﻿ for the source image. TxT_xTx​﻿ is applied to the keypoints and TjT_jTj​﻿ to the Jacobians such that:
﻿xs,k=Tx(xc,k,Rs,ts,δs,k)≡Rsxc,k+ts+δs,kx_{s, k} = T_x(x_{c, k}, R_s, t_s, \delta_{s, k}) \equiv R_sx_{c, k} + t_s + \delta_{s, k}xs,k​=Tx​(xc,k​,Rs​,ts​,δs,k​)≡Rs​xc,k​+ts​+δs,k​﻿ 
﻿Js,k=Tj(Jc,k,Rs)≡RsJs,kJ_{s, k} = T_j(J_{c, k}, R_s) \equiv R_sJ_{s, k}Js,k​=Tj​(Jc,k​,Rs​)≡Rs​Js,k​﻿﻿
Driving Video Feature Extraction
Figure: Different features extracted from the driving video. (Source)
The driving video is used to extract motion-related information. To this end, head pose estimation network HHH﻿ and expression deformation estimator network △\triangle△﻿ is used. Note that 3D feature extractor (FFF﻿) and canonical key point extraction network (LLL﻿) are not used. This is inclined to the formulated goal. From the paper,
Instead of extracting canonical 3D keypoints from the driving image ddd﻿ using LLL﻿, we reuse xc,kx_{c, k}xc,k​﻿ and Jc,kJ_{c, k}Jc,k​﻿, which were extracted from the source image sss﻿. This is because the face in the output image must have the same identity as the one in the source image sss﻿. There is no need to compute them again.Using the identity-specific information (xc,kx_{c, k}xc,k​﻿ and Jc,kJ_{c, k}Jc,k​﻿) and motion-related information, final 3D keypoints xd,kx_{d, k}xd,k​﻿ and their Jacobians Jd,kJ_{d, k}Jd,k​﻿ is computed for the driving video. The same transformations TxT_xTx​﻿ and TjT_jTj​﻿ are used such that,
﻿xd,k=Tx(xc,k,Rd,td,δd,k)≡Rdxc,k+td+δd,kx_{d, k} = T_x(x_{c, k}, R_d, t_d, \delta_{d, k}) \equiv R_dx_{c, k} + t_d + \delta_{d, k}xd,k​=Tx​(xc,k​,Rd​,td​,δd,k​)≡Rd​xc,k​+td​+δd,k​﻿
Jd,k=Tj(Jc,k,Rd)≡RdJd,kJ_{d, k} = T_j(J_{c, k}, R_d) \equiv R_dJ_{d, k}Jd,k​=Tj​(Jc,k​,Rd​)≡Rd​Jd,k​﻿﻿
This 3D keypoint and its Jacobian are derived for every frame in the driving video. Since identity-specific information is used for computing final 3D keypoints and Jacobians for the driving video frame, we can provide user-specific rotation and translation matrix to change a person's head pose.
Our approach allows manual changes to the 3D head pose during synthesis. Let RuR_uRu​﻿ and tut_utu​﻿be user-specified rotation and translation, respectively. The final head pose in the output image is given by Rd←RuRdR_d \leftarrow R_uR_dRd​←Ru​Rd​﻿ and td←tu+tdt_d \leftarrow t_u + t_dtd​←tu​+td​﻿. In video conferencing, we can change a person’s head pose in the video stream freely despite the original view angle.
Video Synthesis
Figure: Video synthesis pipeline. (Source)
The 3D keypoints and Jacobians extracted from the source image and the driving video frame are used to estimate warping flow maps. This flow map wkw_kwk​﻿ is generated based on the kthk^{th}kth﻿ keypoint using the first-order approximation. This flow field wkw_kwk​﻿ is used to warp the source feature wk(fs)w_k(f_s) wk​(fs​)﻿ where k∈{1,2,..,K}k \in \{1,2,..,K\}k∈{1,2,..,K}﻿. First Order Motion Model for Image Animation by Siarohin et al. might be a useful read. You can check out this paper's summary by Lavanya Shukla here. 
These warped features are fed to a motion field estimator network MMM﻿ which is a 3D U-Net style network. As shown in the figure, two outputs are estimated using this network - mask mmm﻿ and occlusion ooo﻿. 
Figure: Architectural design of MMM﻿. (Source)
Softmax activation is used to obtain the flow composition mask mmm﻿, which consists of KKK﻿ 3D masks. These are again combined with KKK﻿ warping flow maps wkw_kwk​﻿ to obtain the final composite flow field www﻿. This is finally used to obtain the warped source feature w(fs)w(f_s)w(fs​)﻿. 
Warping leads to occlusions. A 2D occlusion mask oo o﻿ is predicted to be inputted to the generator. 
The authors have used a generator network GGG﻿ that takes the warped 3D source feature map w(fs)w(f_s)w(fs​)﻿ and first projects them back to the 2D feature. This is then multiplied with the occlusion mask ooo﻿ followed by a series of 2D residual blocks and upsampling layers to obtain the final image.
Figure: Architectural design of GGG﻿. (Source)
To summarize so far, we have source image sss﻿ and driving video ddd﻿. The task is to generate output video yyy﻿ such that it has the identity-specific information from sss﻿ and motion-specific information from ddd﻿. To obtain identity-specific information different neural networks are used and the same goes for motion-specific information. These pieces of information are used to obtain KKK﻿ 3D keypoints and Jacobians for both sss﻿ and ddd﻿. 
These keypoints and Jacobians are then used to warp the source appearance feature fsf_sfs​﻿extracted from sss﻿ from which they generate the final output image using a generator network GGG﻿.
So how did the authors train this system? We'll cover the procedure, the dataset they used, and their losses. 
Training the ModelsThe authors used a dataset of talking-head videos to train their models here. They  mention the use of VoxCeleb2 and TalkingHead-1KH for evaluation, though it is a bit unclear which dataset they used for training. 
For each video, two frames were sampled:
one would act as a source image sss﻿,
and the other as the frame from the driving video ddd﻿.
The networks FFF﻿, LLL﻿, HHH﻿, △\triangle△﻿, MMM﻿, and GGG﻿ are trained together by minimizing the following loss:
﻿L=LP(d,y)+LG(d,y)+LE({xd,k},{Jd,k})+LL({xd,k})+LH(Rd,Rˉd)+L△({δd,k})\mathcal{L} = \mathcal{L}_P(d, y) + \mathcal{L}_G(d, y) + \mathcal{L}_E(\{x_{d,k}\}, \{J_{d, k}\}) + \mathcal{L}_L(\{x_d, k\}) + \mathcal{L}_H(R_d, \bar{R}_d) + \mathcal{L}_{\triangle}(\{\delta_{d, k}\})L=LP​(d,y)+LG​(d,y)+LE​({xd,k​},{Jd,k​})+LL​({xd​,k})+LH​(Rd​,Rˉd​)+L△​({δd,k​})﻿﻿
Let's go through each term one-by-one:
Perpetual Loss (LP\mathcal{L}_PLP​﻿): Perpetual loss is commonly used in image reconstruction tasks. Here's a nice description of this loss function. In short, a pre-trained VGG network is used to extract features from both the ground truth image and the reconstructed image. The L1L_1L1​﻿ distance is computed between the features. The features are extracted from multiple hidden layers of varying resolutions. Besides regular VGG (trained on ImageNet) the authors have also used a pre-trained face VGG  network for obvious reasons. 
GAN Loss (LG\mathcal{L}_GLG​﻿): The authors have used patch GAN implementation along with hinge loss. Check out this quick summary here.
Equivalence Loss (LE\mathcal{L}_ELE​﻿): This loss ensures the consistency of the estimated keypoints. Let xdx_dxd​﻿ be the detected keypoints for the input image ddd﻿. When a known transformation TTT﻿ is applied to the image (T(d)T(d)T(d)﻿), the detected keypoints should be transformed in the same way. L1L_1L1​﻿ distance is minimized such that ∣∣xd−T−1(xT(d))∣∣1||x_d - T^{-1}(x_{T(d)}) ||_1∣∣xd​−T−1(xT(d)​)∣∣1​﻿ tends to zero. Here T−1T^{-1} T−1﻿ is the inverse of the known transform. The same logic is applicable to the jacobians of the keypoints. 
Key Prior Loss (LLL_LLL​﻿): This loss encourages the estimated image-specific keypoints xd,kx_{d, k}xd,k​﻿to spread out across the face region, instead of crowding around a small neighborhood. Distance is computed between the keypoint pairs and penalized if the distance is below some threshold. 
Head Pose Loss (LHL_HLH​﻿): L1L_1L1​﻿ distance is computed between the estimated head pose RdR_dRd​﻿ and the one predicted by a pre-trained estimator Rˉd\bar{R}_dRˉd​﻿. This approximation is as good as the pre-trained model head pose estimator.
Deformation Prior Loss (L△L_{\triangle}L△​﻿): This loss is simply given as L1L_1L1​﻿ norm of the deviation such that, L△=∣∣δd,k∣∣1\mathcal{L}_{\triangle} = ||\delta_{d, k}||_1L△​=∣∣δd,k​∣∣1​﻿.
The models are trained using the coarse-to-fine technique. ADAM optimizer with a learning rate of 0.0002 is used to train the model on 256x256 resolution images for 100 epochs. This is then fine-tuned on 512x512 resolution images for 10 epochs. 
ConclusionThe results of this research are really quite promising. The techniques we talked about here resulted in a 10X bandwidth reduction and, if you had a chance to look at the video in our introduction, you can see that the video quality is incredibly high considering that reduction. Models like this could make it possible to democratize access to video conferencing and reduce strain on networks, especially in residential and rural areas where bandwidth is already harder to come by. 
The paper is packed with implementation details, failure modes, and other nitty-gritty. I highly recommend going through the paper especially the appendix section. 
I would also like to thank Justin Tenuto for his edits. 
﻿