Overview: Egocentric Videoconferencing

This report explores a method for egocentric video conferencing that enables hands-free video call. Made by Ayush Thakur using Weights & Biases
Ayush Thakur

Video conferencing has become an essential part of our day-to-day life. For all the good it has to offer, the internet's bandwidth limits it, it has device constraints and place constraints, requires a front-facing camera for good facial coverage, etc.

Video conferencing on the move is challenging but certainly convenient. This report explores a method for egocentric video conferencing that enables hands-free video calls. I highly encourage you to go through the project website linked below.

Project Website | Paper

Introduction

Video conferencing is useful as it portrays a wide range of communication signals such as facial expressions or eye gaze. Video calls require a front-facing camera to allow good facial coverage. This is feasible in a controlled and static indoor environment like your work desk. However, it might be challenging for everyday scenarios where people use hand-held mobile devices. It is even more challenging when walking in outdoor environments.

The existing techniques that can enable common video conferencing using egocentric input view can be broadly divided into,

egocentric.png

-> Figure 1: Egocentric view to frontal view using the proposed deep learning based video-to-video translation technique. (Source) <-

The authors of Egocentric Videoconferencing proposed,

Subtle expressions like tongue movement, eye movements, eye blinking, etc., are effortlessly translated to frontal facial views. The algorithm at its core is a video-to-video translation technique.

Overview of the Proposed Method

archiego.png

-> Figure 2: Simplistic overview of the proposed architecture. $E_i$ is used for conditioning, rendered neutral face images are the input, and the synthesized frontal images are the output. (Source) <-

The proposed method is a video-to-video translation technique and uses a conditional GAN(more on this later). The cGAN is conditioned on the egocentric facial view($E_i$) of an individual such that the learned generator($G$) generates the frontal facial view of the same individual. Since this is meant for video conferencing, the authors have trained the architecture using a sequence of $N=11$ frames instead of single images.

The cGAN is mathematically given by $G(X|Y)$. In this case, $Y$ is $E_i$, and $X$ needs an image like input since we want to translate from one frame to another. The authors have used renderings of the neutral face model($C_i$). Let us look at each of the components separately, but before that, we will quickly go through the data collection process.

Data Collection

egocentric-setup.png

-> Figure 3: The data collection setup and the cameras used. (Source) <-

The Architecture

The proposed method, at its core, is a video-to-video translation technique. Given the success of GANs for image-to-image translation techniques, the authors have used a conditional GAN.

Conditional GAN 101

Open In Colab

If you are familiar with GANs, you might have heard of conditional GAN. If not, here is a quick rundown. A conventional GAN's generator can generate images using the latent vector(random noise). However, you have no control over the generated image. Conditional GAN(cGAN) is a simple yet effective modification to your regular GAN as shown in figure x.

image.png

-> Figure 4: Conditional GAN architecture. (Source) <-

Thus a conditional GAN is generated by conditioning both the generator($G$) and the discriminator($D$) to some extra information such as class labels $y$.

You can learn more about cGAN in this excellent blog post. Try out the linked [colab notebook] (https://colab.research.google.com/drive/1VxEGx_G4nuSoeAzNqLFRuRy84yGbX9O3?usp=sharing) to experiment with a simple conditional GAN. The batchwise generator and discriminator loss is shown in the media panel below. Every column of the generated images in the media panel shown below belongs to an individual class. Thus the images were conditionally generated.

Section 4

With this simple demonstration, It is not hard to realize the importance of cGAN for generating photo-realistic video frames using the egocentric view of the face, as the generated frontal frame must be conditioned based on the expressions, eye blinks, eye gaze, etc. captured by the egocentric frame. The generated frontal frame must also be conditioned on the head pose.

In the proposed architectural design,

Rendered Neutral Faces ($C_i$)

renderedneutral.png

-> Figure 5: Overview of synthetic neutral face rendering for input to cGAN for pose conditioning. (Source) <-

As shown in figure 2, the cGAN takes in rendered neutral face images as input. This is to enable the control of head movement in the target view. This is achieved by first getting the monocular face reconstruction using Face2Face. The inputs to Face2Face are the images of the front-perspective camera(shown in the Data Collection section).

3D Morphable Face Model (3DMM) is used because of its ability to model intrinsic properties of 3D faces, such as shape, skin texture, illumination, expression, etc. However, it is modified to use only the geometry and reflectance properties as the expression, pose, etc properties are learned using the egocentric view. Learn more about 3DMM in this detailed survey.

Background Removal

The proposed method does not handle dynamic background and is removed from both the egocentric and frontal camera video frames. The authors have used a scene segmentation architecture called BiSeNet. Each frame is segmented and the background is set to black.

Training

The model is trained just like any GAN. The generator minimizes the adversarial loss to provide a high level of video-realism, while the discriminator maximizes the classification accuracy of real and fake videos. The authors have employed content loss and perceptual loss, in addition to the mentioned adversarial loss.

Results

image.png

-> Figure 6: The predicted frontal view from the egocentric view of the face. (Source) <-

Limitations and Conclusion

This work is a nice step forward towards real-time hands-free egocentric video conferencing for mobile eyewear devices. The results show much promise, and in the words of Károly from Two Minutes Paper, two more papers down the line and we will see considerable progress.

The proposed method has a few novel bits, but the authors did a fabulous job of pointing out the limitations of their work. To list down a few of them:

I hope you have enjoyed this paper breakdown. I would love to know what you think about this proposed method. I found this an interesting use case of conditional GAN. This is indeed clever.