An Introduction to Egocentric Videoconferencing
This article explores a method for egocentric video conferencing that enables hands-free video calls, enabling you to better participate in video calls when on the move.
Created on November 11|Last edited on November 18
Comment
Video conferencing has become an essential part of our day-to-day life. For all the good it has to offer, the internet's bandwidth limits it, it has device and location constraints, and requires a front-facing camera for good facial coverage, etc.
Video conferencing on the move is challenging — but certainly convenient. This article explores a method for egocentric video conferencing that enables hands-free video calls. I highly encourage you to go through the project website linked below.
Project Website | Paper
Table of Contents
Background to Egocentric VideoconferencingOverview of the Proposed MethodResultsLimitations and Conclusion
Background to Egocentric Videoconferencing
Video conferencing is useful as it portrays a wide range of communication signals such as facial expressions or eye gaze. Video calls require a front-facing camera to allow good facial coverage. This is feasible in a controlled and static indoor environment like your work desk. However, it might be challenging in everyday scenarios where people use hand-held mobile devices. It is even more challenging when walking in outdoor environments.
The existing techniques that can enable common video conferencing using egocentric input view can be broadly divided into:
- frontalisation-based : transform face poses in a camera view, where large parts of the face are occluded into complete and frontal views of the face. Existing methods generate noticeable artifacts and deformations in the face structure.
- reenactment-based : the process of capturing the facial expression from a source actor in a video and transferring them to a video of a different target face.
-
Figure 1: Egocentric view to frontal view using the proposed deep learning based video-to-video translation technique. (Source)
The authors of Egocentric Videoconferencing proposed:
- a low-cost wearable egocentric camera setup and
- a deep learning framework to translate the egocentric facial views(video) to frontal facial views(video) common in videoconferencing. This is shown in figure 1.
Subtle expressions like tongue movement, eye movements, eye blinking, etc., are effortlessly translated to frontal facial views. The algorithm at its core is a video-to-video translation technique.
Overview of the Proposed Method

Figure 2: Simplistic overview of the proposed architecture. is used for conditioning, rendered neutral face images are the input, and the synthesized frontal images are the output. (Source)
The proposed method is a video-to-video translation technique and uses a conditional GAN(more on this later). The cGAN is conditioned on the egocentric facial view() of an individual such that the learned generator() generates the frontal facial view of the same individual. Since this is meant for video conferencing, the authors have trained the architecture using a sequence of frames instead of single images.
The cGAN is mathematically given by . In this case, is , and needs an image like input since we want to translate from one frame to another. The authors have used renderings of the neutral face model(). Let us look at each of the components separately, but before that, we will quickly go through the data collection process.
Data Collection

- The training data consists of paired egocentric and front-view videos recorded using two different RGB cameras. They were synchronized using a simple calibration stage.
- There are two recording setups: one for the static indoor scenario and another for the dynamic outdoor environment.
- The egocentric camera is a low-cost RGB fish eye camera. It can be attached to the frame of an eyeglass. The setup is a bit bulky, but further improvement in design and technology will improve this.
- The frontal camera is a commodity HD camera for indoor scenarios, while for outdoor scenarios, a commodity mobile phone camera is used.
- The authors had collected 27 sequences on an average of 14000 frames long. 13 individuals were used, and the frames were extracted at 24 frames/second. They manually took tight crop around the face for both egocentric and frontal view videos. The cropped frames were resized to 256x256 resolution.
- Out of ~14000 frames, 7500 frames were used for training, 2500 for validation, and the rest for testing. Thus there is a unique model for each individual.
The Architecture
The proposed method, at its core, is a video-to-video translation technique. Given the success of GANs for image-to-image translation techniques, the authors have used a conditional GAN.
Conditional GAN 101
If you are familiar with GANs, you might have heard of conditional GAN. If not, here is a quick rundown. A conventional GAN's generator can generate images using the latent vector(random noise). However, you have no control over the generated image. Conditional GAN(cGAN) is a simple yet effective modification to your regular GAN as shown in figure x.

Thus a conditional GAN is generated by conditioning both the generator() and the discriminator() to some extra information such as class labels .
You can learn more about cGAN in this excellent blog post. Try out the linked colab notebook to experiment with a simple conditional GAN. The batchwise generator and discriminator loss is shown in the media panel below. Every column of the generated images in the media panel shown below belongs to an individual class. Thus the images were conditionally generated.
Run set
1
With this simple demonstration, It is not hard to realize the importance of cGAN for generating photo-realistic video frames using the egocentric view of the face, as the generated frontal frame must be conditioned based on the expressions, eye blinks, eye gaze, etc. captured by the egocentric frame. The generated frontal frame must also be conditioned on the head pose.
In the proposed architectural design,
- The generator network() is a U-Net-style convolutional neural network. The proposed U-Net consists of 7 down- and up- convolutional layers with skip connection. The decoder is symmetric to the encoder of the U-Net architecture. All the layers use a kernel size of 4x4 with a stride of 2.
- The proposed discriminator uses a patch-based convolutional neural network similar to pix2pix. The discriminator is conditioned on input egocentric frames.
Rendered Neutral Faces ()

Figure 5: Overview of synthetic neutral face rendering for input to cGAN for pose conditioning. (Source)
As shown in figure 2, the cGAN takes in rendered neutral face images as input. This is to enable the control of head movement in the target view. This is achieved by first getting the monocular face reconstruction using Face2Face. The inputs to Face2Face are the images of the front-perspective camera(shown in the Data Collection section).
3D Morphable Face Model (3DMM) is used because of its ability to model intrinsic properties of 3D faces, such as shape, skin texture, illumination, expression, etc. However, it is modified to use only the geometry and reflectance properties as the expression, pose, etc properties are learned using the egocentric view. Learn more about 3DMM in this detailed survey.
Background Removal
The proposed method does not handle dynamic backgrounds and is removed from both the egocentric and frontal camera video frames. The authors have used a scene segmentation architecture called BiSeNet. Each frame is segmented and the background is set to black.
Training
The model is trained just like any GAN. The generator minimizes the adversarial loss to provide a high level of video realism, while the discriminator maximizes the classification accuracy of real and fake videos. The authors have employed content loss and perceptual loss, in addition to the mentioned adversarial loss.
- Each sequence contains around 14000 frames, out of which 7500 frames are used for training.
- Each sequence is trained for 100 epochs.
- A learning rate of 0.0002, first momentum of 0.5, and a batch size of 12 were used.
- The content loss is a simple loss that enforces the output images to resemble the ground truth .
- The perceptual loss is computed using a pre-trained VGG-Face network. The distance is computed between the generated and ground truth frames at the output of the intermediate convolutional layers of the VGG network.
Results

Limitations and Conclusion
This work is a nice step forward towards real-time hands-free egocentric video conferencing for mobile eyewear devices. The results show much promise, and in the words of Károly from Two Minutes Paper, two more papers down the line and we will see considerable progress.
The proposed method has a few novel bits, but the authors did a fabulous job of pointing out the limitations of their work. To list down a few of them:
- The solution is person-specific. The cGAN is trained on sequence from a single individual. It is also limited by the expressions seen at the training time.
- Testing on unseen individuals hallucinates incorrect renderings with strong visual artifacts.
- The egocentric camera setup is bulky.
- The method removes the dynamic background.
- The method can struggle with scenes shot in very dark illuminations, leading to artifacts.
I hope you have enjoyed this paper breakdown. I would love to know what you think about this proposed method. I found this an interesting use case of conditional GAN. This is indeed clever.
Add a comment
Hello Ayush, thanks for the report. It's a nicely written paper summary. Indeed this seems to be an interesting use case of conditional GANs. Such papers on the practical application of machine learning are a delight. Even though there are limitations but it's a good direction.
I wonder if the authors have made the data collection code public along with the dataset that they have used to train the models.
1 reply
Iterate on AI agents and models faster. Try Weights & Biases today.