Improving Deepfake Performance with Data

Boris Dayma

In this project, I use faceswap repo to swap faces between Lukas and Chris who were kind enough to let me play with their pictures for the purpose of this experiment. I’ll particularly focus on the effect of data in improving the model.

The challenge is clear, we are trying to swap faces between people with different skin tone and facial hair.

Usually, you would want to find people with similar features in order to improve the success of your face swapping. In many cases, you swap only the inner part of the face. Some projects even focus only on the mouth section to create fake speeches. Deepfakes also handles the "stitching and blending" part of generated images with more traditional algorithms.

In this post, we will only focus on the Deep Learning part related to generating faces.

To illustrate the difficulty of the challenge, here is an example of what we would achieve with default settings (the red box shows which part of the image is generated).

We can notice a few issues:

In addition, the long hair vs short hair actually causes issues to define well the contour of the face to reproduce.

Those are often solved (at least partly) by:

In this case, we are going to keep trying to improve the results with the two people selected.

The first step that comes in mind is to crop around a larger section of the face.

Here is what I obtained after a few hours of training.

The image spaces to consider are:

The results are shown in groups of 3 images (from left to right):

Looking at those images (I stopped training only after a few hours) I quickly noticed a few points:

I had to go back to the data collected to understand this imbalance in quality of results.

Initial data was collected from W&B tutorials.

Lukas has a richer data-set, with a larger diversity of backgrounds, orientations, light exposure, etc in his image space (even though he really likes his blue and grey shirts for his tutorials). On the other hand, Chris data-set is based on one single video and is always turned towards the same side. This could explain the imbalance in quality of pictures generated.

I also looked at the implementation into more details, especially at the structure of encoders & decoders as well as the losses definition. I found the guide from torzdf very useful to understand better the main principles, in particular the following diagram.

We have 2 sets of images associated with different people sharing a same encoder. We then use a decoder specific to each data-set to try to recreate the original image.

The job of the encoder is to capture unique details of a specific image while the decoder needs to recreate an image of a person (already known) through the features that have been provided by the encoder.

We hope that, by using a shared encoder between the 2 faces, the representation in the latent space (after the encoder) will capture consistently relevant details (expression, mouth shape, eyes open/close) that will be able to transfer to a different decoder (associated with another face).

Ideally, features unique to a person such as hair style, eye color, skin tone would be only part of the decoder while features relevant to everybody such as facial expression or smile should be given by the encoder.

One risk I would anticipate is that the encoder could recognize the face and generate different features in the latent space based on who it’s looking at, maybe because a particular face generates more complex eyebrow shapes while another face has more complex mouth expressions (so different strategies in term of loss optimization for the stack encoder/decoder).

I didn’t investigate this theory but a solution would be to create a general purpose encoder (for any face) and discard the decoder part. It may however need a larger latent space.

In order to improve my results, I always evaluate whether I should focus on data vs model. In this particular instance:

This clearly showed me that I should try to gather more data, very easy on my part as I just had to ask for it!

Chris gave me 5 short clips in diverse environments and poses leading to a much richer data set (even though they were taken the same day, leading to exact same haircut and facial hair).

The training graph shows 2 gaps which correspond to when I added more data (once on “A” and once on “B”). The loss on “B” is higher probably due to the difficulty of generating realistic hair and beard.

This led to much more interesting generated samples.

Only once I reached those results did I decide to try to improve the results through the neural network architecture.

Using a more powerful stack of encoder/decoder (and slower training), the quality of generated samples improved (going from “original” implementation to “dfaker”) and it became much more fun to go through the pictures!

I actually let the training run for several days to see when it would start over-fitting. I expected generated samples to start becoming worse. Despite the fact that I had a relatively small data-set, it never happened, proving the model more robust than I expected.

While not perfect, the results are quite impressive considering the fact that the data set is made up of just a few tutorial videos for Lukas and 5 short clips of 20-30 seconds for Chris.

It is interesting to note the limitations of this model. In particular it would be easy for a human to know which images have been generated:

Feel free to visualize my final run in more details or experiment from our fork.

Join our mailing list to get the latest machine learning updates.