Improving Deepfake Performance with Data

Generating high quality Faceswaps with better data
Created on September 18|Last edited on September 18
Comment
In this project, I use faceswap repo to swap faces between Lukas and Chris who were kind enough to let me play with their pictures for the purpose of this experiment. I’ll particularly focus on the effect of data in improving the model.
﻿
The challenge is clear, we are trying to swap faces between people with different skin tone and facial hair.
Usually, you would want to find people with similar features in order to improve the success of your face swapping. In many cases, you swap only the inner part of the face. Some projects even focus only on the mouth section to create fake speeches. Deepfakes also handles the "stitching and blending" part of generated images with more traditional algorithms.
In this post, we will only focus on the Deep Learning part related to generating faces.
To illustrate the difficulty of the challenge, here is an example of what we would achieve with default settings (the red box shows which part of the image is generated).
﻿
We can notice a few issues:
skin tone is pretty well faded but zooming in, we can clearly see the differences
the hair and beard are extremely difficult to reconstitute, even more due to the section cropped of the face which may include only one side
In addition, the long hair vs short hair actually causes issues to define well the contour of the face to reproduce.
Those are often solved (at least partly) by:
using people with same skin tone
using people with similar facial hair
limiting the samples to front faces only (or only slight angles to the sides)
when used on people with long hair, they often have a clear face (no hair coming to the front of the face)
In this case, we are going to keep trying to improve the results with the two people selected.
The first step that comes in mind is to crop around a larger section of the face.
Here is what I obtained after a few hours of training.
﻿
﻿
The image spaces to consider are:
“A” space (or “Original) for Lukas (the non-beard guy)
“B” space (or “Swap”) for Chris (the beard guy)
The results are shown in groups of 3 images (from left to right):
Input image (whether from “A” on the left set or “B” on the right set)
Generated image staying in the same image space
Generated image going from one image space to the other
Looking at those images (I stopped training only after a few hours) I quickly noticed a few points:
It took longer to generate good images than previously, which is understandable due to the fact that we need to recreate a larger part of the image
The image set on the right “Swap > Original” seems to be better generated than the one on the left (“Original > Swap”) => going from “B” to “A” seems to be easier than the opposite
When generating images of “B” (see “Original > Swap”), the background always was poorly drawn => background in “B” space is not well identified and is drawn as hair
In “B” space (Chris), the right side of the face is always better drawn than the left side
I had to go back to the data collected to understand this imbalance in quality of results.
﻿
﻿
﻿
Initial data was collected from W&B tutorials.﻿
Lukas has a richer data-set, with a larger diversity of backgrounds, orientations, light exposure, etc in his image space (even though he really likes his blue and grey shirts for his tutorials). On the other hand, Chris data-set is based on one single video and is always turned towards the same side. This could explain the imbalance in quality of pictures generated.
I also looked at the implementation into more details, especially at the structure of encoders & decoders as well as the losses definition. I found the guide from torzdf very useful to understand better the main principles, in particular the following diagram.
﻿
﻿
We have 2 sets of images associated with different people sharing a same encoder. We then use a decoder specific to each data-set to try to recreate the original image.
The job of the encoder is to capture unique details of a specific image while the decoder needs to recreate an image of a person (already known) through the features that have been provided by the encoder.
We hope that, by using a shared encoder between the 2 faces, the representation in the latent space (after the encoder) will capture consistently relevant details (expression, mouth shape, eyes open/close) that will be able to transfer to a different decoder (associated with another face).
Ideally, features unique to a person such as hair style, eye color, skin tone would be only part of the decoder while features relevant to everybody such as facial expression or smile should be given by the encoder.
One risk I would anticipate is that the encoder could recognize the face and generate different features in the latent space based on who it’s looking at, maybe because a particular face generates more complex eyebrow shapes while another face has more complex mouth expressions (so different strategies in term of loss optimization for the stack encoder/decoder).
I didn’t investigate this theory but a solution would be to create a general purpose encoder (for any face) and discard the decoder part. It may however need a larger latent space.
In order to improve my results, I always evaluate whether I should focus on data vs model. In this particular instance:
training takes 1-2 days to get decent results
it is hard to evaluate the quality of a GAN from metrics (results are typically subjective or need to be evaluated by a panel of several participants)
this repo is popular, so we can expect that the model architectures are pretty good
I only used one short video for Chris so had very limited amount of data
This clearly showed me that I should try to gather more data, very easy on my part as I just had to ask for it!
Chris gave me 5 short clips in diverse environments and poses leading to a much richer data set (even though they were taken the same day, leading to exact same haircut and facial hair).
﻿
The training graph shows 2 gaps which correspond to when I added more data (once on “A” and once on “B”). The loss on “B” is higher probably due to the difficulty of generating realistic hair and beard.
﻿
﻿
This led to much more interesting generated samples.
﻿
﻿
﻿
Only once I reached those results did I decide to try to improve the results through the neural network architecture.
Using a more powerful stack of encoder/decoder (and slower training), the quality of generated samples improved (going from “original” implementation to “dfaker”) and it became much more fun to go through the pictures!
﻿
﻿
﻿
﻿
﻿
I actually let the training run for several days to see when it would start over-fitting. I expected generated samples to start becoming worse. Despite the fact that I had a relatively small data-set, it never happened, proving the model more robust than I expected.
While not perfect, the results are quite impressive considering the fact that the data set is made up of just a few tutorial videos for Lukas and 5 short clips of 20-30 seconds for Chris.
It is interesting to note the limitations of this model. In particular it would be easy for a human to know which images have been generated:
Hair style and facial hair are particularly affected showing they are encoded as features by the encoder while they should belong more to the person-specific decoder than the encoder. Actually after several days it seems that the model starts learning that Lukas is supposed to have short hair while Chris should have long hair (though it has more trouble generating them and they remain blurry).
The background becomes blurry when changing space showing that the model has not completely learnt that the background is independent from the face generated (would need more pictures with diverse backgrounds)
Feel free to visualize my final run in more details or experiment from our fork.
﻿
﻿
﻿
﻿
﻿
﻿
Add a comment
Tags: W&B Features
Iterate on AI agents and models faster. Try Weights & Biases today.