Pivotal Tuning for Latent-Based Editing of Real Images

An interesting method has come up that edits facial attributes, without allowing the loss of tattoos and pimples. This report will be a journey through this paper.
Sayantan Das
Made by Sayantan Das

Contents

  1. Introduction
  2. PTI Method
  3. Experiments
  4. Additional Observations
  5. Conclusion

1. Introduction

Welcome to my report on "Pivotal Tuning for Latent-based editing of Real Images (PTI)," which is part one of my series on StyleGAN Inversion! Below, you'll find a quick intro, a description of the methods, and a lot of tangible, interactive examples of the technique. Without further ado, let's get going!
The paper we'll be talking about came out on arxiv in June this year and it's generated a lot of interest in certain Twitter circles. For example, here's is one by popular AI practitioner Ahsen Khaliq.
At a high level: what you see above is that when we use an inversion method such as the StyleGANv2 projector, the input image Xis fed into a network that maps the image into a latent vector representation z
We use the original generator (StyleGANv2 in our case) where we feed in the latent vector from the previous operation to obtain an image.
We will discuss more of this in the section below.
Demonstrating the stochasticity of StyleGAN inversion and re-generation with a handy gif (Made by Sayantan Das)
So what do we want here?We want that vector z fetches us image X every time we feed into the generator. However, owing to the stochasticity of the generator, it is not possible to make a bijection between z and its corresponding X and vice versa.Here comes the method PTI -- that ensures that z gives us the original X and also makes it editable without losing any high fidelity facial feature.

2. The PTI Method

The goal of this report is to make attribute edits such as head pose, smile and age transformations to images.
We could try to perform GAN Inversion techniques to obtain the latent vector but traversing on that vector in specific directions will not provide the desired manipulation control.
The goal of PTI is to support both high-fidelity controllable edits and realistic image generation.
The methodology is simple and intuitive:
1. Perform GAN Inversion on the input image X and obtain the latent vector w_{pivot}.
2. Obtain a new image by feeding the w_{pivot} into the generator G called image X^{pivot}.
3. The generator G's weights are tuned in such that w_{pivot} generates X and not X^{pivot}.
4. The Pivotal Tuning feedback is guided by two losses: LPIPS and L2 distance.
What is LPIPS?Learned Perceptual Image Patch Similarity (LPIPS) metric is used to find the L2 distance between two embedding vectors. Both the images in the metric LPIPS(I_1,I_2) are fed into a pretrained network (AlexNet,VGG16 or SqueezeNet) and their embedding vectors v_1,v_2 are obtained; over which an L2 measure is calculated. An implementation of LPIPS for Pytorch can be found here.
The PTI network back props on these losses to tune the generator G to help us obtain G_{tuned}, thus making our equation look something like this:
G^{-1}(X) = w_{pivot}
G_{tuned}(w_{pivot}) = X
The above equation is the closed-form equation which in reality is an optimization approach. So a typical Generator G^* undergoing the tuning procedure should go through the following equations:
G(w_{pivot}) = X_{pivot}
\mathrm{L}_{pt} = \mathrm{LPIPS}(X,X_{pivot}) +\lambda_{reg}*\mathrm{L2}(X,X_{pivot})
w^* = w - \frac{\partial L_{pt}}{\partial X}
which gives us Generator G^*.

Note

G^{-1} means GAN Inversion method and does not mean an invertible generator function.
Now? Let's see this in action.

3. Experiments

Below, use the slider on the right (the one that says "Index") to move through age, rotation, and facial expression. There's also a short gif at the bottom of this panel that cycles through each change.

StyleCLIP based directional editing

StyleCLIP has come up to be a very popular approach in controllable image generation. It works by providing a text prompt which guides the latent traversal in the StyleGAN's W-space. Like so:
Collected from https://github.com/orpatashnik/StyleCLIP
The authors of PTI could only provide us with the directions for a few attributes, the images of which are generated below as follows:
Report Gallery
Please check out the above-linked report on similar work in editable image generation. They use a StyleGAN Encoder -- similar to what we employ for GAN Inversion.

4. Additional Observations

Various caveats could be observed while experimenting on PTI.

Me when I was a kid
Me Originally
Me when I wanted to smile less

5. Conclusion

PTI works very well on identity preservation as compared to various other methods that came out since early 2020, however future methods that build upon the successes of this approach should typically focus on absolute disentanglement of all possible combinations of attribute factors in the image. Moreover, ethnicity preservation is required which also is a good practice towards more ethical AI systems.

Contact me on GitHub and Twitter for any questions with regards to the report.
https://github.com/ucalyptus
https://twitter.com/sayantandas_