Skip to main content

Pivotal Tuning for Latent-Based Editing of Real Images

An interesting method has come up that edits facial attributes, without allowing the loss of tattoos and pimples. This report will be a journey through this paper.
Created on August 5|Last edited on October 10
Made by Sayantan Das

Contents

1. Introduction

Welcome to my report on "Pivotal Tuning for Latent-based editing of Real Images (PTI)," which is part one of my series on StyleGAN Inversion! Below, you'll find a quick intro, a description of the methods, and a lot of tangible, interactive examples of the technique. Without further ado, let's get going!
The paper we'll be talking about came out on arxiv in June this year and it's generated a lot of interest in certain Twitter circles. For example, here's is one by popular AI practitioner Ahsen Khaliq.

At a high level: what you see above is that when we use an inversion method such as the StyleGANv2 projector, the input image XXis fed into a network that maps the image into a latent vector representation zz
We use the original generator (StyleGANv2 in our case) where we feed in the latent vector from the previous operation to obtain an image.
We will discuss more of this in the section below.
Demonstrating the stochasticity of StyleGAN inversion and re-generation with a handy gif (Made by Sayantan Das)
So what do we want here?

We want that vector zz fetches us image XX every time we feed into the generator. However, owing to the stochasticity of the generator, it is not possible to make a bijection between zz and its corresponding XX and vice versa.

Here comes the method PTI -- that ensures that zz gives us the original XX and also makes it editable without losing any high fidelity facial feature.
💡

2. The PTI Method

The goal of this report is to make attribute edits such as head pose, smile and age transformations to images.
We could try to perform GAN Inversion techniques to obtain the latent vector but traversing on that vector in specific directions will not provide the desired manipulation control.
The goal of PTI is to support both high-fidelity controllable edits and realistic image generation.
The methodology is simple and intuitive:
1. Perform GAN Inversion on the input image XX and obtain the latent vector wpivot.w_{pivot}.
2. Obtain a new image by feeding the wpivotw_{pivot} into the generator GG called image XpivotX^{pivot}.
3. The generator GG's weights are tuned in such that wpivotw_{pivot} generates XX and not XpivotX^{pivot}.
4. The Pivotal Tuning feedback is guided by two losses: LPIPS and L2 distance.
What is LPIPS?

Learned Perceptual Image Patch Similarity (LPIPS) metric is used to find the L2 distance between two embedding vectors. Both the images in the metric LPIPS(I1,I2)LPIPS(I_1,I_2) are fed into a pretrained network (AlexNet,VGG16 or SqueezeNet) and their embedding vectors v1,v2v_1,v_2 are obtained; over which an L2 measure is calculated.

An implementation of LPIPS for Pytorch can be found here.
💡
The PTI network back props on these losses to tune the generator GG to help us obtain GtunedG_{tuned}, thus making our equation look something like this:
G1(X)=wpivotG^{-1}(X) = w_{pivot}

Gtuned(wpivot)=XG_{tuned}(w_{pivot}) = X

The above equation is the closed-form equation which in reality is an optimization approach. So a typical Generator GG^* undergoing the tuning procedure should go through the following equations:
G(wpivot)=XpivotG(w_{pivot}) = X_{pivot}

Lpt=LPIPS(X,Xpivot)+λregL2(X,Xpivot)\mathrm{L}_{pt} = \mathrm{LPIPS}(X,X_{pivot}) +\lambda_{reg}*\mathrm{L2}(X,X_{pivot})

w=wLptXw^* = w - \frac{\partial L_{pt}}{\partial X}

which gives us Generator GG^*.

Note

G1G^{-1} means GAN Inversion method and does not mean an invertible generator function.

Now? Let's see this in action.

3. Experiments

Below, use the slider on the right (the one that says "Index") to move through age, rotation, and facial expression. There's also a short gif at the bottom of this panel that cycles through each change.

Run set
20


StyleCLIP based directional editing

StyleCLIP has come up to be a very popular approach in controllable image generation. It works by providing a text prompt which guides the latent traversal in the StyleGAN's W-space. Like so:
The authors of PTI could only provide us with the directions for a few attributes, the images of which are generated below as follows:

Run set
1


Please check out the above-linked report on similar work in editable image generation. They use a StyleGAN Encoder -- similar to what we employ for GAN Inversion.

4. Additional Observations

Various caveats could be observed while experimenting on PTI.

  • Loss of Ethnicity
Me when I was a kid
Me Originally
Me when I wanted to smile less
  • Entangled attributes: It is observed while editing images based on a single attribute, various other attributes (which are often correlated because of the nature and mode of the data distribution used for StyleGAN, which is FFHQ in our case) are also affected.
  • For example: imagine you wanted to smile more, but that also led to loss of beard from your face.
    My beard usually stays on my face while I smile. YMMV

5. Conclusion

PTI works very well on identity preservation as compared to various other methods that came out since early 2020, however future methods that build upon the successes of this approach should typically focus on absolute disentanglement of all possible combinations of attribute factors in the image. Moreover, ethnicity preservation is required which also is a good practice towards more ethical AI systems.



Contact me on GitHub and Twitter for any questions with regards to the report.
Iterate on AI agents and models faster. Try Weights & Biases today.