Pivotal Tuning for Latent-Based Editing of Real Images

An interesting method has come up that edits facial attributes, without allowing the loss of tattoos and pimples. This report will be a journey through this paper.
Sayantan Das
Created on August 5|Last edited on October 10
Comment
﻿
									Made by Sayantan Das
Contents﻿Introduction﻿
﻿PTI Method﻿
﻿Experiments﻿
﻿Additional Observations﻿
﻿Conclusion﻿
1. IntroductionWelcome to my report on "Pivotal Tuning for Latent-based editing of Real Images (PTI)," which is part one of my series on StyleGAN Inversion! Below, you'll find a quick intro, a description of the methods, and a lot of tangible, interactive examples of the technique. Without further ado, let's get going!
﻿The paper we'll be talking about came out on arxiv in June this year and it's generated a lot of interest in certain Twitter circles. For example, here's is one by popular AI practitioner Ahsen Khaliq.
﻿
At a high level: what you see above is that when we use an inversion method such as the StyleGANv2 projector, the input image XXX﻿is fed into a network that maps the image into a latent vector representation zzz﻿﻿
We use the original generator (StyleGANv2 in our case) where we feed in the latent vector from the previous operation to obtain an image. 
We will discuss more of this in the section below.
Demonstrating the stochasticity of StyleGAN inversion and re-generation with a handy gif (Made by Sayantan Das)
So what do we want here?
﻿
We want that vector zzz﻿ fetches us image XXX﻿ every time we feed into the generator. However, owing to the stochasticity of the generator, it is not possible to make a bijection between zzz﻿ and its corresponding XXX﻿ and vice versa.
﻿
Here comes the method PTI -- that ensures that zzz﻿ gives us the original XXX﻿ and also makes it editable without losing any high fidelity facial feature.
💡
2. The PTI MethodThe goal of this report is to make attribute edits such as head pose, smile and age transformations to images. 
We could try to perform GAN Inversion techniques to obtain the latent vector but traversing on that vector in specific directions will not provide the desired manipulation control.
The goal of PTI is to support both high-fidelity controllable edits and realistic image generation.
The methodology is simple and intuitive:
	1. Perform GAN Inversion on the input image XXX﻿ and obtain the latent vector wpivot.w_{pivot}.wpivot​.﻿	  
	2. Obtain a new image by feeding the wpivotw_{pivot}wpivot​﻿ into the generator GGG﻿ called image							   XpivotX^{pivot}Xpivot﻿.
	3. The generator GGG﻿'s weights are tuned in such that wpivotw_{pivot}wpivot​﻿ generates XXX﻿ and not XpivotX^{pivot}Xpivot﻿.
	4. The Pivotal Tuning feedback is guided by two losses: LPIPS and L2 distance.
What is LPIPS?
﻿
Learned Perceptual Image Patch Similarity (LPIPS) metric is used to find the L2 distance between two embedding vectors. Both the images in the metric LPIPS(I1,I2)LPIPS(I_1,I_2)LPIPS(I1​,I2​)﻿ are fed into a pretrained network (AlexNet,VGG16 or SqueezeNet) and their embedding vectors v1,v2v_1,v_2v1​,v2​﻿ are obtained; over which an L2 measure is calculated. 
﻿
An implementation of LPIPS for Pytorch can be found here.
💡
The PTI network back props on these losses to tune the generator GGG﻿ to help us obtain GtunedG_{tuned}Gtuned​﻿, thus making our equation look something like this:
G−1(X)=wpivotG^{-1}(X) = w_{pivot}G−1(X)=wpivot​﻿
Gtuned(wpivot)=XG_{tuned}(w_{pivot}) = XGtuned​(wpivot​)=X﻿
The above equation is the closed-form equation which in reality is an optimization approach. So a typical Generator G∗G^*G∗﻿ undergoing the tuning procedure should go through the following equations:
G(wpivot)=XpivotG(w_{pivot}) = X_{pivot}G(wpivot​)=Xpivot​﻿
Lpt=LPIPS(X,Xpivot)+λreg∗L2(X,Xpivot)\mathrm{L}_{pt} = \mathrm{LPIPS}(X,X_{pivot}) +\lambda_{reg}*\mathrm{L2}(X,X_{pivot})Lpt​=LPIPS(X,Xpivot​)+λreg​∗L2(X,Xpivot​)﻿
w∗=w−∂Lpt∂Xw^* = w - \frac{\partial L_{pt}}{\partial X}w∗=w−∂X∂Lpt​​﻿
which gives us Generator G∗G^*G∗﻿.
Note﻿G−1G^{-1}G−1﻿ means GAN Inversion method and does not mean an invertible generator function.
﻿
Now? Let's see this in action. 
3. ExperimentsBelow, use the slider on the right (the one that says "Index") to move through age, rotation, and facial expression. There's also a short gif at the bottom of this panel that cycles through each change. 
﻿
Run set20
﻿
StyleCLIP based directional editing﻿StyleCLIP has come up to be a very popular approach in controllable image generation. It works by providing a text prompt which guides the latent traversal in the StyleGAN's W-space. Like so: 
					Collected from https://github.com/orpatashnik/StyleCLIP﻿
The authors of PTI could only provide us with the directions for a few attributes, the images of which are generated below as follows:
﻿
Run set1
﻿
﻿
Please check out the above-linked report on similar work in editable image generation. They use a StyleGAN Encoder -- similar to what we employ for GAN Inversion.
4. Additional Observations
Various caveats could be observed while experimenting on PTI.Loss of Ethnicity
		Me when I was a kid
			Me Originally
   Me when I wanted to smile less
Entangled attributes: It is observed while editing images based on a single attribute, various other attributes (which are often correlated because of the nature and mode of the data distribution used for StyleGAN, which is FFHQ in our case) are also affected.
For example: imagine you wanted to smile more, but that also led to loss of beard from your face.
My beard usually stays on my face while I smile. YMMV
5. ConclusionPTI works very well on identity preservation as compared to various other methods that came out since early 2020, however future methods that build upon the successes of this approach should typically focus on absolute disentanglement of all possible combinations of attribute factors in the image. Moreover, ethnicity preservation is required which also is a good practice towards more ethical AI systems.
﻿
﻿
Contact me on GitHub and Twitter for any questions with regards to the report.
﻿https://github.com/ucalyptus﻿
﻿https://twitter.com/sayantandas_﻿
﻿
Add a comment
Tags: Intermediate, Computer Vision, GenAI, Experiment, Tutorial, Github, Panels, Plots
Iterate on AI agents and models faster. Try Weights & Biases today.