In-Domain GAN Inversion for Real Image Editing

Explore a SOTA GAN Inversion technique proposed by the authors of In-Domain GAN Inversion. Made by Ayush Thakur using Weights & Biases
Ayush Thakur

GANs map the latent space (random noise) to data space (images) by playing a min-max adversarial game, but don't provide an "inverse" model, i.e., a mapping from data space to latent space. This report explores the state of the art in GAN inversion, which can be used for image editing. We will see some impressive GAN based editing results.

Paper → | Video →

Run the experiments in Colab →

Introduction

GANs learn a deep generative model that can synthesize novel, high-dimensional data samples. The latent space is known to encode rich semantic information, and varying the latent code leads to manipulating the corresponding attributes occurring in the generated image. You can learn more about generative models and latent space in this Towards Deep Generative Modeling with W&B report.

How about applying specific manipulations to your face image? Maybe you want to add sunglasses to your input image. GANs cannot take a particular image as input to infer its latent code.

The GAN inversion technique overcomes this. This technique allows you to input a real image(a face), which is mapped to the latent code. The aim is to have the most accurate latent code for recovering the input image using the generator. This is done either by learning an extra encoder beyond the GAN or directly optimizing the latent code for an individual image. Techniques like in-domain GAN inversion combine these two ideas by using the learned encoder to generate an initialization for optimization.

What is In-Domain?

Given a pre-trained GAN model, GAN inversion aims at finding the most accurate latent code for the input image to reconstruct the input image. The generator of the GAN does this reconstruction. To this end, existing inversion methods focus on rebuilding the target image at the pixel level without considering the inverted latent code's semantic information. If the process does not use this semantic information, it will fail to do high-quality semantic editing. We certainly don't want our edited face to look something else.

On this note, a suitable GAN inversion method should reconstruct the target image at the pixel level and align the inverted latent code with the semantic information encoded in the latent space. The authors have named their GAN inversion method in-domain GAN inversion because it uses in-domain code, which is semantically meaningful.

Overview of the Proposed Method

I. The Problem Statement

The GAN model consists of a generator network, $G(.)$, which maps the latent space to the data space, $Z \rightarrow X$, and a discriminator network, $(D(.)$, which distinguishes the real, $x^{real}$ from the synthesized data, $x^{syn}$.

The problem statement devices a method to learn to reverse this mapping, $X^{real} \rightarrow Z^{inv}$, such that it can recover the $x^{real}$ from $z^{inv}$(small $x$ and $z$ denotes a real image and a latent code drawn from the real images space, $X$ and latent space, $Z$). While doing so, the $z^{inv}$ should also align with the learned semantic space $S$ in the pre-trained GAN model.

To do so, the authors have proposed first to train a domain-guided encoder and then use this encoder as a regularizer for the further domain-regularized optimization.

Side Note

Usually, GAN samples the latent code $z$ from a distribution $Z$, such as normal distribution. The recent StyleGAN paper proposes to first map the initial latent space $Z$ to a second latent space $W$ with a multi-layer perceptron(MLP). The code $w$ is sampled from $W$ and fed to the generator.

image.png

-> Figure 1: The architecture of Adversarial Latent Autoencoder based on StyleGAN. Notice $Z$ is mapped to $W$ using an MLP, $F$. Read more about it here. <-

We will use $z$ to denote $w$ going ahead.

II. Domain Guided Encoder

image.png

-> Figure 2: The first row is the conventional encoder learned for GAN inversion. The second row is the ** domain-guided encoder**. Source <-

Existing methods that learn an encoder to map the data space to the latent space have done so with no regard to whether the codes produced by the encoder, $z^{inv}$, align with the code learned by the generator, $z$. In figure 2, the first row is the conventional encoder learned for GAN inversion. In a single forward pass, a batch of latent codes $z$ are randomly sampled from a known distribution $Z$ and fed into $G(.)$ to get a batch of synthetic images, $x^{syn}$. This is then fed to an encoder, $E(.)$ to get $z^{enc}$(in GAN inversion literature, this is $z^{inv}$). The _mean square error _(MSE) is computed between $z^{sam}$ and $z^{inv}$. But this is not sufficient to utilize the rich semantic information.

To this end, the authors have proposed a domain guided encoder. In figure 2, the second row is the domain guided encoder. The main differences in the design of this encoder compared to the conventional encoder are:

  1. The output of the encoder is fed into the generator. Thus the latent space is shared between the encoder and the generator. Let's call this $z^{enc}$.

  2. The MSE is applied in the image space instead of latent space. This way, the encoder's output code is guaranteed to align with the semantic domain of the generator.

  3. The encoder is trained with real images instead of synthetic images. This way, the encoder is more applicable to real applications.

  4. The discriminator is used to compete with the encoder in an adversarial manner. This ensures that reconstructed images are realistic enough.

  5. The authors have also used perceptual loss by utilizing the features extracted by VGG. This helps to deal with high-level differences like style and content.

Note:

Click on the :gear: icon and move the slider to get the original image. Theencoder image is the reconstruction of the original images made by the generator while using the latent code, which is the domain-guided encoder's output.

Section 5

III. Domain-Regularized Optimization

image.png

-> Figure 3: The first row is the conventional optimization. The second row is the proposed domain-regularized optimization. (Source) <-

We have a trained domain guided encoder at this point, which can reconstruct the input image based on the pre-trained generator. And the code generating this reconstruction will be semantically meaningful. However, to better fit the pixel values' target image, we need to refine this code.

This is where domain-regularized optimization comes into the picture. The GAN inversion is more like an instance-level(per image) task to best reconstruct a given input image.

Previous methods have relied on gradient descent based optimization where the latent code is optimized "freely" based on the generator only. There is thus no constraint on the latent code, and it can very likely produce out-of-domain inversion. This is shown in the first row of figure 3.

Domain-regularized optimization relies on a domain-guided encoder with two improvements.

  1. The output of the domain guided encoder provides a starting point for the optimization.

  2. The domain-guided encoder acts as a regularizer to enforce that the refined latent code is within the generator's semantic domain.

Click on the :gear: icon and move the slider to get the original image. Theinverted image is the reconstruction of the original images made by the generator while using the refined latent code, which is done using the domain-regularized optimization.

Section 6

Results

In this section, we will look at the results produced by the use of in-domain GAN inversion on real image editing tasks. The tasks that we will look into are semantic image manipulation, diffusion, etc.

To reproduce the results check out the Colab notebook linked here.

Semantic Diffusion

Semantic diffusion aims at diffusing a particular part (usually the most representative part) of the target image into the context of another image. We want the fused result to keep the characteristics of the target image and at the same time adapt the context information.

The media panel shown below shows the result of the diffusion task. The first image is the original image(target image) followed by the context image. The most representative patch from the target image is stitched on the context image which is then reconstructed using the in-domain GAN inversion method. Scroll through the examples down below.

Section 8

Semantic Manipulation

Image manipulation is another way to examine whether the embedded latent codes align with the semantic knowledge learned by GANs.

Click on the :gear: icon and move the slider to visualize the manipulation based on the expression. The index 0 is the original image followed by the inverted image, index 1. By moving the slider further the faces will light up with a smile.

Section 10

Semantic Interpolation

Image interpolation aims at semantically interpolating two images, which is suitable for investigating the semantics contained in the inverted codes.

Click on the :gear: icon and move the slider to visualize the interpolation of the original image towards the target image. Thus index 1 is the original image while index 6 is the target image.

Section 12

Conclusion

This report summarizes the paper, making it more accessible for the readers. I have used lines from the paper where it was the best way to convey the information.

Here are some papers on this topic:

I hope you like this report. GAN inversion is indeed an interesting topic. I would love to get your feedback. You can do so in the comment section or reach out to me on @ayushthakur0.