Introducing the pixel2style2pixel (pSp) Framework with W&B

In this report, we'll be discussing our work 'Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation.' We'll start with some background on StyleGAN and the versatility it offers. We'll then dive into our work and the pixel2style2pixel (pSp) framework and see how tracking your pSp experiments with W&B is seamless.
Yuval Alaluf
The proposed pixel2style2pixel (pSp) framework can be used to solve a wide variety of image-to-image translation tasks. Here we show results of pSp on StyleGAN inversion, multi-modal conditional image synthesis, facial frontalization, inpainting, and super-resolution.

The pixel2style2pixel Repository

Before getting started, I invite you all to check out our official implementation of pSp. There you can find pre-trained models and a Google Colab to help you get started. In addition, head over to our project page for more details and resources.

Introduction

StyleGAN

Figure 1. The original StyleGAN architecture. Introduced by Karras et al. [2019] in "A Style-Based Generator Architecture for Generative Adversarial Networks".
In recent years, there has been some amazing progress in Generative Adversarial Networks (GANs), which aim to synthesize artificial images indistinguishable from authentic images. Starting from the seminal work of Goodfellow et al. in 2014, the visual quality and fidelity of these GANs have grown tremendously, paving the way for StyleGAN to be established as the current state-of-the-art model for image synthesis.
Here, instead of deep diving into StyleGAN's design and architecture, we'll focus on understanding what StyleGAN has to offer beyond its unprecedented image quality. I'll begin by highlighting some key aspects of the StyleGAN family that really makes it special compared to previous generators:
  1. StyleGAN has a learned latent space called \mathcal{W} in addition to the standard \mathcal{Z} space that is learned via a mapping network. Since this space is learned, the distribution of \mathcal{W} is able to better fit the true distribution of the real data StyleGAN is trained on. As we'll soon see, this \mathcal{W} space encodes special properties that make StyleGAN incredibly powerful.
  2. Having obtained a latent code w\in\mathcal{W}, this code is injected at several inputs along the network via syle modulation layers or AdaIN layers (as was done in StyleGAN1). In doing so, we'll see that each latent code, or style code, can control different levels of details in the generated image.

StyleGAN's Disentanglement

Figure 2. The disentanglement of StyleGAN. By traversing along various directions in StyleGAN's latent space, one can independently control different image properties such as age and pose.
As we mentioned earlier, what makes StyleGAN truly special is its ability to encode semantic information in a disentangled fashion where one can separately control different factors of variation of the generated image. Leveraging this, some recent methods such as GANSpace [Härkönen et al. 2020], InterFaceGAN [Shen et al. 2020], SeFa [Shen et al. 2021], and StyleFlow [Abdal et al. 2020] demonstrate how different directions within StyleGAN's latent space correspond to different factors of variation. They perform a so-called latent space traversal to control these factors.
To better illustrate what I mean, let's consider Figure 2. Let's assume we have some style code w\in\mathcal{W}. We can move along some direction within this latent space, say a direction controlling pose or age, to generate a similarly-looking image with the desired attributed altered accordingly. In other words, by leveraging StyleGAN's rich latent space, we can naturally perform image editing with ease.

StyleGAN Inversion

This idea of disentanglement and the resulting support for image editing naturally raises the question of how one can leverage these capabilities for editing real images. To do so, we must find the image's corresponding latent code in a process called GAN Inversion. In other words, for a given image we want to find a latent code w such that passing w to StyleGAN returns the original input image. So how do we do this?
Well, it turns out that if we try to encode an image into \mathcal{W}, it's quite hard to reconstruct the original image. That is, trying to encode any real image into a 512-dimensional vector (the dimension of \mathcal{W}) is quite challenging. Therefore, it has instead become common practice to encode the real image into an extended space called \mathcal{W}+, composed of 18 different style codes, one for each input of StyleGAN.
Two different methodologies have recently been established as the go-to for performing such an inversion into \mathcal{W}+. First, many works approach this task by performing per-image latent vector optimization and directly optimize the latent vector to minimize the reconstruction error for a given image. While optimization-based inversion results in high reconstruction quality, it typically requires several minutes per image. More recently, some works have designed an encoder to learn a direct mapping from a given image to its corresponding latent code. Although substantially more efficient than optimization, there remains a significant gap between the reconstruction quality of learning-based and optimization-based inversions. As a result, most works still resort to using a costly per-image optimization process for inversion.

The pixel2style2pixel Framework

And this is where our pixel2style2pixel (pSp) framework comes into play. We'll first describe how pSp can be used to solve the StyleGAN inversion task using an encoder-based inversion scheme while outperforming existing learning-based approaches. We'll then show how one can utilize the powerful StyleGAN generator and extend the pSp encoder to solve a wide range of image-to-image translation tasks, all using the same architecture.

The pSp Encoder

Architecture

Figure 3. The pixel2style2pixel (pSp) architecture. Based on an FPN-based architecture, our encoder extracts the intermediate style representation of a given real image at three different spatial scales, corresponding to the coarse, medium, and fine style groups of StyleGAN. Observe that during training, the pre-trained StyleGAN generator remains fixed during training with only the encoder trained using our curated set of loss functions.
Recognizing the reconstruction gap between optimization-based inversion and encoder-based inversion, most of our focus on this work revolved around designing an encoder-based inversion scheme able to efficiently and accurately encode real images into StyleGAN's latent space. Recall that in the inversion task, we are tasked with embedding a real image into a series of 18 style vectors representing the extended \mathcal{W}+ latent space.
While we explored various architectures, our final encoder is based on a Feature Pyramid Network (FPN) with a ResNet-based backbone. This hierarchical architecture is motivated by the hierarchical nature of StyleGAN in which different input layers correspond to different levels of details. Specifically, given an input image, we begin by extracting feature maps at three different spatial scales. Given these feature maps, we then introduce simple intermediate convolutional networks named map2style blocks to extract the 18 different style vectors corresponding to the \mathcal{W}+ style representation of the input image. Finally, by feeding this learned intermediate representation to StyleGAN we obtain the reconstructed image. We provide an overview of this architecture in Figure 3.
In a sense, this scheme performs a pixel-to-style-to-pixel translation where every image is first encoded into an intermediate style representation and then into a corresponding image, inspiring the name pSp. What's special here is that, during inference, pSp performs its inversion in a fraction of a second compared to several minutes per image when inverting using optimization techniques.

Loss Functions

While the above architecture is a core part of pSp, the choice of loss functions is also crucial for an accurate inversion. Given an input image x the output of pSp is given by:
pSp(\textbf{x}) := G(E(\textbf{x}) + \overline{\textbf{w}})
where E(\cdot) and G(\cdot) denote the encoder and pre-trained StyleGAN generator, respectively and \overline w represents the average latent code of StyleGAN. Observe that during training, only the encoder network is trained with the StyleGAN generator remaining fixed.
Like in previous inversion schemes, we employ the L2 and LPIPS losses to learn both pixel-wise and perceptual similarities between the input image and its reconstruction:
\mathcal{L_{\text{2}}}\left ( \textbf{x} \right ) = || \textbf{x} - pSp(\textbf{x}) ||_2.
\mathcal{L_{\text{LPIPS}}}\left ( \textbf{x} \right ) = || F(\textbf{x}) - F(pSp(\textbf{x}))||_2,
where F denotes the perceptual feature extractor.
Additionally, we employ a regularization loss on the encoded style code:
\mathcal{L}_{\text{reg}}\left ( \textbf{x} \right ) = || E(\textbf{x}) - \overline{\textbf{w}} ||_2.
Here, we encourage the encoder to output latent codes closer to the average latent code of StyleGAN. Similar to the truncation trick introduced in StyleGAN, we find that adding this regularization in the training of our encoder improves image quality without harming the fidelity of our outputs, especially in some of the more ambiguous tasks such as conditional image synthesis we'll explore below.
A core challenge when encoding real face images into a series of style vectors is the preservation of the input identity. After all, preserving fine details using a set of vectors is incredibly challenging. To address this, we introduce a dedicated recognition loss measuring the cosine similarity between the output image and its source,
\mathcal{L}_{\text{ID}}\left (\textbf{x} \right ) = 1-\left \langle R(\textbf{x}),R(pSp(\textbf{x}))) \right \rangle ,
where R is a pre-trained ArcFace network trained for facial recognition.
Given these loss functions, we can train our pSp encoder in a supervised fashion, where the generated output is compared to the original input image. By computing the above losses and updating the encoder weights using backpropagation, we are able to learn an accurate, yet efficient encoding of real images into StyleGAN.

pSp for Image-to-Image Translation

Figure 4. pSp for image-to-image translation. By directly encoding a given sketch image into the desired, transformed latent code, pairing a pSp encoder with a pre-trained StyleGAN allows one to generate realistic images, even when the input image does not reside in the StyleGAN domain.
So with pSp, we have obtained a pretty powerful encoder able to directly and accurately encode real images into the StyleGAN latent space. But what can I do with this encoded representation? For example, say I have an image of an individual and want to visualize them as a child. How can we do this with pSp?
As hinted earlier, many works leverage the rich semantic space of StyleGAN for image manipulation by traversing the latent space along some path controlling some target attribute (in this case age). So what we can do is:
  1. Invert the image into StyleGAN's latent space using pSp.
  2. Edit the obtained latent code in a semantically meaningful manner using latent space traversal.
  3. Pass the edited code to StyleGAN to generate the transformed image.
And it turns out that although such an "invert first, edit later" approach provides users with some diverse and realistic-looking edits, it is inherently limited.
This is because to perform such edits, the input image must be invertible --- there must exist a latent code that reconstructs the image. However, this requirement is a severe limitation when we consider tasks such as conditional image generation where the input image (e.g., a sketch) does not reside in the StyleGAN domain.
Recognizing this limitation, our key insight is that pSp can be applied to more general image-to-image translation tasks by directly encoding the input image into the desired latent code corresponding to the desired output image. This allows one to manipulate images even when the input image cannot be encoded into the latent space of StyleGAN. And since we utilize a fixed, pre-trained StyleGAN, we're able to easily leverage its state-of-the-art synthesis quality. We can visualize this direct translation concept in Figure 4.
And what's special here is that in the above formulation, pSp forms a generic framework able to solve a wide range of tasks all using the same architecture and similar training scheme. This is in contrast to many previous works that design a dedicated architecture for solving a single task.

Multi-Modal Synthesis with pSp

Figure 5. Style mixing with pSp. By randomly sampling latent codes and performing style mixing on the fine-level styles, pSp can generate an endless collection of realistic images for a given sketch image or segmentation map.
The translation between images through the style domain differentiates pSp from many standard image-to-image translation frameworks as it allows us to leverage the powerful properties of StyleGAN. More specifically, we discussed above the idea of StyleGAN's disentanglement where different input layers control different levels of details.
This ability to independently manipulate semantic attributes leads to another useful property: the support for multi-modal synthesis. If we consider translation tasks such as sketch-to-image where a single input sketch may correspond to several outputs, it is desirable to be able to model these possible outputs. While previous image-to-image approaches require specialized changes to their architecture to support multi-modal synthesis, our pSp framework inherently supports this by leveraging the robustness of StyleGAN.
As shown in Figure 5, this can be done by simply sampling multiple style vectors and performing style mixing on the fine-level latent vectors of the encoded image. This allows one to generate an endless collection of output images for a given input sketch or segmentation map, for example.

Pairing pSp with W&B

Let's now move on to a hands-on experiment with pSp and how Weights and Biases (W&B) can be used to help us visualize our data and track our experiments. For this report, I'll be exploring the sketch-to-image task where we wish to transform sketch images into realistic face images.
Observe that integrating W&B with your pSp experiments is as simple as adding an additional flag, --use_wandb, when training. That's it! By adding this flag, you'll be able to track all of your pSp experiments with ease.

Visualizing the Data

Before I get started, I always like to visualize my data. With W&B this can be done in only a few lines of code:
def log_dataset_wandb(dataset, dataset_name, n_images=16): idxs = np.random.choice(a=range(len(dataset)), size=n_images, replace=False) data = [wandb.Image(dataset.source_paths[idx]) for idx in idxs] wandb.log({f"{dataset_name} Data Samples": data})self.wb_logger = WBLogger(self.opts)self.wb_logger.log_dataset_wandb(train_dataset, dataset_name="Train")self.wb_logger.log_dataset_wandb(test_dataset, dataset_name="Test")
Here, we'll define a class called WBLogger which will be responsible for handling all logging to W&B. And we can now visualize the train and test data as follows :

Visualizing the Loss Functions

One of the most important parts of training is being able to easily follow the loss values during training. With W&B this is can be done with only a few lines of code:
def log(prefix, metrics_dict, global_step): log_dict = {f'{prefix}_{key}': value for key, value in metrics_dict.items()} log_dict["global_step"] = global_step wandb.log(log_dict)
Here, metrics_dict holds the loss values we want to plot at the current global_step with prefix denoting whether we are plotting the losses on the train or test set. We'll start by looking at the training loss curves:
You may be curious why the train_loss_w_norm increases during training. If we think about the role of this regularization loss, this makes sense. Recall that the w-regularization loss encourages the encoder to output latent codes that are close to average latent code. During training, however, the encoder will begin outputting latent codes that begin to move away from this average code, resulting in the increase we see here.
And now, visualizing the test losses during training:

Visualizing Intermediate Training Results

Everything looks great! It looks like training is just about converging after about a day of training. Before moving on to exploring our trained model, I want to share something that I find very useful to do when training --- visualizing the intermediate results. Although this sounds trivial, I often see people make conclusions by looking only at the loss curves during training. However, this can often be misleading. Therefore, I like to visualize the model's outputs throughout training to make sure the model is doing what I expect it to do.
In our case, what we can do is visualize the model's outputs for a given test batch at various steps during training:
def log_images_to_wandb(x, y, y_hat, prefix, step, opts): im_data = [] column_names = ["Source", "Target", "Output"] for i in range(len(x)): cur_im_data = [ wandb.Image(common.log_input_image(x[i], opts)), wandb.Image(common.tensor2im(y[i])), wandb.Image(common.tensor2im(y_hat[i])), ] im_data.append(cur_im_data) outputs_table = wandb.Table(data=im_data, columns=column_names) wandb.log({f"{prefix.title()} Step {step} Output Samples": outputs_table})
Let's begin by visualizing the translation results after 5,000 training steps.
We can see that after only 5,000 steps, the encoder has already learned to capture the rough pose of the input sketches. Some details have been lost, of course, but overall, the translations are able to capture what we expected.
And if we visualize some test samples after 40,000 training steps we get the following:
Overall, the two sets of results are quite similar, giving another indication that our model has converged to the desired result.
Awesome! Having trained our sketch-to-image model, let's see what we can do with it.

Performing Style Mixing with pSp

We've discussed how leveraging StyleGAN allows pSp to generate multiple outputs for a single input sketch using style mixing on the fine-level styles. Let's see this in action using the model we trained above. At a high level, style mixing can be performed as follows:
# generate random vectors to inject into input imagevecs_to_inject = np.random.randn(opts.n_outputs_to_generate, 512).astype('float32')multi_modal_outputs = []for vec_to_inject in vecs_to_inject: cur_vec = torch.from_numpy(vec_to_inject).unsqueeze(0).to("cuda") # get latent vector to inject into our input image _, latent_to_inject = net(cur_vec, input_code=True, return_latents=True) # get output image with injected style vector res = net(input_image.unsqueeze(0).to("cuda").float(), latent_mask=latent_mask, inject_latent=latent_to_inject, alpha=opts.mix_alpha, resize=opts.resize_outputs) multi_modal_outputs.append(res[0])
Here, we are given a single input image (denoted by input_image) and perform style mixing using multiple randomly sampled style codes (denoted by vecs_to_inject). For the complete style mixing script, check out our official implementation.
Let's visualize some results!
As we can see, by using style mixing, we're able to generate a diverse set of realistic outputs for each input sketch, giving users a lot of flexibility in the translation task.

Conclusion

That's all for now! I hope you can appreciate the power of StyleGAN and the amazing things we can do with it. While I hope that I was able to share the flexibility pSp has to offer, we have just scratched the surface of what these powerful GAN encoders can truly bring to the table.
My team and I have been hard at work extending the ideas you've seen here and if you're interested in learning more, I invite you guys to check out some of our recent works:
Please feel free to reach out if you have any questions or wish to discuss any of these ideas further!