This report showcases StarGAN v2: Diverse Image Synthesis for Multiple Domains, incredible work by Yunjei Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha from CVPR 2020.
StarGAN v2 generates higher-quality and more diverse images with more flexibility across datasets and domains (e.g. whether the training data or output images belong to the category of "cats", "dogs", or "wildlife"). Here I explore their approach, share some interesting examples, and suggest future applications. I would love to hear your questions or suggestions via comments at the end of this report.
In the gifs and image grids above, and throughout this report, the first/header row and the first/header column show images loaded into StarGAN v2 at inference time to produce the output or result image found at the intersection of each header row and header column. So, every image in the grid besides the first row and first column was generated by StarGAN v2.
The top row shows the source, content, or identity image. This image guides the identity—or at least pose, expression, and overall facial structure—of the generated output. This identity is shared by all the images you see as you scan down a column from a given content image. Perhaps you'll notice the shared resemblance—though crucially, not an identical match—in each column of human faces as you do this.
The first, leftmost column shows the reference or style image. This image guides the color and texture of the generated output (e.g. the fur pattern of the cats, the hair style and makeup of the women). This style is shared by all the images you see as you scan across a row for a given style image. As you do this for animal faces, you may notice they look like different individuals or life stages of the same breed.
These additional examples illustrate the high level of visual quality and realism (scroll down inside the panel to view many more). Again the first row shows the content or identity images, and the first column (duplicated as the third column) shows the reference or style images.
The pose and expression of the two dogs is transferred very consistently. Interestingly, despite the strong visual similarity across each row, the animals in the rightmost column all have narrower features and look more juvenile, especially in the cat section, than their matches in the second column—could this be because the black-and-white content dog happens to be a puppy while the brown dog happens to be an adult?
StarGAN v2 achieves impressive quality, diversity, and realism within a single visual domain ("cats" and "female-presenting celebrities"). The cross-domain synthesis results are even more exciting. A "domain" here is a type or category of images commonly sharing certain visual attributes or features (such as "cats" versus "dogs" versus "wildlife").
To generate a new output image, StarGAN v2 requires two inputs:
A third implicit requirement is a target domain: one of the N domains on which the network was trained. N is fixed before training, and in the paper it refers to real-world conceptual categories: male/female (N=2) for the CelebA dataset/human face generation network, and cat/dog/wildlife (N=3) for the Animal Faces High Quality (AFHQ) dataset/animal face generation network. This requirement is implicit because the paper generally matches the target domain to the domain of the reference image.
StarGAN v2 is essentially domain-content agnostic: no module requires a specific domain as an input value, and all modules except the generator (which produces a single output image) have multiple parallel output branches (one for each of N domains). During training on a specific domain, the branch of each module corresponding to that domain is selected consistently. However, during inference we can select a different output branch to see a different target domain—or see all N possibilities in parallel.
Below, the top rows in each grid show the sample content images from StarGAN v2 (first row), in three blocks of "cats" (5 columns), "dogs" (6 columns), and "wildlife" (5 columns). All of these content images are crossed with a single style image (the dog in the first column) into the same three target domains: "cats" (row 2), "dogs"(row 3), and "wildlife" (row 4). The paper focuses on results where the target domain matches the reference image: in this case, the third row, for "dog", has the most realistic results. However, the other two rows also produce interesting new animals. These can help disentangle the visual properties of the conceptual domain (a label imposed externally by the world/researchers when designing the model) from the visual style (a representation/encoding learned by the model independently of domain).
To see more detail, hover over the top right corner of the panel below and click on the empty square to bring up a large overlay version.
The effect is easiest to observe when
Looking carefully at these regions, you may notice the generated images picking up the stylistic aspects most independent of the domains of the source and reference images.
The StarGAN v2 repository is clear, well-organized, and easy to follow. You can download the datasets and pretrained models for generating human celebrity and animal faces. You can also train from scratch and set up logging and visualization to W&B with this fork of the repository. Here are some visualizations I found useful, especially given the long training time for these GANs (2-3 days for 100,000 iterations).
Visualizing model performance at a glance shows the network quickly capturing the basic shapes of the content image and then refining the details towards more realism, first towards the general domain of the reference image and then focusing increasingly on the details of the reference image. You can use the slider below the image grids to highlight a subregion of time, or hover over the top right corner and click on the empty square to see a larger version.
I plot the various losses for the discriminator and generator components of the StarGAN v2 trained on CelebA-HQ for close to 50 hours. In all plots, the latent/noise image losses are shown in more saturated hues (red, magenta, dark blue) than the reference image losses (pinks, light blue). The latent loss on real images has the greatest impact on the discriminator (top chart), and the adversarial loss on the generator is the dominant component overall (bottom left). The other components of the generator loss are shown in more detail in the bottom right plot (zoomed in from bottom left plot):
The paper leverages two evaluation metrics to show the superior quality of StarGAN v2 for image generation:
The charts below show FID (left) and LPIPS (right) scores for all six domain translations and the mean score in black for a StarGAN v2 variant. Though this version of the model was only trained for 3000 iterations, longer training runs show a similar relative ranking and trends. Target domains are reds for wildlife, blues for dogs, and violets for cats.
Although this is a small set of metrics from which to draw conclusions, it can inform future exploration: how to expand the dataset, how to define additional domains, how to adjust relative weighting of different learning signals, etc
This is but a small selection of the possibilities. Image quality and resolution doesn't have a huge effect, but a center crop of the face with the animal looking at the camera yields the best results.
You can try it easily on your own photos in this Colab notebook. Pet-enthusiasts from the machine learning community have generously provided some examples for us to use--thank you all!
Please share any of your cute or interesting results :)
While I have chosen to focus on furry creatures in this report, improving the quality, diversity, and scalability of GAN-drive image synthesis can help in much more critical applications like healthcare and scientific research:
The functional separation in this work between learning a broad and diverse spectrum of visual style (through latent representations and encodings of reference images) versus learning a narrow, strict, often-socially-constructed domain label (like "gender") is very encouraging. The more our models can learn a nuanced and semantically rich representation of the world, the less likely we are to find our future selves disappointed—or worse, oppressed—by their rigid definitions and insufficient understanding.