Cute Animals and Post-Modern Style Transfer: StarGAN v2 for Multi-Domain Image Synthesis

This article explains how to diversify and streamline image generation across visual domains using StarGAN v2 and Weights & Biases.
Stacey Svetlichnaya
Created on July 22|Last edited on May 9
Comment
﻿
This article showcases StarGAN v2: Diverse Image Synthesis for Multiple Domains, incredible work by Yunjei Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha from CVPR 2020. 
StarGAN v2 generates higher-quality and more diverse images with more flexibility across datasets and domains (e.g. whether the training data or output images belong to the category of "cats", "dogs", or "wildlife"). Here I explore their approach, share some interesting examples, and suggest future applications. I would love to hear your questions or suggestions via comments at the end of this report.
﻿Try it on your own photos in this Colab →﻿
Additional resources﻿StarGAN V2 project site →﻿
﻿Read the paper →﻿
﻿Repo fork instrumented with W&B →﻿
Table of ContentsDetail View: Diverse Styles of One IdentityCross-Domain SynthesisModel Training and VisualizationFuture Work
﻿
﻿
﻿
Demo examples2
﻿
﻿
How To Read the Image GridsIn the gifs and image grids above, and throughout this report, the first/header row and the first/header column show images loaded into StarGAN v2 at inference time to produce the output or result image found at the intersection of each header row and header column. So, every image in the grid beside the first row and first column was generated by StarGAN v2.
First Row: Source/ContentThe top row shows the source, content, or identity image. This image guides the identity—or at least pose, expression, and overall facial structure—of the generated output. This identity is shared by all the images you see as you scan down a column from a given content image. Perhaps you'll notice the shared resemblance—though crucially, not an identical match—in each column of human faces as you do this.
First Column: Reference/StyleThe first, leftmost column shows the reference or style image. This image guides the color and texture of the generated output (e.g. the fur pattern of the cats, the hair style and makeup of the women). This style is shared by all the images you see as you scan across a row for a given style image. As you do this for animal faces, you may notice they look like different individuals or life stages of the same breed.
Detail View: Diverse Styles of One IdentityThese additional examples illustrate the high level of visual quality and realism (scroll down inside the panel to view many more). Again the first row shows the content or identity images, and the first column (duplicated as the third column) shows the reference or style images.
The pose and expression of the two dogs is transferred very consistently. Interestingly, despite the strong visual similarity across each row,  the animals in the rightmost column all have narrower features and look more juvenile, especially in the cat section, than their matches in the second column—could this be because the black-and-white content dog happens to be a puppy while the brown dog happens to be an adult?
﻿
﻿
Source example2
﻿
Cross-Domain SynthesisStarGAN v2 achieves impressive quality, diversity, and realism within a single visual domain ("cats" and "female-presenting celebrities"). The cross-domain synthesis results are even more exciting. A "domain" here is a type or category of images commonly sharing certain visual attributes or features (such as "cats" versus "dogs" versus "wildlife"). 
To generate a new output image, StarGAN v2 requires two inputs:
the source/content image for the target identity (pose, facial structure, expression, etc)
the reference/style image for the target visual style/appearance (colors and textures like hairstyle or fur pattern, beards, makeup, whiskers, etc).
A third implicit requirement is a target domain: one of the N domains on which the network was trained. N is fixed before training, and in the paper it refers to real-world conceptual categories: male/female (N=2) for the CelebA dataset/human face generation network, and cat/dog/wildlife (N=3) for the Animal Faces High Quality (AFHQ) dataset/animal face generation network. This requirement is implicit because 
the paper generally matches the target domain to the domain of the reference image. 
Separating Visual Style From Conceptual DomainStarGAN v2 is essentially domain-content agnostic: no module requires a specific domain as an input value, and all modules except the generator (which produces a single output image) have multiple parallel output branches (one for each of N domains). During training on a specific domain, the branch of each module corresponding to that domain is selected consistently. However, during inference we can select a different output branch to see a different target domain—or see all N possibilities in parallel.
 Below, the top rows in each grid show the sample content images from StarGAN v2 (first row), in three blocks of "cats" (5 columns), "dogs" (6 columns), and "wildlife" (5 columns). All of these content images are crossed with a single style image (the dog in the first column) into the same three target domains: "cats" (row 2), "dogs"(row 3), and "wildlife" (row 4). The paper focuses on results where the target domain matches the reference image: in this case, the third row, for "dog", has the most realistic results. However, the other two rows also produce interesting new animals. These can help disentangle the visual properties of the conceptual domain (a label imposed externally by the world/researchers when designing the model) from the visual style (a representation/encoding learned by the model independently of domain). 
To see more detail, hover over the top right corner of the panel below and click on the empty square to bring up a large overlay version.
﻿
Reference example2
﻿
﻿
Target Domain Is Independent of Style ImageThe effect is easiest to observe when
the  target domain doesn't match the reference image (rows 2 and 4)
additionally, the target domain doesn't match the source image (for each grid, these are the first 5 generated images in the bottom row: cat2dog AS wildlife (source is cat, reference is dog, target domain is wildlife) and the last 5 images in the second row: wildlife2dog AS cat (source is wildlife, reference is dog, target domain is cat).
Looking carefully at these regions, you may notice the generated images picking up the stylistic aspects most independent of the domains of the source and reference images.
Some Concrete Observationsthe ear shape: cat and wildlife rows in both grids have distinctly pointy, straight ears, while the dog rows track more closely with a breed, especially in the wildlife2dog section where we see more floppy ears
the coloration: cat and wildlife rows in both grids take on the fur texture/pattern of the reference image but map it into the more realistic version for the particular domain. For example, in the cat2wildlife section of the top image, the generated animals look most like real wolves or foxes with an even distribution of brown/black/gray in the fur, including around the nose, rather than distinct patches of color as in the dog reference image. In contrast, the entire wildlife row in the second grid (last row) has a black nose on a white snout, with much darker fur around the white snout. These shifts are more consistent with the domain (wolves/foxes) than with the reference style (a real dog which the model has to interpret and encode  as representative of the "cat" or "wildlife" domains).
the background/color temperature of the reference image: all the generated images in the bottom grid take on the gray background and darker/colder color palette of the reference image, while in the top row the background is consistently bright and faded.  This is a mild effect but consistent effect with this model, likely inevitable without explicit separation of foreground vs background during training.
Model Training and VisualizationThe StarGAN v2 repository is clear, well-organized, and easy to follow.  You can download the datasets and pretrained models for generating human celebrity and animal faces. You can also train from scratch and set up logging and visualization to W&B with this fork of the repository. Here are some visualizations I found useful, especially given the long training time for these GANs (2-3 days for 100,000 iterations).
Check on Image Quality During TrainingVisualizing model performance at a glance shows the network quickly capturing the basic shapes of the content image and then refining the details towards more realism, first towards the general domain of the reference image and then focusing increasingly on the details of the reference image. You can use the slider below the image grids to highlight a subregion of time, or hover over the top right corner and click on the empty square to see a larger version.
﻿
﻿
Quality over time examples1
﻿
﻿
Visualize and Analyze Loss ComponentsI plot the various losses for the discriminator and generator components of the StarGAN v2 trained on CelebA-HQ for close to 50 hours. In all plots, the latent/noise image losses are shown in more saturated hues (red, magenta, dark blue) than the reference image losses (pinks, light blue). The latent loss on real images has the greatest impact on the discriminator (top chart), and the adversarial loss on the generator is the dominant component overall (bottom left). The other components of the generator loss are shown in more detail in the bottom right plot (zoomed in from the bottom left plot):
in early training, the style reconstruction loss dominates
the style diversification loss increases and becomes the strongest learning component 5-10 hours in—the authors mention linearly decaying the weight on this style diversification loss to 0 over time to stabilize training
﻿
﻿
Full training run on CelebA-HQ1
﻿
﻿
Evaluate Model PerformanceThe paper leverages two evaluation metrics to show the superior quality of StarGAN v2 for image generation:
﻿Frechet Inception Distance or FID: measures the difference between two sets of images (real examples from one domain and ones generated by the model ) based on their high-level representations via an Inception V3 network. Lower is better, meaning the generated images are more similar to real ones.
﻿Learned Perceptual Image Patch Similarity or LPIPS: measures the distance between image patches on two sets of images. Higher is better for this paper, as it indicates more diversity in the various images generated for a single target domain.
The charts below show FID (left) and LPIPS (right) scores for all six domain translations and the mean score in black for a StarGAN v2 variant. Though this version of the model was only trained for 3000 iterations, longer training runs show a similar relative ranking and trends. Target domains are reds for wildlife, blues for dogs, and violets for cats.
﻿
﻿
Eval metrics examples1
﻿
﻿
Get Ideas for Next ExperimentsAlthough this is a small set of metrics from which to draw conclusions, it can inform future exploration: how to expand the dataset, how to define additional domains, how to adjust relative weighting of different learning signals, etc
Initial Observationswildlife is the hardest domain in which to generate images: it shows the least improvement in FID, highest FID overall, and lowest LPIPS (indicating a lack of diversity in the generated images). This is likely because the "wildlife" category as the authors define it has the most inherent genetic/evolutionary diversity, ranging from leopards and lions to wolves and foxes, as opposed to the handful of species of domesticated cats
generated dogs are the most diverse domain: perhaps because interdomain diversity is so high in dogs across breeds. Also, dogs are very slightly easier to learn to model than wildlife, and substantially harder to model than cats.
cats are the easiest domain to realistically synthesize, but generated cats are slightly less diverse than dogs: more evidence of the mysterious and special ties between computers and cats
Future Work
Synthesize Your Own Cute AnimalsThis is but a small selection of the possibilities. Image quality and resolution doesn't have a huge effect, but a center crop of the face with the animal looking at the camera yields the best results.
You can try it easily on your own photos in this Colab notebook. Pet-enthusiasts from the machine learning community have generously provided some examples for us to use--thank you all!
Please share any of your cute or interesting results :)
Meaningful Applications of GANsWhile I have chosen to focus on furry creatures in this report, improving the quality, diversity, and scalability of GAN-drive image synthesis can help in much more critical applications like healthcare and scientific research:
﻿data augmentation for cancer classification from gene expression data﻿
﻿modelling cells﻿
﻿translation between different medical imaging modalities﻿
﻿optical microscopy﻿
and many other medical imaging applications﻿
From Discrete Labels to Spectral UnderstandingThe functional separation in this work between learning a broad and diverse spectrum of visual style (through latent representations and encodings of reference images) versus learning a narrow, strict, often-socially-constructed domain label (like "gender") is very encouraging. The more our models can learn a nuanced and semantically rich representation of the world, the less likely we are to find our future selves disappointed—or worse, oppressed—by their rigid definitions and insufficient understanding.
﻿
﻿
Generated examples6
﻿
﻿