PoE-GAN: Generating Images from Multi-Modal Inputs

PoE-GAN is a recent, fascinating paper where the authors generate images from multiple inputs like text, style, segmentation, and sketch. We dig into the architecture, the underlying math, and of course, generate some images along the way.
Soumik Rakshit
Created on February 1|Last edited on February 25
Comment
﻿
IntroductionWe’ve seen models that can generate images from individual modalities like text, sketches, and masks. The question we’ll be asking today is whether you can generate images from not just discrete modalities, but a combination of multiple. Essentially:
Is it possible to generate an image from inputs in multiple modalities such that the generated image satisfies all the given input conditions? 
This is the question that the authors of the paper Multimodal Conditional Image Synthesis with Product-of-Experts GANs attempt to answer. Existing conditional image synthesis frameworks generate images based on user input in a single modality, such as Pix2Pix (accepts an image as input), GauGAN (accepts a semantic mask as input) and StackGAN (accepts text as input). Bu their inability to handle multi-modal inputs often limits their practical usage. The authors attempt of the paper we're looking at today address this limitation of existing conditional image synthesis frameworks by coming up with the Product-of-Experts GAN framework (or PoE-GAN for short). 
The TLDR here: PoE-GAN is an image synthesis framework that can generate diverse and high-quality images based on any number of modalities present in the user input. Let's dig in:
What Does PoE-GAN Generate?Conditional image synthesis allows users to use creative inputs to control the output of image synthesis methods. This has found applications in many content creation tools, such as GauGAN by NVIDIA, a model that turns doodles into stunning, photorealistic landscapes. Such tools have historically depended on models that have been trained to convert single modality inputs to images.
However, different input modalities are best suited for conveying different types of conditioning information, as seen in the results produced by PoE-GAN. And the results are impressive. 
A quick note about what each modality here is generally best suited for:
A segmentation mask makes it easy to define the coarse layout of semantic classes in an image—the relative locations and sizes of sky, cloud, mountain, and water regions.
Sketch, meanwhile, allows us to specify the structure and details within the same semantic region, such as individual mountain ridges.
Text is well-suited for modifying and describing objects or regions in the image, which cannot be achieved by using segmentation or sketch, e.g. ‘frozen lake’ and ‘pink clouds’.
But really, we know you're here for the images. In the left column below, you'll see various images of mountains generated by single modalities and, of course, combinations of two or three:
﻿
Run set16
﻿
﻿
The Big Ideas Behind PoE-GAN
A Product of ExpertsFirst, let's formalize the learning objective for the task at hand:
Given a dataset of images paired with M different input modalities, the goal is to train a single generative model that learns to capture the image distribution conditioned on an arbitrary subset of possible modalities.For the scope of this paper, the authors consider four different modalities: text, semantic segmentation, sketch, and style reference. Notably, each input modality adds a constraint the synthesized image must satisfy. The set of images that satisfy all constraints is the intersection of the sets each of which satisfies one individual constraint. As illustrated in the figure below, we model this by assuming that the jointly conditioned probability distribution is proportional to the product of singly conditioned probability distributions. Under this setting, for the product distribution to have a high density in a region, each of the individual distributions needs to have a high density in that region, thereby satisfying each constraint. This has also been previously referred to as product-of-experts.
The product of distributions is analogous to the intersection of the sets. It has a high density where both distributions have relatively high densities. Source: Figure 2 from the paper (https://arxiv.org/pdf/2112.05130)
﻿
Products of Experts is a method of combining multiple probabilistic models of the same data by multiplying the probabilities together followed by renormalization. This is a very efficient method of modeling high-dimensional data (distribution of images in this case) which simultaneously satisfies many different low-dimensional constraints (multiple modalities of conditional input in this case).
💡
For more details on Product-of-Experts modeling, refer to the following:
Section 3.1 of the PoE-GAN paper (https://arxiv.org/abs/2112.05130)
Products of Experts by Geoffrey Hinton (https://www.cs.toronto.edu/~hinton/absps/icann-99.pdf)
Multi-scale and Hierarchical Latent SpaceSome of the modalities that are considered by the authors of PoE-GAN are two-dimensional and naturally contain information at multiple scales (e.g., sketch, segmentation). In order to better preserve the high-resolution control signals, the authors come up with a a hierarchical latent space with latent variables at different resolutions. This allows us to directly pass information from each resolution of the spatial encoder to the corresponding resolution of the latent space. The latent code is basically partitioned into groups where each group is a feature map with increasing resolutions. (For further details, refer to Section 3.2 of the PoE-GAN paper.)
Generator ArchitectureThe PoE-GAN Generator is trained to map a latent code zzz﻿ to an image xxx﻿. Since the output image is uniquely determined by the latent code, the problem of estimating p(x∣Y)p(x|Y)p(x∣Y)﻿ is equivalent to that of estimating p(z∣Y)p(z|Y)p(z∣Y)﻿. In this case, Product-of-Experts is used to model the conditional latent distribution...
p(z∣Y)∝p′(z)∏yi∈Yq(z∣yi)\huge{p(z|Y) \propto p'(z)\prod_{y_{i} \in Y}q(z|y_{i})}p(z∣Y)∝p′(z)∏yi​∈Y​q(z∣yi​)﻿
...where:
﻿p′(z)p'(z)p′(z)﻿ is the prior distribution
each distribution q(z∣yi)q(z|y_{i})q(z∣yi​)﻿ is a distribution predicted by the encoder of that single modality.
Source: Figure 3 from the paper (https://arxiv.org/abs/2112.05130)
EncoderThe generator encodes each modality into a feature vector which is then aggregated in the Global PoE-Net.
Convolutional networks with input skip connections are used to encode segmentation masks and sketches.
A Residual network is used to encoder style images.
﻿CLIP is used to encoder text inputs.
In the Global PoE-Net, we sample a latent feature vector using product-of-experts which is then processed by an MLP to output a feature vector.
﻿
The Global-POENet. Source: Figure 4 from the paper (https://arxiv.org/pdf/2112.05130) 
DecoderThe decoder generates the image using the output of Global PoE-Net and skip connections from the segmentation and sketch encoders. The Local PoE-Net samples the latent feature map at a given resolution where (μ0k,σ0k)(\mu_{0}^{k}, \sigma_{0}^{k})(μ0k​,σ0k​)﻿ is computed by the outputs of the last layer and (μik,σik)(\mu_{i}^{k}, \sigma_{i}^{k})(μik​,σik​)﻿ is computed by concatenating the outputs of the last layer and the corresponding modality. Note that only the modalities that have skip connection (i.e segmentation mask and sketch) contribute to this computation in the local PoE-Net, other modalities (i.e text and style reference) only provide global information but not local details. The latent feature map produced by the Local PoE-Net and the feature vector produced by the Global PoE-Net are fed to the Local-Global Adaptive Instance Normalization (LG-AdaIN) Layer, which is given by...
LGAdaIN(hk,zk,ωk)=γω(γzkhk−μ(hk)σ(hk)+βzk)+βω\Large{LGAdaIN(h^{k}, z^{k}, \omega^{k}) = \gamma_{\omega}(\gamma_{z^{k}} \frac{h^{k}-\mu(h^{k})}{\sigma(h^{k})} + \beta_{z^{k}}) + \beta_{\omega}}LGAdaIN(hk,zk,ωk)=γω​(γzk​σ(hk)hk−μ(hk)​+βzk​)+βω​﻿
...where:
﻿hkh^{k}hk﻿ is a feature map in the residual branch after convolution
﻿μ(hk)\mu(h^{k})μ(hk)﻿ and σ(hk)\sigma(h^{k})σ(hk)﻿ are channel-wise mean and standard deviation
﻿ω\omegaω﻿ is the feature map output from the Global PoE-Net
﻿zkz^{k}zk﻿ is the latent feature map produced by the Local PoE-Net
﻿βω\beta_{\omega}βω​﻿ and γω\gamma_{\omega}γω​﻿ are feature vectors computed from ω\omegaω﻿﻿
﻿βzk\beta_{z^{k}}βzk​﻿ and γzk\gamma_{z^{k}}γzk​﻿ are feature maps computed from zkz^{k}zk﻿﻿
A residual block in the decoder. Source: Figure 5 from the paper (https://arxiv.org/abs/2112.05130)
﻿
The LG- AdaIN layer can be viewed as a combination of AdaIN (Adaptive Instance Normalization) and SPADE (Spatially Adaptive Denormalization) that takes both a global feature vector and a spatially-varying feature map to modulate the activations.
💡
﻿
Discriminator ArchitectureThe authors of PoE-GAN propose a multimodal projection discriminator that generalizes Projection Discriminator to handle multiple conditional inputs. Unlike the standard Projection Discriminator that computes a single inner product between the image embedding and the conditional embedding, the PoE-GAN's Multi-modal Projection Discriminator or MPD computes one inner product per input modality and add them together to obtain the final loss. The MPD can be visualized as:
Comparison between the standard projection discriminator (left) and our multimodal projection discriminator (right). Source: Figure 6 from the paper (https://arxiv.org/abs/2112.05130)
For spatial modalities such as segmentation and sketch, it's more effective to enforce their alignment with the image in multiple scales. Hence, the authors of PoE-GAN devise a Multi-scale MPD where the image and spatial modalities are encoded into feature maps at different resolution and the MPD Loss is computed at each resolution. Having computed a loss value at each location and resolution, the final loss is obtained by averaging first across locations then across resolutions.
Multi-scale Multi-modal Projection Discriminator. Source: Figure 7 from the paper (https://arxiv.org/abs/2112.05130)
﻿
Loss Functions
Latent RegularizationIn order for the marginalized conditional latent distribution to match the unconditional prior distribution under the Product-of-Experts assumption, the authors minimize Kullback-Leibler Divergence from prior distribution p′(z)p'(z)p′(z)﻿ to the conditional latent distribution p(z∣yi)p(z|y_{i})p(z∣yi​)﻿ at every resolution. The KL-Divergence loss also reduces conditional mode collapse since it encourages the conditional latent distribution to be close to the prior and therefore have high entropy. It also encourages each modality to only provide the minimum information necessary to specify the conditional image distribution. The KL-Divergence Loss is given by
LKL=∑yi∈Yωi∑kωkEp(z<k∣yi)[DKL(p(zk∣z<k,yi)∣∣p′(zk∣z<k,yi))]\Large{L_{KL} = \sum_{y_{i} \in Y}\omega_{i}\sum_{k}\omega^{k}\mathbb{E}_{p(z^{<k}|y_{i})}[D_{KL}(p(z^{k}|z^{<k}, y_{i})||p'(z^{k}|z^{<k}, y_{i}))]}LKL​=∑yi​∈Y​ωi​∑k​ωkEp(z<k∣yi​)​[DKL​(p(zk∣z<k,yi​)∣∣p′(zk∣z<k,yi​))]﻿
where,
﻿ωk\omega^{k}ωk﻿ is a resolution-dependent rebalancing weight
﻿ωi\omega_{i}ωi​﻿ is a modality-specific loss weight
The Contrastive LossesA Contrastive loss takes the output of the network for a positive example and calculates its distance to an example of the same class and contrasts that with the distance to negative examples. To put it simply, the loss is low if positive samples are encoded to similar (closer) representations and negative examples are encoded to different (farther) representations. For more information on Contrastive Losses, refer to this article: Contrastive Loss Explained.
Given a batch of paired vectors (u,v)=(ui,vi),i=1,2,3,....,N(u, v) = {(u_{i}, v_{i}), i = 1, 2, 3,...., N}(u,v)=(ui​,vi​),i=1,2,3,....,N﻿ the Symmetric Cross-Entropy maximizes the similarity of the vectors in a pair while keeping non-paired vectors apart.
LCE(u,v)=−12N∑i=1Nlog(ecos(ui,vi)/τ∑j=1Necos(ui,vj)/τ)−12N∑i=1Nlog(ecos(ui,vi)/τ∑j=1Necos(uj,vi)/τ)\LARGE{L^{CE}(u, v) = -\frac{1}{2N}\sum_{i=1}^{N}log(\frac{e^{cos(u_{i}, v_{i})/\tau}}{\sum_{j=1}^{N} e^{cos(u_{i}, v_{j})/\tau}})-\frac{1}{2N}\sum_{i=1}^{N}log(\frac{e^{cos(u_{i}, v_{i})/\tau}}{\sum_{j=1}^{N} e^{cos(u_{j}, v_{i})/\tau}})}LCE(u,v)=−2N1​∑i=1N​log(∑j=1N​ecos(ui​,vj​)/τecos(ui​,vi​)/τ​)−2N1​∑i=1N​log(∑j=1N​ecos(uj​,vi​)/τecos(ui​,vi​)/τ​)﻿
where, τ\tauτ﻿ is a temperature hyper-parameter.
The authors of PoE-GAN propose two kinds of contrastive losses: the Image Contrastive Loss and the Conditional Contrastive Loss.
The Image Contrastive Loss maximizes the similarity between a real image xxx﻿ and a random fake image x~\tilde{x}x~﻿ synthesized based on the corresponding conditional inputs. It's given by Lcx=Lce(Evgg(x),Evgg(x~))L_{cx} = L^{ce}(E_{vgg}(x), E_{vgg}(\tilde{x}))Lcx​=Lce(Evgg​(x),Evgg​(x~))﻿ where EvggE_{vgg}Evgg​﻿ is a pre-trained VGG encoder. Note that this loss is very similar to the Perceptual Loss which is widely used in conditional synthesis but performs better in comparison.
The Conditional Contrastive Loss aims to better align images with the corresponding conditions. Specifically, the discriminator is trained to maximize the similarity between its embedding of a real image x and the conditional input yiy_{i}yi​﻿. It's given by LcyD=Lce(Dx(x),Dyi(yi))L_{cy}^{D} = L^{ce}(D_{x}(x), D_{y_{i}}(y_{i}))LcyD​=Lce(Dx​(x),Dyi​​(yi​))﻿ where DxD_{x}Dx​﻿ and DyiD_{y_{i}}Dyi​​﻿ are two modules in the discriminator that extract features from xxx﻿ and yiy_{i}yi​﻿.
The Generator is trained with the same loss, but using the generated image x~\tilde{x}x~﻿ instead of the real image to compute the discriminator embedding. The Generator Loss is given by LcyD=Lce(Dx(x~),Dyi(yi))L_{cy}^{D} = L^{ce}(D_{x}(\tilde{x}), D_{y_{i}}(y_{i}))LcyD​=Lce(Dx​(x~),Dyi​​(yi​))﻿.
The Training Objective for PoE-GAN can be summarized as...
LG=LGANG+LKL+λ1Lcx+λ2LGcy\Large{L^{G} = L_{GAN}^{G} + L_{KL} + \lambda_{1}L_{cx} + \lambda_{2}L_{G}^{cy}}LG=LGANG​+LKL​+λ1​Lcx​+λ2​LGcy​﻿
﻿
LD=LGAND+λ2LDcy+λ3LGP\Large{L^{D} = L_{GAN}^{D} + \lambda_{2}L_{D}^{cy} + \lambda_{3}L_{GP}}LD=LGAND​+λ2​LDcy​+λ3​LGP​﻿
...where:
﻿LGANGL_{GAN}^{G}LGANG​﻿ and LGANDL_{GAN}^{D}LGAND​﻿ are non-structured GAN losses
﻿LGPL_{GP}LGP​﻿ is the R1R_{1}R1​﻿ gradient penalty loss
﻿λ1\lambda_{1}λ1​﻿, λ2\lambda_{2}λ2​﻿ and λ3\lambda_{3}λ3​﻿ are weights associated with the loss terms
Results of PoE-GAN
Single-Modal ResultsWhen tested using a single input modality, PoE-GAN outperforms previous state-of-the-art approaches specifically designed for that modality, such as the segmentation-to-image methods (SPADE, OASIS) and the text-to-image synthesis methods (DF-GAN, DM-GAN + CL).
﻿
Segmentation-to-Image Synthesis﻿
Run set16
﻿
Text-to-Image Synthesis﻿
Run set16
﻿
Results on Arbitrary Subsets of ModalitiesPoE-GAN can produce diverse output images when conditioned on an arbitrary subset of modalities. Below we show some random samples from PoE-GAN conditioned on two modalities (text + segmentation, text + sketch, and segmentation + sketch) on a dataset of landscape images.
Text + Segmentation Input﻿
Run set16
﻿
Text + Sketch Input﻿
Run set16
﻿
Segmentation + Sketch Input﻿
Run set16
﻿
﻿
Segmentation + Style Reference﻿
Run set16
﻿
Sketch + Style Reference﻿
Run set16
﻿
Text + Style Reference﻿
Run set16
﻿
Results on Unconditional InputsPoE-GAN becomes an unconditional generative model when given no input modalities. Below are un-curated samples generated unconditionally by PoE-GAN.
Unconditional Results on the 256x256 MS-COCO Dataset﻿
Run set16
﻿
﻿
Unconditional Results on the 1024x1024 Landscape Dataset﻿
Run set16
﻿
Unconditional Results on the 1024x1024 MM-CelebA-HQ Dataset﻿
Run set16
﻿
﻿
Performance of PoE-GAN﻿
Model Size ComparisonIn the following charts attached below, we compare the number of parameters used in PoE-GAN and the baselines used by the authors. We can see that PoE-GAN does not use significantly more parameters, in fact it uses fewer parameters than some of the single-modal baselines, in spite of being trained for a much more challenging task. This shows that the improvements in PoE-GAN's results does not come simply from using a larger model.
﻿
Run set16
﻿
Comparison on MM-CelebA-HQ (256x256)In the following charts attached below, we compare PoE-GAN trained on the 256x256 MM-CelebA-HQ dataset with the text-to-image baselines used by the authors which shows that PoE-GAN achieves a significantly lower FID and slightly higher LPIPS than previous methods.
﻿
Run set16
﻿
Note that higher LPIPS and lower FID corresponds to better quality of synthesized image.
💡
﻿
Limitations of PoE-GANThe authors' investigation reveals one important limitation on PoE-GAN: it doesn't work well when conditioned on contradictory multimodality inputs. 
For example, when the segmentation and text are contradictory to each other, the text input is usually ignored. This is because, in the product-of-experts formulation, an expert with a larger variance will have a smaller influence on the product distribution, and in this case, the variance of the text expert is usually larger than that of the segmentation expert.
Potential for Negative Societal ImpactsImage synthesis networks can help people express themselves and artists create digital content, but they can undeniably also be misused for visual misinformation (as discussed in this article The Emergence of Deepfake Technology: A Review). Since PoE-GAN enables users to synthesize images using multiple modalities, it becomes even easier to create a desired fake image and thus increasing its potential for being used as a tool for spreading visual misinformation. The authors of PoE-GAN encourage research that helps detect or prevent these potential negative misuses.
ConclusionIn this post, we have seen so far how the authors of Multimodal Conditional Image Synthesis with Product-of-Experts introduce a multimodal conditional image synthesis model based on product-of-experts and show its effectiveness for converting an arbitrary subset of input modalities to an image satisfying all conditions. We also discussed the novel generator and discriminator architectures, the loss functions and formulation of the training objective as part of this novel image synthesis framework. 
We saw from the results that while PoE-GAN is empirically superior than the prior multimodal synthesis work, it also outperforms state-of-the-art unimodal conditional image synthesis approaches when conditioned on a single modality. Furthermore, we also talked the limitations of PoE-GAn and concerns by the authors regarding  its potential of negative societal impact in spreading visual misinformation.
Similar Posts
Image Generation Based on Abstract Concepts Using CLIP + BigGAN
Is the state-of-the-art text-prompted image generation model aware of abstract, high-level concepts?
Omnimatte: How to Detect Objects and Their Effects
Automatically capture the shadows and reflections of objects in videos
How to Evaluate GANs using Frechet Inception Distance (FID)
In this article, we will briefly discuss the details of GAN evaluation and how to implement the Frechet Inception Distance (FID) evaluation pipeline.
JoJoGAN: One Shot Face Stylization with W&B and Gradio 
This report showcases JoJoGAN: One Shot Face Stylization for fine-tuning a pretrained stylegan from faces to stylized art. Track experiments on wandb and use the live demo with Gradio. Try the live demo in your browser!
﻿
﻿