Barbershop: Hair Transfer with GAN-Based Image Compositing Using Segmentation Masks

A novel GAN-based optimization method for photo-realistic hairstyle transfer
Created on April 5|Last edited on April 18
Comment
﻿
IntroductionRecently, image editing applications powered by Generative Adversarial Networks (GANs) have become popular among both professional and casual users, especially tools for editing or creating images of human beings. 
And even though recent work on GANs enables us to synthesize realistic hair and faces, it remains difficult to combine them into a single, coherent, and plausible image rather than a disjointed set of image patches. This often limits the practical adoption of these tools.
This is the exact problem that the authors of the paper Barbershop: GAN-based Image Compositing using Segmentation Masks attempt to solve. The authors of this paper present a novel solution to image blending, particularly for the problem of hairstyle transfer, based on GAN inversion. In this report, we will take a look at how this novel technique works and what makes it stand apart from the existing approaches. 
But this is a GAN piece so we know what you want to see before we get started:
﻿
Sample results of Hair-transfer by Barbershop28
﻿
The official implementation of Barbershop can be found at ZPdesu/Barbershop. There's also a fork of the official implementation at soumik12345/Barbershop which you can use to run Barbershop and visualize the results as a Weights & Biases Table.
💡
﻿
Failure of Existing ApproachesDespite the recent success of face editing based on latent space manipulation (think Image2StyleGAN and Image2StyleGAN++), most editing tasks operate on an image by changing global attributes such as pose, expression, gender, or age instead of tackling hair editing.
Another approach to image editing is to select features from reference images and mix them together to form a single, composite image. Examples of composite image editing that have seen recent progress are problems of hair transfer and face-swapping. These tasks are extremely difficult mainly for the fact that the visual properties of different parts of an image are not independent of each other. For example:
The visual qualities of hair are heavily influenced by ambient and reflected light as well as transmitted colors from the underlying face, clothing, and background.
The pose of a head influences the appearance of the nose, eyes, and mouth, and the geometry of a person’s head and shoulders influences shadows and the geometry of their hair.
Other challenges include disocclusion (meaning when a previously occluded object becomes visible) of the background which happens when the hair region shrinks with respect to the background.
Disocclusion of the face region can expose new parts of the face, such as the ears, forehead, or jawline.
The shape of the hair is influenced by pose and also by the camera intrinsic parameters, and so the pose might have to change to adapt to the hair.
Failure to account for the global consistency of an image leads to noticeable artifacts such as the different regions of the image appearing disjointed, even if each part is synthesized with a high level of realism. In order for the composite image to seem plausible.
Previous methods of hair transfer based on GANs either use a complex pipeline of conditional generators in which each condition module is specialized to represent, process, and convert reference inputs with different visual attributes (such as MichiGAN), or make use of the latent space optimization with carefully designed loss and gradient orthogonalization to explicitly disentangle hair attributes (such as LOHO). While both of these methods show very promising initial results, the authors find that they could be greatly improved. For example,
Both methods need pre-trained inpainting networks to fill holes left over by misaligned hair masks, which may lead to blurry artifacts and unnatural boundaries.
The authors state that better results can be achieved without an auxiliary inpainting network to fill the holes, as transitions between regions have higher quality if they are synthesized by a single GAN.
The previous methods do not make use of a semantic alignment step to merge semantic regions from different reference images in latent space, e.g. to align a hair region and a face region from different images.
The concepts of identity, shape, structure, and appearance were introduced in MichiGAN and then used by other approaches including LOHO in order to describe different aspects of hair lack a precise definition.
﻿
The Novelty of BarbershopIn order for the composite image to seem plausible, the authors aim to make a single coherent composite image that balances the fidelity of each region to the corresponding reference image while also synthesizing an overall convincing and highly realistic image.
The authors provide precise and formal definitions for the concepts of identity, shape, structure, and appearance:
The shape of the hair is the binary segmentation region
The identity of an image of a head encompasses all the features one would need to identify an individual
The appearance broadly refers to the fine details (such as hair color)
The structure refers to coarser features (such as the form of locks of hair)
The authors propose a novel latent space, called the FSFSFS﻿ space, for representing images. This new space is better at preserving details and is more capable of encoding spatial information.
The authors also propose a new GAN-embedding algorithm for aligned embedding. Similar to previous work, the algorithm can embed an image to be similar to an input image. In addition, the image is slightly modified to conform to a new segmentation mask.
The authors further propose a novel image compositing algorithm that can blend multiple images encoded in our new latent space to yield high-quality results.
The authors achieve a significant improvement in hair transfer, with the proposed approach being preferred over existing state-of-the-art approaches by over 95% of participants in a user study.
﻿
Barbershop: How Does the Shop Operate???
OverviewBarbershop creates composite images by selecting semantic regions (such as hair, or facial features) from reference images and seamlessly blending them together. 
In order to achieve this, the authors employ an automatic segmentation of reference images and make use of a target semantic segmentation mask image MMM﻿. In order to perform  hairstyle transfer, one can copy the hairstyle from one image, and use another image for all other semantic categories. 
More generally, a set of KKK﻿ reference images, IkI_{k}Ik​﻿ for k=1,2,...,Kk = 1, 2, ..., Kk=1,2,...,K﻿, are each aligned to the target mask and then blended to form a novel image. The output of Barbershop is a composite image IblendI^{blend}Iblend﻿ in which the region of semantic-category kkk﻿ has the style of reference image IkI_{k}Ik​﻿. A schematic overview of Barbershop is shown in the following diagram:
﻿
﻿
(a) reference images for the face (top) and hair (bottom) features
(b) reconstructed images using the FSFSFS﻿ latent space
(c) a target mask with hair region (magenta) from the hair image and all other regions from the face image
(d) alignment in W+W^{+}W+﻿ space
(e) a close-up view of the face (top) and hair (bottom) in W+W^{+}W+﻿ space
(f) close-up views after details are transferred
(g) an entire image with details transferred
(h) the structure tensor is transferred into the blended image but the appearance code is from SfaceS_{face}Sface​﻿﻿
the appearance code is optimized
Generator ArchitectureThe authors' approach to image blending finds a latent code for the blended image, which has the benefit of avoiding many of the traditional artifacts of image blending, particularly at the boundaries of the blended regions. 
In particular, the authors build on the StyleGAN2 architecture and extend the II2S embedding algorithm proposed by InterfaceGAN. The II2S algorithm uses the inputs of the 18 affine style blocks of StyleGAN2 as a single W+W^{+}W+﻿ latent code. This allows the input of each block to vary separately, but II2S is biased towards latent codes that have a higher probability according to the StyleGAN2 training set. There is a potential to suppress or reduce the prominence of less-common features in the training data.
In order to increase the capacity of our embedding and capture image details, we embed images using a latent code C=(F,S)C = (F, S)C=(F,S)﻿ comprised of a structure tensor, F∈R32×32×512F \in \mathbb{R}^{32\times32\times512}F∈R32×32×512﻿ which replaces the output of the style block at layer mmm﻿ of the StyleGAN2 image synthesis network and an appearance code, S∈R(18−m)×512S \in \mathbb{R}^{(18 - m) \times 512}S∈R(18−m)×512﻿ that is used as input to the remaining style blocks. This proposed extension of conventional GAN embedding, which is called the FSFSFS﻿ space, provides more degrees of freedom to capture individual facial details such as moles. However, it also requires a careful design of latent code manipulations, because it is easier to create artifacts.
The relationship between the FSFSFS﻿ and W+W^{+}W+﻿ latent space is shown in the following diagram. The first mmm﻿ (for m=7m = 7m=7﻿) blocks of the W+W^{+}W+﻿ code are replaced by the output of style block mmm﻿ to form a structure tensor FFF﻿, and the remaining parts of W+W^{+}W+﻿ are used as an appearance code SSS﻿.
Source: Figure 2 from the paper
Main Steps in the Barbershop PipelineReference images IkI_{k}Ik​﻿ are segmented and a target segmentation is generated automatically, or optionally the target segmentation is manually edited.
Embed input reference images IkI_{k}Ik​﻿ to find latent codes Ckrec=(Fkrec,Sk)C^{rec}_{k} = (F^{rec}_{k}, S_{k})Ckrec​=(Fkrec​,Sk​)﻿﻿
Find latent codes Ckalign=(Fkalign,Sk)C^{align}_{k} = (F^{align}_{k}, S_{k})Ckalign​=(Fkalign​,Sk​)﻿ such that are embeddings of images which match the target segmentation MMM﻿ while also being similar to the input images IkI_{k}Ik​﻿ .
A combined structure tensor FblendF^{blend}Fblend﻿ is formed by copying region kkk﻿ of FkalignF^{align}_{k}Fkalign​﻿ for each k=1...Kk = 1...Kk=1...K﻿.
Blending weights for the appearance codes SkS_{k}Sk​﻿ are found so that the appearance code SblendS^{blend}Sblend﻿ is a mixture of the appearances of the aligned images. The mixture weights are found using a novel masked appearance loss function.
﻿
Results﻿﻿The authors use a set of 120 high-resolution (1024×1024) images from the dataset used by  InterfaceGAN in order to train Barbershop. From these images, 198 pairs of images were selected for the hairstyle transfer experiments based on the variety of appearances and hair shapes. Images are segmented and the target segmentation masks are generated automatically. The following Weights & Biases table showcases some of the results obtained by Barbershop.
﻿
Hair-style Gallery59
﻿
Reconstruction Results in Different SpacesThe following table shows reconstruction results in different spaces:
In the top row we can see that, in the W+W^{+}W+﻿ space, the structure of the subject’s curly hair on the left of the image is lost, and a wisp of hair on her forehead, as well as her necklace, is removed, but they are preserved in the FSFSFS﻿ space.
In the middle row we can see that, the hair and brow furrows details are important to the expression of the subject, they are not preserved in W+W^{+}W+﻿ space but they are in FSFSFS﻿ space.
In the bottom row we can see that, the ground-truth image has freckles, without noise optimization this is not captured in W+W^{+}W+﻿ space but it is preserved in FSFSFS﻿ space.
﻿
Reconstruction Results in Different Spaces1
﻿
Comparison of Barbershop with Prevision SoTA Methods
Qualitative ComparisonWe can observe from the following table that Barbershop produces improved transitions between hair and other regions, fewer disocclusion artifacts, and better consistent handling of global aspects such as lighting.
﻿
Qualitative Comparison of Barbershop with Previous SoTA Methods1
﻿
Quantitative ComparisonThe authors compare Barbershop with a baseline model without FSFSFS﻿ space, MichiGAN and LOHO based on the following metrics:
﻿Root Mean Squared Error﻿
﻿Peak Signal-to-Noise Ratio﻿
﻿Structural Similarity﻿
﻿VGG-based Perceptual Similarity﻿
﻿Learned Perceptual Image Patch Similarity﻿
﻿Fréchet Inception Distance﻿
﻿
Quantitative Comparison of Barbershop with SoTA Methods1
﻿
﻿
Limitations of BarbershopEven though the capacity of the latent space has been increased for Barbershop, it is difficult to reconstruct underrepresented features from the latent space such as jewelry.
Issues such as occlusion can produce confusing results. For example, thin wisps of hair which also partially reveal the underlying face are difficult to capture.
Many details such as the hair structure are difficult to preserve when aligning embeddings, and when the reference and target segmentation masks do not overlap perfectly the method may fall back to a smoother structure.
while Barbershop is tolerant of some errors in the segmentation mask input, large geometric distortions cannot be compensated.
﻿
ConclusionIn this report, we discuss Barbershop, a novel framework for GAN-based image editing.
Barbershop enables a user to interact with images by manipulating segmentation masks and copying content from different reference images.
The authors propose a new latent space that combines the commonly used W+W^{+}W+﻿ style code with a structure tensor. The use of the structure tensor makes the latent code more spatially aware and enables us to preserve more facial details during editing.
The authors also propose a new GAN-embedding algorithm for aligned embedding. Similar to previous work, the algorithm can embed an image to be similar to an input image. In addition, the image can be slightly modified to conform to a new segmentation mask.
The authors also propose a novel image compositing algorithm that can blend multiple images encoded in our new latent space to yield a high-quality result.
The results produced by Barbershop show significant improvements over the current state of the art. In a user study, our results are preferred over 95% of the time.
Finally, we discussed the limitations posed by Barbershop.
﻿
Similar Reports
EditGAN: High-Precision Semantic Image Editing
Robust and high-precision semantic image editing in real-time 
PoE-GAN: Generating Images from Multi-Modal Inputs
PoE-GAN is a recent, fascinating paper where the authors generate images from multiple inputs like text, style, segmentation, and sketch. We dig into the architecture, the underlying math, and of course, generate some images along the way. 
JoJoGAN: One Shot Face Stylization with W&B and Gradio 
This report showcases JoJoGAN: One Shot Face Stylization for fine-tuning a pretrained stylegan from faces to stylized art. Track experiments on wandb and use the live demo with Gradio. Try the live demo in your browser!
How to Evaluate GANs using Frechet Inception Distance (FID)
In this article, we will briefly discuss the details of GAN evaluation and how to implement the Frechet Inception Distance (FID) evaluation pipeline.
﻿
﻿
Add a comment
Tags: Articles, TMP, GAN, GenAI, Computer Vision, Advanced
Iterate on AI agents and models faster. Try Weights & Biases today.
Barbershop: Hair Transfer with GAN-Based Image Compositing Using Segmentation Masks

Introduction

Failure of Existing Approaches

The Novelty of Barbershop

Barbershop: How Does the Shop Operate???

Overview

Generator Architecture

Main Steps in the Barbershop Pipeline

Results﻿﻿

Reconstruction Results in Different Spaces

Comparison of Barbershop with Prevision SoTA Methods

Qualitative Comparison

Quantitative Comparison

Limitations of Barbershop

Conclusion

Similar Reports

Results