In this report, we will explore the key ideas presented in DeepFaceDrawing: Deep Generation of Face Images from Sketches by Shu-Yu Chen, Wanchao Su, Lin Gao, Shihong Xia, and Hongbo Fu.
Image-to-image translation is a class of computer vision deep learning tasks where the goal is to learn the mapping between an input image and an output image. In an analogy to automatic language translation(ex: English to French), ** image-to-image translation is the task of translating one possible representation of a scene to another representation. On a high level, it is predicting pixels from pixels.** Some of the example tasks can be:
Deep Image Inpainting - Here, the input image is corrupted with missing pixel values, and the task is to fill those patches as the output. Here is an introduction to image inpainting with deep learning.
Colorize image - We feed black & white images and want to get them colorized most realistically. Here is a report by Boris Dayma covering DeOldify.
Sketch to Image - We feed an object's sketch and want to get a realistic image as output. The paper that this report is covering is an example to sketch to image translation. Let us dive into it.
-> Figure 1: Examples of image-to-image translation tasks. (Source) <-
Recent developments in image-to-image translation allow fast generation of face images from freehand sketches. However, all these techniques quickly overfit to input sketches. Thus, it requires professionally drawn sketches, limiting the number of people using applications based on these techniques. These deep learning based solutions take an input sketch as a hard constraint and try to infer missing texture and shading information. The problem(task) is thus formulated as a reconstruction problem. Further, these models are trained with pairs of realistic images and their corresponding edge maps. This is why test images are required to have a quality similar to edge maps.
The authors of DeepFaceDrawing have proposed a novel sketch for image synthesis to tackle overfitting and the need for professionally drawn sketches. The following summarizes the contributions of this work:
The idea of input sketches as a soft constraint instead of hard constraint to guide image synthesis - the key idea here is to implicitly learn a space of plausible face sketches from real face sketch images and find the closest point in this space(using manifold projection) to approximate an input sketch. This enables the proposed work to produce high-quality face images even from rough/incomplete sketch.
Local to global approach - the idea of learning space of plausible face sketches globally is not feasible due to limited training data. The authors thus proposed to learn feature embeddings of key face components. These key components include eyes, mouth, and nose. The idea here is to push the corresponding components in the input sketch towards underlying component manifolds learned.
Novel deep neural network to map the embedded component features to realistic images with multi-channel feature maps as intermediate results to improve information flow(More in the next section).
Before we go through the proposed deep learning framework, here is a video by TwoMinutePapers. This will help build more intuition about this particular deep learning task.
This section will briefly discuss the model framework, which enables high-quality sketch-to-image translation. The framework consists of three main modules - CE(Component Embedding), FM(Feature Mapping), and IS(Image Synthesis). This framework is trained in two-stage. Let us get into the nitty-gritty of this framework, shall we?
-> Figure 2: Overview of DeepFaceDrawing architecture <-
Figure 2 shows that the framework takes a sketch(one channel image) and generates a high-quality facial image of 512x512. Thus this framework is capable of high-resolution sketch-to-image synthesis.
The first sub-network is the CE module responsible for learning feature embeddings of individual face components using separate autoencoder networks. Thus this module turns component sketches into semantically meaningful feature vectors. You can learn more about autoencoders from the Towards Deep Generative Modeling with W&B report.
Since human faces have a clear structure, the face sketch is decomposed into five components - "left-eye", "right-eye", "nose", "mouth" and "remainder"(sketch after removing the first four components). Notice that "left-eye" and "right-eye" are treated separately to best explore the generated faces' flexibility. Given five components, we have five separate autoencoder networks to learn feature embedding for each component.
-> Figure 3: Result of the ablation study for the number of feature dimensions in the CE module. (Source) <-
-> Figure 4: The architecture of the Component Embedding Module. (Source) <-
DeepFaceDrawing is trained in two stages. For the first stage, we are interested in learning the component embeddings. We train only the CE module by using component sketches to train five individual autoencoders for feature embeddings. The training is done in a self-supervised manner, with the mean square error(MSE) loss between the input sketch component and the reconstructed image.
The second sub-network consists of two modules: Feature Mapping and Image Synthesis. This section will discuss the FM module and how it is trained in the second stage.
FM turns the component feature vectors(bottleneck vectors) learned in the first stage of training into corresponding feature maps.
First, the component vector for the input component image is projected to its corresponding component manifold. This is done using the frozen CE module's encoder.
The component vector is then sampled from this manifold. This sampled vector is then mapped to the multi-channel feature map with 32 channels. Note that the input component sketch is single-channel; however, the authors decided to obtain a multi-channel feature map in the second stage of training.
This mapping is done using five separate decoding models. Each decoding model consists of a fully connected layer and five decoding layers. Each feature map has 32 channels and is of the same spatial size as the corresponding component in the sketch domain.
-> Figure 5: The architecture of the Feature Mapping Module. (Source) <-
We first fix the freeze, the trained CE encoder, and train the entire network end to end. The entire network consists of a non-trainable CE encoder, FM module, and IS module. We will get into IS next.
Given the combined feature maps, the IS module converts them to a realistic face image. This module is a conditional GAN architecture that takes in the feature maps as input to the generator, and a discriminator guides the generation. Check out this blog post on Conditional GAN.
Figures 6 and 7 summarize the architectural design of the generator and the discriminator.
-> Figure 6: The architecture of the Generator of Image Synthesis Module. (Source) <-
-> Figure 7: The architecture of Discriminator of Image Synthesis Module. (Source) <-
The GAN in the IS is trained with GAN loss. L1 loss is also used to ensure the pixel-wise quality of the generated images. Perpetual loss is used to compare the high-level differences between real and generated images.
Both qualitative and quantitative evaluations show the superior generation ability of this system to existing and alternative solutions.
-> Figure 8: Examples of sketch to image synthesis using DrawFaceDrawing. (Source) <-
The authors of this paper have not open sourced the training script. The code provided in their GitHub repo is exposed as a service where you can draw a sketch and the model will make a realistic image out of it. You can also play around with different parameters.
The code uses Jittor deep learning framework and they will soon be releasing one using PyTorch.
-> Figure 9: Interactive drawing tool. (Source) <-
I hope this report will come in handy in understanding this framework. If you find the report to be useful I would love to hear from you. Also, please feel to let me know if you have any improvement pointers to share. As a parting note, here are two other reports that I wrote for Two Minutes Paper.