In this report, we will explore the key ideas presented in DeepFaceDrawing: Deep Generation of Face Images from Sketches by Shu-Yu Chen, Wanchao Su, Lin Gao, Shihong Xia, and Hongbo Fu.

Paper | Code $\rightarrow$



Image-to-image translation is a class of computer vision deep learning tasks where the goal is to learn the mapping between an input image and an output image. In an analogy to automatic language translation(ex: English to French), ** image-to-image translation is the task of translating one possible representation of a scene to another representation. On a high level, it is predicting pixels from pixels.** Some of the example tasks can be:


-> Figure 1: Examples of image-to-image translation tasks. (Source) <-

Recent developments in image-to-image translation allow fast generation of face images from freehand sketches. However, all these techniques quickly overfit to input sketches. Thus, it requires professionally drawn sketches, limiting the number of people using applications based on these techniques. These deep learning based solutions take an input sketch as a hard constraint and try to infer missing texture and shading information. The problem(task) is thus formulated as a reconstruction problem. Further, these models are trained with pairs of realistic images and their corresponding edge maps. This is why test images are required to have a quality similar to edge maps.

The authors of DeepFaceDrawing have proposed a novel sketch for image synthesis to tackle overfitting and the need for professionally drawn sketches. The following summarizes the contributions of this work:

Before we go through the proposed deep learning framework, here is a video by TwoMinutePapers. This will help build more intuition about this particular deep learning task.

Overview of the Method Proposed

This section will briefly discuss the model framework, which enables high-quality sketch-to-image translation. The framework consists of three main modules - CE(Component Embedding), FM(Feature Mapping), and IS(Image Synthesis). This framework is trained in two-stage. Let us get into the nitty-gritty of this framework, shall we?


-> Figure 2: Overview of DeepFaceDrawing architecture <-

Figure 2 shows that the framework takes a sketch(one channel image) and generates a high-quality facial image of 512x512. Thus this framework is capable of high-resolution sketch-to-image synthesis.

Component Embedding(CE) Module and Stage $I$ training

Architectural Design

The first sub-network is the CE module responsible for learning feature embeddings of individual face components using separate autoencoder networks. Thus this module turns component sketches into semantically meaningful feature vectors. You can learn more about autoencoders from the Towards Deep Generative Modeling with W&B report.

Since human faces have a clear structure, the face sketch is decomposed into five components - "left-eye", "right-eye", "nose", "mouth" and "remainder"(sketch after removing the first four components). Notice that "left-eye" and "right-eye" are treated separately to best explore the generated faces' flexibility. Given five components, we have five separate autoencoder networks to learn feature embedding for each component.


-> Figure 3: Result of the ablation study for the number of feature dimensions in the CE module. (Source) <-


-> Figure 4: The architecture of the Component Embedding Module. (Source) <-

Stage $I$ Training

DeepFaceDrawing is trained in two stages. For the first stage, we are interested in learning the component embeddings. We train only the CE module by using component sketches to train five individual autoencoders for feature embeddings. The training is done in a self-supervised manner, with the mean square error(MSE) loss between the input sketch component and the reconstructed image.

Feature Mapping(FM) Module - Stage $II$ training

Architectural design

The second sub-network consists of two modules: Feature Mapping and Image Synthesis. This section will discuss the FM module and how it is trained in the second stage.

FM turns the component feature vectors(bottleneck vectors) learned in the first stage of training into corresponding feature maps.


-> Figure 5: The architecture of the Feature Mapping Module. (Source) <-

Stage $II$ FM training

We first fix the freeze, the trained CE encoder, and train the entire network end to end. The entire network consists of a non-trainable CE encoder, FM module, and IS module. We will get into IS next.

Image Synthesis(IS) Module

Architectural design

Given the combined feature maps, the IS module converts them to a realistic face image. This module is a conditional GAN architecture that takes in the feature maps as input to the generator, and a discriminator guides the generation. Check out this blog post on Conditional GAN.

Figures 6 and 7 summarize the architectural design of the generator and the discriminator.


-> Figure 6: The architecture of the Generator of Image Synthesis Module. (Source) <-


-> Figure 7: The architecture of Discriminator of Image Synthesis Module. (Source) <-

Stage $II$ iS Training

The GAN in the IS is trained with GAN loss. L1 loss is also used to ensure the pixel-wise quality of the generated images. Perpetual loss is used to compare the high-level differences between real and generated images.


Both qualitative and quantitative evaluations show the superior generation ability of this system to existing and alternative solutions.


-> Figure 8: Examples of sketch to image synthesis using DrawFaceDrawing. (Source) <-

Additional Resources

The authors of this paper have not open sourced the training script. The code provided in their GitHub repo is exposed as a service where you can draw a sketch and the model will make a realistic image out of it. You can also play around with different parameters.

The code uses Jittor deep learning framework and they will soon be releasing one using PyTorch.

Check out this interactive platform $\rightarrow$


-> Figure 9: Interactive drawing tool. (Source) <-


I hope this report will come in handy in understanding this framework. If you find the report to be useful I would love to hear from you. Also, please feel to let me know if you have any improvement pointers to share. As a parting note, here are two other reports that I wrote for Two Minutes Paper.