An Overview of Instance Aware Image Colorization

This article explores an interesting learning-based image colorization technique that produces stunning colored images.
Created on October 30|Last edited on November 21
Comment
﻿
﻿
Image colorization is an ill-posed problem, i.e., there are multiple plausible choices to colorize an object. A black and white car can either be colored red, blue, or gray. This report explores an interesting deep learning framework to achieve instance-aware colorization.
﻿Paper | Code | Google Colab﻿Upload black and white image(s) and download the colored images in the linked Colab notebook.
Table of ContentsAn Introduction to Image ColorizationOverview of the Proposed MethodResultsTraining ProcedureEnding Note
﻿
﻿
﻿
Run set1
﻿
﻿
An Introduction to Image ColorizationImage colorization is a fascinating deep learning task to automatically predict the missing channels from a given single-channel grayscale image. There exist many plausible ways to color a grayscale image, which makes this a challenging problem statement. It is also a prevalent pretext task for image representation learning. Learn more about it in the Unsupervised Visual Representation Learning with SwAV report.
Some of the existing image colorization techniques include:
Scribble-based colorization: The colorization process relies on high-level user scribbles like color paints or strokes to guide the process. This process thus reduces the plausible image space to generate convincing results. However, at the cost of intensive manual labor to provide careful guidance. Imagine applying this technique to an old classic black-and-white video (movie). 
Example-based colorization: To reduce human efforts, this technique colorizes images with the color statistics transferred from a reference image. The result generated depends on the semantic similarity between the reference image and the grayscale input image. This process is not entirely automatic. 
Learning-based colorization: Deep convolutional neural network-based frameworks have automated the colorization process in a true sense. Most of the existing architectures address two key components for convincing colorization - semantics, and multi-modality. Thus the framework should keep the semantics of the image intact, and it should be able to produce a wide variety of colors for the same object. 
Instance-aware colorization: Learning-based colorization has achieved impressive performance but suffers visual artifacts when processing images with multiple foreground objects or multiple objects in a cluttered background. This brings us to this paper, where the authors' critical insight is to have a clear object-background separation that dramatically improves colorization performance. 
﻿
﻿
Figure 1: A clear separation of the object(orange) and background leads to a more convincing colorization. (Source)
Try out the colab notebook to colorize your grayscale images. 
﻿
Overview of the Proposed Method
Why?The proposed instance-aware colorization method is different from existing learning-based methods and is more effective due to the following reasons:
Instead of learning to colorize the entire image, it learns to colorize individual instances(objects) in an image. This is a substantially easier task since the network is not confused with the background. 
The instance-based colorization allows the instance network to learn object-level representations allowing better color distribution for that object. 
Instance level colorization must be tied to the entire image colorization. Let us look at the architectural design of this framework.
How?﻿
Figure 2: Overview of the proposed method. (Source)
﻿
The network architecture consists of three components:
Off-the-shelf pre-trained model to detect object instances and produce cropped object images. 
Two backbone networks trained end to end, for instance, and full-image colorization, respectively. They are identical networks with different weights and can be any of the existing colorization architectures like DeOldify. 
A fusion module to selectively blend features extracted from layers of the two colorization networks. 
Object Detection
Figure 3: Object detection pipeline of the InstColorization framework.
﻿
The framework takes in a grayscale image XXX﻿ as input and predicts two remaining color channels in the CIE Lab color space. This color space describes all the colors visible to the human eye. It was created to serve as a device-independent model to be used as a reference. Thus InstColorization framework is device agnostic.
As shown in figure 3, an off-the-shelf pre-trained mask R-CNN object detector is used to detect the instances in the image. The instances are cropped using the obtained bounding box coordinates. They are then resized to 256 x 256 resolution images. 
For training, a colored image dataset is used. The images are converted to CIE Lab color space. Only the LLL﻿ channel is used, and the rest two are discarded. An object detector is used on this single-channel image. The instances are cropped using the bounding box coordinates from the single-channel image and the colored counterpart.
﻿
Run set2
﻿
﻿
Image Colorization Backbone﻿
Figure 4: Network architecture contains two branches of colorization networks, one for colorizing the instance images and the other for colorizing the full image.
﻿
The instance image(XiX_iXi​﻿) and the input grayscale image(XXX﻿) are fed to the instance colorization network and full-image colorization network respectively. Both networks share the same architecture but different weights. 
The networks are similar so that they have the same number of layers to facilitate feature fusion. The authors have used the main colorization network introduced in Real-Time User-Guided Image Colorization with Learned Deep Priors.
For training, the full-colorization network is trained first. The trained weights are transferred to initialize the weights of the instance-colorization network.
Fusion Module
Figure 5: The feature fusion module. (Source)
﻿
To produce accurate and coherent colorization, the authors proposed a fusion module, which can blend the immediate feature maps from both instance and full-image networks. One can think of naively overlaying the feature maps to blend it, but that leads to visible artifacts due to the overlapping pixels' inconsistency.
The fusion takes place at multiple layers of the colorization network. For the sake of simplicity, let us discuss this module for the jthj^{th}jth﻿ layer. Just a reminder that the feature map from both the networks at the jthj^{th}jth﻿ layer will have the same shape. 
Figure 5 summarizes the fusion module. 
The module takes two inputs:
The feature map(fjXf_j^XfjX​﻿) from full-image nework at the jthj^{th}jth﻿ layer. 
A bunch of instance features and corresponding bounding boxes {fjXi,Bi}i=1N\{f_j^{X_i}, B_i\}_{i=1}^N{fjXi​​,Bi​}i=1N​﻿.
For both kinds of features, a small neural network with three convolutional layers are learned to predict full-image weight map WFW_FWF​﻿ and per-instance weight map WIiW_I^iWIi​﻿. These weight maps are used to fuse the features.
To fuse fjXif_j^{X_i}fjXi​​﻿ to fjXf_j^XfjX​﻿, the input bounding boxes BiB_iBi​﻿ are used, which determines the size and location of the instances. Note that the per-instance feature maps and the per-instance weight map are resized to match the size of the full image. The resized per-instance feature map is fjXˉif_j^{\bar X_i}fjXˉi​​﻿ and per-instance weight map is WˉIi\bar W_I^iWˉIi​﻿.
The weight maps are stacked, and softmax is applied on each pixel and obtains the fused feature using a weighted sum.
 fjX~=fjX∙WF+∑i=1NfjXˉi∙WˉIif_j^{\tilde X} = f_j^X \bullet W_F + \sum_{i=1}^N f_j^{\bar X_i} \bullet \bar W_I^ifjX~​=fjX​∙WF​+∑i=1N​fjXˉi​​∙WˉIi​﻿﻿
Here, N is the number of instances per image.
Results
﻿
Before we get into the training details, let's look at some of the results. 
﻿
Run set27
﻿
﻿
Training Procedure
Training DatasetThe authors have used ImageNet and COCO-Stuff datasets to train and evaluate the model. They have additionally used the Places205 dataset to evaluate the model on out-of-distribution data samples.
Evaluation MetricsThe authors have used PSNR and SSIM to quantify the colorization quality. They have also used perceptual metric LPIPS. 
Training DetailsThe whole network is trained sequentially in a three-step training process:
First, the full-colorization network, a backbone network from an existing image colorization architecture like that of DeOldify, is trained. The authors have used the network proposed in Real-Time User-Guided Image Colorization with Learned Deep Priors. They have used the pre-trained weights to initialize the training process. They have trained the model for two epochs with a learning rate of 1e-5.
They then train the instance-aware network using the trained weights of the full-colorization network in the first step. This is more of a fine-tuning step than the extracted instances from the images instead of the entire image. They have trained the model for five epochs with a learning rate of 5e-5.
Finally, the fusion module is trained. The weights in both full and instance colorization networks are frozen.
Ending NoteThe field of image colorization is exciting and challenging. We have seen much progress in recent times, and it will get better with new papers. 
This work by the authors is promising. The insight that a clear figure-ground separation can dramatically improve colorization performance can be seen in the results. This work leverages an off-the-shelf object detection model. Thus better detectors will improve colorizations. 
In my opinion, the novel bit of instance-aware colorization is because it can use any existing learning-based colorization architecture as the backbone architecture for full-image and instance colorization networks.
I hope you find this summary insightful and will encourage you to read the paper. Leave your thoughts in the comment down below. If you have any questions, I would love to address it. 
﻿
Run set27
﻿
﻿