Introduction

Introduction

What is image inpainting?

Image inpainting is the art of reconstructing damaged/missing parts of an image and can be extended to videos easily. There are a plethora of use cases that have been made possible due to image inpainting.

(Image inpainting results gathered from NVIDIA’s web playground)

Imagine having a favorite old photograph with your grandparents from when you were a child but due to some reasons, some portions of that photograph got corrupted. This would be the last thing you would want given how special the photograph is for you. Image inpainting can be a life savior here.

Image inpainting can be immensely useful for museums that might not have the budget to hire a skilled artist to restore deteriorated paintings.

Now, think about your favorite photo editor. Having the image inpainting function in there would be kind of cool, isn’t it?

Image inpainting can also be extended to videos (videos are a series of image frames after all). Due to over-compression, it is very likely that certain parts of the video can get corrupted sometimes. Modern image inpainting techniques are capable of handling this gracefully as well.

Producing images where the missing parts have been filled with both visually and semantically plausible appeal is the main objective of an artificial image inpainter. It’s safe enough to admit that it is indeed a challenging task.

Now, that we have some sense of what image inpainting means (we will go through a more formal definition later) and some of its use cases, let’s now switch gears and discuss some common techniques used to inpaint images (spoiler alert: classical computer vision).

Doing image inpainting: The traditional way

There is an entire world of computer vision without deep learning. Before Single Shot Detectors (SSD) came into existence, object detection was still possible (although the precision was not anywhere near what SSDs are capable of). Similarly, there are a handful of classical computer vision techniques for doing image inpainting. In this section, we are going to discuss two of them. First, let’s introduce ourselves to the central themes these techniques are based on - either texture synthesis or patch synthesis.

To inpaint a particular missing region in an image they borrow pixels from surrounding regions of the given image that are not missing. It’s worth noting that these techniques are good at inpainting backgrounds in an image but fail to generalize to cases where:

In some cases for the latter one, there have been good results with traditional systems. But when those objects are non-repetitive in structure, that again becomes difficult for the inpainting system to infer.

If we think of it, at a very granular level, image inpainting is nothing but restoration of missing pixel values. So, we might ask ourselves - why can’t we just treat it as another missing value imputation problem? Well, images are not just any random collection of pixel values, they are a spatial collection of pixel values. So, treating the task of image inpainting as a mere missing value imputation problem is a bit irrational. We will answer the following question in a moment - why not simply use a CNN for predicting the missing pixels?

Now, coming to the two techniques -

image.png

To have a taste of the results that these two methods can produce, refer to this article. Now that we have familiarized ourselves with the traditional ways of doing image inpainting let’s see how to do it in the modern way i.e. with deep learning.

Doing image inpainting: The modern way

In this approach, we train a neural network to predict missing parts of an image such that the predictions are both visually and semantically consistent. Let’s take a step back and think how we (the humans) would do image inpainting. This will help us formulate the basis of a deep learning-based approach. This will also help us in forming the problem statement for the task of image impainting.

When trying to reconstruct a missing part in an image, we make use of our understanding of the world and incorporate the context that is needed to do the task. This is one example where we elegantly marry a certain context with a global understanding. So, could we instill this in a deep learning model? We will see.

We humans rely on the knowledge base(understanding of the world) that we have acquired over time. Current deep learning approaches are far from harnessing a knowledge base in any sense. But we sure can capture spatial context in an image using deep learning. A convolutional neural networks or CNN is a specialized neural network for processing data that has known grid like topology – for example an image can be thought of as 2D grid of pixels. It will be a learning based approach where we will train a deep CNN based architecture to predict missing pixels.

A simple image inpainting model with the CIFA10 dataset

ML/DL concepts are best understood by actually implementing them. In this section, we will walk you through the implementation of the Deep Image Inpainting, while discussing the few key components of the same. We first require a dataset and most importantly prepare it to suit the objective task. Just a spoiler before discussing the architecture, this DL task is in a self-supervised learning setting.

Why choose a simple dataset?

Since inpainting is a process of reconstructing lost or deteriorated parts of images, we can take any image dataset and add artificial deterioration to it. For this specific DL task, we have a plethora of datasets to work with. Having said that we find that real-life applications of image inpainting are done on high-resolution images(Eg: 512 x 512 pixels). But according to [this paper]http://openaccess.thecvf.com/content_cvpr_2018/papers/Yu_Generative_Image_Inpainting_CVPR_2018_paper.pdf(), to allow a pixel being influenced by the content 64 pixels away, it requires at least 6 layers of 3×3 convolutions with dilation factor 2.

Thus using such high-resolution images does not fit the purpose here. It’s a general practice to apply ML/DL concepts to toy datasets. Cutting short on computational resources and for quick implementation, we will use the CIFAR10 dataset.

Data Preparation

Certainly, the entry step to any DL task is data preparation. In our case, as mentioned we need to add artificial deterioration to our images. This can be done using the standard image processing idea of masking an image. Since it is done in a self-supervised learning setting, we need X and y (same as X) pairs to train our model. Here X will be batches of masked images, while y will be original/ground truth image.

image.png

To simplify masking we first assumed that the missing section is a square hole. To prevent overfitting to such an artifact, we randomized the position of the square along with its dimensions.

Using these square holes significantly limits the utility of the model in application. This is because in reality deterioration in images is not just a square bob. Thus inspired by this paper we implemented irregular holes as masks. We simply drew lines of random length and thickness using OpenCV.

We will implement a Keras data generator to do the same. It will be responsible for creating random batches of X and y pairs of desired batch size, applying the mask to X and making it available on the fly. For high-resolution images using data generator is the only cost-effective option. Our data generator createAugment is inspired by this amazing blog. Please give it a read.

class createAugment(keras.utils.Sequence):
 # Generates masked_image, masks, and target images for training
 def __init__(self, X, y, batch_size=32, dim=(32, 32),
	n_channels=3, shuffle=True):
	# Initialize the constructor
	self.batch_size = batch_size
	self.X = X
	self.y = y
	self.dim = dima
	self.n_channels = n_channels
	self.shuffle = shuffle
	self.on_epoch_end()

 def __len__(self):
	# Denotes the number of batches per epoch
	return int(np.floor(len(self.X) / self.batch_size))

 def __getitem__(self, index):
	# Generate one batch of data
	indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
	# Generate data
	X_inputs, y_output = self.__data_generation(indexes)
	return X_inputs, y_output

 def on_epoch_end(self):
	# Updates indexes after each epoch
	self.indexes = np.arange(len(self.X))
	if self.shuffle:
		np.random.shuffle(self.indexes)

The methods in the code block above are self-explanatory. Let’s talk about the methods data_generation and createMask implemented specifically for our use case. As the name suggests this private method is responsible for generating binary masks for each image in a batch of a given batch size. It’s drawing black lines of random length and thickness on white background. You may notice that it’s returning the mask along with the masked image. Why do we need this mask? We will see soon.

Kera's model.fit requires input and target data for which it calls __getitem__ under the hood. If traingen is an instance of createAugment, then traingen[i] is roughly equivalent to traingen.__getitem__(i), where i ranges from 0 to len(traingen). This special method is internally calling __data_generation which is responsible for preparing batches of Masked_images, Mask_batch and y_batch.

def __data_generation(self, idxs):
	# Masked_images is a matrix of masked images used as input
	Masked_images = np.empty((self.batch_size, self.dim[0], self.dim[1], self.n_channels)) # Masked image

	# Mask_batch is a matrix of binary masks used as input
	Mask_batch = np.empty((self.batch_size, self.dim[0], self.dim[1], self.n_channels)) # Binary Masks
	
	# y_batch is a matrix of original images used for computing error from reconstructed image
	y_batch = np.empty((self.batch_size, self.dim[0], self.dim[1], self.n_channels)) # Original image
 
	## Iterate through random indexes
	for i, idx in enumerate(idxs):
		image_copy = self.X[idx].copy()
		## Get mask associated to that image
		masked_image, mask = self.__createMask(image_copy)

		## Append and scale down.
		Masked_images[i,] = masked_image/255
		Mask_batch[i,] = mask/255
		y_batch[i] = self.y[idx]/255

	return [Masked_images, Mask_batch], y_batch

Architecture

Inpainting is part of a large set of image generation problems. The goal of inpainting is to fill the missing pixels. It can be seen as creating or modifying pixels which also includes tasks like deblurring, denoising, artifact removal, etc to name a few. Methods for solving those problems usually rely on an Autoencoder – a neural network that is trained to copy it’s input to its output. It is comprised of an encoder which learns a code to describe the input, h = f(x), and a decoder that produces the reconstruction, r = g(h) or r = g(f(x)).

Vanilla Convolutional Autoencoder

An Autoencoder is trained to reconstruct the input, i.e. g(f(x)) = x, but this is not the only case. We hope that training the Autoencoder will result in h taking on discriminative features. It has been noticed that if the Autoencoder is not trained carefully then it tends to memorize the data and not learn any useful salient feature.

Rather than limiting the capacity of the encoder and decoder (shallow network), regularized Autoencoders are used. Usually, a loss function is used such that it encourages the model to learn other properties besides the ability to copy the input. These other properties can include sparsity of the representation, robustness to noise or to missing input. This is where image inpainting can benefit from Autoencoder based architecture. Let’s build one.

image.png

To set a baseline we will build an Autoencoder using vanilla CNN. It’s always a good practice to first build a simple model to set a benchmark and then make incremental improvements. If you want to refresh your concepts on Autoencoders this article here by PyImageSearch is a good starting point. As stated previously the aim is not to master copying, so we design the loss function such that the model learns to fill the missing points. We use mean_square_error as the loss to start with and dice coefficient as the metric for evaluation.

def dice_coef(y_true, y_pred):
	y_true_f = keras.backend.flatten(y_true)
	y_pred_f = keras.backend.flatten(y_pred)
	intersection = keras.backend.sum(y_true_f * y_pred_f)
	return (2. * intersection) / (keras.backend.sum(y_true_f + y_pred_f))

For tasks like image segmentation, image inpainting etc, pixel-wise accuracy is not a good metric because of high color class imbalance. Though it’s easy to interpret, the accuracy score is often misleading. Two commonly used alternatives are IoU (Intersection over Union) and Dice Coefficient. They are both similar, in the sense that the goal is to maximize the area of overlap between the predicted pixel and the ground truth pixel divided by their union. You can check out this amazing explanation here.

Wouldn’t it be interesting to see how the model is learning to fill the missing holes over multiple epochs or steps?

We implemented a simple demo PredictionLogger callback that, after each epoch completes, calls model.predict() on the same test batch of size 32. Using wandb.log() we can easily log masked images, masks, prediction and ground truth images. Fig 1 is the result of this callback. Here’s the full callback that implements this -

class PredictionLogger(tf.keras.callbacks.Callback):
	def __init__(self):
		super(PredictionLogger, self).__init__()
   
# The callback will be executed after an epoch is completed
	def on_epoch_end(self, logs, epoch):
		# Pick a batch, and sample the masked images, masks, and the labels
		sample_idx = 54
		[masked_images, masks], sample_labels = testgen[sample_idx]  

		# Initialize empty lists store intermediate results
		m_images = []
		binary_masks = []
		predictions = []
		labels = []
       
        # Iterate over the batch
		for i in range(32):
			 # Our inpainting model accepts masked imaged and masks as its inputs,
			 # then use perform inference
			 inputs = [B]
			 impainted_image = model.predict(inputs)

			 # Append the results to the respective lists
			 m_images.append(masked_images[i])
			 binary_masks.append(masks[i])
			 predictions.append(impainted_image.reshape(impainted_image.shape[1:]))
			 labels.append(sample_labels[i])

		# Log the results on wandb run page and voila!
		wandb.log({"masked_images": [wandb.Image(m_image)
		                     for m_image in m_images]})
		wandb.log({"masks": [wandb.Image(mask)
		                     for mask in binary_masks]})
		wandb.log({"predictions": [wandb.Image(inpainted_image)
		                     for inpainted_image in predictions]})
		wandb.log({"labels": [wandb.Image(label)
		                     for label in labels]})

A simple image inpainting model with the CIFA10 dataset

Partial Convolutions

We will now talk about Image Inpainting for Irregular Holes Using Partial Convolutions as a strong alternative to vanilla CNN. Partial convolution was proposed to fill missing data such as holes in images. The original formulation is as follows – Suppose X is the feature values for the current sliding (convolution) window, and M is the corresponding binary mask. Let the holes be denoted by 0 and non-holes by 1. Mathematically partial convolution can be expressed as,

image.png

The scaling factor, sum(1)/sum(M), applies appropriate scaling to adjust for the varying amount of valid (unmasked) inputs. After each partial convolution operation, we update our mask as follows: if the convolution was able to condition its output on at least one valid input (feature) value, then we mark that location to be valid. It can be expressed as,

image.png

With multiple layers of partial convolutions, any mask will eventually be all ones, if the input contained any valid pixels. In order to replace the vanilla CNN with a partial convolution layer in our image inpainting task, we need an implementation of the same.

Unfortunately, since there is no official implementation in TensorFlow and Pytorch we have to implement this custom layer ourselves. This TensorFlow tutorial on how to build a custom layer is a good stating point. Luckily, I could find a Keras implementation of partial convolution here. The codebase used TF 1.x as Keras backend which I upgraded to use TF 2.x. We have provided this upgraded implementation along with the GitHub repo for this blog post. Find the PConv2D layer here.

Let’s implement the model in code, and train it on CIFAR 10 dataset. We implemented a class inpaintingModel. To build the model you need to call the prepare_model() method.

def prepare_model(self, input_size=(32,32,3)):
	input_image = keras.layers.Input(input_size)
	input_mask = keras.layers.Input(input_size)

	conv1, mask1, conv2, mask2 = self.__encoder_layer(32, input_image, input_mask)
	conv3, mask3, conv4, mask4 = self.__encoder_layer(64, conv2, mask2)
	conv5, mask5, conv6, mask6 = self.__encoder_layer(128, conv4, mask4)
	conv7, mask7, conv8, mask8 = self.__encoder_layer(256, conv6, mask6)

	conv9, mask9, conv10, mask10 = self.__decoder_layer(256, 128, conv8, mask8, conv7, mask7)
	conv11, mask11, conv12, mask12 = self.__decoder_layer(128, 64, conv10, mask10, conv5, mask5)
	conv13, mask13, conv14, mask14 = self.__decoder_layer(64, 32, conv12, mask12, conv3, mask3)
	conv15, mask15, conv16, mask16 = self.__decoder_layer(32, 3, conv14, mask14, conv1, mask1)

	outputs = keras.layers.Conv2D(3, (3, 3), activation='sigmoid', padding='same')(conv16)

	return keras.models.Model(inputs=[input_image, input_mask], outputs=[outputs])

As it’s an Autoencoder, this architecture has two components – encoder and decoder which we have discussed already. In order to reuse the encoder and decoder conv blocks we built two simple utility functions encoder_layer and decoder_layer,

def __encoder_layer(self, filters, in_layer, in_mask):
	conv1, mask1 = PConv2D(32, (3,3), strides=1, padding='same')([in_layer, in_mask])
	conv1 = keras.activations.relu(conv1)

	conv2, mask2 = PConv2D(32, (3,3), strides=2, padding='same')([conv1, mask1])
	conv2 = keras.layers.BatchNormalization()(conv2, training=True)
	conv2 = keras.activations.relu(conv2)

	return conv1, mask1, conv2, mask2

def __decoder_layer(self, filter1, filter2, in_img, in_mask, share_img, share_mask):
	up_img = keras.layers.UpSampling2D(size=(2,2))(in_img)
	up_mask = keras.layers.UpSampling2D(size=(2,2))(in_mask)
	concat_img = keras.layers.Concatenate(axis=3)([share_img, up_img])
	concat_mask = keras.layers.Concatenate(axis=3)([share_mask, up_mask])

	conv1, mask1 = PConv2D(filter1, (3,3), padding='same')([concat_img, concat_mask])
	conv1 = keras.activations.relu(conv1)

	conv2, mask2 = PConv2D(filter2, (3,3), padding='same')([conv1, mask1])
	conv2 = keras.layers.BatchNormalization()(conv2)
	conv2 = keras.activations.relu(conv2)

	return conv1, mask1, conv2, mask2

The essence of the Autoencoder implementation lies in the Upsampling2D and Concatenate layers. An alternative to this is to use Conv2DTranspose layer.

We compiled the model with the Adam optimizer with default parameters, mean_square_error as the loss and dice_coef as the metric. Using model.fit() we trained the model, the results of which were logged using WandbCallback and PredictionLogger callbacks.

Partial Convolutions

Conclusion

Let’s conclude with some additional pointers on the topic, including how it relates to self-supervised learning and some recent approaches for doing image inpainting.

A very interesting property of an image inpainting model is that it is capable of understanding an image to some extent. Much like in NLP, where we use embeddings to understand the semantic relationship between the words and use those embeddings for downstream tasks like text classification.

The premise here is, when you start to fill in the missing pieces of an image with both semantic and visual appeal, you start to understand the image. This is more along the lines of self-supervised learning where you take advantage of the implicit labels present in your input data when you do not have any explicit labels.

This is particularly interesting because we can use the knowledge of an image inpainting model in a computer vision task as we would use the embeddings for an NLP task. For learning more about this, I highly recommend this excellent article by Jeremy Howard.

So far, we have only used a pixel-wise comparison as our loss function. This often forces our network to learn very rigid and not-so-rich features representations. A very interesting yet simple idea, approximate exact matching, was presented by Charles et al. in this report. According to their study, if we shift the pixel values of an image by a small constant, that does not make the image visually very different from its original form. So, they added an additional term in the pixel-wise comparison loss to incorporate this idea.

Another interesting tweak to our network would be to enable it to attend on related feature patches at distant spatial locations in an image. In this paper Generative Image Inpainting with Contextual Attention, Jiahui et al. introduced the idea of contextual attention which allows the network to explicitly utilize the neighboring image features as references during its training.

Thanks for reading this article until the end. Image inpainting is a very interesting computer vision task and we hope this article gave you a fair introduction to the topic. Please feel free to let us know about any feedback you might have on the article via Twitter (Ayush and Sayak). We would really appreciate it :)