# 3D Image Inpainting

A novel way to convert a single RGB-D image into a 3D image. Made by Ayush Thakur using Weights & Biases
Ayush Thakur

In this report, we introduce some key components of the [3D Photography using Context-aware Layered Depth Inpainting] (https://shihmengli.github.io/3D-Photo-Inpainting/) paper and look at intermediate results alongside stunning 3D images. Since we can see the Neowise comet this month, the report is space-themed and tries to bring images from space into life.

#### Reproduce results in this colab $\rightarrow$

Alternatively, you can use the forked repo which lets you visualize your model predictions in Weights & Biases here $\rightarrow$

## Introduction

3D pictures can take your photography to a whole new dimension. However, creating such parallax effects with classical reconstruction and rendering techniques requires elaborate setup and specialized hardware, which is not always feasible.
Depth is the most important aspect of 3D photography. A 3D image can be created by taking two shots of the same scene, where one is a little offset to the other. This slight difference is enough to trick your brain into thinking you are looking at an image with depth. Recent advancements in cell phone cameras, like dual-lens camera, enable capturing depth information. The resulting image is an RGB-D(color and depth) image. In an attempt to generate a lifelike view from this RGB-D image, occlusions created by parallax must be nullified.
In this paper, the authors have proposed a method of converting a single image into a 3D photo. They have used Layered Depth Image(LDI) as an underlying representation and have presented a learning-based inpainting model that can synthesize new color and depth in the occluded region.

### The Paper \rightarrow

Here is a short video from the authors showcasing their stunning results.

## Overview of the Paper

Some prerequisites that the proposed method relies on are:

• Layer Depth Image(LDI) Representation: In simple terms, an LDI is a view of the scene from a single input camera view, but with multiple pixels along each line of sight. Thus LDI contains potentially multiple depth pixels per pixel location. This is an intermediate representation where the depth information can either come from a dual-camera cell phone or can be estimated from a single RGB image. The authors have used the LDI representation because it has a natural ability to handle multiple layers, i.e., it can handle depth complex scenarios, and they are memory efficient. An LDI is similar to a regular 4-connected image(shown in figure 1), except at every position in the pixel lattice, it can hold any number of pixels, from zero to many. Each LDI pixel stores a color and a depth value. The original paper on LDI can give the necessary background to the readers. More on 4-connected image here.

• Learning-Based Inpainting: Inpainting is a task to fill missing content in an image with plausible content. In the context of 3D images, the occlusions can be in color or depth. CNN-based methods have received considerable attention due to their ability to predict semantically meaningful content that is not available in the known regions. Earlier learning-based methods use “rigid” layer structures, i.e., every pixel in the image has the same number of layers. At every pixel, they store the nearest surface in the first layer, the second-nearest in the next layer, etc. This is problematic because, across depth discontinuities, the content within a layer changes abruptly, destroying locality in receptive fields of convolution kernels. To learn more about image inpainting, check out an introduction to image inpainting using deep learning.

With some context, we can now look into some details of this paper. This paper is rich with techniques that require narrow attention, but in my opinion, the best way to summarize this paper is to go through the proposed method and look at individual components along the way.

### Method Overview

• Input: The input to this method is single RGB-D image. The depth information can come from a mobile dual camera, as mentioned earlier, or can be estimated from a single image. The authors have used a pretrained depth estimation model to compute depth from the input image with only color information. Thus the proposed method applies to any image.

• Initialization: The input image(RGB-D or after depth estimation) is lifted onto an LDI. Initially, it is created with a single layer everywhere, and every LDI pixel is connected to its four cardinal neighbors. Unlike original LDI discussed in the prerequisite, the authors ensured that each pixel in LDI representation stores pointers to either zero or at most one direct neighbor in each of the four cardinal directions (left, right, top, bottom). LDI pixels are 4-connected like normal image pixels within smooth regions but do not have neighbors across depth discontinuities. This takes us to the next step.

• Pre-processing: Since LDI does not have neighbors across depth discontinuities, we need to find these discontinuity edges. The occlusion will occur at these edges. Thus the inpainting operation needs to extend the content of these regions. However, the depth map from the dual camera or depth estimation has blurry discontinuities across multiple pixels. To sharpen it, the authors have used a bilateral median filter. This step thus ensures easy localization of the edges. There are few more sub-steps like thresholding and removal of short edges(<10 pixels).

In the example below, the input RGB image is to the right, followed by the depth map estimated by the pre-trained depth estimation model.

Notice the blurred edges. A bilateral median filter is used to sharpen them, followed by a few more preprocessing steps.

• Iteratively: From the computed and pre-processed depth edge, a depth edge is selected for inpainting. For every depth edge selected, color and depth content is synthesized. First, the LDI pixels across the edge is disconnected(remember each pixel was connected to its nearest neighbor in the four cardinal directions.) The pixels that became disconnected (i.e., are now missing a neighbor) are silhouette pixels. Naturally, there will be a foreground silhouette and a background silhouette. We require inpainting only for background silhouette. A local context region from the known side of the edge is extracted to generate the synthesis region for the unknown(occluded) side of the edge. This synthesis region is simply a contiguous region of new pixels. The color and depth values are initialized in the synthesis region first, by using a simple iterative flood fill algorithm. Check section 3.2 of the paper for the detailed implementation of this step. Further, this context and synthesis regions are used to synthesize color and depth values using a learning-based technique.

Figure 1: Steps showing one iteration of inpainting. (Source

• Learning-Based Inpainting: As mentioned in the prerequisite, earlier methods used "rigid" layer representation, and thus standard CNN could be applied on each layer. However, in this method, the tensor(LDI) topology is more complex, and thus standard CNN cannot be applied. The authors thus broke the problem of inpainting into local inpainting sub-problems. Thus depth edges were computed, and for each depth edge, context and synthesis regions were first initialized with flood fill algorithm. These local regions have image topology, and thus, standard CNN can be applied. The authors broke this local inpainting task into three sub-networks.

• Edge Inpainting Network: This network is used to predict the depth edges in the synthesis regions, thus producing an inpainted edge. Doing this first constrain the content prediction.

• Color Inpainting network: The inputs to this network are inpainted edges and context color. The output is the inpainted color for the synthesis region.

• Depth inpainting network: The inputs to this network are inpainted edges and context depth. The output is the inpainted depth for the synthesis region.

Figure 2: The impainting network. (Source)

Now let us dive into the exciting part.

## Results

#### Reproduce results in this colab $\rightarrow$

Alternatively, you can use the forked repo, which lets you visualize your model predictions in Weights & Biases as we have done here$\rightarrow$.

After successful 3D image inpainting, in the google colab, you will find image_name.ply file in the mesh directory. It is the inpainted 3D mesh generated by integrating all the inpainted depth and color values back into the original LDI. I was curious to look at this mesh. With some investigation, I realized that it is a Point Cloud. Being new to this, the easiest way for me to visualize this was to log this point cloud in the wandb dashboard using wandb.log. Learn more about doing this here. I used Open3D to load the .ply file. The point cloud object has depth and color information. They are shown below. However, I do not know how to interpret them. Nevertheless, curios endeavor.

## How Were the Models Trained?

#### Training Data

The authors generated training data out of the MS COCO dataset. They first applied the pre-trained depth estimation model on the COCO dataset to obtain depth maps. They then extracted context/synthesis regions(briefly explained above) to form a pool of these regions. They were randomly sampled and were placed on different images in the MS COCO dataset. Out of at most three such regions per image, one was selected using training.

#### Model

Check out the supplementary material of the paper for more information on training details.

It is time for some beautiful 3D images. :fire:

## What's Next and Conclusion

The goal of this report is to summarize the paper, making it more accessible for the readers. I have used lines from the paper at places because that was the best way to convey the information.

Few of the thing that excited me the most about the proposed method are as follows:

• We can generate high-quality 3D images with cheap computation. This can be used as a web service.

• What excited the most is using this technique as a data augmentation pipeline with few modifications. I got this idea while watching the interview of Sayak Paul at the Machine Learning Street Talk. In my opinion, an object in an image, viewed from different camera angles, is a more natural way of generating a larger dataset. This can be useful for object detection task on a smaller dataset.

Thank you for your time. For constructive feedback on summarizing this paper, reach out to me on Twitter, @ayushthakur0.