In this report, we introduce some key components of the [3D Photography using Context-aware Layered Depth Inpainting] (https://shihmengli.github.io/3D-Photo-Inpainting/) paper and look at intermediate results alongside stunning 3D images. Since we can see the Neowise comet this month, the report is space-themed and tries to bring images from space into life.
Alternatively, you can use the forked repo which lets you visualize your model predictions in Weights & Biases here $\rightarrow$
Some prerequisites that the proposed method relies on are:
Layer Depth Image(LDI) Representation: In simple terms, an LDI is a view of the scene from a single input camera view, but with multiple pixels along each line of sight. Thus LDI contains potentially multiple depth pixels per pixel location. This is an intermediate representation where the depth information can either come from a dual-camera cell phone or can be estimated from a single RGB image. The authors have used the LDI representation because it has a natural ability to handle multiple layers, i.e., it can handle depth complex scenarios, and they are memory efficient. An LDI is similar to a regular 4-connected image(shown in figure 1), except at every position in the pixel lattice, it can hold any number of pixels, from zero to many. Each LDI pixel stores a color and a depth value. The original paper on LDI can give the necessary background to the readers. More on 4-connected image here.
Learning-Based Inpainting: Inpainting is a task to fill missing content in an image with plausible content. In the context of 3D images, the occlusions can be in color or depth. CNN-based methods have received considerable attention due to their ability to predict semantically meaningful content that is not available in the known regions. Earlier learning-based methods use “rigid” layer structures, i.e., every pixel in the image has the same number of layers. At every pixel, they store the nearest surface in the first layer, the second-nearest in the next layer, etc. This is problematic because, across depth discontinuities, the content within a layer changes abruptly, destroying locality in receptive fields of convolution kernels. To learn more about image inpainting, check out an introduction to image inpainting using deep learning.
With some context, we can now look into some details of this paper. This paper is rich with techniques that require narrow attention, but in my opinion, the best way to summarize this paper is to go through the proposed method and look at individual components along the way.
Input: The input to this method is single RGB-D image. The depth information can come from a mobile dual camera, as mentioned earlier, or can be estimated from a single image. The authors have used a pretrained depth estimation model to compute depth from the input image with only color information. Thus the proposed method applies to any image.
Initialization: The input image(RGB-D or after depth estimation) is lifted onto an LDI. Initially, it is created with a single layer everywhere, and every LDI pixel is connected to its four cardinal neighbors. Unlike original LDI discussed in the prerequisite, the authors ensured that each pixel in LDI representation stores pointers to either zero or at most one direct neighbor in each of the four cardinal directions (left, right, top, bottom). LDI pixels are 4-connected like normal image pixels within smooth regions but do not have neighbors across depth discontinuities. This takes us to the next step.
Pre-processing: Since LDI does not have neighbors across depth discontinuities, we need to find these discontinuity edges. The occlusion will occur at these edges. Thus the inpainting operation needs to extend the content of these regions. However, the depth map from the dual camera or depth estimation has blurry discontinuities across multiple pixels. To sharpen it, the authors have used a bilateral median filter. This step thus ensures easy localization of the edges. There are few more sub-steps like thresholding and removal of short edges(<10 pixels).
In the example below, the input RGB image is to the right, followed by the depth map estimated by the pre-trained depth estimation model.
Notice the blurred edges. A bilateral median filter is used to sharpen them, followed by a few more preprocessing steps.
Figure 1: Steps showing one iteration of inpainting. (Source
Learning-Based Inpainting: As mentioned in the prerequisite, earlier methods used "rigid" layer representation, and thus standard CNN could be applied on each layer. However, in this method, the tensor(LDI) topology is more complex, and thus standard CNN cannot be applied. The authors thus broke the problem of inpainting into local inpainting sub-problems. Thus depth edges were computed, and for each depth edge, context and synthesis regions were first initialized with flood fill algorithm. These local regions have image topology, and thus, standard CNN can be applied. The authors broke this local inpainting task into three sub-networks.
Edge Inpainting Network: This network is used to predict the depth edges in the synthesis regions, thus producing an inpainted edge. Doing this first constrain the content prediction.
Color Inpainting network: The inputs to this network are inpainted edges and context color. The output is the inpainted color for the synthesis region.
Depth inpainting network: The inputs to this network are inpainted edges and context depth. The output is the inpainted depth for the synthesis region.
Figure 2: The impainting network. (Source)
Now let us dive into the exciting part.
Alternatively, you can use the forked repo, which lets you visualize your model predictions in Weights & Biases as we have done here$\rightarrow$.
After successful 3D image inpainting, in the google colab, you will find
image_name.ply file in the
mesh directory. It is the inpainted 3D mesh generated by integrating all the inpainted depth and color values back into the original LDI. I was curious to look at this mesh. With some investigation, I realized that it is a Point Cloud. Being new to this, the easiest way for me to visualize this was to log this point cloud in the wandb dashboard using
wandb.log. Learn more about doing this here. I used Open3D to load the
.ply file. The point cloud object has depth and color information. They are shown below. However, I do not know how to interpret them. Nevertheless, curios endeavor.
The authors generated training data out of the MS COCO dataset. They first applied the pre-trained depth estimation model on the COCO dataset to obtain depth maps. They then extracted context/synthesis regions(briefly explained above) to form a pool of these regions. They were randomly sampled and were placed on different images in the MS COCO dataset. Out of at most three such regions per image, one was selected using training.
Edge inpainting model: The architecture was based on this paper, "EdgeConnect: Generative Image Inpainting with Adversarial Edge Learning". Check out the GitHub repo here.
Color and Depth inpainting: For these networks, the authors used a standard U-Net architecture with partial convolution. In an introduction to image inpainting using deep learning I along with Sayak Paul have tried to explain the implementation detail of this architecture. Check out the implementation in Keras here.
Check out the supplementary material of the paper for more information on training details.
It is time for some beautiful 3D images. :fire:
The goal of this report is to summarize the paper, making it more accessible for the readers. I have used lines from the paper at places because that was the best way to convey the information.
Few of the thing that excited me the most about the proposed method are as follows:
We can generate high-quality 3D images with cheap computation. This can be used as a web service.
What excited the most is using this technique as a data augmentation pipeline with few modifications. I got this idea while watching the interview of Sayak Paul at the Machine Learning Street Talk. In my opinion, an object in an image, viewed from different camera angles, is a more natural way of generating a larger dataset. This can be useful for object detection task on a smaller dataset.
Thank you for your time. For constructive feedback on summarizing this paper, reach out to me on Twitter, @ayushthakur0.