Paper Summary: One Shot 3D Photography

This report explores a novel 3D photography method by using a single 2D image. . Made by Ayush Thakur using Weights & Biases
Ayush Thakur

In this report, we will look at the key components of One Shot 3D Photography by Johannes Kopf et al. Using the linked colab notebook, we will generate stunning 3D photographs and look at the immediate results formed.

Reproduce results in this colab $\rightarrow$

Note that the official GitHub repository for this paper covers just one of the four methods mentioned in this paper. The authors have suggested using the code base of 3D Image Inpainting to generate 3D images with the result generated by the first component.

Section 2

Introduction

3D photography is a new way to bring captured moments in the form of 2D images to come back alive. The authors refer to a 3D photo as one that displays parallax induced by moving the viewpoint. Such viewpoints can be generated on "flat" mobile or desktop screens. Image sharing(viewing) apps like Facebook, Instagram, etc can have 3D photos as an additional feature. Thus the authors of this paper have come with a novel 3D photography method that is optimized for mobile.

Having said that, creating and displaying 3D photos poses challenges:

The proposed system in this paper provides a practical approach to 3D photography that addresses the above-mentioned challenges. Some of the silent features of this method are:

Paper | Project Website

Here is a short video from the authors showcasing their stunning results.

video_link

Overview of the Proposed Method

image.png

-> Figure 1: The proposed 4 stage method for 3D image photography <-

The proposed method requires a single image to generate a 3D image. The input image can be captured with a mobile device but not limited to it(any image will work). The method involves four stages of processing and runs end-to-end on any mobile capture device. The processing time on iPhone 11 pro is shown in figure 1.

Depth Estimation

The first step is to estimate a dense depth map from the input image. Conventional methods achieve high-quality results but use a large memory footprint, not ideal for mobile use. The authors have proposed a new architecture, called Tiefenrausch that is optimized to consume considerably fewer resources in terms of inference latency, peak memory consumption, model size, while still performing competitively to the state-of-the-art results.

image.png

-> Figure 2: Depth estimation network schematic** <-

The depth estimation model that achieves the mentioned optimization is a well known U-Net architecture with down-/up-sampling blocks and skip connection. However, the mentioned improvements were achieved by combining three techniques -

The media panel below shows the result of the proposed depth estimation network.

Section 5

Performance of Tiefenrausch

image.png

-> Figure 4: Quantitative evaluation of the proposed depth estimation model Tiefenrausch against several state-of-the-art baseline methods. <-

The authors have evaluated four versions of their depth model with several SOTA baseline methods:

Lifting to Layered Depth Image

3D photography requires a geometric representation of the scene. The authors have used Layered Depth Image(LDI) representation. It consists of a regular rectangular lattice with an integer coordinate, just like a normal image; but every position can hold zero, one, or more pixels. This representation is preferred because of sparsity(only contain depth and color values), topology(are locally image like), high level-of-detail and can be easily converted to textured meshes.

After the depth map is obtained, the input image is lifted to the mentioned LDI representation. But before doing so, the dense depth map goes through a pre-processing step. The raw depth map is over-smoothed due to the regularization during the training process. This over-smoothness "washes out" depth discontinuity. The pre-processing step de-clutter depth discontinuities and sharpen them. As shown in figure 5, the raw depth image is first filtered using weighted median filter with a 5x5 kernel size, and then cleaned by performing a connected component analysis.

one-shot.png

-> Figure 5: Pre-processing step for the dense depth map <-

After pre-processing of the depth map, new geometry in occluded parts of the scene is "hallucinated". First, the depth image is lifted onto an LDI to represent multiple layers of the scene. Initially, this LDI has a single layer everywhere and all pixels are fully connected to their neighbors, except across discontinuities with a disparity difference of more than a threshold.

LDI Inpainting

At this point, we have LDI with multiple layers around the depth discontinuities but still lacks color values in the parallax(occluded) region. This is to be inpainted with plausible colors so that viewing of the 3D photo appears seamless and realistic.

A naive approach to inpainting would be to inpaint them in screen space however, filling each view at runtime is slow. The better approach is to inpaint on the LDI structure. The inpainting thus needs to be performed once. However, it's not easy to process LDI with the neural network due to irregular connectivity structure. The solution to this problem uses the insight that the LDI is locally structured like a regular image. Thus convolutional neural network based architecture can be trained entirely in 2D and then use the pretrained weights for LDI inpainting, without having done any training with LDIs.

Model architecture for LDI inpainting

image.png

-> Figure 6: The proposed Farbrausch architecture for LDI inpainting. <-

The authors have proposed a new architecture called Farbrausch that enables high-quality inpainting of parallax regions on LDI and is optimized for mobile devices. The authors started with a traditional 2D Partial Convolutional based U-Net network with 5 stages of downsampling. This network is then converted to take in LDI representation by replacing every PConv layer with LDIPConv layer. This layer accepts an LDI and mask. Check out an Introduction to image inpainting with deep learning to learn more about Partial Convolution.

Neural architecture search using the Chameleon method is used in this case as well to identify the best set of hyperparameters encoding the number of output channels for each stage of the encoder. The FLOP count is traded off against the Partial Conv inpainting loss on its validation set.

Conversion to Mesh Representation

We started with a single image, estimated the depth, used it to get LDI representation, and trained a model to inpaint the occluded parts due to parallax introduced by the change in viewpoints. This multi layered inpainted LDI is converted into a textured mesh which is the final representation. This is done in two parts:

Section 11

Results

Section 10