In this report, we will look at the key components of One Shot 3D Photography by Johannes Kopf et al. Using the linked colab notebook, we will generate stunning 3D photographs and look at the immediate results formed.
Note that the official GitHub repository for this paper covers just one of the four methods mentioned in this paper. The authors have suggested using the code base of 3D Image Inpainting to generate 3D images with the result generated by the first component.
3D photography is a new way to bring captured moments in the form of 2D images to come back alive. The authors refer to a 3D photo as one that displays parallax induced by moving the viewpoint. Such viewpoints can be generated on "flat" mobile or desktop screens. Image sharing(viewing) apps like Facebook, Instagram, etc can have 3D photos as an additional feature. Thus the authors of this paper have come with a novel 3D photography method that is optimized for mobile.
Having said that, creating and displaying 3D photos poses challenges:
The proposed system in this paper provides a practical approach to 3D photography that addresses the above-mentioned challenges. Some of the silent features of this method are:
Effort: The method requires a single image to convert it to a 3D image. Thus it's named "One Shot".
Accessibility: This method can be accessed on any mobile device, even with devices with a single-lens camera.
Speed: The trained models are optimized for mobile devices using quantized aware training. It takes a few seconds for all the processing steps on a mobile device with limited resource requirements.
Interactive Interaction: The interaction with a 3D photo is in real-time and the resulting 3d Image is easy to share.
Here is a short video from the authors showcasing their stunning results.
-> Figure 1: The proposed 4 stage method for 3D image photography <-
The proposed method requires a single image to generate a 3D image. The input image can be captured with a mobile device but not limited to it(any image will work). The method involves four stages of processing and runs end-to-end on any mobile capture device. The processing time on iPhone 11 pro is shown in figure 1.
The first step is to estimate a dense depth map from the input image. Conventional methods achieve high-quality results but use a large memory footprint, not ideal for mobile use. The authors have proposed a new architecture, called Tiefenrausch that is optimized to consume considerably fewer resources in terms of inference latency, peak memory consumption, model size, while still performing competitively to the state-of-the-art results.
-> Figure 2: Depth estimation network schematic** <-
The depth estimation model that achieves the mentioned optimization is a well known U-Net architecture with down-/up-sampling blocks and skip connection. However, the mentioned improvements were achieved by combining three techniques -
Efficient block structure: This block designed for up-/down-sampling is optimized for fast inference on mobile devices. The block contains a sequence of point-wise(1x1) convolution, KxK depthwise convolution where K is the kernel size, and another (1x1) convolution. This block is shown in figure 3.
-> Figure 3: Efficient block structure. $e$ is the multiplicative factor to expand channels. $s_u$ and $s_d$ refer to up and down sampling scale factors respectively. <-
Neural architecture search: The authors used the Chameleon method to find an optimal design given an architecture search space. The method iteratively samples points from the search space to train an accuracy predictor. This accuracy predictor is used to accelerate a genetic search to find a model that maximizes predicted accuracy while satisfying specified resource constraints. The resulting network design achieves a more favorable trade-off between accuracy, latency, and model size. The total time to search was approximately three days using 800 Tesla V100 GPUs.
8-bit quantization: The result of neural architecture search is an optimized model with reduced FLOP count and a lower number of parameters that is further reduced/optimized using Quantization Aware Training(QAT). The resulting low-precision(8-bit integer) model achieves 4x model size reduction as well as a reduction in inference latency with a minor reduction in performance. Learn more about QAT in A Tale of Model Quantization by Sayak Paul.
The media panel below shows the result of the proposed depth estimation network.
-> Figure 4: Quantitative evaluation of the proposed depth estimation model Tiefenrausch against several state-of-the-art baseline methods. <-
The authors have evaluated four versions of their depth model with several SOTA baseline methods:
3D photography requires a geometric representation of the scene. The authors have used Layered Depth Image(LDI) representation. It consists of a regular rectangular lattice with an integer coordinate, just like a normal image; but every position can hold zero, one, or more pixels. This representation is preferred because of sparsity(only contain depth and color values), topology(are locally image like), high level-of-detail and can be easily converted to textured meshes.
After the depth map is obtained, the input image is lifted to the mentioned LDI representation. But before doing so, the dense depth map goes through a pre-processing step. The raw depth map is over-smoothed due to the regularization during the training process. This over-smoothness "washes out" depth discontinuity. The pre-processing step de-clutter depth discontinuities and sharpen them. As shown in figure 5, the raw depth image is first filtered using weighted median filter with a 5x5 kernel size, and then cleaned by performing a connected component analysis.
-> Figure 5: Pre-processing step for the dense depth map <-
After pre-processing of the depth map, new geometry in occluded parts of the scene is "hallucinated". First, the depth image is lifted onto an LDI to represent multiple layers of the scene. Initially, this LDI has a single layer everywhere and all pixels are fully connected to their neighbors, except across discontinuities with a disparity difference of more than a threshold.
At this point, we have LDI with multiple layers around the depth discontinuities but still lacks color values in the parallax(occluded) region. This is to be inpainted with plausible colors so that viewing of the 3D photo appears seamless and realistic.
A naive approach to inpainting would be to inpaint them in screen space however, filling each view at runtime is slow. The better approach is to inpaint on the LDI structure. The inpainting thus needs to be performed once. However, it's not easy to process LDI with the neural network due to irregular connectivity structure. The solution to this problem uses the insight that the LDI is locally structured like a regular image. Thus convolutional neural network based architecture can be trained entirely in 2D and then use the pretrained weights for LDI inpainting, without having done any training with LDIs.
-> Figure 6: The proposed Farbrausch architecture for LDI inpainting. <-
The authors have proposed a new architecture called Farbrausch that enables high-quality inpainting of parallax regions on LDI and is optimized for mobile devices. The authors started with a traditional 2D Partial Convolutional based U-Net network with 5 stages of downsampling. This network is then converted to take in LDI representation by replacing every PConv layer with LDIPConv layer. This layer accepts an LDI and mask. Check out an Introduction to image inpainting with deep learning to learn more about Partial Convolution.
Neural architecture search using the Chameleon method is used in this case as well to identify the best set of hyperparameters encoding the number of output channels for each stage of the encoder. The FLOP count is traded off against the Partial Conv inpainting loss on its validation set.
We started with a single image, estimated the depth, used it to get LDI representation, and trained a model to inpaint the occluded parts due to parallax introduced by the change in viewpoints. This multi layered inpainted LDI is converted into a textured mesh which is the final representation. This is done in two parts: