# [Overview] X-Fields: Implicit Neural View-, Light- and Time-Image Interpolation

This report briefly examines this exciting paper, proposing a novel method to seamlessly interpolate time, light, and view of a 2D image using X-Field.
Ayush Thakur

## Section 4

This report briefly examines this exciting paper, proposing a novel method to interpolate time, light, and view seamlessly of a 2D image using sparse data, X-field. This X-field is represented by learning a neural network to map time, light, or view coordinates to 2D images.

## Introduction

New sensors capture images of a scene from different points(video), angles(light field), or under varying illumination(reflectance field). One can use this diverse information to improve the experience of Virtual Reality(VR). With this information, new view, illumination, etc., can be interpolated to generate a seamless transition from one scene to another.

However, seamless interpolation requires dense sampling, leading to excessive storage, capture, and processing requirements. Sparse sampling is an alternative but requires accurate interpolation across time, light, and view, which is obvious.

X-field is a set of 2D images taken across the different views, time, or illumination conditions, i.e., video, light field, reflectance field, or combination thereof. The authors have proposed a Neural Network based architecture that can represent this high dimensional X-fields.

The crux of the paper can be understood using figure 1 shown above: From sparse image observations(time in this case) with varying conditions and coordinates, a neural network(mapping) is trained such that, when provided the space, time, or light coordinate as an input, generates the observed sample image as an output. For a non-observed coordinate, the output is faithfully interpolated(shown as GIF).

Check out the official YouTube video by the authors.

## Overview of the Proposed Method

The proposed approach is motivated by two key observations:

• Deep representations help interpolation: Representing information using neural networks leads to better interpolation.

• This is true as long as every unit is differentiable: The above observation holds for any architecture as long as all units are differentiable.

The X-field is represented as a non-linear function:

$L_{out}^θ (x) \in \chi \to \mathbb{R}^{3 \times n_p}$

with trainable parameters θ to map from an $n_d$-dimensional X-field coordinate $x \in \chi \subset \mathbb{R}^{n_d}$ to 2D RGB images with $n_p$ pixels. The dimension of X-field($\chi$) depends on the capture modality, e.g, 1D for video interpolation.

Consider X-field to be a high dimensional continuous space. We have finite, rather sparse input images. This sparse observed X-Field coordinates could be represented as $Y \subset \chi$ for which an image $L_{in}(y)$ was captured at the known coordinate $y$. $|Y|$ is sparse, i.e, small, like $3 \times 3$, $2 \times 3$, etc.

For example, given an array of $3 \times 5$ light field images, the input is a 2D coordinate $(s,t)$ with $s \in {0,1,2}$ and $t \in {0,1,2,3,4}$. During test time, we can give any continuous value between 0 and 2 for $s$ and 0 to 4 for $t$. The learned neural network architecture will faithfully interpolate in the given range.

The images shown below are the sparse input($Y$), which belongs to the X-field($\chi$). The capture modality, in this instance, is suited for light(illumination) interpolation. Observe the shadow of the white angel.

## Section 4

To summarize, an architecture $L_{out}$ is trained to map vectors $y$ to captured images $L_{in}(y)$ in the hope of also getting plausible images $L_{out}(x)$ for unobserved vectors $x$. This is inline with the first key observation mentioned above.

During test time, interpolation is expected but is bounded by $Y$. Thus training never evaluates any X-field coordinates $x$ that is not in $Y$, as we would not know what image $L_{in}(x) at that coordinate would be. ## Architecture Design The architecture design is the novel bit of the paper.$L_{out}$is modeled using three main ideas. • Appearance is a combination of appearance in observed images($L_{in}(y)$). • Appearance is assumed to be a product of _ Shading_ and albedo. • The unobserved Shading and Albedo at$x$is considered a warped version of the observed Shading and Albedo at y. These assumptions need not hold, but in that case, the neural network will have a more challenging time capturing the relationship of coordinates and images. The proposed architecture is implemented in four steps: • Decoupling Shading and Albedo: Shading refers to the depiction of depth perception in 3D models(within the field of 3D computer graphics) or illustrations(within the area of 2D computer graphics) by varying the level of darkness. Albedo is the proportion of incident light that is reflected away from a surface. In other words, it is the overall brightness of an object. • Interpolation images as a weighted combination of warped images. • Representing "flow" using neural network. • Resolving inconsistencies. We will go through each of them one by one. ### Decouple Shading and Albedo(De-lighting) De-lighting splits appearance into a combination of Shading, which moves in one way in response to changes in X-Field coordinates. Every observed image is decomposed as:$L_{in}(y) = E(y) \odot A(y)E$is the shading image,$A$is the albedo image, and$\odot$is the point wise product. During test time, both Shading and Albedo are interpolated independently and recombined into new radiance at an unobserved location$x$by multiplication. The output is mathematically given as,$L_{out}(x) = int(A(L_{in}(y)), y \to x) \odot int(E(L_{in}(y)), y \to x)int$is an operator that will be described shortly. ## Section 12 ### Interpolation and Warping Warping deforms an observed image into an unobserved image, that is conditioned on the observed and the unobserved X-Field coordinates:$warp(I, y \to x) \in I \times \chi \times Y \to I$A spatial transformer(STN) with bilinear filtering is used to compute the pixels in one image by reading them from another image according to a given "flow" map. Interpolation warps all observed images and merges the individual results. Both warp and merge are performed completely identical for shading($E$) and albedo($A$). This operation is denoted by$I$(also used above) and is given by,$int(I, y \to x) = \sum_{y \in Y} (cons(y \to x) \odot warp(I(y), y \to x))$The critical question is, from which position "q" should a pixel at position "p" read when the image at$x$is reconstructed from the one at$y$? The first half of the answer is to use Jacobians of the mapping from X-Field coordinates to pixel positions. Jacobian captures, for example, how a pixel moves in a certain view and light if time is changed. Mathematically for a given pixel "p" it's a partial derivative given as,$flow_{\delta}(x)[p] = \frac{\delta p(x)}{\delta x} \in X \to \mathbb{R}^{2 \times n_d}$Here$[p]$is indexing into discrete pixel array. The above formula specifies how pixels move for an infinitesimal change of X-Field coordinates. Jacobian matrix holds all partial derivatives of the two pixel coordinates with respect to all$n_d$-dimensional X-Field coordinates. However, this is not the finite value "q". To find "q", the change in X-Field coordinate$y \to x$is projected to 2D pixel motion using finite differences:$flow_{\Delta}(y \to x)[p] = p + \Delta (y \to x)flow_{\delta}(x)[p] = q$This equation gives a finite pixel motion for a finite change of X-Field coordinates. ### Flow Input to the flow computation is the X-Field coordinate$x$, and the output is the Jacobian. This is implemented using a Convolutional Neural Network(CNN). The architecture starts with a fully connected layer that takes in the coordinates$x$and is then reshaped into a 2D image with 128 channels. The Cord-Conv layer is added in this stage. This is followed by multiple upsampling to reach the output resolution while the number of channels is reduced to$n_d$output channels. ## Section 14 ### Consistency To combine all observed images warped to the unobserved X-Field coordinate, each image pixel is weighted by its flow consistency. For a pixel "q" to contribute to the image at "p", the flow at "q" has to map back to "p". Consistency of one pixel "p" when warped to coordinate$x$from$y$is the partition of unity of a weight function:$cons(y \to x)[p] = w(y \to x)[p](\sum_{y' \in Y} w(y' \to x)[p])^{-1}$weights 𝑤 are smoothly decreasing functions of the 1-norm of the delta of the pixel position "p" and the backward flow at the position "q" where "p" was warped to:$w(y \to x)[p] = exp(-σ|p - flow_{\Delta}(x \to y)[q])|_1)$here,$σ = 10$is a bandwidth parameter. ## Training ### Check out the Colab Notebook to reproduce results$\rightarrow\$

The official GitHub repo is available here. I have instrumented Weights and Biases with the same, and you can find the repo here.

The linked Colab Notebook will let you play with the available dataset. Choose the dataset of your choice and select the appropriate Python command to train that scene's model. It might take some time, depending on the dataset.

The average L1 loss for the different models is shown in the media panel below.