[Overview] X-Fields: Implicit Neural View-, Light- and Time-Image Interpolation

This report briefly examines this exciting paper, proposing a novel method to seamlessly interpolate time, light, and view of a 2D image using X-Field.
Ayush Thakur

Section 4

This report briefly examines this exciting paper, proposing a novel method to interpolate time, light, and view seamlessly of a 2D image using sparse data, X-field. This X-field is represented by learning a neural network to map time, light, or view coordinates to 2D images.

Paper | GitHub | Colab Notebook

Introduction

New sensors capture images of a scene from different points(video), angles(light field), or under varying illumination(reflectance field). One can use this diverse information to improve the experience of Virtual Reality(VR). With this information, new view, illumination, etc., can be interpolated to generate a seamless transition from one scene to another.

However, seamless interpolation requires dense sampling, leading to excessive storage, capture, and processing requirements. Sparse sampling is an alternative but requires accurate interpolation across time, light, and view, which is obvious.

X-field is a set of 2D images taken across the different views, time, or illumination conditions, i.e., video, light field, reflectance field, or combination thereof. The authors have proposed a Neural Network based architecture that can represent this high dimensional X-fields.

The crux of the paper can be understood using figure 1 shown above: From sparse image observations(time in this case) with varying conditions and coordinates, a neural network(mapping) is trained such that, when provided the space, time, or light coordinate as an input, generates the observed sample image as an output. For a non-observed coordinate, the output is faithfully interpolated(shown as GIF).

Check out the official YouTube video by the authors.

video_link

Overview of the Proposed Method

The proposed approach is motivated by two key observations:

The X-field is represented as a non-linear function:

$L_{out}^θ (x) \in \chi \to \mathbb{R}^{3 \times n_p}$

with trainable parameters θ to map from an $n_d$-dimensional X-field coordinate $x \in \chi \subset \mathbb{R}^{n_d}$ to 2D RGB images with $n_p$ pixels. The dimension of X-field($\chi$) depends on the capture modality, e.g, 1D for video interpolation.

Consider X-field to be a high dimensional continuous space. We have finite, rather sparse input images. This sparse observed X-Field coordinates could be represented as $Y \subset \chi$ for which an image $L_{in}(y)$ was captured at the known coordinate $y$. $|Y|$ is sparse, i.e, small, like $3 \times 3$, $2 \times 3$, etc.

For example, given an array of $3 \times 5$ light field images, the input is a 2D coordinate $(s,t)$ with $s \in {0,1,2}$ and $t \in {0,1,2,3,4}$. During test time, we can give any continuous value between 0 and 2 for $s$ and 0 to 4 for $t$. The learned neural network architecture will faithfully interpolate in the given range.

The images shown below are the sparse input($Y$), which belongs to the X-field($\chi$). The capture modality, in this instance, is suited for light(illumination) interpolation. Observe the shadow of the white angel.

Section 4

To summarize, an architecture $L_{out}$ is trained to map vectors $y$ to captured images $L_{in}(y)$ in the hope of also getting plausible images $L_{out}(x)$ for unobserved vectors $x$. This is inline with the first key observation mentioned above.

During test time, interpolation is expected but is bounded by $Y$. Thus training never evaluates any X-field coordinates $x$ that is not in $Y$, as we would not know what image $L_{in}(x) at that coordinate would be.

Architecture Design

The architecture design is the novel bit of the paper. $L_{out}$ is modeled using three main ideas.

These assumptions need not hold, but in that case, the neural network will have a more challenging time capturing the relationship of coordinates and images.

The proposed architecture is implemented in four steps:

We will go through each of them one by one.

image.png

Decouple Shading and Albedo(De-lighting)

De-lighting splits appearance into a combination of Shading, which moves in one way in response to changes in X-Field coordinates.

Every observed image is decomposed as:

$L_{in}(y) = E(y) \odot A(y)$

$E$ is the shading image, $A$ is the albedo image, and $\odot$ is the point wise product.

During test time, both Shading and Albedo are interpolated independently and recombined into new radiance at an unobserved location $x$ by multiplication. The output is mathematically given as,

$L_{out}(x) = int(A(L_{in}(y)), y \to x) \odot int(E(L_{in}(y)), y \to x)$

$int$ is an operator that will be described shortly.

Section 12

Interpolation and Warping

Warping deforms an observed image into an unobserved image, that is conditioned on the observed and the unobserved X-Field coordinates:

$warp(I, y \to x) \in I \times \chi \times Y \to I$

A spatial transformer(STN) with bilinear filtering is used to compute the pixels in one image by reading them from another image according to a given "flow" map.

Interpolation warps all observed images and merges the individual results. Both warp and merge are performed completely identical for shading($E$) and albedo($A$). This operation is denoted by $I$(also used above) and is given by,

$int(I, y \to x) = \sum_{y \in Y} (cons(y \to x) \odot warp(I(y), y \to x))$

The critical question is, from which position "q" should a pixel at position "p" read when the image at $x$ is reconstructed from the one at $y$?

The first half of the answer is to use Jacobians of the mapping from X-Field coordinates to pixel positions. Jacobian captures, for example, how a pixel moves in a certain view and light if time is changed. Mathematically for a given pixel "p" it's a partial derivative given as,

$flow_{\delta}(x)[p] = \frac{\delta p(x)}{\delta x} \in X \to \mathbb{R}^{2 \times n_d}$

Here $[p]$ is indexing into discrete pixel array. The above formula specifies how pixels move for an infinitesimal change of X-Field coordinates. Jacobian matrix holds all partial derivatives of the two pixel coordinates with respect to all $n_d$-dimensional X-Field coordinates. However, this is not the finite value "q". To find "q", the change in X-Field coordinate $y \to x$ is projected to 2D pixel motion using finite differences:

$flow_{\Delta}(y \to x)[p] = p + \Delta (y \to x)flow_{\delta}(x)[p] = q$

This equation gives a finite pixel motion for a finite change of X-Field coordinates.

Flow

Input to the flow computation is the X-Field coordinate $x$, and the output is the Jacobian. This is implemented using a Convolutional Neural Network(CNN). The architecture starts with a fully connected layer that takes in the coordinates $x$ and is then reshaped into a 2D image with 128 channels. The Cord-Conv layer is added in this stage. This is followed by multiple upsampling to reach the output resolution while the number of channels is reduced to $n_d$ output channels.

Section 14

Consistency

To combine all observed images warped to the unobserved X-Field coordinate, each image pixel is weighted by its flow consistency. For a pixel "q" to contribute to the image at "p", the flow at "q" has to map back to "p". Consistency of one pixel "p" when warped to coordinate $x$ from $y$ is the partition of unity of a weight function:

$cons(y \to x)[p] = w(y \to x)[p](\sum_{y' \in Y} w(y' \to x)[p])^{-1}$

weights 𝑤 are smoothly decreasing functions of the 1-norm of the delta of the pixel position "p" and the backward flow at the position "q" where "p" was warped to:

$w(y \to x)[p] = exp(-σ|p - flow_{\Delta}(x \to y)[q])|_1)$

here, $σ = 10$ is a bandwidth parameter.

Training

Check out the Colab Notebook to reproduce results $\rightarrow$

The official GitHub repo is available here. I have instrumented Weights and Biases with the same, and you can find the repo here.

The linked Colab Notebook will let you play with the available dataset. Choose the dataset of your choice and select the appropriate Python command to train that scene's model. It might take some time, depending on the dataset.

The average L1 loss for the different models is shown in the media panel below.

Section 10

Results

Section 9