Skip to main content

X-Fields: Implicit Neural View-, Light- and Time-Image Interpolation

This article briefly examines the X-Fields paper, proposing a novel method to seamlessly interpolate time, light, and view of a 2D image using X-Field.
Created on January 19|Last edited on December 4


Interpolation

time_rendered.gif

Time 3

IMG_1893.JPG

Time 1

IMG_1891.JPG

Time 0

IMG_1889.JPG

Run set
0

This article briefly examines the exciting X-Fields paper, proposing a novel method to interpolate time, light, and view seamlessly of a 2D image using sparse data, X-field. This X-field is represented by learning a neural network to map time, light, or view coordinates to 2D images.

Paper | GitHub | Colab Notebook

Table of Contents



Introduction to X-Fields

New sensors capture images of a scene from different points (video), angles (light field), or under varying illumination(reflectance field). One can use this diverse information to improve the experience of Virtual Reality (VR). With this information, new views, illumination, etc., can be interpolated to generate a seamless transition from one scene to another.
However, seamless interpolation requires dense sampling, leading to excessive storage, capture, and processing requirements. Sparse sampling is an alternative but requires accurate interpolation across time, light, and view, which is obvious.
X-field is a set of 2D images taken across different views, time, or illumination conditions, i.e., video, light field, reflectance field, or combination thereof. The authors have proposed a neural network-based architecture that can represent this high-dimensional X-fields.
The crux of the paper can be understood using figure 1 shown above: From sparse image observations (time in this case) with varying conditions and coordinates, a neural network (mapping) is trained such that, when provided the space, time, or light coordinate as an input, generates the observed sample image as an output. For a non-observed coordinate, the output is faithfully interpolated (shown as GIF).
Check out the official YouTube video by the authors.




Overview of the Proposed Method

The proposed approach is motivated by two key observations:
  • Deep representations help interpolation: Representing information using neural networks leads to better interpolation.
  • This is true as long as every unit is differentiable: The above observation holds for any architecture as long as all units are differentiable.
The X-field is represented as a non-linear function:
Loutθ(x)χR3×npL_{out}^θ (x) \in \chi \to \mathbb{R}^{3 \times n_p}
with trainable parameters θ to map from an ndn_d-dimensional X-field coordinate xχRndx \in \chi \subset \mathbb{R}^{n_d} to 2D RGB images with npn_p pixels. The dimension of X-field(χ\chi) depends on the capture modality, e.g, 1D for video interpolation.
Consider X-field to be a high dimensional continuous space. We have finite, rather sparse input images. This sparse observed X-Field coordinates could be represented as YχY \subset \chi for which an image Lin(y)L_{in}(y) was captured at the known coordinate yy. Y|Y| is sparse, i.e, small, like 3×33 \times 3, 2×32 \times 3, etc.
For example, given an array of 3×53 \times 5 light field images, the input is a 2D coordinate (s,t)(s,t) with s0,1,2s \in {0,1,2} and t0,1,2,3,4t \in {0,1,2,3,4}. During test time, we can give any continuous value between 0 and 2 for ss and 0 to 4 for tt. The learned neural network architecture will faithfully interpolate in the given range.
The images shown below are the sparse input (YY), which belongs to the X-field (χ\chi). The capture modality, in this instance, is suited for light (illumination) interpolation. Observe the shadow of the white angel.



Run set
7

To summarize, an architecture LoutL_{out} is trained to map vectors yy to captured images Lin(y)L_{in}(y) in the hope of also getting plausible images Lout(x)L_{out}(x) for unobserved vectors xx. This is inline with the first key observation mentioned above.
During test time, interpolation is expected but is bounded by YY. Thus training never evaluates any X-field coordinates xx that is not in YY, as we would not know what image $L_{in}(x) at that coordinate would be.



Architecture Design

The architecture design is the novel bit of the paper. LoutL_{out} is modeled using three main ideas.
  • Appearance is a combination of appearance in observed images (Lin(y)L_{in}(y)).
  • Appearance is assumed to be a product of Shading and albedo.
  • The unobserved Shading and Albedo at xx is considered a warped version of the observed Shading and Albedo at y.
These assumptions need not hold, but in that case, the neural network will have a more challenging time capturing the relationship of coordinates and images.
The proposed architecture is implemented in four steps:
  • Decoupling Shading and Albedo: Shading refers to the depiction of depth perception in 3D models (within the field of 3D computer graphics) or illustrations(within the area of 2D computer graphics) by varying the level of darkness. Albedo is the proportion of incident light that is reflected away from a surface. In other words, it is the overall brightness of an object.
  • Interpolation images as a weighted combination of warped images.
  • Representing "flow" using neural network.
  • Resolving inconsistencies.
We will go through each of them one by one.



Decouple Shading and Albedo (De-lighting)

De-lighting splits appearance into a combination of Shading, which moves in one way in response to changes in X-Field coordinates.
Every observed image is decomposed as:
Lin(y)=E(y)A(y)L_{in}(y) = E(y) \odot A(y)
EE is the shading image, AA is the albedo image, and \odot is the point wise product.
During test time, both Shading and Albedo are interpolated independently and recombined into new radiance at an unobserved location xx by multiplication. The output is mathematically given as,
Lout(x)=int(A(Lin(y)),yx)int(E(Lin(y)),yx)L_{out}(x) = int(A(L_{in}(y)), y \to x) \odot int(E(L_{in}(y)), y \to x)
intint is an operator that will be described shortly.


Run set
7



Interpolation and Warping

Warping deforms an observed image into an unobserved image, that is conditioned on the observed and the unobserved X-Field coordinates:
warp(I,yx)I×χ×YIwarp(I, y \to x) \in I \times \chi \times Y \to I
A spatial transformer(STN) with bilinear filtering is used to compute the pixels in one image by reading them from another image according to a given "flow" map.
Interpolation warps all observed images and merges the individual results. Both warp and merge are performed completely identical for shading(EE) and albedo(AA). This operation is denoted by II(also used above) and is given by,
int(I,yx)=yY(cons(yx)warp(I(y),yx))int(I, y \to x) = \sum_{y \in Y} (cons(y \to x) \odot warp(I(y), y \to x))
The critical question is, from which position "q" should a pixel at position "p" read when the image at xx is reconstructed from the one at yy?
The first half of the answer is to use Jacobians of the mapping from X-Field coordinates to pixel positions. Jacobian captures, for example, how a pixel moves in a certain view and light if time is changed. Mathematically for a given pixel "p" it's a partial derivative given as,
flowδ(x)[p]=δp(x)δxXR2×ndflow_{\delta}(x)[p] = \frac{\delta p(x)}{\delta x} \in X \to \mathbb{R}^{2 \times n_d}
Here [p][p] is indexing into discrete pixel array. The above formula specifies how pixels move for an infinitesimal change of X-Field coordinates. Jacobian matrix holds all partial derivatives of the two pixel coordinates with respect to all ndn_d-dimensional X-Field coordinates. However, this is not the finite value "q". To find "q", the change in X-Field coordinate yxy \to x is projected to 2D pixel motion using finite differences:
flowΔ(yx)[p]=p+Δ(yx)flowδ(x)[p]=qflow_{\Delta}(y \to x)[p] = p + \Delta (y \to x)flow_{\delta}(x)[p] = q
This equation gives a finite pixel motion for a finite change of X-Field coordinates.

Flow

Input to the flow computation is the X-Field coordinate xx, and the output is the Jacobian. This is implemented using a Convolutional Neural Network (CNN). The architecture starts with a fully connected layer that takes in the coordinates xx and is then reshaped into a 2D image with 128 channels. The Cord-Conv layer is added in this stage. This is followed by multiple upsampling to reach the output resolution while the number of channels is reduced to ndn_d output channels.



Run set
6


Consistency

To combine all observed images warped to the unobserved X-Field coordinate, each image pixel is weighted by its flow consistency. For a pixel "q" to contribute to the image at "p", the flow at "q" has to map back to "p". Consistency of one pixel "p" when warped to coordinate xx from yy is the partition of unity of a weight function:
cons(yx)[p]=w(yx)[p](yYw(yx)[p])1cons(y \to x)[p] = w(y \to x)[p](\sum_{y' \in Y} w(y' \to x)[p])^{-1}
weights 𝑤 are smoothly decreasing functions of the 1-norm of the delta of the pixel position "p" and the backward flow at the position "q" where "p" was warped to:
w(yx)[p]=exp(σpflowΔ(xy)[q])1)w(y \to x)[p] = exp(-σ|p - flow_{\Delta}(x \to y)[q])|_1)
here, σ=10σ = 10 is a bandwidth parameter.

Training

Check out the Colab Notebook to reproduce results \rightarrow

The official GitHub repo is available here. I have instrumented Weights and Biases with the same, and you can find the repo here.
The linked Colab Notebook will let you play with the available dataset. Choose the dataset of your choice and select the appropriate Python command to train that scene's model. It might take some time, depending on the dataset.
The average L1 loss for the different models is shown in the media panel below.

Run set
7


Results


Run set
7

Pythonix Huang
Pythonix Huang •  *
Cannot open the colab url, any plan to fix?
Reply
Iterate on AI agents and models faster. Try Weights & Biases today.