X-Fields: Implicit Neural View-, Light- and Time-Image Interpolation
This article briefly examines the X-Fields paper, proposing a novel method to seamlessly interpolate time, light, and view of a 2D image using X-Field.
Created on January 19|Last edited on December 4
Comment
Interpolation
Time 3
Time 1
Time 0
Run set
0
This article briefly examines the exciting X-Fields paper, proposing a novel method to interpolate time, light, and view seamlessly of a 2D image using sparse data, X-field. This X-field is represented by learning a neural network to map time, light, or view coordinates to 2D images.
Paper | GitHub | Colab Notebook
Table of Contents
Introduction to X-Fields
New sensors capture images of a scene from different points (video), angles (light field), or under varying illumination(reflectance field). One can use this diverse information to improve the experience of Virtual Reality (VR). With this information, new views, illumination, etc., can be interpolated to generate a seamless transition from one scene to another.
However, seamless interpolation requires dense sampling, leading to excessive storage, capture, and processing requirements. Sparse sampling is an alternative but requires accurate interpolation across time, light, and view, which is obvious.
X-field is a set of 2D images taken across different views, time, or illumination conditions, i.e., video, light field, reflectance field, or combination thereof. The authors have proposed a neural network-based architecture that can represent this high-dimensional X-fields.
The crux of the paper can be understood using figure 1 shown above: From sparse image observations (time in this case) with varying conditions and coordinates, a neural network (mapping) is trained such that, when provided the space, time, or light coordinate as an input, generates the observed sample image as an output. For a non-observed coordinate, the output is faithfully interpolated (shown as GIF).
Check out the official YouTube video by the authors.
Overview of the Proposed Method
The proposed approach is motivated by two key observations:
- Deep representations help interpolation: Representing information using neural networks leads to better interpolation.
- This is true as long as every unit is differentiable: The above observation holds for any architecture as long as all units are differentiable.
The X-field is represented as a non-linear function:
with trainable parameters θ to map from an -dimensional X-field coordinate to 2D RGB images with pixels. The dimension of X-field() depends on the capture modality, e.g, 1D for video interpolation.
Consider X-field to be a high dimensional continuous space. We have finite, rather sparse input images. This sparse observed X-Field coordinates could be represented as for which an image was captured at the known coordinate . is sparse, i.e, small, like , , etc.
For example, given an array of light field images, the input is a 2D coordinate with and . During test time, we can give any continuous value between 0 and 2 for and 0 to 4 for . The learned neural network architecture will faithfully interpolate in the given range.
The images shown below are the sparse input (), which belongs to the X-field (). The capture modality, in this instance, is suited for light (illumination) interpolation. Observe the shadow of the white angel.
Run set
7
To summarize, an architecture is trained to map vectors to captured images in the hope of also getting plausible images for unobserved vectors . This is inline with the first key observation mentioned above.
During test time, interpolation is expected but is bounded by . Thus training never evaluates any X-field coordinates that is not in , as we would not know what image $L_{in}(x) at that coordinate would be.
Architecture Design
The architecture design is the novel bit of the paper. is modeled using three main ideas.
- Appearance is a combination of appearance in observed images ().
- Appearance is assumed to be a product of Shading and albedo.
- The unobserved Shading and Albedo at is considered a warped version of the observed Shading and Albedo at y.
These assumptions need not hold, but in that case, the neural network will have a more challenging time capturing the relationship of coordinates and images.
The proposed architecture is implemented in four steps:
- Decoupling Shading and Albedo: Shading refers to the depiction of depth perception in 3D models (within the field of 3D computer graphics) or illustrations(within the area of 2D computer graphics) by varying the level of darkness. Albedo is the proportion of incident light that is reflected away from a surface. In other words, it is the overall brightness of an object.
- Interpolation images as a weighted combination of warped images.
- Representing "flow" using neural network.
- Resolving inconsistencies.
We will go through each of them one by one.

Decouple Shading and Albedo (De-lighting)
De-lighting splits appearance into a combination of Shading, which moves in one way in response to changes in X-Field coordinates.
Every observed image is decomposed as:
is the shading image, is the albedo image, and is the point wise product.
During test time, both Shading and Albedo are interpolated independently and recombined into new radiance at an unobserved location by multiplication. The output is mathematically given as,
is an operator that will be described shortly.
Run set
7
Interpolation and Warping
Warping deforms an observed image into an unobserved image, that is conditioned on the observed and the unobserved X-Field coordinates:
A spatial transformer(STN) with bilinear filtering is used to compute the pixels in one image by reading them from another image according to a given "flow" map.
Interpolation warps all observed images and merges the individual results. Both warp and merge are performed completely identical for shading() and albedo(). This operation is denoted by (also used above) and is given by,
The critical question is, from which position "q" should a pixel at position "p" read when the image at is reconstructed from the one at ?
The first half of the answer is to use Jacobians of the mapping from X-Field coordinates to pixel positions. Jacobian captures, for example, how a pixel moves in a certain view and light if time is changed. Mathematically for a given pixel "p" it's a partial derivative given as,
Here is indexing into discrete pixel array. The above formula specifies how pixels move for an infinitesimal change of X-Field coordinates. Jacobian matrix holds all partial derivatives of the two pixel coordinates with respect to all -dimensional X-Field coordinates. However, this is not the finite value "q". To find "q", the change in X-Field coordinate is projected to 2D pixel motion using finite differences:
This equation gives a finite pixel motion for a finite change of X-Field coordinates.
Flow
Input to the flow computation is the X-Field coordinate , and the output is the Jacobian. This is implemented using a Convolutional Neural Network (CNN). The architecture starts with a fully connected layer that takes in the coordinates and is then reshaped into a 2D image with 128 channels. The Cord-Conv layer is added in this stage. This is followed by multiple upsampling to reach the output resolution while the number of channels is reduced to output channels.
Run set
6
Consistency
To combine all observed images warped to the unobserved X-Field coordinate, each image pixel is weighted by its flow consistency. For a pixel "q" to contribute to the image at "p", the flow at "q" has to map back to "p". Consistency of one pixel "p" when warped to coordinate from is the partition of unity of a weight function:
weights 𝑤 are smoothly decreasing functions of the 1-norm of the delta of the pixel position "p" and the backward flow at the position "q" where "p" was warped to:
here, is a bandwidth parameter.
Training
Check out the Colab Notebook to reproduce results
The official GitHub repo is available here. I have instrumented Weights and Biases with the same, and you can find the repo here.
The linked Colab Notebook will let you play with the available dataset. Choose the dataset of your choice and select the appropriate Python command to train that scene's model. It might take some time, depending on the dataset.
The average L1 loss for the different models is shown in the media panel below.
Run set
7
Results
Run set
7
Add a comment
Cannot open the colab url, any plan to fix?
Reply
Iterate on AI agents and models faster. Try Weights & Biases today.