X-Fields: Implicit Neural View-, Light- and Time-Image Interpolation

This article briefly examines the X-Fields paper, proposing a novel method to seamlessly interpolate time, light, and view of a 2D image using X-Field.
Ayush Thakur
Created on January 19|Last edited on December 4
Comment
﻿
﻿
Interpolation
 
Time 3
Time 1
 
Time 0
 
Run set0
﻿
This article briefly examines the exciting X-Fields paper, proposing a novel method to interpolate time, light, and view seamlessly of a 2D image using sparse data, X-field. This X-field is represented by learning a neural network to map time, light, or view coordinates to 2D images. 
﻿Paper | GitHub | Colab Notebook﻿﻿﻿
Table of ContentsIntroduction to X-FieldsOverview of the Proposed MethodArchitecture DesignTrainingResults
﻿
Introduction to X-FieldsNew sensors capture images of a scene from different points (video), angles (light field), or under varying illumination(reflectance field). One can use this diverse information to improve the experience of Virtual Reality (VR). With this information, new views, illumination, etc., can be interpolated to generate a seamless transition from one scene to another. 
However, seamless interpolation requires dense sampling, leading to excessive storage, capture, and processing requirements. Sparse sampling is an alternative but requires accurate interpolation across time, light, and view, which is obvious. 
X-field is a set of 2D images taken across different views, time, or illumination conditions, i.e., video, light field, reflectance field, or combination thereof. The authors have proposed a neural network-based architecture that can represent this high-dimensional X-fields. 
The crux of the paper can be understood using figure 1 shown above: From sparse image observations (time in this case) with varying conditions and coordinates, a neural network (mapping) is trained such that, when provided the space, time, or light coordinate as an input, generates the observed sample image as an output. For a non-observed coordinate, the output is faithfully interpolated (shown as GIF).
Check out the official YouTube video by the authors. 
﻿
﻿
﻿
Overview of the Proposed MethodThe proposed approach is motivated by two key observations:
Deep representations help interpolation: Representing information using neural networks leads to better interpolation. 
This is true as long as every unit is differentiable: The above observation holds for any architecture as long as all units are differentiable. 
The X-field is represented as a non-linear function:
﻿Loutθ(x)∈χ→R3×npL_{out}^θ (x) \in \chi \to \mathbb{R}^{3 \times n_p}Loutθ​(x)∈χ→R3×np​﻿﻿
with trainable parameters θ to map from an ndn_dnd​﻿-dimensional X-field coordinate x∈χ⊂Rndx \in \chi \subset \mathbb{R}^{n_d}x∈χ⊂Rnd​﻿ to 2D RGB images with npn_pnp​﻿ pixels. The dimension of X-field(χ\chiχ﻿) depends on the capture modality, e.g, 1D for video interpolation. 
Consider X-field to be a high dimensional continuous space. We have finite, rather sparse input images. This sparse observed X-Field coordinates could be represented as Y⊂χY \subset \chiY⊂χ﻿ for which an image Lin(y)L_{in}(y)Lin​(y)﻿ was captured at the known coordinate yyy﻿. ∣Y∣|Y|∣Y∣﻿ is sparse, i.e, small, like 3×33 \times 33×3﻿, 2×32 \times 32×3﻿, etc.
For example, given an array of 3×53 \times 53×5﻿ light field images, the input is a 2D coordinate (s,t)(s,t)(s,t)﻿ with s∈0,1,2s \in {0,1,2}s∈0,1,2﻿ and t∈0,1,2,3,4t \in {0,1,2,3,4}t∈0,1,2,3,4﻿. During test time, we can give any continuous value between 0 and 2 for sss﻿ and 0 to 4 for ttt﻿. The learned neural network architecture will faithfully interpolate in the given range.
The images shown below are the sparse input (YYY﻿), which belongs to the X-field (χ\chiχ﻿). The capture modality, in this instance, is suited for light (illumination) interpolation. Observe the shadow of the white angel.
﻿
﻿
﻿
Run set7
﻿
To summarize, an architecture LoutL_{out}Lout​﻿ is trained to map vectors yyy﻿ to captured images Lin(y)L_{in}(y)Lin​(y)﻿ in the hope of also getting plausible images Lout(x)L_{out}(x)Lout​(x)﻿ for unobserved vectors xxx﻿. This is inline with the first key observation mentioned above.
During test time, interpolation is expected but is bounded by YYY﻿. Thus training never evaluates any X-field coordinates xxx﻿ that is not in YYY﻿, as we would not know what image $L_{in}(x) at that coordinate would be.
﻿
﻿
Architecture DesignThe architecture design is the novel bit of the paper. LoutL_{out}Lout​﻿ is modeled using three main ideas. 
Appearance is a combination of appearance in observed images (Lin(y)L_{in}(y)Lin​(y)﻿).
Appearance is assumed to be a product of  Shading and albedo.
The unobserved Shading and Albedo at xxx﻿ is considered a warped version of the observed Shading and Albedo at y.
These assumptions need not hold, but in that case, the neural network will have a more challenging time capturing the relationship of coordinates and images.
The proposed architecture is implemented in four steps:
Decoupling Shading and Albedo: Shading refers to the depiction of depth perception in 3D models (within the field of 3D computer graphics) or illustrations(within the area of 2D computer graphics) by varying the level of darkness. Albedo is the proportion of incident light that is reflected away from a surface. In other words, it is the overall brightness of an object.
Interpolation images as a weighted combination of warped images.
Representing "flow" using neural network.
Resolving inconsistencies.
We will go through each of them one by one.
﻿
﻿
Decouple Shading and Albedo (De-lighting)De-lighting splits appearance into a combination of Shading, which moves in one way in response to changes in X-Field coordinates. 
Every observed image is decomposed as:
 Lin(y)=E(y)⊙A(y)L_{in}(y) = E(y) \odot A(y)Lin​(y)=E(y)⊙A(y)﻿﻿
﻿EEE﻿ is the shading image, AAA﻿ is the albedo image, and ⊙\odot⊙﻿ is the point wise product.
During test time, both Shading and Albedo are interpolated independently and recombined into new radiance at an unobserved location xxx﻿ by multiplication. The output is mathematically given as,
﻿Lout(x)=int(A(Lin(y)),y→x)⊙int(E(Lin(y)),y→x)L_{out}(x) = int(A(L_{in}(y)), y \to x) \odot int(E(L_{in}(y)), y \to x)Lout​(x)=int(A(Lin​(y)),y→x)⊙int(E(Lin​(y)),y→x)﻿﻿
﻿intintint﻿ is an operator that will be described shortly.
﻿
﻿
Run set7
﻿
﻿
Interpolation and WarpingWarping deforms an observed image into an unobserved image, that is conditioned on the observed and the unobserved X-Field coordinates:
﻿warp(I,y→x)∈I×χ×Y→Iwarp(I, y \to x) \in I \times \chi \times Y \to Iwarp(I,y→x)∈I×χ×Y→I﻿﻿
A spatial transformer(STN) with bilinear filtering is used to compute the pixels in one image by reading them from another image according to a given "flow" map. 
Interpolation warps all observed images and merges the individual results. Both warp and merge are performed completely identical for shading(EEE﻿) and albedo(AAA﻿). This operation is denoted by III﻿(also used above) and is given by,
﻿int(I,y→x)=∑y∈Y(cons(y→x)⊙warp(I(y),y→x))int(I, y \to x) = \sum_{y \in Y} (cons(y \to x) \odot warp(I(y), y \to x))int(I,y→x)=∑y∈Y​(cons(y→x)⊙warp(I(y),y→x))﻿﻿
The critical question is, from which position "q" should a pixel at position "p" read when the image at xxx﻿ is reconstructed from the one at yyy﻿?
The first half of the answer is to use Jacobians of the mapping from X-Field coordinates to pixel positions. Jacobian captures, for example, how a pixel moves in a certain view and light if time is changed. Mathematically for a given pixel "p" it's a partial derivative given as,
﻿flowδ(x)[p]=δp(x)δx∈X→R2×ndflow_{\delta}(x)[p] = \frac{\delta p(x)}{\delta x} \in X \to \mathbb{R}^{2 \times n_d}flowδ​(x)[p]=δxδp(x)​∈X→R2×nd​﻿﻿
Here [p][p][p]﻿ is indexing into discrete pixel array. The above formula specifies how pixels move for an infinitesimal change of X-Field coordinates. Jacobian matrix holds all partial derivatives of the two pixel coordinates with respect to all ndn_dnd​﻿-dimensional X-Field coordinates. However, this is not the finite value "q". To find "q", the change in X-Field coordinate y→xy \to xy→x﻿ is projected to 2D pixel motion using finite differences:
﻿flowΔ(y→x)[p]=p+Δ(y→x)flowδ(x)[p]=qflow_{\Delta}(y \to x)[p] = p + \Delta (y \to x)flow_{\delta}(x)[p] = qflowΔ​(y→x)[p]=p+Δ(y→x)flowδ​(x)[p]=q﻿﻿
This equation gives a finite pixel motion for a finite change of X-Field coordinates. 
FlowInput to the flow computation is the X-Field coordinate xxx﻿, and the output is the Jacobian. This is implemented using a Convolutional Neural Network (CNN). The architecture starts with a fully connected layer that takes in the coordinates xxx﻿ and is then reshaped into a 2D image with 128 channels. The Cord-Conv layer is added in this stage. This is followed by multiple upsampling to reach the output resolution while the number of channels is reduced to ndn_dnd​﻿ output channels. 
﻿
﻿
﻿
Run set6
﻿
ConsistencyTo combine all observed images warped to the unobserved X-Field coordinate, each image pixel is weighted by its flow consistency. For a pixel "q" to contribute to the image at "p", the flow at "q" has to map back to "p". Consistency of one pixel "p" when warped to coordinate xxx﻿ from yyy﻿ is the partition of unity of a weight function:
﻿cons(y→x)[p]=w(y→x)[p](∑y′∈Yw(y′→x)[p])−1cons(y \to x)[p] = w(y \to x)[p](\sum_{y' \in Y}  w(y' \to x)[p])^{-1}cons(y→x)[p]=w(y→x)[p](∑y′∈Y​w(y′→x)[p])−1﻿﻿
weights 𝑤 are smoothly decreasing functions of the 1-norm of the delta of the pixel position "p" and the backward flow at the position "q" where "p" was warped to:
﻿w(y→x)[p]=exp(−σ∣p−flowΔ(x→y)[q])∣1)w(y \to x)[p] = exp(-σ|p - flow_{\Delta}(x \to y)[q])|_1)w(y→x)[p]=exp(−σ∣p−flowΔ​(x→y)[q])∣1​)﻿﻿
here, σ=10σ = 10σ=10﻿ is a bandwidth parameter. 
Training﻿Check out the Colab Notebook to reproduce results →\rightarrow→﻿﻿﻿The official GitHub repo is available here. I have instrumented Weights and Biases with the same, and you can find the repo here.
The linked Colab Notebook will let you play with the available dataset. Choose the dataset of your choice and select the appropriate Python command to train that scene's model. It might take some time, depending on the dataset. 
The average L1 loss for the different models is shown in the media panel below. 
﻿
Run set7
﻿
Results﻿
Run set7
﻿
﻿