Overview: Neural Scene Flow Fields (NSFF) for Space-Time View Synthesis of Dynamic Scenes

This report summarizes the proposed approach to create a novel view and time synthesis of dynamic scenes, using only a monocular video with known camera poses as input.
Ayush Thakur
In the paper Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes by Li et al., the authors have proposed a method to synthesize new viewpoints in both space and time from a single monocular video of a scene.

Project Page | Paper | GitHub

For a little background, we recommend going through the paper summary of Neural Radiance Fields (NeRF) and 3D Photo Inpainting.
For a quick video summary of the NSFF paper check out the video below.
With all that out of the way, let's dig in:

Introduction

View synthesis is a fascinating area with applications in video editing like bullet-time effects, stabilization, object intersection, and more. Recently, this field saw a lot of progress courtesy of neural networks. However, in prior works like in 3D Image Inpainting or NeRF, the scene is assumed to be "static".
Let me quickly show you a novel view synthesis of a "static" 2D image. 3D Image Inpainting is used to generate the results shown below.

Reproduce results on Colab Notebook 👇

For any view synthesis, the scene (images) needs to be represented using some sort of data representation. Li et al. introduced Neural Scene Flow Fields (NSFF), as a new representation that models the dynamic scene as a time-variant continuous function of appearance, geometry, and 3D scene motion.
This representation is parameterized using a deep neural network (MLP in this case), while the rendering of scenes (interpolation) is done using volume tracing.
The authors have also introduced a new loss function that enforces that the new scene representation is temporally consistent with the input views.
Overall the approach significantly outperforms prior monocular view synthesis methods evident by the table shown below.
Figure 1: Quantitative evaluation of novel view synthesis on the Dynamic Scenes dataset. (Source)

Overview of the Proposed Method

The proposed method is built upon NeRF, to which the authors have added the notion of time. NeRF represents the static scene as a radiance field defined over a bounded 3D volume.
The radiance field is denoted by F_Θ, which is a multi-layer perceptron (MLP), whose input is a single continuous 5D coordinate given by \bold{x} and d: the spatial location (x, y, z) and viewing direction (θ, φ). The output is the volume density (σ) and RGB color (c):
(c, σ) = F_Θ(\bold{x}, d) (1)

Extending NeRF

NeRF is based on the assumption that the scene is static. However, most of the videos shared online do not fit this restriction and have diverse dynamic content (e.g, humans, animals, vehicles, etc), recorded by a single camera.
To capture scene dynamics, we extend the static scenario described in Eq. 1 by including time in the domain and explicitly modeling 3D motion as dense scene flow fields.
Thus for a given 3D point x and time i, the model not only predicts the reflectance and opacity but also forward and backward 3D scene flow F_i = (f_{i \rightarrow i+1}, f_{i \rightarrow i-1}), which denote 3D offset vectors that point to the position of x at times i+1 and i-1 respectively. This is based on the assumption that the movement between observed time instances is linear. Disocclusion weights W_i = (w_{i\rightarrow i+1}, w_{i\rightarrow i-1}) are also predicted to handle disocclusions in 3D space.
Overall the extended NeRF model is given by,
(c_i, σ_i, F_i, W_i) = F_Θ^{dy}(x, d, i)

Optimization

In the NeRF paper, F_Θ is optimized to reconstruct the input views. The loss function used is given by,
\mathcal{L}_{static} = \sum_r||\hat{C}(r) - C(r)||_2^2
where r is the camera ray emitted from the center of projection through a pixel on the image plane. \hat{C} is the reconstructed color and C is the ground truth color.
However, to effectively optimize the time-variant scene representation on the input views, the authors have introduced a new loss function called Temporal Photometric Consistency. In this section, we will look at all the different loss functions used by the authors.
Temporal Photometric Consistency (\mathcal{L}_{pho}): This loss enforces that the scene at time i should be consistent with the scene at neighboring times j \in \mathcal{N}(i), when accounting for motion that occurs due to 3D scene flow.
Figure 2: Warping strategy of scene flow fields. (Source)
The scene at time i is rendered using volume tracing as mentioned earlier using the perspective of the camera at time i, and with the scene warped from time j to i.
Note that since the authors have assumed that the motion between two scenes is linear, warping the scene from time j to i, undo the motion that occurred between i and j. So how is this "consistency" achieved? From the paper:
As shown in figure 2, we achieve this by warping each 3D sampled point location \bold{x}_i along a ray r_i during volume tracing using the predicted scene flows fields \mathcal{F}_ito look up the RGB color c_j and opacity σ_j from neighboring time j. This yields a rendered image, denoted \hat{C}_{j\rightarrow i}, of the scene at time j with both camera and scene motion warped to time i.
Once we have the rendered image, the fancy loss function under the hood is a Mean Squared Error (MSE) between the warped rendered image and the ground truth view given as:
\mathcal{L}_{pho} = \sum_{r_i}\sum_{j\in \mathcal{N}(i)} ||\hat{C}_{j\rightarrow i}(r_i) - C_i(r_i)||_2^2
3D Scene Flow Cycle Consistency Loss (\mathcal{L}_{cyc}): If you are familiar with CycleGAN you might have heard about consistency loss. In the context of scene flow fields, this tear encourages the predicted scene flow f_{i \rightarrow j} is consistent with the backward flow f_{j \rightarrow i} at the corresponding location \bold{x}_i sampled at time j.
Low-level regularization terms (\mathcal{L}_{reg}): The authors have additionally used few regularization terms based on prior works. \mathcal{L}_{reg} consists of three terms with equal weights such that \mathcal{L}_{reg} = \mathcal{L}_{sp}+\mathcal{L}_{temp}+\mathcal{L}_{min}. Let's quickly check out each term one by one:

Two Caveats

3D disocclusion regions caused by motion: Every novel view synthesis method has to deal with the disocclusion caused by the motion of the object(s) in question. These artifacts usually occur at the boundaries of moving objects. Here's a quick example of disocclusion near the boundary of the moving object.
The temporal photoconsistency loss introduced is not directly valid to mitigate these disocclusions. From the paper:
To mitigate errors due to this ambiguity, we predict two extra continuous disocclusion weight fields w_{i\rightarrow i+1} and w_{i\rightarrow i-1} \in [0,1], corresponding to f_{i\rightarrow i+1}and f_{i\rightarrow i-1} respectively. These weights serve as an unsupervised confidence of where the temporal photoconsistency loss should be applied; ideally they should be low at disocclusions and close to 1 everywhere else.
Proper Initialization: Both novel view and time synthesis from monocular video input with known camera pose is a highly ill-posed problem. A problem is ill-posed if there are multiple solutions to solve that problem. Thus multiple scene configurations lead to the same observed image sequences. The losses described so far can on occasion converge to sub-optimal local minima when randomly initialized.
The authors have thus introduced two data-driven losses: a geometric consistency prior and a single-view depth prior. This is given by, \mathcal{L}_{data} = \mathcal{L}_{geo} + β_z\mathcal{L}_{z} where β_z=2 (used in the paper).

Benefitting From Time-Dependent Representation

The issue with temporal photometric consistency loss is that it can only be used in a local temporal neighborhood \mathcal{N}(i).
To tackle this issue of capturing information across frames and over large temporal gaps, the authors proposed to combine their time-dependent scene representation with a time-independent one. The way the combination is formulated ensures that that the resulting volume (representation) faithfully reconstructs the input frames. Two separate MLPs are used to learn the representations.

Let's Synthesize Space-Time Views

The straightforward way to synthesize novel space-time views is to simply volume render each pixel using only dynamic representation or using dynamic+static representation. However, by doing so the approach renders good results for times corresponding to input views. The representation wasn't allowing to interpolate time-variant geometry at intermediate times between two "seen" input scenes. Bummer!
Figure 3: Splatting-based approach to perform space-time interpolation.
To tackle the synthesis of novel space-time views in between input time indices, the authors have adopted a splatting-based plane-sweep volume tracing approach. From the paper,
To render an image at intermediate time i+\delta_i, \delta_i \in (0,1) at a specified target viewpoint, we sweep a plane over every ray emitted from the target viewpoint from front to back. At each sampled step t along the ray, we query point information through our model at both times i and i+1, and displace all 3D points at time i by the scaled scene flow \bold{x}_i+\delta_if_{i\rightarrow i+1}(\bold{x}_i), and similarity for time i+1.We then splat the 3D displaced points onto a (c, \alpha) accumulation buffer at the target viewpoint, and blend splats from time i and i+1 with linear weights 1-\delta_i, \delta_i. The final rendered view is obtained by volume rendering the accumulation buffer.

Conclusion and Limitations

This research opens the door to better video editing features that are less time consuming and yet provide the creative space. The paper introduces an approach to learning a new representation from monocular videos in the wild. This representation implicitly models scene time-variant reflectance, geometry, and 3D motion that can then be used to generate compelling space-time view synthesis results.
Despite being state-of-the-art in this niche task there are few limitations:
Nevertheless, this is an exciting new frontier that the authors explored, and as Károly always mentions: two more papers down the line and we would have made significant progress. I would also like to thank Justin Tenuto for all the edits.