Overview: Neural Scene Flow Fields (NSFF) for Space-Time View Synthesis of Dynamic Scenes

This report summarizes the proposed approach to create a novel view and time synthesis of dynamic scenes, using only a monocular video with known camera poses as input.
Ayush Thakur
Created on March 30|Last edited on March 24
Comment
In the paper Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes by Li et al., the authors have proposed a method to synthesize new viewpoints in both space and time from a single monocular video of a scene.  
���Project Page | Paper | GitHub﻿For a little background, we recommend going through the paper summary of Neural Radiance Fields (NeRF) and 3D Photo Inpainting. 
﻿NeRF – Representing Scenes as Neural Radiance Fields for View Synthesis by Lavanya Shukla
﻿3D Image Inpainting by Ayush Thakur
For a quick video summary of the NSFF paper check out the video below. 
﻿
﻿
With all that out of the way, let's dig in:
Introduction﻿View synthesis is a fascinating area with applications in video editing like bullet-time effects, stabilization, object intersection, and more. Recently, this field saw a lot of progress courtesy of neural networks. However, in prior works like in 3D Image Inpainting or NeRF, the scene is assumed to be "static". 
Let me quickly show you a novel view synthesis of a "static" 2D image. 3D Image Inpainting is used to generate the results shown below. 
﻿Reproduce results on Colab Notebook 👇﻿
Run set3
﻿
For any view synthesis, the scene (images) needs to be represented using some sort of data representation. Li et al. introduced Neural Scene Flow Fields (NSFF), as a new representation that models the dynamic scene as a time-variant continuous function of appearance, geometry, and 3D scene motion. 
This representation is parameterized using a deep neural network (MLP in this case), while the rendering of scenes (interpolation) is done using volume tracing. 
The authors have also introduced a new loss function that enforces that the new scene representation is temporally consistent with the input views. 
Overall the approach significantly outperforms prior monocular view synthesis methods evident by the table shown below.
Figure 1: Quantitative evaluation of novel view synthesis on the Dynamic Scenes dataset. (Source)
Overview of the Proposed MethodThe proposed method is built upon NeRF, to which the authors have added the notion of time. NeRF represents the static scene as a radiance field defined over a bounded 3D volume.  
The radiance field is denoted by FΘF_ΘFΘ​﻿, which is a multi-layer perceptron (MLP), whose input is a single continuous 5D coordinate given by x\bold{x}x﻿ and ddd﻿: the spatial location (xxx﻿, yyy﻿, zzz﻿) and viewing direction (θθθ﻿, φφφ﻿). The output is the volume density (σσσ﻿) and RGB color (c):
﻿(c,σ)=FΘ(x,d)(c, σ) = F_Θ(\bold{x}, d)(c,σ)=FΘ​(x,d)﻿ (1)
Extending NeRFNeRF is based on the assumption that the scene is static. However, most of the videos shared online do not fit this restriction and have diverse dynamic content (e.g, humans, animals, vehicles, etc), recorded by a single camera.
To capture scene dynamics, we extend the static scenario described in Eq. 1 by including time in the domain and explicitly modeling 3D motion as dense scene flow fields.Thus for a given 3D point xxx﻿ and time iii﻿, the model not only predicts the reflectance and opacity but also forward and backward 3D scene flow Fi=(fi→i+1,fi→i−1)F_i = (f_{i \rightarrow  i+1}, f_{i \rightarrow  i-1})Fi​=(fi→i+1​,fi→i−1​)﻿, which denote 3D offset vectors that point to the position of xxx﻿ at times i+1i+1i+1﻿ and i−1i-1i−1﻿ respectively. This is based on the assumption that the movement between observed time instances is linear.  Disocclusion weights Wi=(wi→i+1,wi→i−1)W_i = (w_{i\rightarrow i+1}, w_{i\rightarrow i-1})Wi​=(wi→i+1​,wi→i−1​)﻿ are also predicted to handle disocclusions in 3D space. 
Overall the extended NeRF model is given by,
﻿(ci,σi,Fi,Wi)=FΘdy(x,d,i)(c_i, σ_i, F_i, W_i) = F_Θ^{dy}(x, d, i)(ci​,σi​,Fi​,Wi​)=FΘdy​(x,d,i)﻿﻿
OptimizationIn the NeRF paper, FΘF_ΘFΘ​﻿ is optimized to reconstruct the input views. The loss function used is given by,
﻿Lstatic=∑r∣∣C^(r)−C(r)∣∣22\mathcal{L}_{static} = \sum_r||\hat{C}(r) - C(r)||_2^2Lstatic​=∑r​∣∣C^(r)−C(r)∣∣22​﻿﻿
where rrr﻿ is the camera ray emitted from the center of projection through a pixel on the image plane. C^\hat{C}C^﻿ is the reconstructed color and CCC﻿ is the ground truth color. 
However, to effectively optimize the time-variant scene representation on the input views, the authors have introduced a new loss function called Temporal Photometric Consistency. In this section, we will look at all the different loss functions used by the authors. 
Temporal Photometric Consistency (Lpho\mathcal{L}_{pho}Lpho​﻿): This loss enforces that the scene at time iii﻿ should be consistent with the scene at neighboring times j∈N(i)j \in \mathcal{N}(i)j∈N(i)﻿, when accounting for motion that occurs due to 3D scene flow. 
Figure 2: Warping strategy of scene flow fields. (Source)
The scene at time iii﻿ is rendered using volume tracing as mentioned earlier using the perspective of the camera at time iii﻿, and with the scene warped from time jjj﻿ to iii﻿.
Note that since the authors have assumed that the motion between two scenes is linear, warping the scene from time jjj﻿ to iii﻿, undo the motion that occurred between iii﻿ and jjj﻿.  So how is this "consistency" achieved? From the paper:
As shown in figure 2, we achieve this by warping each 3D sampled point location xi\bold{x}_ixi​﻿ along a ray rir_iri​﻿ during volume tracing using the predicted scene flows fields Fi\mathcal{F}_iFi​﻿to look up the RGB color cjc_jcj​﻿ and opacity σjσ_jσj​﻿ from neighboring time jjj﻿. This yields a rendered image, denoted C^j→i\hat{C}_{j\rightarrow i}C^j→i​﻿, of the scene at time jjj﻿ with both camera and scene motion warped to time iii﻿.Once we have the rendered image, the fancy loss function under the hood is a Mean Squared Error (MSE) between the warped rendered image and the ground truth view given as:
﻿Lpho=∑ri∑j∈N(i)∣∣C^j→i(ri)−Ci(ri)∣∣22\mathcal{L}_{pho} = \sum_{r_i}\sum_{j\in \mathcal{N}(i)} ||\hat{C}_{j\rightarrow i}(r_i) - C_i(r_i)||_2^2Lpho​=∑ri​​∑j∈N(i)​∣∣C^j→i​(ri​)−Ci​(ri​)∣∣22​﻿﻿
3D Scene Flow Cycle Consistency Loss (Lcyc\mathcal{L}_{cyc}Lcyc​﻿): If you are familiar with CycleGAN you might have heard about consistency loss. In the context of scene flow fields, this tear encourages the predicted scene flow fi→jf_{i \rightarrow j}fi→j​﻿ is consistent with the backward flow fj→if_{j \rightarrow i}fj→i​﻿ at the corresponding location xi\bold{x}_ixi​﻿ sampled at time jjj﻿.
Low-level regularization terms (Lreg\mathcal{L}_{reg}Lreg​﻿): The authors have additionally used few regularization terms based on prior works. Lreg\mathcal{L}_{reg}Lreg​﻿ consists of three terms with equal weights such that Lreg=Lsp+Ltemp+Lmin\mathcal{L}_{reg} = \mathcal{L}_{sp}+\mathcal{L}_{temp}+\mathcal{L}_{min}Lreg​=Lsp​+Ltemp​+Lmin​﻿. Let's quickly check out each term one by one:
Spatial Smoothness: This term is used to minimize the weighted lil_ili​﻿ distance between scene flows sampled at neighboring 3D positions along each ray rir_iri​﻿.
Temporal Smoothness: This term encourages 3D point trajectories to be piece-wise linear.
﻿Lmin\mathcal{L}_{min}Lmin​﻿ simply encourages scene flow to be minimal in most 3D space by applying lil_ili​﻿ regularization term to each predicted scene flow.
Two Caveats3D disocclusion regions caused by motion: Every novel view synthesis method has to deal with the disocclusion caused by the motion of the object(s) in question. These artifacts usually occur at the boundaries of moving objects. Here's a quick example of disocclusion near the boundary of the moving object.
﻿
Run set1
﻿
The temporal photoconsistency loss introduced is not directly valid to mitigate these disocclusions. From the paper:
To mitigate errors due to this ambiguity, we predict two extra continuous disocclusion weight fields wi→i+1w_{i\rightarrow i+1}wi→i+1​﻿ and wi→i−1w_{i\rightarrow i-1}wi→i−1​﻿ ∈[0,1]\in [0,1]∈[0,1]﻿, corresponding to fi→i+1f_{i\rightarrow i+1}fi→i+1​﻿and fi→i−1f_{i\rightarrow i-1}fi→i−1​﻿ respectively. These weights serve as an unsupervised confidence of where the temporal photoconsistency loss should be applied; ideally they should be low at disocclusions and close to 1 everywhere else.Proper Initialization: Both novel view and time synthesis from monocular video input with known camera pose is a highly ill-posed problem. A problem is ill-posed if there are multiple solutions to solve that problem. Thus multiple scene configurations lead to the same observed image sequences. The losses described so far can on occasion converge to sub-optimal local minima when randomly initialized. 
The authors have thus introduced two data-driven losses: a geometric consistency prior and a single-view depth prior. This is given by, Ldata=Lgeo+βzLz\mathcal{L}_{data} = \mathcal{L}_{geo} + β_z\mathcal{L}_{z}Ldata​=Lgeo​+βz​Lz​﻿ where βz=2β_z=2βz​=2﻿ (used in the paper).
Benefitting From Time-Dependent RepresentationThe issue with temporal photometric consistency loss is that it can only be used in a local temporal neighborhood N(i)\mathcal{N}(i)N(i)﻿. 
The rigid regions in the scene should be consistent and should leverage the observations from all frames.  
The dynamic objects undergo a lot of deformation and thus it's important to reliably infer correspondence over larger temporal gaps. 
To tackle this issue of capturing information across frames and over large temporal gaps, the authors proposed to combine their time-dependent scene representation with a time-independent one. The way the combination is formulated ensures that that the resulting volume (representation) faithfully reconstructs the input frames. Two separate MLPs are used to learn the representations. 
Let's Synthesize Space-Time ViewsThe straightforward way to synthesize novel space-time views is to simply volume render each pixel using only dynamic representation or using dynamic+static representation. However, by doing so the approach renders good results for times corresponding to input views. The representation wasn't allowing to interpolate time-variant geometry at intermediate times between two "seen" input scenes. Bummer!
Figure 3: Splatting-based approach to perform space-time interpolation.
To tackle the synthesis of novel space-time views in between input time indices, the authors have adopted a splatting-based plane-sweep volume tracing approach. From the paper,
To render an image at intermediate time i+δii+\delta_ii+δi​﻿, δi∈(0,1)\delta_i \in (0,1)δi​∈(0,1)﻿ at a specified target viewpoint, we sweep a plane over every ray emitted from the target viewpoint from front to back. At each sampled step ttt﻿ along the ray, we query point information through our model at both times iii﻿ and i+1i+1i+1﻿, and displace all 3D points at time iii﻿ by the scaled scene flow xi+δifi→i+1(xi)\bold{x}_i+\delta_if_{i\rightarrow i+1}(\bold{x}_i)xi​+δi​fi→i+1​(xi​)﻿, and similarity for time i+1i+1i+1﻿.We then splat the 3D displaced points onto a (c,α)(c, \alpha)(c,α)﻿ accumulation buffer at the target viewpoint, and blend splats from time iii﻿ and i+1i+1i+1﻿ with linear weights 1−δi1-\delta_i1−δi​﻿, δi\delta_iδi​﻿. The final rendered view is obtained by volume rendering the accumulation buffer.
Conclusion and LimitationsThis research opens the door to better video editing features that are less time consuming and yet provide the creative space. The paper introduces an approach to learning a new representation from monocular videos in the wild. This representation implicitly models scene time-variant reflectance, geometry, and 3D motion that can then be used to generate compelling space-time view synthesis results.
Despite being state-of-the-art in this niche task there are few limitations:
The training time is really high. Their GitHub repo mentions that per-scene training will take a whopping two days that too on two Nvidia V100 GPUs.
The method fails to extrapolate i.e, generate scenes out of the training distribution bounds. 
The authors have also mentioned that the rendering quality degrades when either the length of the sequence is increased (the time of video), and when the amount of motion is large. 
Nevertheless, this is an exciting new frontier that the authors explored, and as Károly always mentions: two more papers down the line and we would have made significant progress. I would also like to thank Justin Tenuto for all the edits. 
﻿
﻿