NeRF – Representing Scenes as Neural Radiance Fields for View Synthesis

Publish your model insights with interactive plots for performance metrics, predictions, and hyperparameters. Made by Lavanya Shukla using Weights & Biases
Lavanya Shukla

Introduction

Code | Paper →

In the Representing Scenes as Neural Radiance Fields for View Synthesis paper, the authors present a method that achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views.

Their algorithm represents a scene using a fully-connected (non-convolutional) deep network, whose input is a single continuous 5D coordinate (spatial location (x, y, z) and viewing direction (θ, φ)) and whose output is the volume density and view-dependent emitted radiance at that spatial location.

They synthesize views by querying 5D coordinates along camera rays and use classic volume rendering techniques to project the output colors and densities into an image. Because volume rendering is naturally differentiable, the only input required to optimize our representation is a set of images with known camera poses. They describe how to effectively optimize neural radiance fields to render photorealistic novel views of scenes with complicated geometry and appearance, and demonstrate results that outperform prior work on neural rendering and view synthesis.

If you'd like to learn more about the paper, check out this 2 minute paper video – Screen Shot 2020-04-13 at 12.17.06 AM.png

Baseline Model

We've created a colab notebook complete with a hyperparameter sweep, so you can reproduce this analysis in a colab. See if you can improve on the results by tweaking the hyperparameters.

Try this in a colab →

First, let's train a baseline model, and log our model's predictions in wandb. This lets us observe in real time how the model learns the representation of the underlying scene at each iteration.

Rendering New Views From a Learned Neural Representation For a Single Scene

Let's try changing the various hyperparameters of our model and see how that chances the performance. Below we can see that the longer we train it, the better our model gets at constructing learning views from neural representations of a single scene.

We stop here at 10,000 iterations but I encourage you to try training the model for longer. Below is a video the authors of the paper were able to reproduce after 200,000 iterations – as you can see, the results are remarkably realistic.

After training for 200,000 iterations

Effect of Changing Learning Rates

In this section, we vary the learning rate, while keeping all other hyperparameters the same. Let's pick a reasonable number of epochs to train our model for, say 1000. Here we can compare the loss functions and see that the ideal learning_rate is between 3e-4 and 7e-4.

5e-3 was too high, whereas 5e-5 and 5e-6 were too low. If you wanted to improve your model performance, I'd recommend trying more values in the range between [5e-4 and 7e-4].

We can also this pattern reflected in the videos rendered. Our model start out by not learning the underlying structures at 5e-3, then it slowly gets good at learning these 3D representations between 7e-4 and 3e-4, after which its performance slowly degrades until it cannot capture the neural representations of the scene at 5e-6. Keep in mind that we only trained our model for 1000 epochs. If we trained it for longer with that smaller learning rate, we might start to see really good performance.

Effect of Changing The Embedding Size

In the next experiment, we take this learning rate of 5e-4, and train all our models for 1000 epochs as before. This time, we vary the embedding size and find ourselves in a goldilocks scenario once again.

We can observe from both the loss and psnr plots, and also the rendered videos from the learning neural representations of the scene that an embedding size of 2 is too small to capture the complexity of the underlying scene, whereas 10 is simply too large. I would encourage you to explore more embedding values near 6 to see if you can improve on our model's performance.

Effect of Adding More Layers

Next up - we add more hidden layers, while keeping our learning rate at 5e-4, embedding size at 6, and training all our models for 1000 epochs. We observe interestingly that adding more layers doesn't always equal more performance. A hidden layer size of 6 out-performed both networks with 4 and 8 layers, although it was very close. For this experiment, we can stick with a layer size of 6, and concentrate our efforts on tweaking other parameters that have a bigger impact on the loss and psnr.

Effect of Adding More Neurons

Finally, we keep all the hyperparameters the same as above (learning rate = 5e-4, embedding size = 6, epochs = 1000, layer_size = 6) and tweak the size of the dense_layers. We see a clear pattern of the loss being inversely proportional to the dense layer size.

My GPU ran out of memory at 512 layers, but if you have a bigger GPU, I'd encourage you to try layer sizes greater than 256 to see if the model's performance continues to improve.

Running A Hyperparameter Sweep To Find The Best Model

Finally, we run a hyperparameter sweep to test out these learning rate, embedding size and other hyperparameters in combination and explore the hyperparameter space more thoroughly to find the best performing model. With W&B, you can run a hyperparameter sweep easily by specifying the parameters you'd like to try and the search strategy as a .yaml file.

See how you can launch a hyperparameter sweep in 5 mins →

Try it out yourself

We've created a colab notebook complete with a hyperparameter sweep, so you can reproduce this analysis in a colab. See if you can improve on the results by tweaking the hyperparameters.

Try this in a colab →