Variable Bitrate Neural Fields: Create Fast Approximations of 3D Scenes

This article explores creating accurate, fast approximations of complex 3D scenes with a low memory footprint, as outlined in 'Variable Bitrate Neural Fields'.
Soumik Rakshit
Created on November 22|Last edited on February 9
Comment
﻿
﻿
Neural approximations of scalar and vector fields, such as signed distance functions and radiance fields, have emerged as accurate, high-quality representations in recent years. State-of-the-art results can be obtained by conditioning a neural approximation with a lookup from trainable feature grids that take on part of the learning task and allow for smaller, more efficient neural networks. However, these feature grids usually come at the cost of significantly increased memory consumption compared to stand-alone neural network models. The question we want to answer today:
Is it possible to create a fast neural approximation model with a comparatively lower memory consumption for storage and inference, that can learn accurate and high-quality representations of complex 3D scenes?
This is the problem that the authors of the paper Variable Bitrate Neural Fields attempt to solve. They present a dictionary method for compressing such feature grids, reducing their memory consumption by up to 100 times, and permitting a multi-resolution representation that can be useful for out-of-core streaming. The dictionary optimization is formulated as a vector-quantized auto-decoder (VQAD) problem which lets us learn end-to-end discrete neural representations in a space where no direct supervision is available and with dynamic topology and structure. 
Here's what we'll be covering: 
Table of ContentsA Brief Overview of Existing ApproachesMain Contributions of VQADA Brief Overview of the Proposed ApproachRunning VQAD using Kaolin-WispConclusion
﻿
﻿
And here's a bit of what we'll be reproducing today:
﻿
An end-to-end discrete neural representation of a V8 Model Engine learnt by a Vector-Quantized Auto Decoder (VQAD)1
﻿
This article was written as a Weights & Biases Report which is a project management and collaboration tool for machine learning projects. Reports let you organize and embed visualizations, describe your findings, share updates with collaborators, and more. To know more about reports, check out Collaborative Reports.
💡
Let's dive in.
﻿
A Brief Overview of Existing ApproachesIn recent years, Coordinate-based multi-layer perceptrons (MLPs) have emerged as a promising tool for computer graphics for tasks such as:
View Synthesis:
﻿Neural Radiance Fields (NeRF)﻿
﻿Mip-NeRF﻿
Radiance Caching
﻿Real-time Neural Radiance Caching for Path Tracing﻿
﻿Global Illumination with Radiance Regression Functions﻿
Geometry Representations
﻿On the Effectiveness of Weight-Encoded Neural Implicit 3D Shapes﻿
﻿BACON: Band-limited Coordinate Networks for Multiscale Scene Representation﻿﻿﻿
Whereas discrete signal representations like pixel images or voxels approximate continuous signals with regularly spaced samples of the signal, these neural fields approximate the continuous signal directly with a continuous, parametric function, i.e., an MLP which takes in coordinates as input and outputs a vector (such as color or occupancy).
Feature grid methods are a special class of neural fields that have enabled state-of-the-art signal reconstruction quality while being able to render and train at interactive rates. These methods embed coordinates into a high-dimensional space with a lookup from a parametric embedding (the feature grid)––in contrast to non-feature grid methods which embed coordinates with a fixed function such as positional Fourier embeddings. Moreover, this allows them to move the complexity of the signal representation away from the MLP and into the feature grid. This feature grid might be a spatial data structure such as a sparse grid (Neural Sparse Voxel Fields) or a hash table (Instant Neural Graphics Primitives). However, this approach has some drawbacks:
Feature grid methods require high-resolution feature grids to achieve good quality. This makes them less practical for graphics systems operating within tight memory, storage, and bandwidth budgets.
It is also desirable for a shape representation to dynamically adapt to the spatially varying complexity of the data, the available bandwidth, and desired level of detail, which this approach fails to address.
﻿
Main Contributions of VQADThe authors of the paper Variable Bitrate Neural Fields propose the vector-quantized auto-decoder (VQAD) method to directly learn compressed feature grids for signals without direct supervision. This representation enables progressive, variable bitrate streaming of data by being able to scale the quality according to the available bandwidth or desired level of detail.
In this figure, two example neural radiance fields are shown after streaming from 5 to 8 levels of their underlying octrees. The sizes shown are the total bytes streamed; that is, the finer LODs include the cost of the coarser ones. Source: Figure 1 from the paper.
VQAD also enables end-to-end compression-aware optimization which results in significantly better results than typical vector quantization methods for discrete signal compression. The authors evaluate the proposed method by compressing feature grids that represent neural radiance fields and show that it is able to reduce the storage required by two orders of magnitude with relatively little visual quality loss without entropy encoding.
The top-left slice of the figure shows a baseline neural radiance field whose uncompressed feature grid weighs 15 207 kB. VQAD, shown bottom right, compresses this by a factor of 60x, with minimal visual impact. In a streaming setting, a coarse LOD can be displayed after receiving only the first 10 kB of data. All sizes are without any additional entropy encoding of the bit-stream.
The auto-decoder framework was initially proposed in the paper DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation.
💡
﻿
A Brief Overview of the Proposed ApproachThe proposed method uses the auto-decoder framework with an extra focus on learning compressed representations. The key idea is to replace bulky feature vectors with indices in a learned codebook. These indices, the codebook, and a decoder MLP network are all trained jointly. 
By eschewing the encoder function typically used in transform coding, we're able to learn compressed representations with respect to arbitrary domains, such as the continuous signal that a coordinate network MLP encodes, even under indirect supervision (such as training a neural radiance field from images with a volumetric renderer).
﻿
An overview of VQAD1
﻿
Compressed Auto-DecoderTo effectively apply discrete signal compression to feature grids, the auto-decoder framework is leveraged by the authors where only the decoder fγ−1f_\gamma^{-1}fγ−1​﻿ is explicitly constructed. Performing the forward transform involves solving the following optimization problem through stochastic gradient descent. A strength of the auto-decoder is that it can reconstruct transform coefficients for supervision in a domain different from the signal we wish to reconstruct. The authors define a differentiable forward map as an operator FFF﻿ which lifts a signal onto another domain.
﻿
The Optimization Objective for Compressed Auto-Decoder0
﻿
Feature-Grid CompressionThe feature grid is a matrix Z∈Rm×kZ \in \mathbb{R}^{m \times k}Z∈Rm×k﻿ where mmm﻿ is the size of the grid and kkk﻿ is the feature vector dimension. Local embeddings are queried from the feature grid with interpolation at a coordinate xxx﻿ and fed to an MLP ψ\psiψ﻿ to reconstruct continuous signals.
﻿
Feature Grid Optimization Equation0
﻿
Vector QuantizationLet us now see how vector quantization can be incorporated into the compressed auto-decoder framework.
﻿
Vector Quantization for the Compressed Auto-Decoder Framework0
﻿
Drawbacks of VQADOne of the major drawbacks of the proposed method is its training footprint in terms of memory and compute at training time, which requires the allocation of a matrix of size m×2bm \times 2^bm×2b﻿ to hold the softmax coefficients before they are converted into indices at inference and storage. The authors believe that this could be addressed via a hybrid approach between random and learned indices, where instead of storing a softened version of indices, we learn a parametric function with respect to coordinates that can predict softened indices on-the-fly.
﻿
﻿
Running VQAD using Kaolin-WispThe authors have open-sourced their work as part of NVIDIA Kaolin Wisp, which is a PyTorch library powered by NVIDIA Kaolin Core to work with neural fields (including NeRFs, NGLOD, instant-ngp, and VQAD). Another great aspect of Kaolin Wisp is that it comes with the goodness of Weights & Biases integrated with itself.
To track training and validation metrics, render 3D interactive plots, reproduce your configurations and results, and many more features in your Weights & Biases workspace just add the additional flag --wandb_project <your-project-name> when initializing the training script. The complete list of features supported by Weights & Biases for Kaolin Wisp includes:
Log training and validation metrics in real-time.
Log system metrics in real-time.
Log RGB, RGBA, Depth renderings, etc. during training.
Log interactive 360-degree renderings post-training in all levels of detail.
Log model checkpoints as Weights & Biases artifacts.
Sync experiment configs for reproducibility.
Host Tensorboard instance inside Weights & Biases run.
You can run the following Colab notebook for training your own neural approximation of any 3D scene and visualize the view synthesis results interactively on Weights & Biases...
﻿
﻿
﻿
Running VQAD on the RTMV DatasetThe RTMV Dataset is a large-scale synthetic dataset for novel view synthesis consisting of ~300k images rendered from nearly 2000 complex scenes using high-quality ray tracing at high resolution (1600 x 1600 pixels). The dataset is orders of magnitude larger than existing synthetic datasets for novel view synthesis, thus providing a large and unified benchmark for both training and evaluation. Using 4 distinct sources of high-quality 3D meshes, the scenes of our dataset were composed to exhibit challenging variations in camera views, lighting, shape, materials, and textures. The dataset was generated by a Python-based ray tracing renderer, which is designed to be simple for non-experts to use and share, flexible and powerful through its use of scripting, and able to create high-quality and physically-based rendered images. 
The RTMV Dataset was created by the authors of the paper RTMV: A Ray-Traced Multi-View Synthetic Dataset for Novel View Synthesis. It consists of scenes from four different environments, namely Google Scanned Objects, ABC, Bricks, and Amazon Berkeley. We will now train VQAD on some of these scenes from each category in the dataset.
Results on Lego BricksThis subset of the RTMV dataset contains 1027 scenes, with a single bricks (lego) model per scene and hemisphere views. The camera is aimed at random locations within 1/10 of the unit volume used to scale the object, thus producing images that are not centered on the model. Each scene is illuminated by a white dome light and a warm sun placed randomly on the horizon.
﻿
Results on Lego Bricks from the RTMV Dataset12
﻿
Results on Google Scanned ObjectsThis subset of the RTMV dataset contains 300 scenes, with 20 random objects per scene. Cameras are placed on a hemisphere around the scene and are pointed toward the center of the scene. The Scenes are lit by a single white dome light.
﻿
Results on Google Scanned Objects from the RTMV Dataset9
﻿
Results on the ABC SubsetThe ABC subset from the RTMV dataset contains 300 scenes, with 50 random objects per scene and random camera views. Objects have randomly selected colors and materials. The Scenes are lit by a uniform dome light and an additional bright point light to produce hard shadows.
﻿
Results on the ABC subset from the RTMV Dataset9
﻿
Results on the Amazon Berkeley SubsetThe Amazon Berkeley subset of the RTMV dataset contains 300 scenes, with 40 random objects per scene. Similar to ABC, cameras are placed randomly within a unit cube and are aimed at any object. The scenes are lit with a full HDRI map and apply a random texture on the floor. This is our most challenging environment.
﻿
Results on the Amazon Berkeley subset from the RTMV Dataset9
﻿
﻿
ConclusionIn this report, we take a look at the paper Variable Bitrate Neural Fields for learning fast neural approximations of complex 3D scenes with a low memory footprint for storage and inference.
We briefly explore some of the existing approaches and applications of neural approximations, such as view synthesis and radiance caching, etc.
We take a deep dive into the vector-quantized auto-decoder or VQAD, the proposed framework, and examine the main contributions of the authors and the overall process of training and inference.
We explore the strengths and drawbacks of this approach and also how these drawbacks could be addressed in future works.
We train a few neural approximations of complex 3D scenes from the RTMV Dataset using NVIDIA Kaolin Wisp, which is a PyTorch library powered by NVIDIA Kaolin Core to work with neural fields. We also explore the results interactively using Weights & Biases.
The authors express their belief that neural rendering and neural fields will become more integrated into next-generation graphics pipelines. As such, it is important to design neural representations that can perform the same signal processing operations currently possible with other representations like meshes and voxels.
The authors also express their belief that the vector-quantized auto-decoder, is a step forward in that direction as we demonstrated our method can learn a streamable, compressive representation with minimal visual quality loss.
The authors note that the proposed approach is also directly compatible with highly efficient frameworks like instant neural graphics primitives and express their belief that the synthesis of these techniques is a very exciting research direction.
If you wish to know the fundamentals behind Neural Radiance Fields, we recommend you check out the following reports
Implementing NeRF in JAX
This article uses JAX to create a minimal implementation of 3D volumetric rendering of scenes represented by Neural Radiance Fields, using W&B to track all metrics.
NeRF – Representing Scenes as Neural Radiance Fields for View Synthesis
Block-NeRF: Scalable Large Scene Neural View Synthesis
Representing large city-scale environments spanning multiple blocks using Neural Radiance Fields
Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields
In this article, we explore how to achieve photorealistic rendering of large unbounded 3D scenes from novel camera angles while preserving fine-grained details. 
﻿
﻿
Add a comment
Thomas Capelle • 3 years ago
This article is fantastic!
1 reply
Tags: TMP, Articles, Advanced, Computer Vision, GenAI
Iterate on AI agents and models faster. Try Weights & Biases today.