Skip to main content

LOLNeRF: 3D Reconstruction From Learning On Single-View Images Exclusively

Google researchers have developed LOLNeRF, a model which is able to generate 3D image reconstructions while having been trained on single-view images alone.
Created on September 14|Last edited on September 14
With two eyes, people can understand the 3D environment around them, but still, we can somehow intuit a 3D scene from a single, flat image. This is because we have a lived experience and understanding of how forms typically relate to each other in 3D space, like how a person's nose generally sticks out from their face.
NeRF is a machine learning algorithm commonly known for it's ability to reconstruct 3D geometry from images, however its training process is fundamentally based in already knowing the 3D properties of the training data.

Google researchers wanted to train a new NeRF model to be able to reconstruct 3D geometry never having seen more than just single-perspective, flat images, and by incorporating several tools and models, was able to produce LOLNeRF.

How does LOLNeRF work?

The LOLNeRF (Learn from One Look NeRF) model is build from two models: A standard NeRF model for 3D reconstruction and a GLO model for generalized form understanding. By using them together, LOLNeRF can estimate the 3D properties of a single-view image.
GLO models are similar to GANs in that they are generative models; during training they learn to understand the defining features of their training data so that they can not just recreate the input data they're fed, but create novel images like celebrities or cats that don't exist in reality. For LOLNeRF, this is additionally how it can know the nose of a cat from it's ears.
Knowing the structure of a cat is all good, but that doesn't automatically give it the knowledge of how a cat would be represented in 3D space. To inject a 3D component into the data, which the NeRF portion of the model requires, the researchers used MediaPipe Face Mesh to automatically assign 3D depth values to certain parts of a face, like the nose being positioned forwards and the corners of the eyes positioned backwards.

Again, the image dataset that LOLNeRF is trained on is comprised of only single-view images without any sort of 3D data. MediaPipe Face Mesh is used to estimate 3D data and camera pose, while the GLO model is trained to understand how forms go together.
In the end, LOLNeRF is able to reconstruct images into 3D views while only ever seeing single-view flat images. It can additionally use the GLO part of it's model to generate novel images.


Code for training your own LOLNeRF model will be available soon, however pre-trained model weights will not be provided.

Find out more

Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.