Skip to main content

New 3D Reconstruction Method Utilizing Diffusion!

Its all about the prior.
Created on March 24|Last edited on March 24
The success of LLM's can largely be attributed to a strong utilization of internet scale text data. Given this large amount of data , LLM's have been able to learn a general relationship between words, and ultimately gain a general reasoning ability. Perhaps the closest equivalent for computer vision models are diffusion models, which are able to learn a relationship between text and images from large scale internet images with corresponding captions. One major area of research has focused on generating 3D models of the world, given 2D images of the scene. This task has been incredibly challenging, as scene reconstruction from 2D images requires strong prior information about structure of objects, which is difficult to achieve. Humans are incredibly good at leveraging thousands of hours of observation of 3D objects to predict the structure of objects from a 2D dimensional perspective, and this new research from the University of Columbia leverages diffusion models that contain a large scale understanding of images from different perspectives, similar to humans.

The Objective

The overall goal of the new model is to synthesize a new image from a different perspective than the original image given to the model. The model essentially is able to leverage a pre-trained diffusion model, which contains a learned understanding of geometric features from image data to then reconstruct the same scene from a specified camera transformation. Ground truth perspective data is given to the model from the Objaverse dataset, which contains thousands of 3D objects, and the model is trained by predicting a 2D view of the object from unseen perspectives. Since the Objaverse dataset has 3D meshes, these meshes can be viewed from any perspective, and large amounts of new perspectives can be generated within a graphics renderering software. Despite this dataset being synthetic, the model learns enough from this data achieve amazing performance on real-life images.

2.5D Reconstruction?

It’s important to note that this model does not output 3D meshes, but rather images from different perspectives, which can be specified. However, these images can be then utilized for scene reconstruction using other methods, and the authors utilize Multiview Compressive Coding, which is a neural field based approach which synthesizes 3D scenes from multiple 2D image perspectives of the scene. Overall, its extremely exciting to see this shift in AI, where large scale data is able to be utilized to build a general understanding of the world, and this representation is able to be adapted on smaller tasks to achieve human-like performance. As we are able to increase the scale of the models and data, its likely that we will see a continuing level of performance, that quite possibly will exceed the progress we have seen with Moores law and semiconductors.



The Paper:


Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.