Point-E: OpenAI's Open-Source Text-To-3D & Image-To-3D Diffusion Model

OpenAI researchers have open-source a highly efficient text-to-3d model called Point-E, which mixes text-to-image and image-to-3d models to great effect.
Teli Davies
Created on December 20|Last edited on December 21
Comment
3D object generation is a natural step forward from 2D image generation. However, where current 2D image generation models can be run in just seconds on individual consumer GPUs, current research into 3D object generation struggles to match that efficiency.
To try and make 3D object generation more accessible to the average person, OpenAI researchers have developed a model called Point-E. This model generated 3D objects in a point cloud representation based on the user's text or image input. While it doesn't meet the coherency of state-of-the-art 3D object generation models, the focus is being able to run on small compute.
﻿
While the paper got released a few days ago, the code and model checkpoints were uploaded today. The project is open-source, so anyone can download the model weights and try it out for themselves. There is also now a gradio demo on HuggingFace available, exhibiting the smallest model modified for a quick text-to-3D generation.
﻿
How does Point-E work?Where many text-to-3D models convert text prompts to 3D objects directly, Point-E opts to split the process into two steps for text-to-image and image-to-3D. This split architecture was chosen because it is more scalable, less reliant on hard-to-come-by 3D object datasets, and takes advantage of the already-established flexibility of cheap-but-powerful text-to-image generation models.
﻿
Point-E's training datasetTo make a model, you need a dataset. For Point-E, this means a dataset built from 3D models. Though this is a model for 3D object generation, the final dataset actually only contains 2D renders and 3D point cloud data.
First, they collected several million 3D models, which varied wildly in quality and format. These models were fed through an image rendering pipeline carefully constructed to make output images as consistent as possible (in size, lighting, etc), and in the end, each model had 20 renders produced of them from various angles.
Next, 3D point cloud data was inferred based on the 2D image renders for each 3D model. The 2D images were used instead of the 3D model data because using the 3D models directly would pose many issues, such as needing to deal with many different file formats or dealing with oddly intersecting geometry.
This image-points data collection was first filtered to remove any flat objects and then clustered by CLIP features for further analysis. Clusters of low-quality data were then discarded, leaving only the data that was deemed high enough quality to work within the final dataset.
Point-E's text-to-image internal modelThe text-to-image portion of Point-E is a fine-tuned GLIDE model. It was fine-tuned on our new 3D render data mixed with the original GLIDE training dataset at a 5% and 95% ratio. The new 3D render data only takes up 5% of the fine-tuning dataset because it is so comparatively small.
During training, the text prompts associated with the 3D render images were also appended with a special token indicating them as 3D renders. This special token is also used when sampling the final model to ensure the text-to-image model's output is composed properly for insertion into the image-to-3D model.
Point-E's image-to-3D internal modelThe image-to-3D portion is the most important part of Point-E, and it's a fairly simple transformer-based diffusion model.
Like all diffusion models, it takes in a noise vector and attempts to de-noise it. The output is fed straight back in as input until satisfied. A timestep token is also fed in to track steps. In Point-E's case, the output vector defines a 3D point cloud of points' locations (xyz) and colors (rgb).
This portion of the model, being an image-to-3D model, additionally takes image input as conditioning. The image output from the text-to-image portion is embedded through a pre-trained CLIP model before insertion.
﻿
Also, like other diffusion models, Point-E first constructs a low-resolution, or in this case, low-point count, the output which is subsequently upscaled by a smaller upscaler model of similar architecture. The base model produces 1,000 points, which the upscaler increases to 4,000 points.
Using Point-EPoint-E is open source with instructions to run in and downloadable model checkpoints on the GitHub repository. There is also a gradio demo on Hugging Face spaces exhibiting one of its smaller offshoot models if you want to get right into playing with text-to-3D.
Model sizesPoint-E comes in a variety of sizes:
1 billion parameters. The largest base model.
300 million parameters. The middle-size base model.
40 million parameters. The smallest base model.
Unconditioned. The 40m model without text or image conditioning steps.
Text vector. The 40m model modified to take just text as conditioning instead of an image, skipping the text-to-image step entirely, presenting a true text-to-3D model. It's limited only to the textual content in the dataset made for Point-E, so anything it knows from GLIDE in the other model versions is not present here. This is the model featured in the gradio demo on Hugging Face.
Image vector. The 40m model modified to take in an image as conditioning directly, in a different way than it does in the base models.
Find out moreVisit the Point-E GitHub repository or gradio demo page to try it out yourself.
Read the full paper for all the details on Point-E.
﻿
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.