OpenAI Introduces Shap·E
OpenAI isn't just building chatbots.
Created on May 9|Last edited on May 9
Comment
Despite the popularity and success of LLM’s, OpenAI is still hard at work on solving issues in other domains of AI.
While the performance of generative models for 3D assets may currently lag behind that of models for NLP tasks, this is a vibrant area of research with significant efforts being dedicated to its advancement. The goal of these models is to automate and streamline the generation of 3D assets, which are increasingly in demand in industries such as gaming and virtual reality.
INR's
Implicit Neural Representations (INRs) are a popular means of encoding 3D assets. They offer flexibility and expressiveness and are able to capture the complex data of a 3D model within a functional framework. However, the process of gathering these INRs for every sample in a dataset can be both expensive and time-consuming.
Building upon previous approaches, OpenAI has introduced Shap·E, which was developed as a scalable and versatile conditional generative model for complex 3D implicit representations. It combines a Transformer-based encoder approach with a diffusion model, effectively generating INR parameters for 3D assets.
What sets Shap·E apart is its ability to produce INRs that represent both NeRFs and meshes simultaneously. This allows the generated 3D assets to be rendered in multiple ways or easily incorporated into various 3D applications. When trained on a dataset of several million 3D assets, Shap·E has shown an impressive ability to produce diverse and recognizable samples conditioned on text prompts.
In fact, compared to Point·E, an explicit 3D generative model, Shap·E converges faster and achieves comparable or superior results, while maintaining the same model architecture, datasets, and conditioning mechanisms.

Model Architecture
Impressive Textual Understanding
Interestingly, despite the different approaches to output representation, Shap·E and Point·E share similar success and failure cases when conditioned on images. This suggests that the choice of output representation may not significantly impact the behavior of the model. However, distinct differences do emerge between the two models, especially when conditioning on text captions.
While Shap·E’s sample quality isn't quite on par with optimization-based approaches for text-conditional 3D generation, it greatly outpaces these methods in inference speed. This efficiency offers a potentially favorable trade-off, highlighting Shap·E as a promising new direction in the world of 3D asset generation.


Examples of Results Conditioned on Text
Meta Future?
The advancement of 3D generative models like Shap·E opens up a new world of possibilities in AI. While text generative models such as GPT-4 have demonstrated remarkable capability in understanding and generating human-like text, 3D generative models can revolutionize how we interact with and utilize digital spaces.
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.