Skip to main content

Text2CAD: Generating CAD Models from Text Descriptions

A new model for generating 3D Objects!
Created on October 4|Last edited on October 4
The "Text2CAD" framework is an AI model designed to transform natural language descriptions into parametric CAD (Computer-Aided Design) models. The system uses a specialized dataset, a transformer-based architecture, and a structured training approach to convert text-based inputs into detailed CAD models. This integration makes the CAD design process accessible to users of all skill levels by allowing them to create models through text prompts rather than manually building them using complex software operations. This article explores how the DeepCAD dataset was used to train the model, explains the inputs and outputs, and details the training objectives for achieving accurate text-to-CAD generation.

The DeepCAD Dataset

The DeepCAD dataset serves as the core training source for Text2CAD. It contains CAD models that are documented through detailed sequences of construction steps, such as sketching, extruding, and applying Boolean operations. Each CAD construction sequence represents a comprehensive step-by-step process of how a particular model is built from scratch. To make this dataset usable for the Text2CAD framework, each CAD sequence was annotated with corresponding text descriptions, providing natural language instructions that explain each step. These annotations were generated using a two-stage data pipeline that employed Vision Language Models and Large Language Models.

Generating Descriptions for DeepCAD

Vision Language Models, like LLaVa-NeXT, were used to generate abstract visual descriptions of each CAD model by analyzing multi-view images of the CAD designs. These abstract descriptions captured the shape and geometric properties of the models in human-readable language. Subsequently, these abstract descriptions were further enriched using LLMs such as Mistral-50B, which created more detailed, step-by-step text prompts describing the construction sequences. This process produced a comprehensive set of natural language instructions ranging from high-level overviews to detailed parametric descriptions suitable for advanced users.

Connecting Text Descriptions with CAD Sequences

The DeepCAD dataset, enhanced with these text annotations, forms the basis for training the Text2CAD model. Each text description corresponds to a sequence of CAD operations, guiding the model on how to convert language into a sequence of structured CAD steps. The input to the Text2CAD model is a natural language text prompt, which can describe a variety of design intentions, from basic shapes to complex, multi-step operations. The output of the model is a sequence of CAD operations, represented in a format that CAD software can use to build the specified model. These operations include 2D sketching, 3D transformations like extrusion and lofting, and feature operations such as fillets and chamfers. Each step in the sequence is defined by parameters such as coordinates, lengths, angles, and geometric constraints.

Model Architecture and Workflow

Text2CAD’s transformer-based architecture includes a text encoder, a CAD sequence embedder, and a cross-attention mechanism that integrates information between the input text and the CAD construction sequence. The text encoder, based on a pre-trained BERT model, converts the natural language input into a series of contextual embeddings that capture the semantic meaning of the description. These embeddings are then used to generate the CAD sequence, with the help of a CAD-specific sequence embedder that represents each CAD operation as a unique token. The cross-attention mechanism plays a critical role in aligning the textual instructions with the current state of the CAD sequence. It allows the model to focus on the relevant parts of the text that correspond to the next operation in the sequence, ensuring that each generated step accurately follows the input prompt.


Training Objectives and Optimization

During training, the model’s objective is to minimize the error between its predicted CAD sequence and the ground truth sequence from the DeepCAD dataset. This is done using a cross-entropy loss function, which measures the discrepancy between the predicted and actual sequences at each step. The model learns to predict the next CAD operation by considering the current state of the model, the text prompt, and the previous CAD operations. The goal is to ensure that the generated sequence is logically consistent and geometrically accurate relative to the input text. The training also emphasizes parametric accuracy, requiring the model to produce precise values for dimensions and geometric constraints to match the specifications described in the text prompt.

Geometric and Parametric Alignment

The model’s output is evaluated based on its ability to produce CAD sequences that visually and geometrically align with the described design. Metrics such as chamfer distance and F1 scores for geometric primitives are used to assess how well the generated model corresponds to the ground truth.

Final Output and Use Cases

Once trained, Text2CAD can generate complete CAD models from a variety of text descriptions. The output sequence can be directly used in CAD software to create the specified model, allowing designers to create and modify models through text commands. This capability makes Text2CAD suitable for a range of applications, from assisting novice designers in quickly creating basic models to automating repetitive design tasks for advanced users. The framework significantly lowers the barrier to entry for CAD modeling and paves the way for integrating AI into the design process, making CAD more accessible, efficient, and versatile.
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.