Building reliable apps with GPT-4o and structured outputs

Learn how to enforce consistency on GPT-4o outputs, and build reliable Gen-AI Apps.
Created on October 8|Last edited on October 17
Comment
The world is awash with unstructured data, from free-form text and images to complex interactions in voice and video. However, to unlock the true potential of this information, it needs to be structured and organized. 
Once structured, this data can be harnessed in a multitude of ways, powering a range of use cases from retrieval-augmented generation (RAG) systems to recommendation engines and code generation tools. In this article, we’ll explore the concept of structured outputs, discuss their advantages, and build a few fun projects that showcase their impact on AI-driven solutions.
﻿
What do we mean by structured outputs? Integration of structured outputs in the OpenAI APIPractical examples Categorization: Creating a research paper indexUse in a retrieval-augmented generation system: Building a structured databaseCode generation from voice commands: Converting audio instructions to structured JSON for a form builderConclusion 
﻿
What do we mean by structured outputs? Structured outputs refer to the organized and formatted data generated by AI models that adhere to predefined schemas. This ensures that the generated information is both consistent and easily consumable by various systems, enhancing its overall readability and usability.
In traditional AI applications, ensuring that outputs conform to a desired structure has often relied on prompt engineering or post-processing layers. This approach can be error-prone and difficult to maintain, especially when working with complex data structures. Structured outputs, as implemented in the OpenAI API, solve these challenges by allowing developers to define schemas directly within the API, using formats such as JSON Schema. This guarantees that the model's response will match the expected format, minimizing errors and reducing the need for validation or reformatting.
To use structured outputs in the OpenAI API, include the schema definition in the response_format parameter when making a request. This will enforce the model to generate responses that match the given format, eliminating the need for manual validation or prompt engineering tricks. If a response doesn’t conform to the schema, you can handle it programmatically, such as logging a refusal message or applying conditional logic.
There are two primary methods for applying structured outputs: using JSON Schema or function calling. JSON Schema is ideal when you want the model’s response to strictly follow a specific structure, such as when formatting data for storage or display. Function calling, on the other hand, is used when you want the model to interact with tools or external systems. We will focus on using the JSON schema for this tutorial. 
By leveraging structured outputs, you minimize issues like missing fields or invalid data types, ensuring your application can seamlessly integrate with the model’s output without additional error handling or complex post-processing steps.
Integration of structured outputs in the OpenAI APIThe integration of structured outputs in the OpenAI API is designed to provide robust control over the model's responses. By defining a schema that the model adheres to, developers can ensure that critical elements—such as required fields or valid enumerated values—are always present. This approach differs from earlier methods where developers might have had to write extensive prompts to coax the model into producing the right format or manually verify the output afterwards.
Structured outputs provide several key benefits that enhance the overall efficiency and reliability of AI-driven applications. Improved data handling is achieved through adherence to predefined schemas, ensuring that all generated responses conform to the expected format. This eliminates the need for complex error handling or validation processes, saving developers time and effort. As a result, applications can function more predictably, reducing bugs and inconsistencies.
Additionally, the better system integration offered by structured outputs enables smooth data exchange between systems. With consistent formatting, there is less chance of data mismatches or integration issues, making it easier to pass structured information across various applications, databases, or APIs.
Finally, enhanced user experience is another advantage. Structured outputs improve the readability of responses and allow developers to design intuitive interfaces that can present different parts of the model's output in distinct ways. This leads to clearer communication of information and a more engaging interaction for users.
Practical examples In this section, we’ll explore three primary use cases that illustrate the power and versatility of structured outputs in real-world applications. Each use case will highlight how structured outputs can be leveraged to enforce data consistency, making it easier to work with structured responses in a variety of scenarios.
Categorization: Creating a research paper indexWe’ll demonstrate how to use structured outputs to categorize research papers into a broad set of predefined categories, effectively building an index of research papers. Using a structured JSON Schema format, we ensure that each document is classified consistently, with outputs adhering to categories such as "Supervised Learning," "Reinforcement Learning," or "Natural Language Processing." 
This structured categorization process allows for organization and retrieval of research documents, making it easier to search and analyze the indexed content. This index can then serve as the backbone of a larger knowledge management system, providing a solid foundation for further analysis and applications.
Here’s some code that loads AI research papers and then classifies them using OpenAI structured outputs: 
import os
import json
import arxiv
import shutil
from PyPDF2 import PdfReader
from openai import OpenAI
import weave
﻿
# Initialize Weave and OpenAI
weave.init("paper_classification")
api_key = os.getenv('OPENAI_API_KEY')
﻿
model = "gpt-4o-mini"
client = OpenAI(api_key=api_key)
﻿
# Directory to download and categorize papers
download_dir = "./arxiv_papers"
if not os.path.exists(download_dir):
    os.makedirs(download_dir)
﻿
# List of machine learning categories
categories = [
    "Supervised Learning", "Unsupervised Learning", "Reinforcement Learning", "Deep Learning", 
    "Natural Language Processing", "Computer Vision", "Graph Neural Networks", "Transfer Learning", 
    "Meta-Learning", "Few-Shot Learning", "Self-Supervised Learning", "Representation Learning", 
    "Multi-Modal Learning", "Generative Adversarial Networks (GANs)", "Bayesian Methods", 
    "Probabilistic Models", "Federated Learning", "Privacy-Preserving ML", "Fairness and Bias in ML", 
    "Explainable AI", "Optimization Algorithms", "Adversarial Robustness", "Causal Inference", 
    "Anomaly Detection", "Time Series Analysis", "Graph-Based Learning", "Knowledge Graphs", 
    "Ontology Learning", "Recommender Systems", "Information Retrieval", "Domain Adaptation", 
    "Semi-Supervised Learning", "Data Augmentation Techniques", "Multi-Agent Systems", 
    "Human-in-the-Loop Learning", "Curriculum Learning", "Active Learning", "Imitation Learning", 
    "Inverse Reinforcement Learning", "Policy Optimization", "Robustness to Distribution Shifts", 
    "Neural Architecture Search (NAS)", "Hyperparameter Optimization", "Neurosymbolic AI", 
    "Neural Ordinary Differential Equations", "Memory-Augmented Networks", "Recurrent Neural Networks (RNNs)", 
    "Long Short-Term Memory (LSTM)", "Transformer Models", "Attention Mechanisms", 
    "Pre-trained Language Models (e.g., BERT, GPT)", "Contrastive Learning", "Energy-Based Models", 
    "Neural Style Transfer", "Object Detection", "Segmentation Models", "Image Generation", "3D Vision", 
    "Motion Prediction", "Speech Recognition", "Speech Synthesis", "Emotion Recognition", 
    "Text Generation", "Summarization", "Machine Translation", "Question Answering", "Dialogue Systems", 
    "Conversational AI", "Autonomous Systems", "Robotics and Control", "Game Theory in ML", 
    "Synthetic Data Generation", "Biomedical Data Analysis", "Bioinformatics", "Healthcare Applications of ML", 
    "Drug Discovery", "Predictive Maintenance", "Financial Modeling", "Climate Modeling", 
    "Physics-Informed Learning", "Chemistry Applications", "Material Science Applications", 
    "Social Network Analysis", "Sentiment Analysis", "Text Mining", "Data Mining", "Complex Systems", 
    "Ensemble Methods", "Evolutionary Algorithms", "Quantum Machine Learning", "ML System Performance Optimization", 
    "ML in Edge Computing", "ML for Internet of Things (IoT)", "Multi-Task Learning", "Continual Learning", 
    "Neural-Symbolic Learning", "Vision-Language Models", "Zero-Shot Learning", "Learning from Demonstration", 
    "Neural Network Pruning"
]
﻿
# Define a function to read the first 1000 characters of a PDF
def read_pdf_first_1000_chars(pdf_path):
    try:
        with open(pdf_path, 'rb') as file:
            reader = PdfReader(file)
            text = ""
            for page in reader.pages:
                text += page.extract_text()
                if len(text) >= 1000:
                    return text[:1000]
    except Exception as e:
        print(f"Failed to read {pdf_path}: {e}")
    return ""
﻿
# Define a function to categorize a paper based on its content using structured output
@weave.op
def categorize_paper(text):
    # Define the JSON schema for structured output with enum categories (not required but helpful) 
    category_schema = {
        "type": "json_schema",
        "json_schema": {
            "name": "paper_category_response",
            "schema": {
                "type": "object",
                "properties": {
                    "category": {
                        "type": "string",
                        "enum": categories,  # Use the list of categories as enum options
                        "description": "The category of the research paper"
                    }
                },
                "required": ["category"],  # Ensure that the response contains a category
                "additionalProperties": False,
                "strict": True
            }
        }
    }
﻿
    # Create the prompt for categorizing the text
    prompt = f"""
    Based on the following text from a research paper, categorize it into one of the following machine learning topics: {', '.join(categories)}.
    Please respond with a JSON object in the format: {{"category": "Category Name"}}.
﻿
    Research Paper Content:
    {text}
    """
    
    # Make the API request to categorize the text using structured output
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a categorization assistant."},
            {"role": "user", "content": prompt}
        ],
        response_format=category_schema,  # Use structured output format with enum
        max_tokens=50,
        temperature=0.3
    )
    
    # Parse the model's response to extract the category
    result = response.choices[0].message.content.strip()
    try:
        result_json = json.loads(result)
        category = result_json.get("category", "Uncategorized")
    except json.JSONDecodeError:
        category = "Uncategorized"
﻿
    return category
﻿
# Define a function to move the PDF to the appropriate category folder
def move_pdf_to_category(pdf_path, category):
    category_dir = os.path.join(download_dir, category.replace(" ", "_"))
    if not os.path.exists(category_dir):
        os.makedirs(category_dir)
    shutil.move(pdf_path, os.path.join(category_dir, os.path.basename(pdf_path)))
    print(f"Moved {pdf_path} to {category_dir}")
﻿
# Download recent papers from arXiv
query = "machine learning"
max_results = 1  # Change to a larger number as needed
search = arxiv.Search(
    query=query,
    max_results=max_results,
    sort_by=arxiv.SortCriterion.SubmittedDate
)
﻿
# Iterate through each result and categorize the paper
for result in search.results():
    print(f"Downloading: {result.title}")
    paper_id = result.entry_id.split('/')[-1]
    pdf_url = result.pdf_url
    filename = f"{paper_id}.pdf"
    result.download_pdf(dirpath=download_dir, filename=filename)
    
    # Read the first 100 characters of the downloaded PDF
    pdf_path = os.path.join(download_dir, filename)
    text_snippet = read_pdf_first_1000_chars(pdf_path)
    
    if text_snippet:
        print(f"Categorizing paper: {filename}")
        # Use the categorize_paper function to get the category
        category = categorize_paper(text_snippet)
        print(f"Assigned Category: {category}")
        
        # Move the PDF to the appropriate category folder
        move_pdf_to_category(pdf_path, category)
    else:
        print(f"Failed to extract text from {filename}")
﻿
The script uses structured outputs to categorize research papers based on predefined machine learning categories. It leverages W&B Weave to log and track various inputs and outputs of the categorization function, making it easier to monitor and debug the model's predictions. 
When a research paper is downloaded, a snippet of its content is extracted using PyPDF2 and passed through the categorize_paper function. This function sets up a JSON schema with enumerated categories, ensuring that the model output adheres to the defined structure. Using OpenAI’s API, the model generates a response in JSON format that is parsed and used to determine the category of the paper. Weave tracks these results, ensuring that every inference and categorization can be revisited or shared, providing transparency and ease of analysis. This is particularly useful for building large-scale, organized databases of research papers.
Use in a retrieval-augmented generation system: Building a structured databaseIn a previous project, I built a RAG-based restaurant menu, which basically allows users to search for menu items using natural language queries. In order to accomplish this, I had to first create a structured list of items in the form of a JSON object. 
This required me creating a very complex prompt that would coax an LLM to output a structured JSON object given and unstructured body of text in the form of a a menu PDF.  Here, we will focus on creating this same structured database that will later be used in a RAG system, without the use of a complex prompt, and leveraging structured outputs. This step involves organizing menu items into a structured JSON format, ensuring that the data is clean, well-organized, and easy to work with. 
Once structured, this large JSONL object (basically a list of JSON objects) will be ready for vectorization (which is a major step in building a RAG system). By ensuring that the data adheres to a predefined schema before vectorization, we can maintain consistency and improve the overall efficiency of the retrieval and generation processes that RAG systems rely on. The script uses structured outputs to ensure that menu data is cleanly formatted before being stored in a structured format. This database will later serve as a core part of a RAG system. 
import os
import json
import PyPDF2
import re
from openai import OpenAI
import weave
﻿
# this is for a menu rag system which allows users to search for menu items using 
# natural langauge 
# Initialize Weave
weave.init("menu_standardization")
﻿
# Use your hardcoded OpenAI API key (set your key here)
api_key = os.getenv('OPENAI_API_KEY')
client = OpenAI(api_key=api_key)
﻿
# Define a function to split the text into manageable chunks
def split_text(text, chunk_size=4000, overlap=500):
    chunks = []
    start = 0
    while start < len(text):
        if start + chunk_size > len(text):
            chunks.append(text[start:])
        else:
            end = start + chunk_size
            chunks.append(text[start:end + overlap])
        start += chunk_size
    return chunks
﻿
# Define a function to read and process the menu PDF
def read_and_process_menu(pdf_path):
    menu_data = []
﻿
    # Define the structured output schema expecting an array of objects
    menu_items_schema = {
        "type": "json_schema",
        "json_schema": {
            "name": "menu_items_response",
            "schema": {
                "type": "object",
                "properties": {
                    "items": {  # The root property is an array named "items"
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "title": {
                                    "type": "string",
                                    "description": "The name of the menu item"
                                },
                                "description": {
                                    "type": "string",
                                    "description": "A detailed description of the menu item"
                                },
                                "keywords": {
                                    "type": "string",
                                    "description": "Comma-separated keywords for the menu item (e.g., 'dessert, chicken, side')"
                                }
                            },
                            "required": ["title", "description", "keywords"],  # All three fields are required
                            "additionalProperties": False
                        }
                    }
                },
                "required": ["items"],  # Ensure that "items" is included in the response
                "additionalProperties": False
            }
        }
    }
﻿
    # Read the PDF and extract text
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        for page_num in range(len(reader.pages)):
            page_text = reader.pages[page_num].extract_text()
            text_chunks = split_text(page_text)
﻿
            for chunk in text_chunks:
                # Create the prompt for each chunk
                prompt_text = "Convert the following menu text into a JSON array with each item containing 'title', 'description', and 'keywords'."
﻿
                # Make the API request using the structured output format
                response = client.chat.completions.create(
                    model="gpt-4o-mini",
                    messages=[
                        {"role": "system", "content": "You are a helpful assistant that structures menu items into JSON array format."},
                        {"role": "user", "content": prompt_text + " Here is the text: " + chunk}
                    ],
                    response_format=menu_items_schema  # Use structured output schema
                )
﻿
                # Parse the model's response to extract the structured output
                result = response.choices[0].message.content.strip()
                try:
                    # Parse the structured JSON response
                    parsed_response = json.loads(result)
﻿
                    # Expecting an "items" key containing the list of menu items
                    menu_items = parsed_response.get("items", [])
                    
                    # Add the page number to each menu item
                    for item in menu_items:
                        item['page'] = page_num + 1
﻿
                    # Append the parsed items to the menu data
                    menu_data.extend(menu_items)
﻿
                except json.JSONDecodeError as e:
                    print(f"JSON parsing failed for page {page_num + 1}: {e}")
                    print(f"Response content: {result}")
﻿
    # Remove duplicates based on title
    unique_menu_data = {item['title']: item for item in menu_data}
    menu_data = list(unique_menu_data.values())
﻿
    return menu_data
﻿
# Replace 'menu.pdf' with the path to your PDF file
pdf_path = './menu.pdf'
menu_items = read_and_process_menu(pdf_path)
﻿
# Save the results to a JSON file
with open('gpt4_menu_data.json', 'w') as json_file:
    json.dump(menu_items, json_file, indent=4)
﻿
print("Menu items successfully processed and saved to gpt4_menu_data.json")
﻿
The read_and_process_menu function reads a menu PDF and extracts text using PyPDF2, splitting it into chunks to fit within the model’s token limit. Each chunk is processed with a prompt that instructs the model to return the menu data as a structured JSON array, adhering to a schema defined using OpenAI's structured outputs. By enforcing a schema that includes fields like title, description, and keywords, the data is guaranteed to be consistently structured, which is crucial before vectorizing it for a RAG system.
This will create a folder with subdirectories for each category, and move relevant papers into the correct folder. Score. 
Since we call weave.init(), and we are using the OpenAI API, all calls to the API will be "auto-logged" to weave, and we can later view the inputs and outputs to our function.
Here's what it looks like inside weave after running the script: 
﻿
﻿
Code generation from voice commands: Converting audio instructions to structured JSON for a form builderWe will build a system that captures voice commands and turns them into structured form schemas, which can then be rendered as interactive HTML forms. Using a combination of OpenAI's Whisper API for audio transcription and the OpenAI model for generating structured outputs, this system allows users to speak form details aloud and have them automatically converted into a functioning web form.
Once the form schema is created, it's passed to a Flask app that dynamically generates the form fields and renders them in a web interface. The Flask app reads the JSON object and converts it into HTML components—like text fields, dropdowns, and radio buttons—based on the type of each field. The form can then be filled out and submitted through the web interface, making it easy to interact with the generated schema.
For this project, there are two main components: the voice-to-text processing module and the form generator, each working in tandem to transform voice commands into structured outputs effortlessly. This enables a new way of interacting with software using voice-based inputs for rapid prototyping and UI design. 
Here's the code:
import openai
import sounddevice as sd
import numpy as np
import scipy.io.wavfile as wav
import tempfile
from openai import OpenAI
import os
import json
import weave
﻿
# Set your OpenAI API key (replace 'YOUR_API_KEY' with your actual key)
api_key = os.getenv('OPENAI_API_KEY')
client = OpenAI(api_key=api_key)
﻿
# Parameters for audio recording
SAMPLE_RATE = 24000  # Sample rate for recording
RECORD_DURATION = 10  # Maximum duration to record in seconds
﻿
# Function to record audio using sounddevice and save it as a temporary .wav file
﻿
﻿
from flask import Flask, render_template_string, request
﻿
app = Flask(__name__)
﻿
# Store the form schema in a global variable
form_schema = None
﻿
def generate_form_from_json(json_input):
    """Generates an HTML form based on the given JSON input."""
    form_html = '<form method="POST" action="/submit">\n'
    for field in json_input["fields"]:
        field_type = field.get("type", "text")
        field_label = field.get("label", "")
        field_name = field.get("name", "")
﻿
        if field_type == "text":
            form_html += f'<label>{field_label}</label><br>\n'
            form_html += f'<input type="text" name="{field_name}" required><br><br>\n'
        elif field_type == "dropdown":
            form_html += f'<label>{field_label}</label><br>\n'
            form_html += f'<select name="{field_name}">\n'
            for option in field.get("dropdown_options", []):
                form_html += f'<option value="{option}">{option}</option>\n'
            form_html += '</select><br><br>\n'
        elif field_type == "multiple_choice":
            form_html += f'<label>{field_label}</label><br>\n'
            for option in field.get("dropdown_options", []):
                form_html += f'<input type="radio" name="{field_name}" value="{option}" required>{option}<br>\n'
            form_html += '<br>\n'
        elif field_type == "yes_no":
            form_html += f'<label>{field_label}</label><br>\n'
            form_html += f'<input type="radio" name="{field_name}" value="Yes" required> Yes\n'
            form_html += f'<input type="radio" name="{field_name}" value="No" required> No<br><br>\n'
    
    form_html += '<input type="submit" value="Submit">\n'
    form_html += '</form>'
    return form_html
﻿
@app.route('/form', methods=['GET'])
def form():
    """This route will render the form based on the global JSON schema."""
    global form_schema
    if form_schema is None:
        return "No form schema provided."
    form_html = generate_form_from_json(form_schema)
    return render_template_string(form_html)
﻿
@app.route('/submit', methods=['POST'])
def submit():
    form_data = request.form.to_dict()
    return f"Form submitted successfully with data: {form_data}"
﻿
def run_form_app(json_input, host='127.0.0.1', port=5000):
    """Runs the Flask app with a provided JSON schema."""
    global form_schema
    form_schema = json_input
    app.run(host=host, port=port)
﻿
﻿
def record_audio():
    print("Press Enter to begin recording...")
    input()  # Wait for the Enter key to start recording
    print("Recording... Speak into the microphone.")
﻿
    # Record audio using sounddevice directly in PCM16 format
    audio_data = sd.rec(int(RECORD_DURATION * SAMPLE_RATE), samplerate=SAMPLE_RATE, channels=1, dtype='int16')
    sd.wait()  # Wait until recording is finished
﻿
    # Save audio to a temporary .wav file
    temp_wav_file = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
    wav.write(temp_wav_file.name, SAMPLE_RATE, audio_data)
    print(f"Audio recording saved to: {temp_wav_file.name}")
﻿
    return temp_wav_file.name  # Return the path of the recorded file
﻿
# Function to send the audio file to the Whisper API for transcription
def transcribe_audio(file_path):
    with open(file_path, 'rb') as audio_file:
        # Call the Whisper API for transcription
        transcription = client.audio.transcriptions.create(
            model="whisper-1",
            file=audio_file
        )
        return transcription.text  # Return the transcription text
﻿
# Function to handle the model inference and generate the form schema
﻿
@weave.op
def generate_form_schema(user_prompt):
    # Make the API call to generate the form schema using the transcribed prompt
    response = client.chat.completions.create(
        model="gpt-4o-2024-08-06",
        messages=[
            system_message,
            {"role": "user", "content": user_prompt}  # Use transcribed text as user prompt
        ],
        response_format=form_schema  # Specify the response format using the structured output schema
    )
﻿
    # Extract the parsed form schema from the response
    response_content = response.choices[0].message.content
﻿
    # Parse the JSON content into a Python dictionary
    generated_form_schema = json.loads(response_content)
﻿
    return generated_form_schema
﻿
# Set up the system message to guide the model
system_message = {
    "role": "system",
    "content": """
You are a helpful assistant that generates form schemas in JSON format. The form schema should include:
﻿
1. Use 'type', 'label', and 'name' for all fields.
2. For dropdown fields, include a 'dropdown_options' property with a list of choices.
3. Example structure:
{
    "form": {
        "fields": [
            {
                "type": "text",
                "label": "Player's Name",
                "name": "player_name"
            },
            {
                "type": "dropdown",
                "label": "Position",
                "name": "position",
                "dropdown_options": ["Forward", "Midfielder", "Defender", "Goalkeeper"]
            },
            {
                "type": "yes_no",
                "label": "Previous Experience",
                "name": "previous_experience"
            }
        ]
    }
}
"""
}
﻿
# Create the JSON schema for structured outputs with a root object that contains a 'form' property
form_schema = {
    "type": "json_schema",
    "json_schema": {
        "name": "form_response",
        "schema": {
            "type": "object",
            "properties": {
                "form": {
                    "type": "object",
                    "properties": {
                        "fields": {
                            "type": "array",
                            "items": {
                                "type": "object",
                                "properties": {
                                    "type": {
                                        "type": "string",
                                        "enum": ["text", "dropdown", "multiple_choice", "yes_no"]
                                    },
                                    "label": {"type": "string"},
                                    "name": {"type": "string"},
                                    "dropdown_options": {
                                        "type": ["array", "null"],
                                        "items": {"type": "string"},
                                        "description": "Options for dropdown fields, if applicable"
                                    }
                                },
                                "required": ["type", "label", "name"],
                                "additionalProperties": False
                            }
                        }
                    },
                    "required": ["fields"],
                    "additionalProperties": False
                }
            },
            "required": ["form"],
            "additionalProperties": False,
            "strict": True
        }
    }
}
﻿
# Main function to record audio, transcribe it, and generate the form schema
if __name__ == "__main__":
    # Record audio and save it to a temporary file
    recorded_audio_path = record_audio()
﻿
    # Transcribe the recorded audio to text
    user_prompt = transcribe_audio(recorded_audio_path)
    print(f"Transcribed user prompt: {user_prompt}")
﻿
    # Generate the form schema using the transcribed text
    generated_form_schema = generate_form_schema(user_prompt)
﻿
    # Display the generated schema to the user
    print("Generated form schema:")
    print(json.dumps(generated_form_schema['form'], indent=2))  # Pretty print the dictionary
﻿
    # Launch the form builder with the generated schema
    run_form_app(generated_form_schema['form'])
We chose to use the Whisper API for transcribing voice commands, as it provides a straightforward and reliable way to turn spoken language into text. The text (essentially a prompt from the user) is then used to create predefined form schemas that can be easily integrated into existing workflows.
Although OpenAI now has a fully multimodal option using the Streaming API, which allows real-time handling of text, images, and audio in a single session, it is still in beta and requires more a more complicated configuration compared to using Whisper in a two-stage system (in my humble opinion). The Streaming API would be suitable for applications needing to process multiple rounds of chat dialogue between the model and the user, but here, I chose to take advantage of Whisper’s dedicated focus on audio-to-text transcription, for a single audio prompt.
By leveraging structured outputs, we ensure that the generated form JSON object follows a predefined schema, making it easy to integrate into existing development workflows. This setup showcases how combining voice-to-text transcription with structured outputs can enable developers to generate code through spoken instructions, providing precise control over the format and structure without manual coding.
Weave is also used to log inputs and outputs at various stages of the form generation process, providing an easy way to track, analyze, and visualize the model's performance. By integrating Weave using the @weave.op decorator, we can capture the transcribed audio, the generated form schema, and the model's responses at different stages. This makes it easier to monitor the flow of data, debug issues, and share results with collaborators. Additionally, it offers transparency by allowing developers to revisit the transformation process from voice commands to a rendered form, enhancing the reliability of the application.
Here's a screenshot of a form that was generated purely from audio:
﻿
Conclusion Structured outputs help simplify working with AI. Instead of dealing with messy or unpredictable responses, they let you enforce specific formats, making the data easier to use and reducing the headaches caused by post-processing. Whether you're categorizing papers, organizing menu items, or converting voice commands to code, structured outputs take away a lot of the tedious work. It’s a straightforward way to get reliable results and focus on what matters without wasting time fixing errors. In the end, it's just a practical tool that makes things run smoother—no hype, just less hassle. Thanks for reading. 
﻿
How to train and evaluate an LLM router
This tutorial explores LLM routers, inspired by the RouteLLM paper, covering training, evaluation, and practical use cases for managing LLMs effectively.
How to fine-tune Phi-3 Vision on a custom dataset
Here's how to fine tune a state of the art multimodal LLM on a custom dataset
How to Fine-Tune LLaVA on a Custom Dataset
A tutorial for fine-tuning LLaVA on your own data! 
Skin Lesion Classification on HAM10000 with HuggingFace using PyTorch and W&B
Explore the use of HuggingFace, PyTorch, and W&B for classifying skin lesions with the HAM10000 dataset. We will build, train, and evaluate models for medical diagnostics!
﻿
﻿
﻿
﻿
Add a comment
Tags: Articles, GPT, GenAI, Intermediate
Iterate on AI agents and models faster. Try Weights & Biases today.