Skip to main content

How to fine-tune GPT-4o Vision on a custom dataset

Created on October 6|Last edited on October 7
In this tutorial, we’ll explore how to fine-tune an OpenAI model using the OpenAI API and evaluate its performance with Weave’s evaluation tools. Our focus will be on a practical use case involving a multimodal dataset called ChartQA, which contains question-answer pairs based on visual data from charts and tables. By leveraging OpenAI’s capabilities, we can create a custom model that is fine-tuned to handle this specific data, allowing for more precise and context-aware responses.



Fine-Tuning with OpenAI Model Inference API

Once you fine-tune a model, you’ll only be billed for the tokens used during inference and evaluation. The cost is calculated based on the number of tokens processed for inputs, outputs, and training. Fine-tuning is currently free for GPT-4o and GPT-4o mini up to a daily token limit through October 31, 2024. Each organization receives one million complimentary training tokens per day for GPT-4o and two million tokens for GPT-4o mini. Any overage will be charged at the standard rates.

OpenAI Model Inference Pricing

For GPT-4o-2024-08-06, the cost is three dollars and seventy-five cents per one million input tokens, fifteen dollars per one million output tokens, and twenty-five dollars per one million training tokens. Pricing for the batch API is discounted to one dollar and eighty-seven cents per one million input tokens and seven dollars and fifty cents per one million output tokens. For GPT-4o-mini-2024-07-18, the pricing is thirty cents per one million input tokens, one dollar and twenty cents per one million output tokens, and three dollars per one million training tokens. Batch API pricing is further discounted to fifteen cents per one million input tokens and sixty cents per one million output tokens.
Fine-tuning with OpenAI provides a cost-effective way to improve model performance, especially with discounted rates for batch processing and complimentary daily training tokens.


Environment Setup

Setting up your OpenAI and Weights & Biases (W&B) integration is crucial for managing fine-tuning and tracking model performance effectively. First, obtain your OpenAI API key, which is required for authenticating requests to OpenAI's services. Ensure that this key is securely stored and set as an environment variable, such as `OPENAI_API_KEY`, in your development environment. This will allow you to perform tasks like model fine-tuning and inference.
Next, navigate to the OpenAI dashboard and access your organization settings. You will see the Organization ID, which you can copy for reference if needed. The Organization ID is sometimes used in API requests or when integrating additional services. After locating your Organization ID, scroll down to the Integrations section, where you will find an option to add your Weights & Biases API key. Paste your W&B API key in the provided field and click on the Update button. This step enables the integration between OpenAI and Weights & Biases, allowing you to track and visualize your fine-tuning jobs in your specified W&B project.

Ensure that the integration is enabled and visible in the OpenAI dashboard. You should see a confirmation indicating that the integration is active. After successfully adding the W&B API key, log in to W&B using the CLI command wandb login. This ensures that your environment has the necessary credentials to run fine-tuning jobs and log data efficiently.
With these configurations in place, your OpenAI and Weights & Biases accounts are now linked, making it easier to run and monitor fine-tuning jobs. This integration allows you to track metrics and performance in real-time, making it a powerful setup for model evaluation and iteration.
Before running fine-tuning and evaluation, make sure you have all the necessary libraries installed. You will need Weave, OpenAI, Pillow, and Datasets. These libraries provide essential tools for managing and processing datasets, running fine-tuning jobs, and evaluating models. Open a terminal and run the following command:
pip install weave openai pillow datasets
Weave is used for tracking, visualizing, and evaluating your models, while the OpenAI package is necessary for interacting with OpenAI's API for tasks like fine-tuning and inference. Pillow is required for image processing, such as converting and encoding images into formats compatible with the API. The Datasets library by Hugging Face is useful for loading and manipulating the ChartQA dataset or any other dataset you want to use for fine-tuning your model.

Fine-tuning GPT-4o Vision

Fine-tuning a model allows you to adapt its behavior based on your unique dataset. This process involves providing labeled training data so the model can learn specific patterns, contexts, and nuances that are not fully captured by the base model. Fine-tuning is especially useful when dealing with specialized tasks, such as interpreting visual data, domain-specific language, or any context-specific scenarios.
To start off, we will need to prepare our dataset, and upload it to OpenAI's servers. The code for actually fine-tuning is relatively concise, so I will include the data preperation and fine-tuning code all in one script.
Now we will write a script to fine-tune an OpenAI model using the ChartQA dataset, which is designed to train models for visual question answering based on charts and tables. The dataset provides question-answer pairs that are linked to specific charts, allowing the model to learn how to interpret and respond to questions derived from visual data.
import base64
import json
from datasets import load_dataset
from openai import OpenAI
from io import BytesIO
from PIL import Image
import matplotlib.pyplot as plt
import os

api_key = os.getenv('OPENAI_API_KEY')
# Initialize the OpenAI client with your API key
client = OpenAI(api_key=api_key)

# Load the ChartQA dataset
dataset = load_dataset("TeeA/ChartQA", split='train', cache_dir="./cache")

# Define a helper function to convert PIL image to base64
def encode_pil_image_to_base64(pil_image):
buffered = BytesIO()
pil_image.save(buffered, format="PNG") # Save image as PNG format in memory
return base64.b64encode(buffered.getvalue()).decode("utf-8")

# Prepare training data in the required JSONL format
training_data = dataset.select(range(10)) # Select the first 100 examples for training
validation_data = dataset.select(range(10, 20)) # Select 10 examples for validation

training_jsonl = []
for idx, example in enumerate(training_data):
# Assuming 'image' column contains paths or PIL images. Adjust based on dataset structure.
image_path_or_pil = example["image"]

# If 'image' is a path, open it. Otherwise, assume it's a PIL Image
if isinstance(image_path_or_pil, str):
img = Image.open(image_path_or_pil)
elif isinstance(image_path_or_pil, Image.Image):
img = image_path_or_pil
else:
raise ValueError("Unsupported image type in dataset.")

# Encode the image to base64
encoded_image = encode_pil_image_to_base64(img)

# Prepare JSONL entry with question and answer fields
entry = {
"messages": [
{"role": "system", "content": "You are an assistant that processes charts and tables."},
{"role": "user", "content": example["qa"][0]["query"]}, # Use 'query' field from 'qa' for question
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{encoded_image}"}} # Use encoded image as the URL
]
},
{"role": "assistant", "content": example["qa"][0]["label"]} # Use 'label' field from 'qa' for answer
]
}

training_jsonl.append(json.dumps(entry)) # Use json.dumps to create a valid JSON string

# Save training data to JSONL file
train_filename = "chartqa_training_data.jsonl"
with open(train_filename, "w") as train_file:
for entry in training_jsonl:
train_file.write(f"{entry}\n")

# Upload training file
train_file = client.files.create(
file=open(train_filename, "rb"),
purpose="fine-tune"
)
training_file_id = train_file.id

# Repeat the process for validation data
validation_jsonl = []
for idx, example in enumerate(validation_data):
# Assuming 'image' column contains paths or PIL images. Adjust based on dataset structure.
image_path_or_pil = example["image"]

# If 'image' is a path, open it. Otherwise, assume it's a PIL Image
if isinstance(image_path_or_pil, str):
img = Image.open(image_path_or_pil)
elif isinstance(image_path_or_pil, Image.Image):
img = image_path_or_pil
else:
raise ValueError("Unsupported image type in dataset.")

# Encode the image to base64
encoded_image = encode_pil_image_to_base64(img)

# Prepare JSONL entry with question and answer fields
entry = {
"messages": [
{"role": "system", "content": "You are an assistant that processes charts and tables."},
{"role": "user", "content": example["qa"][0]["query"]}, # Use 'query' field from 'qa' for question
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{encoded_image}"}} # Use encoded image as the URL
]
},
{"role": "assistant", "content": example["qa"][0]["label"]} # Use 'label' field from 'qa' for answer
]
}

validation_jsonl.append(json.dumps(entry)) # Use json.dumps to create a valid JSON string

# Save validation data to JSONL file
val_filename = "chartqa_validation_data.jsonl"
with open(val_filename, "w") as val_file:
for entry in validation_jsonl:
val_file.write(f"{entry}\n")

# Upload validation file
val_file = client.files.create(
file=open(val_filename, "rb"),
purpose="fine-tune"
)
validation_file_id = val_file.id

# Create a fine-tuning job with the uploaded training and validation files
response = client.fine_tuning.jobs.create(
training_file=training_file_id,
validation_file=validation_file_id,
model="gpt-4o-2024-08-06",
integrations=[
{
"type": "wandb",
"wandb": {
"project": "chartqa-finetuning",
"tags": ["charts", "qa", "base64-images"]
}
}
]
)
print(f"Created fine-tuning job: {response.id}")


The script begins by loading the ChartQA dataset using the datasets library, selecting a subset for training and validation. Each entry in the dataset contains a question, a corresponding chart or table image, and an expected answer. The script processes each example by converting the images into base64 format and structuring them into a JSONL format that is compatible with OpenAI’s API. This JSONL format includes a system message to set the context, a user message that combines the question and the encoded image, and an assistant message with the expected answer. This format ensures that the model has all the necessary context for generating the desired output during fine-tuning.
After preparing the dataset, the script uploads the training and validation files to OpenAI. A fine-tuning job is then created using these files, specifying the gpt-4o-2024-08-06 model. The script also integrates with Weights & Biases to log and track the fine-tuning progress, making it easier to visualize the training metrics and monitor performance.
By running this script, you can fine-tune the base model on the ChartQA dataset, enabling it to better handle complex queries related to visual data. This process not only helps the model generate more accurate responses but also ensures that it performs effectively on tasks involving the interpretation of charts and tables.
Once the fine-tuning job is initiated, its progress and results will automatically be tracked and visualized in your Weights & Biases workspace. The integration with W&B provides a clear view of several metrics, including training loss, validation loss, and token accuracy over time.
In the W&B dashboard, each run is represented as a distinct entry under the Runs section. You can see the detailed logs and metrics in the form of charts and graphs that update in real-time as the fine-tuning progresses. For example, you can observe how the training loss decreases as the model learns from the training data and how validation loss behaves on the validation set. Here are the charts for my run:


This real-time tracking and visualization enable you to understand how the model is evolving and make adjustments as needed. With the integration in place, W&B becomes a powerful tool for monitoring and improving your model's fine-tuning process, providing insights that are crucial for optimizing performance.



We would now like to compare the performance of our fine-tuned model against the base model to evaluate the improvements made during fine-tuning. The script below helps us achieve this by defining two separate classes—one for the fine-tuned model and one for the base model—and running an evaluation on both using the ChartQA dataset.

The script first prepares a subset of evaluation data by selecting the last 20 examples from the shuffled dataset and encoding each chart image in base64 format. This ensures that both models are tested on the same set of inputs for consistency. The `OpenAIModelFT` class represents the fine-tuned model, while the `OpenAIModelStock` class represents the base model without any additional training. Each class includes a `predict` method that sends a query and corresponding chart image to the specified model through the OpenAI API, returning the model’s response.

The `run_model_evaluation` function is used to evaluate each model on the same dataset. It instantiates the specified model class, passes it to a `weave.Evaluation` instance, and runs the evaluation using the model's `predict` method. This allows us to measure the performance of both models using a custom scoring function, `substring_match`, which checks if the expected answer is a substring of the model’s output.

After running this script, you will be able to visualize the results in the Weights & Biases comparisons dashboard. This dashboard provides a side-by-side comparison of key metrics like token accuracy and validation loss for both models, making it easy to analyze their performance. By selecting the evaluation runs in the W&B project and clicking on the 'Compare' button, you can generate detailed visualizations that highlight the differences between the fine-tuned and base models. This helps you gain insights into areas where the fine-tuned model performs better and where further improvements can be made.


The comparison dashboard shows the evaluation results for our fine-tuned model and the base model on the same set of 10 examples from the ChartQA dataset. Because we only fine-tuned on a small number of examples, the overall performance metrics between the two models are relatively similar. This indicates that the fine-tuning process has not yet captured the full potential improvement that could be achieved with a larger training set.

Despite the small dataset, we can see that the fine-tuned model was able to reduce total token usage compared to the base model while maintaining similar performance. This is particularly important for high-volume generative AI applications, where inference costs can quickly add up. By optimizing token usage, we can reduce overall inference costs, making the model more cost-effective for production use cases. With additional training data, we would expect the fine-tuned model to show greater performance gains over the base model, demonstrating more accurate responses and better handling of the complex visual data in the ChartQA dataset.



Run: ftjob-3WPbX3DTY1L5I8wHpSKzG9BQ
1



Run: ftjob-3WPbX3DTY1L5I8wHpSKzG9BQ
1