Skip to main content

Getting started fine-tuning with the Mistral API

How to fine-tune Mistral-Small using the Mistral API and W&B Weave.
Created on July 17|Last edited on July 18
Fine-tuning language models is key to enhancing their performance and relevance for specific tasks. The Mistral API provides an efficient and flexible platform to fine-tune models, allowing developers to leverage powerful AI capabilities tailored to their unique requirements.
In this tutorial, we'll focus on training a Mistral model to translate English to Thai. Here's what we'll covering:



What is fine-tuning?

Fine-tuning involves taking a pre-trained language model and training it further on a specific dataset. This process customizes the model for particular tasks, such as language translation, sentiment analysis, or customer support automation. Fine-tuning can significantly improve the model's accuracy, relevance, and efficiency, making it more suitable for the intended applications.
Today, we'll train our model to translate from English to Thai, and respond directly in Thai.

API vs. local fine-tuning

When it comes to fine-tuning, developers can choose between using an API or directly training the open source model via a library like HuggingFace with your own GPUs. An API simplifies the process by abstracting the underlying complexities and providing ready-to-use endpoints. This approach saves time and reduces the potential for errors. Direct fine-tuning offers more control and customization, allowing developers to tweak the process more precisely according to their specific needs.
Each method has its own advantages, and the choice depends on the developer's familiarity with the models and the complexity of the tasks at hand. In most applications, I personally find using the API the most practical approach, as it simultaneously circumvents to source GPUs for training and inference, which can be both time consuming and costly.
However, if you are operating on a large scale, fine-tuning the open-weight models yourself might make more sense, as it will likely save costs when running inference at a large scale.

Setting up your environment to fine-tune with the Mistral API

Begin by setting up your development environment to ensure a smooth fine-tuning process. Ensure you have the necessary Python libraries installed, which you can do using pip. The essential libraries include the following:
pandas
json
wandb
mistralai
numpy
transformers
weave
You will also need to obtain a API key. To do so, sign up or log in to your account at Mistral AI, and navigate to the API section to generate and copy your API key. Adding a valid payment method is required to use Mistral services. Visit the billing page at Mistral AI and enter your payment details. This ensures uninterrupted access to the paid features of the Mistral API.

Generating a dataset

Creating a robust dataset is a fundamental step in fine-tuning language models. For this tutorial, we are using the SCB Machine Translation English-Thai 2020 dataset, which is a large parallel corpus curated from various sources like news articles, Wikipedia, SMS messages, and government documents. The primary objective is to construct an English-Thai dataset for machine translation, ensuring the data is clean and adequately aligned. The dataset can be found on Kaggle here and you can download it to your local system.
To begin, you'll need to prepare the dataset in a JSONL format where each line contains a JSON object representing a training example. Here is a Python script that takes an input CSV file and generates two JSONL files: one for training with prompts and another with just English-Thai pairs.

import pandas as pd
import json

def create_datasets(input_path, output_path_prompt, output_path_pairs):
df = pd.read_csv(input_path)

# Dataset with prompt
df_prompt = [
{
"messages": [
{"role": "user", "content": f"Translate the following text to Thai language: {row['en']}"},
{"role": "assistant", "content": row["th"]},
]
}
for index, row in df.iterrows()
]

# Save dataset with prompt
with open(output_path_prompt, "w") as f:
for line in df_prompt:
json.dump(line, f)
f.write("\n")

# Dataset with just English/Thai pairs
df_pairs = [
{
"messages": [
{"role": "user", "content": row["en"]},
{"role": "assistant", "content": row["th"]},
]
}
for index, row in df.iterrows()
]

# Save dataset with pairs
with open(output_path_pairs, "w") as f:
for line in df_pairs:
json.dump(line, f)
f.write("\n")

# Paths to your CSV files
train_csv_path = "/Users/brettyoung/Desktop/dev_24/tutorials/mistral_thai/scb_mt_enth_2020_train.csv"
test_csv_path = "/Users/brettyoung/Desktop/dev_24/tutorials/mistral_thai/scb_mt_enth_2020_test.csv"

# Paths for the output JSONL files
output_train_prompt = "/Users/brettyoung/Desktop/dev_24/tutorials/mistral_thai/scb_mt_enth_2020_train_prompt.jsonl"
output_train_pairs = "/Users/brettyoung/Desktop/dev_24/tutorials/mistral_thai/scb_mt_enth_2020_train_pairs.jsonl"
output_test_prompt = "/Users/brettyoung/Desktop/dev_24/tutorials/mistral_thai/scb_mt_enth_2020_test_prompt.jsonl"
output_test_pairs = "/Users/brettyoung/Desktop/dev_24/tutorials/mistral_thai/scb_mt_enth_2020_test_pairs.jsonl"

# Create the datasets
create_datasets(train_csv_path, output_train_prompt, output_train_pairs)
create_datasets(test_csv_path, output_test_prompt, output_test_pairs)

print("Train and Test datasets created successfully.")

After generating the initial datasets, I reduced the dataset size to about 1 million tokens for practical purposes.
The following script helps achieve this by reading the JSONL files and creating shorter versions that fit within the specified token limits. Note that while this script uses the AutoTokenizer from the transformers library, the exact tokenizer used for the API model might differ slightly. Nonetheless, this serves as a very close approximation for limiting the number of tokens in the dataset.
import json
from transformers import AutoTokenizer

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

# Paths to the JSONL files
train_pairs_path = "/Users/brettyoung/Desktop/dev_24/tutorials/mistral_thai/scb_mt_enth_2020_train_prompt.jsonl"
short_train_pairs_path = "/Users/brettyoung/Desktop/dev_24/tutorials/mistral_thai/scb_mt_enth_2020_short_train_pairs.jsonl"
test_pairs_path = "/Users/brettyoung/Desktop/dev_24/tutorials/mistral_thai/scb_mt_enth_2020_test_prompt.jsonl"
short_test_pairs_path = "/Users/brettyoung/Desktop/dev_24/tutorials/mistral_thai/scb_mt_enth_2020_short_test_pairs.jsonl"

# Token thresholds
max_input_tokens = 3_000_000
max_output_tokens = 1_000_000
max_test_output_tokens = 50_000

def count_tokens(text):
return tokenizer(text, return_tensors="pt")["input_ids"].shape[1]

def create_short_dataset(input_path, output_path, max_input_tokens, max_output_tokens):
total_input_tokens = 0
total_output_tokens = 0
examples = []

with open(input_path, 'r') as f:
for line in f:
data = json.loads(line)
input_tokens = 0
output_tokens = 0

for message in data["messages"]:
if message["role"] == "user":
input_tokens += count_tokens(message["content"])
elif message["role"] == "assistant":
output_tokens += count_tokens(message["content"])

if total_input_tokens + input_tokens > max_input_tokens or total_output_tokens + output_tokens > max_output_tokens:
break

total_input_tokens += input_tokens
total_output_tokens += output_tokens
examples.append(data)

with open(output_path, 'w') as f:
for example in examples:
json.dump(example, f)
f.write("\n")

print(f"Created {output_path}")
print(f"Total input tokens: {total_input_tokens}")
print(f"Total output tokens: {total_output_tokens}")

def create_short_test_dataset(input_path, output_path, max_output_tokens):
total_output_tokens = 0
examples = []

with open(input_path, 'r') as f:
for line in f:
data = json.loads(line)
output_tokens = 0

for message in data["messages"]:
if message["role"] == "assistant":
output_tokens += count_tokens(message["content"])

if total_output_tokens + output_tokens > max_output_tokens:
break

total_output_tokens += output_tokens
examples.append(data)

with open(output_path, 'w') as f:
for example in examples:
json.dump(example, f)
f.write("\n")

print(f"Created {output_path}")
print(f"Total output tokens: {total_output_tokens}")

# Create the shorter training dataset
create_short_dataset(train_pairs_path, short_train_pairs_path, max_input_tokens, max_output_tokens)

# Create the shorter test dataset
create_short_test_dataset(test_pairs_path, short_test_pairs_path, max_test_output_tokens)
This script initializes the tokenizer and defines functions to count tokens in the text. It creates shorter datasets by ensuring the total number of tokens does not exceed the specified limits. By running these scripts, you will have well-prepared, manageable datasets for fine-tuning your Mistral models using their API.
Although this tokenizer is likely a good approximation for our purposes, it may not be the exact tokenizer used for the model, so keep this in mind in a production setting. Mistral recommends using a validation set of about 5 percent the size of the training set, with a max file size of 1MB, so keep this in mind when choosing the size of your validation set.

Choosing the right Mistral model

Selecting the right Mistral model for fine-tuning using their API is crucial for achieving optimal performance and cost-efficiency in machine translation tasks. One effective approach involves testing multiple models on a subset of the training data to assess their performance before committing to fine-tuning. This initial evaluation helps identify which model is most promising for further optimization.
In practical terms there are several methods to evaluate model performance. Metrics like BLEU scores are commonly used to measure translation quality, although their reliability may vary, especially for languages like Thai. Through my experiments, I've found that these metrics may not always provide a complete picture, particularly in scenarios where direct comparisons with human translation are essential.
Additionally, leveraging zero-shot testing can be valuable in assessing a model's accuracy before fine-tuning. Zero-shot testing involves evaluating how well a model performs on unseen data without any prior training specific to that data. This approach helps gauge the model's ability to generalize across different contexts and tasks.
Therefore, while quantitative metrics are useful for initial assessments, it's equally important to incorporate qualitative human evaluation. Human assessment provides valuable insights into the naturalness and accuracy of translations, ensuring the model meets real-world standards and expectations.
Essentially, combining small-sample testing with metrics and human evaluation offers a comprehensive approach to selecting and optimizing machine translation models effectively. I decided to go with the Mistral-small model for this tutorial.

Before starting, ensure you have prepaid credits in the Mistral billing console to use the fine-tuning services:


Using the Mistral API to fine-tune the model

Fine-tuning a model using the Mistral API involves several steps, from uploading datasets to creating and managing fine-tuning jobs. Here is the full code for fine-tuning your model:
import os
import json
from mistralai.client import MistralClient
from mistralai.models.jobs import TrainingParameters, WandbIntegrationIn

# Set up API keys
mistral_api_key = "your_mistral_key"
wandb_api_key = "your_wandb_key"
model_name = "mistral-small-latest"
run_name = "test_fine_tuning_run" + "_" + model_name
client = MistralClient(api_key=mistral_api_key)

# Paths to the JSONL files
train_pairs_path = "/Users/brettyoung/Desktop/dev_24/tutorials/mistral_thai/scb_mt_enth_2020_short_train_pairs.jsonl"
test_pairs_path = "/Users/brettyoung/Desktop/dev_24/tutorials/mistral_thai/scb_mt_enth_2020_short_test_pairs.jsonl"

# Step 1: Upload the dataset to the Mistral client
with open(train_pairs_path, "rb") as f:
training_data = client.files.create(file=(train_pairs_path, f))

with open(test_pairs_path, "rb") as f:
validation_data = client.files.create(file=(test_pairs_path, f))

# Step 2: Create a fine-tuning job with W&B integration
created_jobs = client.jobs.create(
model=model_name, # Specify the model to fine-tune
training_files=[training_data.id], # Training file IDs
validation_files=[validation_data.id], # Validation file IDs
hyperparameters=TrainingParameters(
training_steps=30, # Number of training steps
learning_rate=0.0001, # Learning rate
),
integrations=[
WandbIntegrationIn(
project="mistral_finetune", # W&B project name
run_name="", # W&B run name
api_key=wandb_api_key, # W&B API key
).dict()
]
)
print("Created fine-tuning job:", created_jobs)

# Step 3: List, retrieve, and manage fine-tuning jobs
# List jobs
jobs = client.jobs.list()
print("Jobs list:", jobs)

# Retrieve a job
retrieved_job = client.jobs.retrieve(created_jobs.id)
print("Retrieved job:", retrieved_job)


Initializing Weights & Biases and Mistral

First, the script initializes the Mistral client using the API key, enabling access to Mistral services. It also sets up the necessary paths for the training and validation datasets, which were prepared earlier.
You'll also need your Weights & Biases API key, which you can get here.
from mistralai.client import MistralClient
from mistralai.models.jobs import TrainingParameters, WandbIntegrationIn

mistral_api_key = "your_mistral_key"
wandb_api_key = "your_wandb_key"
model_name = "open-mistral-7b"
run_name = "test_fine_tuning_run" + "_" + model_name
client = MistralClient(api_key=mistral_api_key)

train_pairs_path = "/Users/brettyoung/Desktop/dev_24/tutorials/mistral_thai/scb_mt_enth_2020_short_train_pairs.jsonl"
test_pairs_path = "/Users/brettyoung/Desktop/dev_24/tutorials/mistral_thai/scb_mt_enth_2020_short_test_pairs.jsonl"

Upload the dataset to the Mistral Client

The next step is to upload the training and validation datasets to the Mistral client. This is done by opening each dataset file and using the `client.files.create` method to upload them.
with open(train_pairs_path, "rb") as f:
training_data = client.files.create(file=(train_pairs_path, f))

with open(test_pairs_path, "rb") as f:
validation_data = client.files.create(file=(test_pairs_path, f))

Create a fine-tuning job

Next, the script creates a fine-tuning job. It specifies the model we'd like to fine-tune—open-mistral-7b—plus the uploaded training and validation file IDs, and the hyperparameters for the training process, such as the number of training steps and the learning rate. Additionally, it integrates with Weights & Biases for tracking the fine-tuning process.
created_jobs = client.jobs.create(
model="open-mistral-7b",
training_files=[training_data.id],
validation_files=[validation_data.id],
hyperparameters=TrainingParameters(
training_steps=30,
learning_rate=0.0001,
),
integrations=[
WandbIntegrationIn(
project="mistral_finetune",
run_name="",
api_key=wandb_api_key,
).dict()
]
)
print("Created fine-tuning job:", created_jobs)

Listing fine-tuning jobs

After creating the fine-tuning job, the script includes commands to list all jobs, retrieve a specific job, and manage jobs as needed. This helps in monitoring the status and details of the fine-tuning jobs.
jobs = client.jobs.list()
print("Jobs list:", jobs)

retrieved_job = client.jobs.retrieve(created_jobs.id)
print("Retrieved job:", retrieved_job)
By following these steps, the script effectively demonstrates how to use the Mistral API to fine-tune a model, from uploading datasets to managing the fine-tuning process. This method leverages the capabilities of the Mistral API and integrates with W&B to ensure a streamlined and tracked fine-tuning workflow. You can monitor the progress of your model in Weight & Biases, and you will see a chart dedicated the the percent of the training run completed, as shown below!

Run: 77fc4baa-988a-4d63-bc6e-4c6adb54d7d8
1


Running inference

To track our model's performance on the original sample set used for fine-tuning, we integrated W&B Weave for efficient monitoring and logging of model outputs. Below is the inference script with the Weave integration:

import os
import json
import pandas as pd
import weave # Import weave for logging

from mistralai.client import MistralClient
from mistralai.models.chat_completion import ChatMessage
from transformers import AutoTokenizer
import random
import numpy as np
import time

# Set random seed for reproducibility
seed = 42
random.seed(seed)
np.random.seed(seed)

# Load the test dataset
test_df = pd.read_csv("/Users/brettyoung/Desktop/dev_24/tutorials/mistral_thai/scb_mt_enth_2020_test.csv")

# Define the fine-tuned model name
mistral_api_key = "api-key"
client = MistralClient(api_key=mistral_api_key)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")



# Retrieve the fine-tuned job
fine_tuned_job_id = "your_id" # Replace with your actual fine-tuned job ID
retrieved_job = client.jobs.retrieve(fine_tuned_job_id)
fine_tuned_model = retrieved_job.fine_tuned_model

# Define run inference function using weave
@weave.op()
def run_inference(model, msg, max_tokens):
msg = "Translate the following text to Thai language: " + msg
return client.chat(model=model, max_tokens=max_tokens, messages=[ChatMessage(role='user', content=msg)]
).choices[0].message.content


for index, row in test_df.iterrows():
if index >= 100: # Limit to 100 examples
break

english_text = row["en"]
reference_translation = row["th"]
ground_truth_tokens = tokenizer(reference_translation, return_tensors="pt")["input_ids"].shape[1]
max_tokens = ground_truth_tokens * 2
# Translate using the fine-tuned model
translation = run_inference(fine_tuned_model, english_text, max_tokens)

time.sleep(0.1)


# Finish weave logging
weave.finish()

We have integrated W&B Weave into our inference script, which logs outputs of our model! Weave is a lightweight toolkit by Weights & Biases for tracking and evaluating language model applications. By decorating Python functions with @weave.op(), Weave helps log and debug model inputs and outputs and organize information gathered in production.
Using Weave, you can later analyze the logged data to identify weaknesses in your model and make necessary adjustments. This approach ensures continuous improvement and robustness of your fine-tuned model in production.
Here is a screenshot of the what gets logged in Weave:


More applications

Fine-tuned models have a wide range of applications across various industries. Using the Mistral API, these models can be integrated into numerous domains to enhance performance and relevance.
In the field of customer support, fine-tuned models can be used to automate responses to common inquiries, improving response times and customer satisfaction. In language translation, models fine-tuned on specific language pairs can provide more accurate translations tailored to particular dialects or industries. For content generation, fine-tuned models can assist in creating customized marketing materials, blog posts, or social media content that aligns with a brand's voice and style.
Additionally, fine-tuned models are valuable in sentiment analysis, where they can be used to monitor social media and customer reviews, providing insights into public perception and enabling companies to respond proactively. In healthcare, these models can assist in medical documentation and automated analysis of patient records, ensuring more efficient and accurate data management.

Conclusion

Fine-tuning models using the Mistral API enhances their performance and relevance for specific tasks. The process involves setting up your environment, generating and preparing datasets, choosing the right model, and using the Mistral API to manage the fine-tuning process. By integrating tools like W&B Weave, developers can monitor the performance of their models in production and make necessary adjustments based on logged data.
In this project, we successfully trained a model on Thai language data for translation. This involved creating a robust dataset with English-Thai sentence pairs, fine-tuning the model using the Mistral API, and evaluating the model's performance using Weave. Fine-tuning significantly improves model accuracy and efficiency for targeted applications.
The Mistral API simplifies the fine-tuning process and offers robust features for model management. Evaluating models before fine-tuning ensures the best balance of performance and cost. W&B Weave provides valuable tools for monitoring and improving models in production.












Iterate on AI agents and models faster. Try Weights & Biases today.