Skip to main content

Working with Pixtral Large for visual chart understanding

A battle between Open Source Pixtral Large and closed source foundation models like Claude 3.5 Sonnet and GPT-4o Vision
Created on November 19|Last edited on November 19
The race for multimodal supremacy has a new contender. Mistral AI's recent release of Pixtral Large, a 124-billion parameter model (123B language decoder + 1B vision encoder), has made waves in the AI community with bold claims of outperforming industry leaders across several benchmarks. Notably, Pixtral reports achieving 88.1% accuracy on ChartQA, which is competitive against both Claude-3.5 Sonnet (89.1%) and GPT-4o (85.2%).
In this article, we'll set up Pixtral Large for visual chart understanding and evaluate is against those formidable competitors (OpenAI's GPT-4o Vision and Anthropic's Claude-3.5 Sonnet) using the ChartQA dataset.


Table of contents



What is chart understanding?

Chart understanding represents a critical frontier in multimodal AI, enabling models to interpret and analyze data visualizations such as graphs, bar charts, and loss curves. This capability bridges the gap between textual and visual information, allowing for a seamless understanding of structured data.
But what exactly is chart understanding? At its core, it involves the ability to:
  • Parse visual elements like axes, legends, and data points.
  • Recognize relationships and trends within the data.
  • Contextualize insights derived from the chart with accompanying textual or numerical information.
In real-world applications, this ability translates to breakthroughs in fields such as:
  • Business Intelligence: Automating the interpretation of dashboards for decision-makers.
  • Scientific Research: Analyzing complex data visualizations to generate insights.
  • Data Analysis: Streamlining the processing of charts from financial reports, academic papers, or public datasets.
These use cases highlight why benchmarks like ChartQA are pivotal for assessing the performance of multimodal models. ChartQA evaluates how well models can extract, interpret, and reason over chart-based information. A high performance on this benchmark, as demonstrated by Pixtral Large, reflects the model's ability to solve complex, real-world tasks.
The chart below highlights Pixtral Large's performance on ChartQA compared to other leading multimodal models:


Setting up Pixtral Large

To begin working with Pixtral Large, you need to set up the Pixtral API and integrate it into your workflow. First, ensure you have a valid API key from Mistral. Once you have your API key, initialize the Mistral client in your environment. This involves installing the mistralai library and setting up the client with your key (you'll see that in the code below).
Before diving into the code, ensure that you have Python installed, along with necessary dependencies like Pillow for image processing, datasets for working with benchmark datasets like ChartQA, and Weave for tracking and logging your model's inputs and outputs. You will also need a Mistral AI account, and an API key.
You can install the necessary python libraries with the following command:
pip install -U mistralai weave pillow datasets openai anthropic
And here’s some basic code for running inference with Pixtral Large:
import base64
from io import BytesIO
from PIL import Image
from datasets import load_dataset
from mistralai import Mistral
import weave

# Initialize Weave
weave.init("pixtral_inference")

# Set up the Pixtral API client
API_KEY = "your api key" # Replace with your API key
client = Mistral(api_key=API_KEY)

# Helper function to encode an image to base64
def encode_image(image_obj):
if isinstance(image_obj, Image.Image): # Check if it's already a PIL Image
img = image_obj
else: # Otherwise, try opening it as a path
img = Image.open(image_obj)
buffered = BytesIO()
img.save(buffered, format="PNG")
return base64.b64encode(buffered.getvalue()).decode("utf-8")

# Define a function to perform inference
@weave.op
def run_inf(image_obj, prompt):
# Encode the image to base64
image_base64 = encode_image(image_obj)

# Prepare input for the Pixtral API
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_base64}"}}
]
}
]

# Perform inference
response = client.chat.complete(
model="pixtral-large-latest",
messages=messages,
max_tokens=300
)
# Return the model's output
return response.choices[0].message.content

# Example usage with the dataset
if __name__ == "__main__":
# Load the ChartQA dataset
dataset = load_dataset("TeeA/ChartQA", split="train")

# Select one sample for inference
sample_data = dataset[0] # Get the first data point

# Extract the image and prompt
image_obj = sample_data["image"] # This should be a PIL Image or file path
prompt = sample_data["qa"][0]["query"] # Get the query for the first QA pair

# Perform inference
result = run_inf(image_obj, prompt)
# Print the result
print(f"Question: {prompt}")
print(f"Model Prediction: {result}")

Monitoring model behavior with Weave

Weave's integration provides an efficient way to track and visualize the inputs and outputs of your Pixtral-powered application. By wrapping the inference function with @weave.op, you'll log the arguments and outputs of the function—including the image, prompt, and model's response. This makes it straightforward to monitor the model's real-time performance and identify potential bottlenecks or areas for improvement.

Evaluation of Pixtral against leading multimodal models

To better understand the performance of Pixtral Large compared to other leading models such as GPT-4o Vision and Claude, we'll conduct a small-scale evaluation using the ChartQA dataset.
The goal is not necessarily to analyze the overall performance, but rather to analyze the qualitative differences in how each model interprets and responds to chart-based queries. This kind of analysis offers a window into how these models process multimodal inputs differently, highlighting their strengths and weaknesses. Using Weave’s evaluation dashboard, we gain access to an interactive, granular view of the models' predictions.
For this evaluation, I created a small helper class, called EzModel (short for easy model), that simplifies the API calls for the various models and platforms. This class abstracts away the complexities of dealing with API calls for Pixtral, GPT, and Claude, unifying them into a single interface. With this helper, we can focus entirely on the task of evaluation without worrying about the specific quirks of each model's API. The class handles image encoding, retry logic for API calls, and model-specific formatting requirements. This allows us to easily iterate through different models and systematically compare their outputs on the same set of questions and images.
Here is the class definition for EzModel. This helper is designed to manage inference for models like Pixtral, GPT-4o, Claude, and Gemini by automatically initializing clients, encoding images, and handling inference. Note, in order to use this class, you will need to set your API keys for each model as environment variables.
Save this code in a file called ez_model.py if you want to use it.
💡
Here's the code:
import os
import base64
from io import BytesIO
from PIL import Image

class EzModel:
def __init__(self, model_name, max_tokens=1024, temperature=0.0):
self.model_name = model_name.lower()
self.api_key = os.getenv(f"{model_name.upper()}_API_KEY")
self.max_tokens = max_tokens
self.temperature = temperature
if not self.api_key:
raise ValueError(f"API key for {model_name} not found in environment variables.")
self.client = self._initialize_client()

def _initialize_client(self):
if self.model_name == "claude":
from anthropic import Anthropic
return Anthropic(api_key=self.api_key)
elif self.model_name == "gpt":
from openai import OpenAI
return OpenAI(api_key=self.api_key)
elif self.model_name == "mistral":
from mistralai import Mistral
return Mistral(api_key=self.api_key)
else:
raise ValueError(f"Unsupported model name: {self.model_name}")
def _encode_image(self, image):
"""Convert PIL image to base64 with correct format per API"""
if image.mode == "RGBA":
image = image.convert("RGB")
buffered = BytesIO()
image.save(buffered, format="JPEG")
return base64.b64encode(buffered.getvalue()).decode("utf-8")

def __call__(self, prompt, pillage=None, base64Img=None, max_tokens=None, temperature=None):
# Handle image input
image_base64 = None
if pillage:
if isinstance(pillage, Image.Image):
image_base64 = self._encode_image(pillage)
else:
raise ValueError("pillage must be a PIL Image object")
elif base64Img:
image_base64 = base64Img

max_tokens = max_tokens if max_tokens is not None else self.max_tokens
temperature = temperature if temperature is not None else self.temperature

if self.model_name == "claude":
return self._infer_claude(image_base64, prompt, max_tokens, temperature)
elif self.model_name == "gpt":
return self._infer_gpt(image_base64, prompt, max_tokens, temperature)
elif self.model_name == "mistral":
return self._infer_mistral(image_base64, prompt, max_tokens, temperature)
else:
raise ValueError(f"Unsupported model: {self.model_name}")

def _infer_claude(self, image_base64, prompt, max_tokens, temperature):
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": prompt
}
]
}
]
if image_base64:
messages[0]["content"].insert(0, {
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": image_base64
}
})
return self.client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=max_tokens,
temperature=temperature,
messages=messages
).content[0].text

def _infer_gpt(self, image_base64, prompt, max_tokens, temperature):
if image_base64:
content = [
{
"type": "text",
"text": prompt
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_base64}"
}
}
]
else:
content = prompt
return self.client.chat.completions.create(
model="gpt-4o-2024-08-06",
messages=[{"role": "user", "content": content}],
max_tokens=max_tokens,
temperature=temperature
).choices[0].message.content

def _infer_mistral(self, image_base64, prompt, max_tokens, temperature):
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": prompt
}
]
}
]
if image_base64:
messages[0]["content"].append({
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_base64}"
}
})
return self.client.chat.complete(
model="pixtral-large-latest",
messages=messages,
max_tokens=max_tokens,
temperature=temperature
).choices[0].message.content

Evaluating Pixtral Large with W&B Weave

Now that the helper is ready, we move to the evaluation script itself. Using the helper, I set up three model wrappers for Pixtral, GPT-4o, and Claude. The evaluation process begins with preparing a subset of the ChartQA dataset. The dataset consists of chart images paired with specific question-answer pairs. The images are preprocessed to ensure compatibility with all the models, and the questions are reformatted for consistency across evaluations.
Here's the code for our evaluation:
import os
import json
import asyncio
from datasets import load_dataset
from ez_model import EzModel
from weave import Evaluation, Model
import weave
from PIL import Image
from io import BytesIO
import base64

# Initialize Weave
weave.init("pixtral_vs_others_eval")

def get_pil_image(image):
"""Convert input to PIL Image"""
if isinstance(image, str): # File path
return Image.open(image)
elif isinstance(image, Image.Image): # Already PIL image
return image
elif isinstance(image, bytes): # base64 decoded bytes
return Image.open(BytesIO(image))
elif isinstance(image, str) and image.startswith('data:image'): # base64 string
# Strip data URL prefix if present
base64_data = image.split(',')[1] if ',' in image else image
image_bytes = base64.b64decode(base64_data)
return Image.open(BytesIO(image_bytes))
else:
raise ValueError("Unsupported image type")

# Instantiate shared model instances
model_instances = {
"claude": EzModel("claude"),
"gpt": EzModel("gpt"),
"mistral": EzModel("mistral"),
}

class ClaudeModel(Model):
@weave.op()
def predict(question: str, image):
# Hardcoded model instance for "claude"
model = model_instances["claude"]
# Convert image to PIL and pass as pillage
pil_image = get_pil_image(image)
response = model(prompt=question, pillage=pil_image)
return {"model_output": response}

class GPTModel(Model):
@weave.op()
def predict(question: str, image):
# Hardcoded model instance for "gpt"
model = model_instances["gpt"]
# Convert image to PIL and pass as pillage
pil_image = get_pil_image(image)
response = model(prompt=question, pillage=pil_image)
return {"model_output": response}

class MistralModel(Model):
@weave.op()
def predict(question: str, image):
# Hardcoded model instance for "mistral"
model = model_instances["mistral"]
# Convert image to PIL and pass as pillage
pil_image = get_pil_image(image)
response = model(prompt=question, pillage=pil_image)
return {"model_output": response}

@weave.op()
def llm_judge_scorer(ground_truth: str, model_output: dict) -> dict:
"""Query multiple LLMs to judge correctness and use Claude to determine the majority vote."""
if not model_output or "model_output" not in model_output:
return {"llm_judge_score": 0.0}

# Gather individual judgments from all models
judgments = {}
for judge_name, judge_model in model_instances.items():
prompt = (
f"You are an LLM judge evaluating correctness on ChartQA.\n"
f"Ground Truth: {ground_truth}\n"
f"Predicted Response: {model_output['model_output']}\n\n"
"Does the predicted response contain the ground truth answer? "
"The answer does not need to be verbatim but must contain the correct answer. "
"Respond with 'true' or 'false'."
)
response = judge_model(prompt=prompt)
judgments[judge_name] = "true" in response.lower().strip()

majority_prompt = (
f"You are tasked with determining the majority vote from multiple LLM judgments.\n"
f"The judgments from the LLMs are as follows:\n{json.dumps(judgments)}\n\n"
f"Determine the majority vote based on the judgments. "
f"If there is a tie, the decision is 'false'. "
f"Respond with 'true' or 'false'."
)
final_decision = model_instances["claude"](prompt=majority_prompt)


return {"llm_judge_score": 1.0 if "true" in final_decision.lower().strip() else 0.0}


def create_evaluation_dataset(dataset_name: str, split: str = "test", eval_size: int = 3):
"""Prepare evaluation dataset from the ChartQA dataset."""
dataset = load_dataset(dataset_name, split=split, cache_dir="./cache").shuffle(seed=42)
# Get last eval_size examples
start_idx = max(0, len(dataset) - eval_size)
eval_data = dataset.select(range(start_idx, len(dataset)))

evaluation_dataset = []
for example in eval_data:
question = example["qa"][0]["query"]
ground_truth = example["qa"][0]["label"]
image = example["image"]

# Convert to PIL Image if needed
if isinstance(image, str): # If it's a file path
pil_image = Image.open(image)
if pil_image.mode == "RGBA":
pil_image = pil_image.convert("RGB")
elif isinstance(image, Image.Image):
pil_image = image
else:
raise ValueError(f"Unexpected image type: {type(image)}")

evaluation_dataset.append({
"question": "Based on the image, answer the following question: " + question,
"ground_truth": ground_truth,
"image": pil_image
})

# Take first 3 examples and ensure we have data
if not evaluation_dataset:
raise ValueError("No examples were processed from the dataset")
return evaluation_dataset

async def run_evaluations():
"""Run evaluations for all models."""
eval_dataset = create_evaluation_dataset("TeeA/ChartQA")

# Initialize models
models = {
"claude": ClaudeModel(),
"gpt": GPTModel(),
"mistral": MistralModel(),
}

# Define scorers
scorers = [llm_judge_scorer]

# Run evaluations
results = {}
for model_name, model in models.items():
print(f"\nEvaluating {model_name}...")
evaluation = Evaluation(
dataset=eval_dataset,
scorers=scorers,
name=model_name + " Evaluation"
)
results[model_name] = await evaluation.evaluate(model.predict)

# Print and save results
print("\nEvaluation Results:")
for model_name, result in results.items():
print(f"\n{model_name} Results:")
print(json.dumps(result, indent=2))

output_file = "image_model_evaluation_results.json"
with open(output_file, "w") as f:
json.dump(results, f, indent=2)
print(f"\nResults saved to {output_file}")

return results

if __name__ == "__main__":
asyncio.run(run_evaluations())


The evaluation framework tasks each model with generating predictions for the prepared dataset. These predictions are tracked and logged using Weave, which captures the inputs—both the questions and the corresponding images—the outputs. By logging every prediction, the framework provides a detailed record of how each model performs, enabling in-depth analysis of specific cases where a model might excel or falter. The logged data also facilitates direct comparisons, as it is easy to trace the exact inputs and outputs for each model side by side.
A notable component of this evaluation is the scoring mechanism, which is essentially a LLM Judge Scorer. This scorer is an LLM ensemble that evaluates the correctness of model responses using additional language models as judges. Rather than relying solely on basic metrics, this scorer introduces a layer of reasoning and judgment into the evaluation process. For every response generated by a model, the scorer presents the response alongside the ground truth to a panel of language model judges, including Pixtral, Claude, and GPT-4o. These judges evaluate whether the response aligns with the ground truth, considering both factual correctness and semantic relevance.
The scorer explicitly instructs the language model judges to focus on the substance of the response rather than its exact wording. This allows the evaluation to reflect the models’ understanding and interpretative capabilities, rather than penalizing them for stylistic differences.
Once the individual judgments are collected, the scorer aggregates them to determine a majority decision. If most judges agree that a response is correct, it is marked as such; otherwise, it is marked incorrect. To add another layer of rigor, Claude is employed to take the aggregated responses and produce a final binary output corresponding to the result
The results of this evaluation are stored and visualized through Weave, offering both quantitative scores and qualitative insights. By combining structured scoring with detailed logging, the framework provides a comprehensive view of model performance. It not only measures how accurate the models are but also uncovers patterns in their responses, such as recurring strengths in specific question types or consistent errors in particular visual scenarios. Through this process, Pixtral’s capabilities can be assessed in detail, and its performance compared effectively against competitors like GPT-4o and Claude in the context of multimodal tasks.
Here's a screenshot of what it looks like inside Weave Evaluations:


On this small subset of the ChartQA dataset, the performance of Pixtral Large was comparable to that of GPT-4o and Claude. All three models demonstrated strong capabilities in interpreting chart-based data and providing meaningful responses, with no significant differences in their ability to handle the tasks. These results reinforce the impression that Pixtral is a competitive player in the multimodal AI space, capable of holding its own alongside established proprietary models.
A notable distinction, however, was observed in the style and structure of the responses. Pixtral consistently produced longer and more detailed answers when compared to GPT-4o and Claude. While GPT-4o and Claude generally opted for concise responses that directly addressed the questions with minimal elaboration, Pixtral often articulated its reasoning process step by step. This behavior resembled a "chain of thought" approach, where the model explicitly outlined intermediate steps, such as extracting specific elements from the chart, explaining its interpretation, and deriving the final result. This could be the secret to Pixtral's impressive abilities.
This was brought to my attention thanks to the awesome comparison view inside Weave Evaluations. Here's a screenshot of the comparisons view:


Conclusion

Although we did not evaluate the models on the full ChartQA test set, I have full trust in Mistral's comprehensive evaluation of Pixtral Large on the entire ChartQA dataset. Their published results, which indicate Pixtral Large's strong performance across a broader range of examples, provide a reliable picture of the model’s true capabilities. Our small subset simply confirms Pixtral’s alignment with other leading models in terms of its accuracy and reasoning abilities, and also gives insight into how the model might be using Chain-of-Thought to achieve such high performance.
The open-source nature of Pixtral adds another layer of impressiveness to its results. While GPT-4o and Claude are proprietary models backed by large-scale commercial resources, Pixtral’s open availability allows researchers and developers to explore, fine-tune, and adapt it for a variety of use cases. Achieving results so close to market-leading proprietary systems while remaining open-source is a significant achievement, highlighting the strength and potential of Pixtral in the competitive AI landscape.
In sum, Pixtral’s competitive performance on this subset and its accessibility make it an excellent option for researchers and practitioners alike. Its demonstrated capabilities ensure it will remain a significant force in advancing multimodal AI applications.

Iterate on AI agents and models faster. Try Weights & Biases today.