Working with Pixtral Large for visual chart understanding

A battle between Open Source Pixtral Large and closed source foundation models like Claude 3.5 Sonnet and GPT-4o Vision
Created on November 19|Last edited on November 19
Comment
The race for multimodal supremacy has a new contender. Mistral AI's recent release of Pixtral Large, a 124-billion parameter model (123B language decoder + 1B vision encoder), has made waves in the AI community with bold claims of outperforming industry leaders across several benchmarks. Notably, Pixtral reports achieving 88.1% accuracy on ChartQA, which is competitive against both Claude-3.5 Sonnet (89.1%) and GPT-4o (85.2%).
In this article, we'll set up Pixtral Large for visual chart understanding and evaluate is against those formidable competitors (OpenAI's GPT-4o Vision and Anthropic's Claude-3.5 Sonnet) using the ChartQA dataset.
﻿
Table of contentsWhat is chart understanding?Setting up Pixtral LargeMonitoring model behavior with WeaveEvaluation of Pixtral against leading multimodal modelsEvaluating Pixtral Large with W&B Weave  Conclusion 
﻿
What is chart understanding?Chart understanding represents a critical frontier in multimodal AI, enabling models to interpret and analyze data visualizations such as graphs, bar charts, and loss curves. This capability bridges the gap between textual and visual information, allowing for a seamless understanding of structured data.
But what exactly is chart understanding? At its core, it involves the ability to:
Parse visual elements like axes, legends, and data points.
Recognize relationships and trends within the data.
Contextualize insights derived from the chart with accompanying textual or numerical information.
In real-world applications, this ability translates to breakthroughs in fields such as:
Business Intelligence: Automating the interpretation of dashboards for decision-makers.
Scientific Research: Analyzing complex data visualizations to generate insights.
Data Analysis: Streamlining the processing of charts from financial reports, academic papers, or public datasets.
These use cases highlight why benchmarks like ChartQA are pivotal for assessing the performance of multimodal models. ChartQA evaluates how well models can extract, interpret, and reason over chart-based information. A high performance on this benchmark, as demonstrated by Pixtral Large, reflects the model's ability to solve complex, real-world tasks.
 The chart below highlights Pixtral Large's performance on ChartQA compared to other leading multimodal models:
﻿
Setting up Pixtral LargeTo begin working with Pixtral Large, you need to set up the Pixtral API and integrate it into your workflow. First, ensure you have a valid API key from Mistral. Once you have your API key, initialize the Mistral client in your environment. This involves installing the mistralai library and setting up the client with your key (you'll see that in the code below).
Before diving into the code, ensure that you have Python installed, along with necessary dependencies like Pillow for image processing, datasets for working with benchmark datasets like ChartQA, and Weave for tracking and logging your model's inputs and outputs. You will also need a Mistral AI account, and an API key. 
You can install the necessary python libraries with the following command:
pip install -U mistralai weave pillow datasets openai anthropic
And here’s some basic code for running inference with Pixtral Large: 
import base64
from io import BytesIO
from PIL import Image
from datasets import load_dataset
from mistralai import Mistral
import weave
﻿
# Initialize Weave
weave.init("pixtral_inference")
﻿
# Set up the Pixtral API client
API_KEY = "your api key"  # Replace with your API key
client = Mistral(api_key=API_KEY)
﻿
# Helper function to encode an image to base64
def encode_image(image_obj):
    if isinstance(image_obj, Image.Image):  # Check if it's already a PIL Image
        img = image_obj
    else:  # Otherwise, try opening it as a path
        img = Image.open(image_obj)
    buffered = BytesIO()
    img.save(buffered, format="PNG")
    return base64.b64encode(buffered.getvalue()).decode("utf-8")
﻿
# Define a function to perform inference
@weave.op
def run_inf(image_obj, prompt):
    # Encode the image to base64
    image_base64 = encode_image(image_obj)
﻿
    # Prepare input for the Pixtral API
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_base64}"}}
            ]
        }
    ]
﻿
    # Perform inference
    response = client.chat.complete(
        model="pixtral-large-latest",
        messages=messages,
        max_tokens=300
    )
    
    # Return the model's output
    return response.choices[0].message.content
﻿
# Example usage with the dataset
if __name__ == "__main__":
    # Load the ChartQA dataset
    dataset = load_dataset("TeeA/ChartQA", split="train")
﻿
    # Select one sample for inference
    sample_data = dataset[0]  # Get the first data point
﻿
    # Extract the image and prompt
    image_obj = sample_data["image"]  # This should be a PIL Image or file path
    prompt = sample_data["qa"][0]["query"]  # Get the query for the first QA pair
﻿
    # Perform inference
    result = run_inf(image_obj, prompt)
    
    # Print the result
    print(f"Question: {prompt}")
    print(f"Model Prediction: {result}")
Monitoring model behavior with WeaveWeave's integration provides an efficient way to track and visualize the inputs and outputs of your Pixtral-powered application. By wrapping the inference function with @weave.op, you'll log the arguments and outputs of the function—including the image, prompt, and model's response. This makes it straightforward to monitor the model's real-time performance and identify potential bottlenecks or areas for improvement.
Evaluation of Pixtral against leading multimodal modelsTo better understand the performance of Pixtral Large compared to other leading models such as GPT-4o Vision and Claude, we'll conduct a small-scale evaluation using the ChartQA dataset. 
The goal is not necessarily to analyze the overall performance, but rather to analyze the qualitative differences in how each model interprets and responds to chart-based queries. This kind of analysis offers a window into how these models process multimodal inputs differently, highlighting their strengths and weaknesses. Using Weave’s evaluation dashboard, we gain access to an interactive, granular view of the models' predictions. 
For this evaluation, I created a small helper class, called EzModel (short for easy model), that simplifies the API calls for the various models and platforms. This class abstracts away the complexities of dealing with API calls for Pixtral, GPT, and Claude, unifying them into a single interface. With this helper, we can focus entirely on the task of evaluation without worrying about the specific quirks of each model's API. The class handles image encoding, retry logic for API calls, and model-specific formatting requirements. This allows us to easily iterate through different models and systematically compare their outputs on the same set of questions and images.
Here is the class definition for EzModel. This helper is designed to manage inference for models like Pixtral, GPT-4o, Claude, and Gemini by automatically initializing clients, encoding images, and handling inference. Note, in order to use this class, you will need to set your API keys for each model as environment variables. 
Save this code in a file called ez_model.py if you want to use it. 
💡
Here's the code: 
import os
import base64
from io import BytesIO 
from PIL import Image
﻿
class EzModel:
    def __init__(self, model_name, max_tokens=1024, temperature=0.0):
        self.model_name = model_name.lower()
        self.api_key = os.getenv(f"{model_name.upper()}_API_KEY")
        self.max_tokens = max_tokens
        self.temperature = temperature
        
        if not self.api_key:
            raise ValueError(f"API key for {model_name} not found in environment variables.")
        self.client = self._initialize_client()
﻿
    def _initialize_client(self):
        if self.model_name == "claude":
            from anthropic import Anthropic
            return Anthropic(api_key=self.api_key)
        elif self.model_name == "gpt":
            from openai import OpenAI
            return OpenAI(api_key=self.api_key)
        elif self.model_name == "mistral":
            from mistralai import Mistral
            return Mistral(api_key=self.api_key)
        else:
            raise ValueError(f"Unsupported model name: {self.model_name}")
    
    def _encode_image(self, image):
        """Convert PIL image to base64 with correct format per API"""
        if image.mode == "RGBA":
            image = image.convert("RGB")
        
        buffered = BytesIO()
        image.save(buffered, format="JPEG")
        return base64.b64encode(buffered.getvalue()).decode("utf-8")
﻿
    def __call__(self, prompt, pillage=None, base64Img=None, max_tokens=None, temperature=None):
        # Handle image input
        image_base64 = None
        if pillage:
            if isinstance(pillage, Image.Image):
                image_base64 = self._encode_image(pillage)
            else:
                raise ValueError("pillage must be a PIL Image object")
        elif base64Img:
            image_base64 = base64Img
﻿
        max_tokens = max_tokens if max_tokens is not None else self.max_tokens
        temperature = temperature if temperature is not None else self.temperature
﻿
        if self.model_name == "claude":
            return self._infer_claude(image_base64, prompt, max_tokens, temperature)
        elif self.model_name == "gpt":
            return self._infer_gpt(image_base64, prompt, max_tokens, temperature)
        elif self.model_name == "mistral":
            return self._infer_mistral(image_base64, prompt, max_tokens, temperature)
        else:
            raise ValueError(f"Unsupported model: {self.model_name}")
﻿
    def _infer_claude(self, image_base64, prompt, max_tokens, temperature):
        messages = [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": prompt
                    }
                ]
            }
        ]
        if image_base64:
            messages[0]["content"].insert(0, {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": "image/jpeg",
                    "data": image_base64
                }
            })
            
        return self.client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=max_tokens,
            temperature=temperature,
            messages=messages
        ).content[0].text
﻿
    def _infer_gpt(self, image_base64, prompt, max_tokens, temperature):
        if image_base64:
            content = [
                {
                    "type": "text",
                    "text": prompt
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{image_base64}"
                    }
                }
            ]
        else:
            content = prompt
            
        return self.client.chat.completions.create(
            model="gpt-4o-2024-08-06",
            messages=[{"role": "user", "content": content}],
            max_tokens=max_tokens,
            temperature=temperature
        ).choices[0].message.content
﻿
    def _infer_mistral(self, image_base64, prompt, max_tokens, temperature):
        messages = [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": prompt
                    }
                ]
            }
        ]
        if image_base64:
            messages[0]["content"].append({
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{image_base64}"
                }
            })
            
        return self.client.chat.complete(
            model="pixtral-large-latest",
            messages=messages,
            max_tokens=max_tokens,
            temperature=temperature
        ).choices[0].message.content
Evaluating Pixtral Large with W&B Weave  Now that the helper is ready, we move to the evaluation script itself. Using the helper, I set up three model wrappers for Pixtral, GPT-4o, and Claude. The evaluation process begins with preparing a subset of the ChartQA dataset. The dataset consists of chart images paired with specific question-answer pairs. The images are preprocessed to ensure compatibility with all the models, and the questions are reformatted for consistency across evaluations.
Here's the code for our evaluation: 
import os
import json
import asyncio
from datasets import load_dataset
from ez_model import EzModel
from weave import Evaluation, Model
import weave
from PIL import Image
from io import BytesIO
import base64
﻿
# Initialize Weave
weave.init("pixtral_vs_others_eval")
﻿
def get_pil_image(image):
    """Convert input to PIL Image"""
    if isinstance(image, str):  # File path
        return Image.open(image)
    elif isinstance(image, Image.Image):  # Already PIL image
        return image
    elif isinstance(image, bytes):  # base64 decoded bytes
        return Image.open(BytesIO(image))
    elif isinstance(image, str) and image.startswith('data:image'):  # base64 string
        # Strip data URL prefix if present
        base64_data = image.split(',')[1] if ',' in image else image
        image_bytes = base64.b64decode(base64_data)
        return Image.open(BytesIO(image_bytes))
    else:
        raise ValueError("Unsupported image type")
﻿
# Instantiate shared model instances
model_instances = {
    "claude": EzModel("claude"),
    "gpt": EzModel("gpt"),
    "mistral": EzModel("mistral"),
}
﻿
class ClaudeModel(Model):
    @weave.op()
    def predict(question: str, image):
        # Hardcoded model instance for "claude"
        model = model_instances["claude"]
        # Convert image to PIL and pass as pillage
        pil_image = get_pil_image(image)
        response = model(prompt=question, pillage=pil_image)
        return {"model_output": response}
﻿
class GPTModel(Model):
    @weave.op()
    def predict(question: str, image):
        # Hardcoded model instance for "gpt"
        model = model_instances["gpt"]
        # Convert image to PIL and pass as pillage
        pil_image = get_pil_image(image)
        response = model(prompt=question, pillage=pil_image)
        return {"model_output": response}
﻿
class MistralModel(Model):
    @weave.op()
    def predict(question: str, image):
        # Hardcoded model instance for "mistral"
        model = model_instances["mistral"]
        # Convert image to PIL and pass as pillage
        pil_image = get_pil_image(image)
        response = model(prompt=question, pillage=pil_image)
        return {"model_output": response}
    
﻿
@weave.op()
def llm_judge_scorer(ground_truth: str, model_output: dict) -> dict:
    """Query multiple LLMs to judge correctness and use Claude to determine the majority vote."""
    if not model_output or "model_output" not in model_output:
        return {"llm_judge_score": 0.0}
﻿
    # Gather individual judgments from all models
    judgments = {}
    for judge_name, judge_model in model_instances.items():
        prompt = (
            f"You are an LLM judge evaluating correctness on ChartQA.\n"
            f"Ground Truth: {ground_truth}\n"
            f"Predicted Response: {model_output['model_output']}\n\n"
            "Does the predicted response contain the ground truth answer? "
            "The answer does not need to be verbatim but must contain the correct answer. "
            "Respond with 'true' or 'false'."
        )
        response = judge_model(prompt=prompt)
        judgments[judge_name] = "true" in response.lower().strip()
﻿
    
    majority_prompt = (
        f"You are tasked with determining the majority vote from multiple LLM judgments.\n"
        f"The judgments from the LLMs are as follows:\n{json.dumps(judgments)}\n\n"
        f"Determine the majority vote based on the judgments. "
        f"If there is a tie, the decision is 'false'. "
        f"Respond with 'true' or 'false'."
    )
    final_decision = model_instances["claude"](prompt=majority_prompt)
﻿
﻿
    return {"llm_judge_score": 1.0 if "true" in final_decision.lower().strip() else 0.0}
﻿
﻿
def create_evaluation_dataset(dataset_name: str, split: str = "test", eval_size: int = 3):
    """Prepare evaluation dataset from the ChartQA dataset."""
    dataset = load_dataset(dataset_name, split=split, cache_dir="./cache").shuffle(seed=42)
    
    # Get last eval_size examples
    start_idx = max(0, len(dataset) - eval_size)
    eval_data = dataset.select(range(start_idx, len(dataset)))
﻿
    evaluation_dataset = []
    for example in eval_data:
        question = example["qa"][0]["query"]
        ground_truth = example["qa"][0]["label"]
        image = example["image"]
﻿
        # Convert to PIL Image if needed
        if isinstance(image, str):  # If it's a file path
            pil_image = Image.open(image)
            if pil_image.mode == "RGBA":
                pil_image = pil_image.convert("RGB")
        elif isinstance(image, Image.Image):
            pil_image = image
        else:
            raise ValueError(f"Unexpected image type: {type(image)}")
﻿
        evaluation_dataset.append({
            "question": "Based on the image, answer the following question: " + question,
            "ground_truth": ground_truth,
            "image": pil_image
        })
﻿
    # Take first 3 examples and ensure we have data
    if not evaluation_dataset:
        raise ValueError("No examples were processed from the dataset")
        
    return evaluation_dataset
﻿
async def run_evaluations():
    """Run evaluations for all models."""
    eval_dataset = create_evaluation_dataset("TeeA/ChartQA")
﻿
    # Initialize models
    models = {
        "claude": ClaudeModel(),
        "gpt": GPTModel(),
        "mistral": MistralModel(),
    }
﻿
    # Define scorers
    scorers = [llm_judge_scorer]
﻿
    # Run evaluations
    results = {}
    for model_name, model in models.items():
        print(f"\nEvaluating {model_name}...")
        evaluation = Evaluation(
            dataset=eval_dataset,
            scorers=scorers,
            name=model_name + " Evaluation"
        )
        results[model_name] = await evaluation.evaluate(model.predict)
﻿
    # Print and save results
    print("\nEvaluation Results:")
    for model_name, result in results.items():
        print(f"\n{model_name} Results:")
        print(json.dumps(result, indent=2))
﻿
    output_file = "image_model_evaluation_results.json"
    with open(output_file, "w") as f:
        json.dump(results, f, indent=2)
    print(f"\nResults saved to {output_file}")
﻿
    return results
﻿
if __name__ == "__main__":
    asyncio.run(run_evaluations())
﻿
﻿
The evaluation framework tasks each model with generating predictions for the prepared dataset. These predictions are tracked and logged using Weave, which captures the inputs—both the questions and the corresponding images—the outputs. By logging every prediction, the framework provides a detailed record of how each model performs, enabling in-depth analysis of specific cases where a model might excel or falter. The logged data also facilitates direct comparisons, as it is easy to trace the exact inputs and outputs for each model side by side.
A notable component of this evaluation is the scoring mechanism, which is essentially a LLM Judge Scorer. This scorer is an LLM ensemble that evaluates the correctness of model responses using additional language models as judges. Rather than relying solely on basic metrics, this scorer introduces a layer of reasoning and judgment into the evaluation process. For every response generated by a model, the scorer presents the response alongside the ground truth to a panel of language model judges, including Pixtral, Claude, and GPT-4o. These judges evaluate whether the response aligns with the ground truth, considering both factual correctness and semantic relevance.
The scorer explicitly instructs the language model judges to focus on the substance of the response rather than its exact wording. This allows the evaluation to reflect the models’ understanding and interpretative capabilities, rather than penalizing them for stylistic differences.
Once the individual judgments are collected, the scorer aggregates them to determine a majority decision. If most judges agree that a response is correct, it is marked as such; otherwise, it is marked incorrect. To add another layer of rigor, Claude is employed to take the aggregated responses and produce a final binary output corresponding to the result
The results of this evaluation are stored and visualized through Weave, offering both quantitative scores and qualitative insights. By combining structured scoring with detailed logging, the framework provides a comprehensive view of model performance. It not only measures how accurate the models are but also uncovers patterns in their responses, such as recurring strengths in specific question types or consistent errors in particular visual scenarios. Through this process, Pixtral’s capabilities can be assessed in detail, and its performance compared effectively against competitors like GPT-4o and Claude in the context of multimodal tasks.
Here's a screenshot of what it looks like inside Weave Evaluations: 
﻿
﻿
On this small subset of the ChartQA dataset, the performance of Pixtral Large was comparable to that of GPT-4o and Claude. All three models demonstrated strong capabilities in interpreting chart-based data and providing meaningful responses, with no significant differences in their ability to handle the tasks. These results reinforce the impression that Pixtral is a competitive player in the multimodal AI space, capable of holding its own alongside established proprietary models.
A notable distinction, however, was observed in the style and structure of the responses. Pixtral consistently produced longer and more detailed answers when compared to GPT-4o and Claude. While GPT-4o and Claude generally opted for concise responses that directly addressed the questions with minimal elaboration, Pixtral often articulated its reasoning process step by step. This behavior resembled a "chain of thought" approach, where the model explicitly outlined intermediate steps, such as extracting specific elements from the chart, explaining its interpretation, and deriving the final result. This could be the secret to Pixtral's impressive abilities. 
This was brought to my attention thanks to the awesome comparison view inside Weave Evaluations. Here's a screenshot of the comparisons view: 
﻿
Conclusion Although we did not evaluate the models on the full ChartQA test set, I have full trust in Mistral's comprehensive evaluation of Pixtral Large on the entire ChartQA dataset. Their published results, which indicate Pixtral Large's strong performance across a broader range of examples, provide a reliable picture of the model’s true capabilities. Our small subset simply confirms Pixtral’s alignment with other leading models in terms of its accuracy and reasoning abilities, and also gives insight into how the model might be using Chain-of-Thought to achieve such high performance.
The open-source nature of Pixtral adds another layer of impressiveness to its results. While GPT-4o and Claude are proprietary models backed by large-scale commercial resources, Pixtral’s open availability allows researchers and developers to explore, fine-tune, and adapt it for a variety of use cases. Achieving results so close to market-leading proprietary systems while remaining open-source is a significant achievement, highlighting the strength and potential of Pixtral in the competitive AI landscape.
In sum, Pixtral’s competitive performance on this subset and its accessibility make it an excellent option for researchers and practitioners alike. Its demonstrated capabilities ensure it will remain a significant force in advancing multimodal AI applications.
How to Run Mistral-7B on an M1 Mac With Ollama
Ever wanted to run Mistral 7B on your Macbook? In this tutorial I show you how!
Vision fine-tuning GPT-4o on a custom dataset
Learn to vision fine-tune GPT-4o on a custom dataset, with evaluation and tracking.
Getting started fine-tuning with the Mistral API
How to fine-tune Mistral-Small using the Mistral API and W&B Weave.
Building reliable apps with GPT-4o and structured outputs
Learn how to enforce consistency on GPT-4o outputs, and build reliable Gen-AI Apps. 
﻿
﻿
Add a comment
Tags: Articles, LLM, Experiment, GenAI, Computer Vision
Iterate on AI agents and models faster. Try Weights & Biases today.