LLaVA-o1: Advancing structured reasoning in vision-language models

Discover how LLaVA-o1 tackles reasoning challenges in multimodal AI with structured problem-solving. Learn about its dataset, capabilities, and performance analysis using W&B Weave.
Brett Young
Created on November 25|Last edited on December 3
Comment
The field of vision-language Models has witnessed significant progress in enabling systems to understand and interpret multimodal inputs. Despite these advancements, many existing VLMs face considerable challenges when tasked with systematic and structured reasoning. While models like OpenAI's GPT-4o Vision and similar systems perform well in generating direct answers, they often fall short in complex tasks that demand logical thinking, detailed analysis, or mathematical reasoning.
These limitations can result in errors, particularly in reasoning-intensive scenarios.
In response to these challenges, researchers have developed LLaVA-o1. LLaVA-o1 is designed to address the shortcomings of traditional multimodal systems as it was explicitly trained to respond in a structured, multistage reasoning format. In this article, we will be experimenting with the model using Weave Evaluations, in order to gain a deeper insight into how the LLaVA-o1 is performing.
﻿
 
Table of contentsHow LLaVA-o1 works as a vision-language modelThe LLaVA-o1 dataset Stage-level beam search in LLaVA-o1Comparing LLaVA-o1 with its base modelLimitations of LLaVA-o1 Running inference with LLaVA-o1 Analyzing LLaVA-o1 with Weave Evaluations Analyzing your results with Weave Catching bugs with WeaveConclusion: LLaVA is a step forward for vision-language models
﻿
How LLaVA-o1 works as a vision-language modelDuring training, the LLaVA-o1 learns to organize its responses into four distinct stages, ensuring clarity and systematic problem-solving.
The four stage are:
The first stage (Summary) outlines the task and sets the context for the problem, providing a high-level understanding of what needs to be solved.
Next, in the Caption stage, the model analyzes the image, identifying relevant visual elements and offering a detailed description of the components critical to the task.
The Reasoning stage follows, where the model conducts a step-by-step logical analysis to derive intermediate insights, ensuring the response is grounded in methodical thinking.
Finally, in the Conclusion stage, the model synthesizes the findings from the previous stages into a coherent and precise answer, directly addressing the task or query.
Here's a screenshot from the paper showing the responses from both the base model and the LLaVA-o1 model: 
﻿
The LLaVA-o1 dataset LLaVA-o1 is trained using the LLaVA-o1-100k dataset, a carefully designed multimodal dataset tailored to enhance reasoning-intensive tasks. According to the paper, the dataset was constructed by aggregating and annotating examples from a diverse set of challenging visual question-answering (VQA) benchmarks, including datasets such as MMStar, MathVista, MMVet, and others. These datasets were chosen for their focus on reasoning-driven questions that require models to interpret visual content systematically, analyze trends, and perform logical problem-solving.
To create the LLaVA-o1-100k dataset, the authors utilized GPT-4o to generate detailed, stage-by-stage annotations for each task. This structured annotation process produced reasoning chains broken into four distinct stages—Summary, Caption, Reasoning, and Conclusion—to mirror the reasoning framework used in LLaVA-o1. These annotations provide a clear and systematic problem-solving template for the model to learn during training.
Here is the prompt that the authors used to construct the dataset: 
"I have an image and a question that I want you to answer. I need you to strictly follow the format with four 
specific sections: SUMMARY, CAPTION, REASONING, and CONCLUSION. It is crucial that you adhere to this structure exactly 
as outlined and that the final answer in the CONCLUSION matches the standard correct answer precisely. 
To explain further: In SUMMARY, briefly explain what steps you'll take to solve the problem. In CAPTION, 
describe the contents of the image, specifically focusing on details relevant to the question. 
In REASONING, outline a step-by-step thought process you would use to solve the problem based on the image. 
In CONCLUSION, give the final answer in a direct format, and it must match the correct answer exactly. 
If it's a multiple choice question, the conclusion should only include the option without repeating what the option is. 
Here's how the format should look: <SUMMARY> [Summarize how you will approach the problem and explain the steps you will 
take to reach the answer.] </SUMMARY> <CAPTION> [Provide a detailed description of the image, particularly emphasizing 
the aspects related to the question.] </CAPTION> <REASONING> [Provide a chain-of-thought, logical explanation of the problem. 
This should outline step-by-step reasoning.] </REASONING> <CONCLUSION> [State the final answer in a clear and direct format. 
It must match the correct answer exactly.] </CONCLUSION> (Do not forget </CONCLUSION>!) Please apply this format meticulously 
to analyze the given image and answer the related question, ensuring that the answer matches the standard one perfectly."
The dataset combines examples from general-purpose VQA benchmarks, such as ShareGPT4V and A-OKVQA, with science-targeted datasets like ScienceQA and AI2D. Additionally, the dataset includes reasoning-heavy tasks from specialized datasets like ChartQA and CLEVR-Math, ensuring broad coverage of reasoning scenarios. By integrating tasks across diverse domains, the LLaVA-o1-100k dataset provides a robust training ground, enabling the model to handle complex, real-world multimodal reasoning challenges effectively.
This carefully curated dataset forms the foundation of LLaVA-o1’s ability to excel in systematic and structured reasoning tasks, setting it apart from models trained on traditional VQA datasets.
Stage-level beam search in LLaVA-o1In addition to its training innovations, LLaVA-o1 introduces a novel stage-level beam search method during inference. This technique generates and evaluates multiple reasoning paths at each stage of the problem-solving process, selecting the most coherent and accurate one. While this method theoretically improves reliability and accuracy by refining responses, we chose not to use it in our evaluation.
Based on my analysis and supported by findings in the paper, the majority of LLaVA-o1’s performance gains appear to come from the LLaVA-o1-100k dataset and the structured reasoning process, rather than the stage-level beam search. For instance, the paper reports that the model achieved a 6.9% average improvement across benchmarks with its structured reasoning framework alone, while the introduction of stage-level beam search added only a marginal 1-2% improvement on reasoning-intensive tasks like MMVet and MathVista.
Furthermore, stage-level beam search increases computational overhead significantly, making it less practical for real-time applications. Given that the dataset and reasoning framework account for the majority of performance improvements, and the marginal gains from beam search do not justify the added complexity, we opted to focus our evaluation on LLaVA-o1’s structured reasoning and dataset-driven training.
Comparing LLaVA-o1 with its base modelLLaVA-o1 is built on the Llama-3.2-11B-Vision-Instruct model, which serves as its foundational base. This base model is a robust vision-language system, designed to process multimodal inputs and provide direct answers. However, while effective for general tasks, it struggles with reasoning-intensive scenarios due to its reliance on straightforward prediction techniques. 
To evaluate the true impact of LLaVA-o1, the authors established a comparison not only with the original base model but also with a baseline fine-tuned model, which is trained on the same LLaVA-o1-100k dataset but without structured reasoning stages—focusing instead on direct question-and-answer responses.
This fine-tuned baseline model is necessary for fair evaluation, as it isolates the effect of the reasoning framework introduced in LLaVA-o1. 
💡
While the baseline fine-tuned model benefits from the additional data, it lacks the structured reasoning process that defines LLaVA-o1. By comparison, LLaVA-o1 employs a multi-stage reasoning framework—comprising summary, caption, reasoning, and conclusion stages—that allows it to systematically process complex problems in a way the baseline model cannot.
Through evaluations on benchmarks like MMStar, MathVista, MMVet, and AI2D, LLaVA-o1 demonstrated substantial improvements over both the original base model and the fine-tuned Q/A baseline. These benchmarks reveal its superior capabilities in logical reasoning, mathematical problem-solving, and scientific analysis, driven by its ability to produce structured, interpretable outputs. 
Limitations of LLaVA-o1 However, while LLaVA-o1 demonstrated strong performance on reasoning-intensive tasks, it was less effective on simpler tasks where structured reasoning may not be necessary. For instance, in general visual question-answering tasks like those found in the MMBench dataset, which involve straightforward queries, LLaVA-o1’s structured reasoning approach provided only marginal improvements and, in some cases, led to slight declines in performance.
Similarly, in the AI2D benchmark, which focuses on interpreting basic diagrams, the model underperformed compared to direct training methods. The multi-stage reasoning framework appeared to overcomplicate these simpler problems, suggesting that the additional reasoning layers may not always be advantageous for tasks that demand direct answers. These challenges underscore the importance of tailoring vision-language models to the specific complexities of the problems they are designed to address, showing that while structured reasoning excels in complex scenarios, it might hinder performance on straightforward tasks.
Running inference with LLaVA-o1 To run inference with LLaVA-o1, start by installing the necessary python libraries along with python 3.10: 
pip install torch==2.4.0 transformers==4.45.0 pillow==10.4.0 datasets==3.1.0 weave==0.51.18 openai==1.55.1
I like to provide the exact versions of the packages to ensure that the code will run smoothly on your system. However, if you are reading at a much later date than the creation of this tutorial, you may need to upgrade the packages to a newer version. 
💡
Now, we can use the following script to run inference with LLaVA-o1: 
﻿
import torch
from transformers import MllamaForConditionalGeneration, AutoProcessor
from PIL import Image
from datasets import load_dataset
import weave
﻿
weave.init("llava_single_inference")
﻿
@weave.op()
def predict(question: str, image):
    """Run inference with LLaVA on a single sample."""
    # Load model and processor inside the function
    MODEL_ID = "Xkev/Llama-3.2V-11B-cot"
    model = MllamaForConditionalGeneration.from_pretrained(
        MODEL_ID, torch_dtype=torch.bfloat16, device_map="auto"
    )
    processor = AutoProcessor.from_pretrained(MODEL_ID)
﻿
    pil_image = Image.open(image) if isinstance(image, str) else image
﻿
    # Prepare inputs
    messages = [
        {"role": "user", "content": [
            {"type": "image"},
            {"type": "text", "text": question}
        ]}
    ]
    input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
    inputs = processor(
        pil_image,
        input_text,
        add_special_tokens=False,
        return_tensors="pt"
    ).to(model.device)
﻿
    # Generate output
    output = model.generate(
        **inputs,
        max_new_tokens=2048,
        return_dict_in_generate=True,
        output_scores=True
    )
﻿
    # Decode output
    output_text = processor.decode(output.sequences[0], skip_special_tokens=True)
    return {"model_output": output_text}
﻿
﻿
def create_single_sample(dataset_name: str, split: str = "test"):
    """Load a single sample from the dataset."""
    dataset = load_dataset(dataset_name, split=split, cache_dir="./cache")
    example = dataset[0]
﻿
    question = example["qa"][0]["query"]
    image = example["image"]
﻿
    pil_image = Image.open(image) if isinstance(image, str) else image
﻿
    return {
        "question": f"Based on the image, answer the following question: {question}",
        "image": pil_image
    }
﻿
﻿
if __name__ == "__main__":
    # Load a single sample and perform inference
    sample = create_single_sample("TeeA/ChartQA")
    result = predict(sample["question"], sample["image"])
﻿
    # Print the result
    print(f"Question: {sample['question']}")
    print(f"Model Output: {result['model_output']}")
﻿
In this example, we use the MllamaForConditionalGeneration class from transformers to load the LLaVA-o1 model, which is fine-tuned for reasoning tasks that require both visual and textual input.
The AutoProcessor class is used to prepare inputs for the model. It tokenizes text, processes images into compatible formats, and generates model-ready inputs, ensuring seamless integration between raw data and the model's requirements. This script also uses the datasets library, which simplifies loading and managing datasets for machine learning tasks, allowing for access to benchmarks such as ChartQA.
﻿W&B Weave is integrated into the workflow to track and analyze the inference process. The @weave.op decorator wraps the predict function, enabling automatic logging of inputs and outputs during inference. This logging is useful for debugging, model evaluation, and deeper analysis of the model's behavior on specific tasks. After running the script, you will see a new trace inside weave:
﻿
This trace is helpful for visualizing how your models perform in production, as well as catching bugs that might be less obvious compared to not using a logging mechanism.  In the trace, you will also see the reasoning process that LLaVA-o1 uses to generate the answer.
Analyzing LLaVA-o1 with Weave Evaluations I wanted to explore the qualitative responses between two models, so I conducted a comprehensive evaluation comparing LLaVA-o1 and its base model, Llama-3.2-11B-Vision-Instruct, using the full MMVet dataset. The objective was to analyze how LLaVA-o1’s structured reasoning framework impacts its performance on reasoning-intensive multimodal tasks compared to the direct-response approach of the base model.
While an ideal comparison would involve a model fine-tuned on the MMVet dataset without the structured reasoning stages—focusing solely on direct Q/A pairs—such a model has not been released. Therefore, this evaluation centers on understanding performance differences between LLaVA-o1 and its base model.
It is important to note that this comparison isn’t entirely fair. The base model has not been exposed to any tasks or Q/A pairs from the MMVet dataset, giving LLaVA-o1 an advantage in leveraging both the structured reasoning framework and potential dataset-specific optimizations. Despite these limitations, this evaluation provides meaningful insights into how structured reasoning affects model performance and highlights specific areas of success or failure.
💡
For this evaluation, I used the full MMVet dataset, which assesses six core multimodal capabilities:
recognition,
﻿OCR,
knowledge reasoning,
language generation,
spatial reasoning, and
mathematics.
The evaluation was conducted using Weave Evaluations, which facilitated a side-by-side comparison of inputs, outputs, and performance metrics. Weave’s detailed comparison view proved invaluable in identifying how LLaVA-o1’s structured reasoning framework contributed to solving specific tasks and where the base model’s simpler direct-response approach led to gaps.
The evaluation tasked both models with generating predictions across the full set of MMVet’s image-question pairs, requiring them to demonstrate a variety of reasoning capabilities. Using Weave, I logged each prediction alongside the corresponding input and ground truth. The process was fully automated via the Weave Evaluations framework, requiring only the integration of the models and dataset into the pipeline.
Weave’s built-in tools ensured every input, output, and prediction was meticulously tracked. This setup allowed for both quantitative analysis and qualitative exploration of model behavior. Through Weave’s comparison view, I was able to identify cases where LLaVA-o1’s structured reasoning approach excelled and where the base model struggled due to its direct-response nature. The full MMVet dataset provided a rich, diverse testbed to evaluate how both models handled multimodal reasoning challenges.
Both LLaVA-o1 and its base model, Llama-3.2-11B-Vision-Instruct, were sourced from their respective implementations on HuggingFace, with configurations and weights specified in their papers to ensure consistency. MMVet, as a benchmark designed for complex, integrated vision-language reasoning, offered a robust framework for testing the structured reasoning of LLaVA-o1 against the base model’s simpler approach. This setup, combined with the automated evaluation capabilities of Weave, provided a fair and insightful comparison of the two models.
﻿
import os
import json
import asyncio
import logging
from datasets import load_dataset
from weave import Evaluation, Model
import weave
from PIL import Image
from io import BytesIO
import base64
import torch
from transformers import MllamaForConditionalGeneration, AutoProcessor
from typing import Dict, Any, ClassVar
from openai import OpenAI
import time
﻿
# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
﻿
﻿
weave.init("llava_model_eval")
﻿
def get_pil_image(image):
    """Convert various image formats to PIL Image"""
    try:
        if isinstance(image, str):
            return Image.open(image)
        elif isinstance(image, Image.Image):
            return image
        elif isinstance(image, bytes):
            return Image.open(BytesIO(image))
        elif isinstance(image, str) and image.startswith('data:image'):
            base64_data = image.split(',')[1] if ',' in image else image
            image_bytes = base64.b64decode(base64_data)
            return Image.open(BytesIO(image_bytes))
        else:
            raise ValueError(f"Unsupported image type: {type(image)}")
    except Exception as e:
        logger.error(f"Error processing image: {e}")
        raise
﻿
def evaluate_chart_answer(question: str, ground_truth: str, predicted_answer: str):
    """Use GPT-4 to evaluate model answers"""
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    
    # # Quick check for exact match in prediction
    # if str(ground_truth).lower() in str(predicted_answer).lower():
    #     return {"correct": 1}
    
    messages = [
        {
            "role": "user",
            "content": f"""You are a judge evaluating answers to chart questions. Return ONLY a JSON object with no additional text.
﻿
Question: {question}
﻿
Ground Truth Answer: {ground_truth} ###
﻿
Predicted Answer: {predicted_answer} 
﻿
Compare these answers and return a JSON object in this exact format:
{{"correct": <0 or 1>}}
﻿
Rules for scoring:
- Score 1 if the answer is correct (or correct within reasonable error margin)
- Score 1 if the correct answer is contained somewhere in the model response
- Score 1 if the answer is right but the formatting is not exactly the same
- Score 0 if the answer is clearly incorrect"""
        }
    ]
    
    retries = 0
    max_retries = 10
    base_delay = 10
﻿
    while retries < max_retries:
        try:
            response = client.chat.completions.create(
                model="gpt-4o-2024-08-06",
                messages=messages,
                max_tokens=1024,
                temperature=0
            )
            
            response_text = response.choices[0].message.content.strip()
            
            try:
                return json.loads(response_text)
            except json.JSONDecodeError:
                import re
                json_match = re.search(r'\{.*\}', response_text, re.DOTALL)
                if json_match:
                    try:
                        return json.loads(json_match.group())
                    except:
                        logger.warning("Failed to parse JSON match")
                logger.warning("Failed to parse response as JSON")
                return {"correct": 0}
        
        except Exception as e:
            logger.error(f"Error during evaluation (attempt {retries + 1}): {e}")
            retries += 1
            if retries < max_retries:
                delay = base_delay * (2 ** (retries - 1))
                logger.info(f"Retrying in {delay} seconds...")
                time.sleep(delay)
            else:
                logger.error("Max retries reached")
                return {"correct": 0}
﻿
@weave.op
def chart_qa_scorer(question: str, ground_truth: str, model_output: dict) -> Dict[str, Any]:
    """Weave operation for scoring model outputs"""
    try:
        if not isinstance(model_output, dict) or 'model_output' not in model_output:
            logger.error("Invalid model output format")
            return {'score': 0}
            
        result = evaluate_chart_answer(
            question=question,
            ground_truth=ground_truth,
            predicted_answer=model_output['model_output']
        )
        
        return {'score': result.get('correct', 0)}
        
    except Exception as e:
        logger.error(f"Scoring error: {e}")
        return {'score': 0}
﻿
def get_model_prediction(model, question: str, image):
    """Get prediction from model for given question and image"""
    try:
        if not model.model or not model.processor:
            raise ValueError("Model or processor not initialized")
            
        pil_image = get_pil_image(image)
        
        messages = [
            {"role": "user", "content": [
                {"type": "image"},
                {"type": "text", "text": question}
            ]}
        ]
        
        try:
            input_text = model.processor.apply_chat_template(messages, add_generation_prompt=True)
        except Exception as e:
            logger.error(f"Error applying chat template: {e}")
            raise
            
        inputs = model.processor(
            pil_image,
            input_text,
            add_special_tokens=False,
            return_tensors="pt"
        ).to(model.model.device)
﻿
        input_tokens = len(inputs.input_ids[0])
        
        output = model.model.generate(
            **inputs, 
            max_new_tokens=2048,
            return_dict_in_generate=True,
            output_scores=True
        )
        
        output_tokens = len(output.sequences[0]) - input_tokens
        output_text = model.processor.decode(output.sequences[0], skip_special_tokens=True)
        
        return {
            'model_output': str(output_text),
            'model': model.__class__.__name__,
            'usage': {
                'prompt_tokens': input_tokens,
                'completion_tokens': output_tokens,
                'total_tokens': input_tokens + output_tokens
            }
        }
        
    except Exception as e:
        logger.error(f"Error during prediction: {e}")
        return {
            'model_output': f"Error during prediction: {str(e)}",
            'model': model.__class__.__name__,
            'usage': {
                'prompt_tokens': 0,
                'completion_tokens': 0,
                'total_tokens': 0
            }
        }
﻿
class BaseModel(Model):
    model: ClassVar[MllamaForConditionalGeneration] = None
    processor: ClassVar[AutoProcessor] = None
﻿
    @staticmethod
    def load_model():
        logger.info("Loading BaseModel...")
        model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
        try:
            BaseModel.model = MllamaForConditionalGeneration.from_pretrained(
                model_id,
                torch_dtype=torch.bfloat16,
                device_map="auto",
            )
            BaseModel.processor = AutoProcessor.from_pretrained(model_id)
            
            if not BaseModel.model or not BaseModel.processor:
                raise ValueError("Failed to load model or processor")
                
            logger.info("BaseModel loaded successfully")
        except Exception as e:
            logger.error(f"Error loading BaseModel: {e}")
            raise
﻿
    @staticmethod
    def clear_model():
        logger.info("Clearing BaseModel from memory...")
        BaseModel.model = None
        BaseModel.processor = None
        torch.cuda.empty_cache()
﻿
    @weave.op
    def predict(self, question: str, image):
        if not BaseModel.model or not BaseModel.processor:
            raise ValueError("BaseModel is not loaded.")
            
        out = get_model_prediction(BaseModel, question, image)
        
        try:
            if "assistant" in str(out['model_output']):
                out['model_output'] = str(out['model_output']).split("assistant")[1].strip()
        except Exception as e:
            logger.warning(f"Error splitting output on 'assistant': {e}")
            
        return out
﻿
class LlavaModel(Model):
    model: ClassVar[MllamaForConditionalGeneration] = None
    processor: ClassVar[AutoProcessor] = None
﻿
    @staticmethod
    def load_model():
        logger.info("Loading LlavaModel...")
        model_id = "Xkev/Llama-3.2V-11B-cot"
        try:
            LlavaModel.model = MllamaForConditionalGeneration.from_pretrained(
                model_id,
                torch_dtype=torch.bfloat16,
                device_map="auto",
            )
            LlavaModel.processor = AutoProcessor.from_pretrained(model_id)
            
            if not LlavaModel.model or not LlavaModel.processor:
                raise ValueError("Failed to load model or processor")
                
            logger.info("LlavaModel loaded successfully")
        except Exception as e:
            logger.error(f"Error loading LlavaModel: {e}")
            raise
﻿
    @staticmethod
    def clear_model():
        logger.info("Clearing LlavaModel from memory...")
        LlavaModel.model = None
        LlavaModel.processor = None
        torch.cuda.empty_cache()
﻿
    @weave.op
    def predict(self, question: str, image):
        if not LlavaModel.model or not LlavaModel.processor:
            raise ValueError("LlavaModel is not loaded.")
            
        out = get_model_prediction(LlavaModel, question, image)
        
        try:
            if "<CONCLUSION>" in str(out['model_output']):
                out['model_output'] = str(out['model_output']).split("<CONCLUSION>")[1].strip()
        except Exception as e:
            logger.warning(f"Error splitting output on '<CONCLUSION>': {e}")
            
        return out
﻿
﻿
﻿
def create_evaluation_dataset(dataset_name: str, split: str = "test", eval_size: int = 100):
    """
    Create evaluation dataset from the HuggingFace MMVet dataset.
    """
    try:
        # Load and shuffle dataset
        dataset = load_dataset(dataset_name, split=split, cache_dir="./cache").shuffle(seed=42)
        
        # Select the last `eval_size` examples
        start_idx = max(0, len(dataset) - eval_size)
        eval_data = dataset.select(range(0, len(dataset)))
﻿
        evaluation_dataset = []
        for example in eval_data:
            question = example["question"]
            ground_truth = example["answer"]
            image = example["image"]
﻿
            # Ensure image is in PIL format
            if isinstance(image, str):
                pil_image = Image.open(image)
                if pil_image.mode == "RGBA":
                    pil_image = pil_image.convert("RGB")
            elif isinstance(image, Image.Image):
                pil_image = image
            else:
                raise ValueError(f"Unexpected image type: {type(image)}")
﻿
            # Append formatted example to evaluation dataset
            evaluation_dataset.append({
                "question": "Based on the image, answer the following question: " + question,
                "ground_truth": ground_truth,
                "image": pil_image
            })
﻿
        if not evaluation_dataset:
            raise ValueError("No examples were processed from the dataset")
﻿
        logger.info(f"Created evaluation dataset with {len(evaluation_dataset)} examples")
        return evaluation_dataset
﻿
    except Exception as e:
        logger.error(f"Error creating evaluation dataset: {e}")
        raise
﻿
async def run_evaluations():
    """Run evaluations on both models"""
    try:
        eval_dataset = create_evaluation_dataset("lmms-lab/MMVet")
        results = {}
﻿
        # Evaluate BaseModel
        logger.info("\nEvaluating BaseModel...")
        base_model = BaseModel()
        base_model.load_model()
        evaluation = Evaluation(
            dataset=eval_dataset,
            scorers=[chart_qa_scorer],
            name="BaseModel Evaluation"
        )
        results["base_model"] = await evaluation.evaluate(base_model)
        base_model.clear_model()
﻿
        # Evaluate LlavaModel
        logger.info("\nEvaluating LlavaModel...")
        llava_model = LlavaModel()
        llava_model.load_model()
        evaluation = Evaluation(
            dataset=eval_dataset,
            scorers=[chart_qa_scorer],
            name="LlavaModel Evaluation"
        )
        results["llava_model"] = await evaluation.evaluate(llava_model)
        llava_model.clear_model()
﻿
        return results
﻿
    except Exception as e:
        logger.error(f"Error during evaluation: {e}")
        raise
﻿
if __name__ == "__main__":
    try:
        results = asyncio.run(run_evaluations())
        logger.info("Evaluation completed successfully")
        logger.info(f"Results: {results}")
    except Exception as e:
        logger.error(f"Failed to run evaluations: {e}")
To conduct this evaluation, I prepared the full MMVet dataset, ensuring that tasks spanning its six core multimodal capabilities—recognition, OCR, knowledge reasoning, language generation, spatial reasoning, and mathematics—were included. The evaluation framework employed custom wrappers for both LLaVA-o1 and the base model, which ensured predictions were generated and logged in a consistent format. These wrappers handled preprocessing, converting images to PIL format when needed, and tokenizing questions to maintain compatibility with the models.
A custom scoring mechanism was developed using GPT-4o, which evaluated the correctness of model outputs against ground truth answers. This scorer implemented a structured set of rules to handle edge cases, such as partially correct answers, formatting inconsistencies, and answers within an acceptable margin of error. GPT-4o’s evaluation returned a JSON object for each prediction, indicating whether the response was correct or incorrect. This automated scoring pipeline not only ensured consistency but also accounted for nuanced reasoning and logical coherence in model outputs.
The results were logged and visualized in Weave, a tool that facilitated side-by-side comparisons of inputs, predictions, and scores. Weave’s powerful comparison views allowed for a detailed qualitative and quantitative analysis of model performance, highlighting areas where LLaVA-o1’s structured reasoning framework excelled and where the base model’s direct-response approach was either effective or lacking. This comprehensive setup provides insights into the interplay between structured reasoning and computational efficiency, offering a nuanced perspective on model behavior across diverse multimodal challenges.
Analyzing your results with Weave By leveraging Weave Evaluations, I gained a clear understanding of where and why each model succeeded or struggled, offering a nuanced view of how structured reasoning impacts model performance. For this experiment, the LLaVA-o1 model performed slightly better than the base model, which did not include any form of fine-tuning on the LLaVA-o1 100k dataset. Overall, I think this technique is something that could potentially boost the performance of your models, however, I highly recommend a thorough evaluation to ensure that performance is truly better compared to a baseline version of the model. Here is a screenshot of the Weave dashboard, where I can view the results for both models: 
﻿
﻿
Using Weave evaluations allows you to dive into the exact responses given by each model, and quickly compare them to the ground truth. This allows you to analyze not only the correctness of the answers but also the reasoning pathways the models take to arrive at those answers. For example, you can identify patterns in LLaVA-o1's structured reasoning that lead to correct responses in complex multimodal scenarios, such as tasks involving spatial understanding or mathematical reasoning, while simultaneously observing where this framework might overcomplicate simpler tasks.
Weave's detailed visualization tools make it easy to spot discrepancies between the predicted answers and the ground truth, enabling you to pinpoint failure cases. You can assess whether errors stem from a lack of contextual understanding, limitations in reasoning depth, or even issues with the model's ability to parse the image or text inputs correctly. Additionally, by exploring token usage and latency metrics, you can evaluate the trade-offs between accuracy and computational efficiency for both models.
Catching bugs with WeaveWhile evaluating the performance of LLaVA-o1 and its base model, I used a large language model (LLM) as the judge to score the predictions. However, during the analysis in Weave, I noticed an issue where the base model’s correct predictions were being mislabeled as incorrect. This became apparent when the base model correctly identified the answer, but the scorer flagged it as wrong, as seen in the visualization below:
﻿
﻿
The problem stemmed from the scoring logic of the LLM judge. Instead of accurately recognizing correct predictions, it was overly sensitive to minor differences in phrasing or formatting that didn’t affect the validity of the answer. For example, numerical answers like "2004" might have been judged incorrectly if the model included additional context or phrased the response differently, despite being technically correct.
Weave's side-by-side comparison of inputs, outputs, and scoring metrics made this issue obvious. Noticing this kind of error—where the judge mislabels a correct prediction—is like finding a "needle in a haystack" when working with hundreds of examples. It could have easily gone unnoticed in a standard evaluation pipeline. However, Weave acted like a magnifying glass, allowing me to clearly inspect the predictions, the ground truth, and the judge's scores in a single interface. By surfacing these discrepancies visually, Weave made it immediately apparent that the problem lay with the evaluation mechanism rather than the model’s outputs. This clarity was instrumental in identifying and resolving the issue, ensuring the evaluation process became both accurate and fair.
To resolve the issue, I enhanced the evaluation logic by adjusting the prompt a bit to allow more nuanced answers that were still technically correct. This straightforward adjustment eliminated unnecessary reliance on the LLM for straightforward cases and significantly improved the overall accuracy of the scoring process. By combining this basic logic with Weave's visualization capabilities, I was able to refine the evaluation pipeline and ensure fair and reliable benchmarking of the models.
Conclusion: LLaVA is a step forward for vision-language modelsThe development of LLaVA-o1 is another step in the long path of achieving useful vision-language models. By explicitly incorporating structured reasoning into its design, LLaVA-o1 addresses key limitations of traditional models, bridging the gap between simple question-answering tasks and complex, reasoning-intensive challenges. Its ability to systematically decompose problems into distinct stages demonstrates how structured thought processes can enhance AI systems’ capabilities in logic, mathematics, and scientific analysis.
LLaVA-o1’s structured reasoning framework is an interesting approach, but its limitations on simpler tasks remind us that no single approach is universally optimal. By leveraging insights from experiments like this, researchers can refine their approaches and push the boundaries of what is possible in AI-driven reasoning.
﻿
Training a KANFormer: KAN's Are All You Need? 
We will dive into a new experimental architecture, replacing the MLP layers in transformers with KAN layers! 
Building a real-time answer engine with Llama 3.1 405B and W&B Weave
Infusing llama 3.1 405B with internet search capabilities!! 
YOLOv9 object detection tutorial 
How to use one of the worlds fastest and most accurate object detectors to run inference, display on your webcam using OpenCV and tracking your results.
﻿
﻿
﻿
﻿
﻿
Add a comment
Tags: Articles, GenAI, Computer Vision, LLM, Tutorial
Iterate on AI agents and models faster. Try Weights & Biases today.