Skip to main content

Running inference and evaluating Llama 4 in Python

Deploy Llama 4 locally or via API with Python scripts. We test multimodal performance against GPT-4o on ChartQA and show how to debug and compare results using Weave.
Created on April 7|Last edited on April 7
Llama 4 is Meta’s latest open-weight model and a major upgrade in both scale and capability. It introduces long context windows—up to 10 million tokens—so it can handle massive inputs like books, transcripts, research papers, or full codebases without dropping earlier information. That enables more complex reasoning, deeper analysis, and extended interactions that weren’t possible with shorter contexts. It also supports multimodal input, letting it process images and text together when needed. Llama 4 isn’t just faster or more powerful—it’s built to handle real workloads with fewer compromises.
In this article, we’ll break down what makes Llama 4 different, how its Scout and Maverick variants compare to other top models, and what kind of performance you can expect in real-world workloads, including testing Llama 4's multimodal reasoning on ChartQA using Weave.
If you want to dive straight into using Llama 4, this will take you right to the coding
Jump to the Llama 4 tutorials


If you want to know what you'll be working with, read on.


Table of contents



Key advancements and features of Llama 4

Llama 4 introduces major architectural and performance improvements that move it well beyond previous Llama releases and many closed models. It supports context windows up to 10 million tokens, processes text and images natively, and was designed from the ground up for efficient deployment and strong real-world performance.

Model variants

There are currently 3 core Llama 4 models: Scout, Maverick and Behemoth (the Behemoth are yet to be released).
Scout uses 17B active parameters with 16 experts and is built for efficiency. It runs on a single H100 GPU with Int4 quantization and delivers top-tier results for its size. It beats models like Mistral 3.1, Gemini 2.0 Flash-Lite, and Gemma 3 across reasoning, coding, and image benchmarks.
Maverick also uses 17B active parameters but with 128 experts and 400B total parameters. It outperforms GPT-4o and Gemini 2.0 Flash on coding, multilingual, vision-language, and reasoning tasks. On LMArena, its experimental chat variant scored 1417 ELO. It also rivals DeepSeek v3.1 in quality—despite using far fewer active parameters.
A diagram of Llama's MoE architecture
Both models were distilled from Llama 4 Behemoth, a 288B active parameter, 2T total parameter model. Behemoth leads STEM-heavy benchmarks, achieving 95.0 accuracy on MATH-500, 73.7 on GPQA Diamond, and 82.2 on MMLU Pro—beating GPT-4.5, Claude 3.7, and Gemini 2.0 Pro.


Quantization

Llama 4 is built to run efficiently under quantization. Scout supports Int4 quantization and can run on a single GPU without noticeable quality loss. Maverick supports both single-host and distributed inference and benefits from mixed precision, including Int4 and bfloat16, for reduced latency and better throughput.

Training methods and data


Llama 4 was trained on over 30 trillion tokens across 200+ languages and multiple modalities including image and video. It uses a mixture-of-experts model with both dense and MoE layers, activating only a few experts per token to reduce compute. Post-training used lightweight SFT, curriculum-tuned online RL, and adaptive DPO to boost reasoning, code quality, and multimodal performance without hurting general capabilities. It was trained to handle up to 48 images during pretraining and tested with up to 8 afterward.

Model comparisons

Llama 4 Maverick is competitive all previous open models in reasoning, coding, vision, and long-context.
Maverick outperforms GPT-4o and Gemini 2.0 Flash, gets close to DeepSeek v3.1 on harder tasks, and is cheaper to run. It's one of the most efficient high-performers out. Behemoth, not yet released, already tops Claude 3.7, GPT-4.5, and Gemini 2.0 Pro on STEM and is the base for both Scout and Maverick.
Llama 4 isn’t just stronger, it's built to scale, run cheaper, and handle long-context and multimodal work.

Enhancing performance with native multimodality and long context windows

Llama 4's integration of native multimodality and exceptionally long context windows significantly enhances its practical application performance. Using an early fusion approach, the model directly integrates text and image tokens at the processing start, creating a shared attention space for immediate alignment between visual and textual elements. This makes Llama 4 particularly effective for tasks like interpreting diagrams, understanding visual instructions, and analyzing screenshots.
Moreover, Llama 4 was trained on a diverse range of image and video frame stills to provide broad visual understanding, including temporal activities and related images. This training enables seamless interaction with multi-image inputs alongside text prompts for comprehensive visual reasoning and understanding tasks. Pre-training included up to 48 images, with successful post-training tests involving up to eight images.
The model's 10 million token context window further expands these capabilities, allowing it to manage extensive multimodal inputs without RAG or external memory aids. This enables the processing of lengthy documents with integrated visual data, maintaining coherence and detailed cross-references throughout extended interactions. Together, these features enhance usability and unlock more advanced, intuitive AI applications.

Cost implications and performance benchmarks of Llama 4 Maverick

Llama 4 Maverick offers a cost-effective solution with superior performance benchmarks in its class. Its efficient resource utilization and competitive pricing make it an attractive option for businesses seeking advanced AI capabilities. Compared to models like Gemini 2.0 Flash, DeepSeek v3.1, and GPT-4o, Llama 4 Maverick stands out due to its balance between performance and cost. I’ll share some of the benchmarks below:


For Llama 4 Maverick, meta estimates a cost of inference estimated at $0.19 per million tokens during distributed inference and $0.30–$0.49 per million tokens when running on a single host (assuming a 3:1 ratio of input to output tokens). In comparison, Gemini 2.0 Flash is slightly cheaper at $0.17 per million tokens, and DeepSeek v3.1 costs $0.48 per million tokens. In contrast, GPT-4o is considerably more expensive at $4.38 per million tokens (eg. 750 thousand input tokens and 250 thousand output tokens).
While Maverick has high performance with a 400B total parameter count, Llama 4 Scout features a similar 17 billion active parameters but significantly fewer total parameters at 109 billion total. This reduced scale means Scout's pricing will likely be even lower due to decreased memory requirements and easier deployment with less infrastructure complexity. Consequently, Scout can offer competitive pricing advantages over models with larger total parameters, making it even more appealing for cost-sensitive applications while still providing robust performance capabilities.
Overall, Llama 4 Maverick’s balance of high-performance capabilities, large context window, and cost-effectiveness positions it favorably against competing models, offering a robust option for technologically advanced enterprises seeking to maximize AI utility within budgetary constraints. The inclusion of Scout further broadens pricing flexibility, catering to diverse budget preferences and application needs.

Running inference with Llama 4

To start, I will write a simple script that checks whether an OpenRouter API key is available. If it is, the script will use OpenRouter to query the Meta Llama 4 Scout model. If not, it will fall back to using the HuggingFace Transformers version of the same model.
This setup lets us flexibly switch between remote API inference and local model execution depending on what's available.
import os
import torch
from openai import OpenAI
from transformers import AutoTokenizer, Llama4ForConditionalGeneration, FbgemmFp8Config
import weave; weave.init("llama4")

OPENROUTER_API_KEY = ""
USE_OPENROUTER = True if OPENROUTER_API_KEY else False

def infer_with_openrouter(prompt):
client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key=OPENROUTER_API_KEY
)
messages = [{"role": "user", "content": prompt}]
response = client.chat.completions.create(
model="meta-llama/llama-4-scout",
messages=messages,
max_tokens=300
)
return response.choices[0].message.content

def infer_with_huggingface(prompt):
model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)

if not torch.cuda.is_available():
print("WARNING: CUDA not available. This will be slow as hell or might just crash.")

model = Llama4ForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
quantization_config=FbgemmFp8Config(),
device_map="auto" if torch.cuda.is_available() else None
)

messages = [{"role": "user", "content": prompt}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=300)
return tokenizer.decode(outputs[0, inputs.shape[-1]:], skip_special_tokens=True)
@weave.op
def infer(prompt):
if USE_OPENROUTER:
print("[ using OpenRouter ]")
return infer_with_openrouter(prompt)
else:
print("[ using HuggingFace ]")
return infer_with_huggingface(prompt)

# example
if __name__ == "__main__":
output = infer("What is the capital of France?")
print("\n" + output)

This setup lets us flexibly switch between remote API inference and local model execution depending on what's available. At the time of writing this, OpenRouter was the most convenient option for accessing Llama 4. I assume that in the near future, the model will be easily accessible through multiple cloud providers. The HuggingFace path gives full control—assuming you have the hardware to handle it.
I tested local inference using 8-bit quantization on a single H100, but it still ran out of memory. So realistically, this setup will require 3–4 H100s or an equivalent setup in terms of VRAM if you want to run the model locally without constant crashing. I used Weave to track inputs and outputs for each inference call, which makes debugging and evaluation much easier as you scale up experiments or run comparisons between models.

Evaluating Llama 4 on ChartQA

I wanted to compare the performance of Llama 4 models (Scout and Maverick) against GPT-4o on multimodal tasks, so I conducted a comprehensive evaluation using the ChartQA dataset. The objective was to analyze how Llama 4’s advancements, particularly its long context windows and native multimodality, translate into improved reasoning and answer generation for chart-based question-answering tasks.
While the ideal evaluation would involve testing the models on the full ChartQA dataset to fully capture their performance across all possible scenarios, this experiment uses a subset of 100 samples from the dataset. This selection provides a balanced and manageable evaluation set, enabling insights into the models' generalization abilities and performance on multimodal reasoning tasks involving charts and visual data without requiring exhaustive testing on the entire dataset.
The evaluation was conducted using Weave Evaluations, which facilitated side-by-side comparisons of inputs, outputs, and performance metrics for each model. Weave’s detailed comparison view was critical in analyzing how Llama 4’s architectural improvements, such as long context windows and its interleaved multimodal reasoning, contributed to solving tasks designed around charts, graphs, and data-heavy questions. GPT-4o, Scout, and Maverick were tested on identical inputs, ensuring the evaluations remained consistent.
The Llama 4 Scout and Llama 4 Maverick models were accessed via OpenRouter, while GPT-4o was accessed through the OpenAI API. The ChartQA dataset, designed specifically for evaluating reasoning over chart-based data, provided a robust framework to test the multimodal capabilities of these models and compare their performane.
Heres the code for the eval:
import base64
import random
import asyncio
import os
from datasets import load_dataset
from io import BytesIO
from PIL import Image
from openai import OpenAI
import weave
from weave import Evaluation, Model
import json
from litellm import completion
import time

weave.init("meta_llama_eval")

OPENAI_API_KEY = "yourapikey" # Replace with your actual OpenAI API key
meta_client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key="your_openrouter_apikey",
)


SEED = 3
random.seed(SEED)

# Function to perform inference using litellm for the scorer
def run_inference_openai(prompt, model_id="gpt-4o-2024-08-06"):
try:
response = completion(
model=model_id,
temperature=0.0,
messages=[
{"role": "user", "content": prompt}
]
)
# Extract content from litellm response
if response and hasattr(response, 'choices') and len(response.choices) > 0:
content = response.choices[0].message.content
return content
else:
print("No content found in response")
return None
except Exception as e:
print(f"Failed to get response: {e}")
return None

# Function to encode PIL image to base64
def encode_pil_image_to_base64(pil_image):
buffered = BytesIO()
pil_image.save(buffered, format="PNG")
return base64.b64encode(buffered.getvalue()).decode("utf-8")

@weave.op
async def gpt4o_scorer(expected: str, model_output: str) -> dict:
"""Score the model's output by comparing it with the ground truth."""
query = f"""
YOU ARE A LLM JUDGE DETERMINING IF THE FOLLOWING MODEL GENERATED ANSWER IS THE SAME AS THE CORRECT ANSWER FOR THE CHART-QA DATASET.
It's ok if predicted/ground truth answers aren't formatted the same or slightly wored differently

Model's Answer: {str(model_output)}
Correct Answer: {expected}
Your task:
1. State the model's predicted answer (answer only).
2. State the ground truth (answer only).
3. Determine if the model's final answer is correct (ignore formatting differences, etc.). RESPOND with the predicted and ground truth answer, followed with a JSON object containing the correctness encapsulated within the following delimiters:
```json
{{ "correctness": true/false }}
```
"""
# Perform inference using litellm
response = run_inference_openai(query, "gpt-4o-2024-08-06")
if response is None:
return {"correctness": False, "reasoning": "Inference failed."}
try:
# Extract correctness JSON object from the response
json_start = response.index("```json") + 7
json_end = response.index("```", json_start)
correctness = json.loads(response[json_start:json_end].strip()).get("correctness", False)
except (ValueError, IndexError):
correctness = False

return {"correctness": correctness, "reasoning": response}

# Load dataset and preprocess
def load_ds():
dataset = load_dataset("TeeA/ChartQA", split="test", cache_dir="./cache")
dataset = dataset.shuffle(seed=SEED)
eval_data = dataset.select(range(len(dataset) - 100, len(dataset)))

def sample_image_and_query(example):
image = example["image"]
img = Image.open(image) if isinstance(image, str) else image
return {
"question": example["qa"][0]["query"],
"image": img, # Pass the PIL image directly
"expected": example["qa"][0]["label"]
}

return [sample_image_and_query(example) for example in eval_data]

# Scoring function
@weave.op()
def substring_match(expected: str, model_output: dict) -> dict:
match = expected.lower() in model_output['output'].lower()
return {"substring_match": match}

# Model classes - now converting PIL images to base64 inside the predict methods
class MetaLLaMA4Scout(Model):
@weave.op()
def predict(self, question: str, image: Image.Image):
# Convert PIL image to base64 inside the predict method
image_base64 = f"data:image/png;base64,{encode_pil_image_to_base64(image)}"
time.sleep(2) # delay in seconds

messages = [
{
"role": "user",
"content": [
{"type": "text", "text": question},
{"type": "image_url", "image_url": {"url": image_base64}}
]
}
]
response = meta_client.chat.completions.create(
model="meta-llama/llama-4-scout",
messages=messages,
max_tokens=300,
temperature=0.0
)
return {"output": response.choices[0].message.content}

class MetaLLaMA4Maverick(Model):
@weave.op()
def predict(self, question: str, image: Image.Image):
# Convert PIL image to base64 inside the predict method
image_base64 = f"data:image/png;base64,{encode_pil_image_to_base64(image)}"
time.sleep(2) # delay in seconds

messages = [
{
"role": "user",
"content": [
{"type": "text", "text": question},
{"type": "image_url", "image_url": {"url": image_base64}}
]
}
]
response = meta_client.chat.completions.create(
model="meta-llama/llama-4-maverick:free",
messages=messages,
max_tokens=300,
temperature=0.0
)
return {"output": response.choices[0].message.content}

class GPT4oModel(Model):
@weave.op()
def predict(self, question: str, image: Image.Image):
# Convert PIL image to base64 inside the predict method
image_base64 = f"data:image/png;base64,{encode_pil_image_to_base64(image)}"
time.sleep(2)
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": question},
{"type": "image_url", "image_url": {"url": image_base64}}
]
}
]
try:
response = completion(
model="openai/gpt-4o-2024-08-06",
messages=messages,
temperature=0
)
return {"output": response.choices[0].message.content}
except Exception as e:
return {"output": f"Error: {e}"}

# Main eval loop
async def run_evaluations():
print("Loading dataset...")
dataset = load_ds()
print(f"Loaded {len(dataset)} problems")

print("Initializing models...")
models = {
"llama-4-scout": MetaLLaMA4Scout(),
"llama-4-maverick": MetaLLaMA4Maverick(),
"gpt-4o": GPT4oModel()
}

print("Preparing dataset for evaluation...")
dataset_prepared = dataset

print("Running evaluations...")
scorers = [gpt4o_scorer]

for model_name, model in models.items():
print(f"\n\n=== EVALUATING {model_name.upper()} ===")
evaluation = Evaluation(
dataset=dataset_prepared,
scorers=scorers,
name=f"{model_name} Evaluation"
)
results = await evaluation.evaluate(model)
print(f"Results for {model_name}: {results}")

if __name__ == "__main__":
asyncio.run(run_evaluations())
The code evaluates three multimodal models: Llama 4 Scout, Llama 4 Maverick, and GPT-4o on a subset of 100 ChartQA samples. We first load and preprocess the evaluation data using Hugging Face datasets, extracting chart images, questions, and expected answers. Each model class is defined with its own predict method that encodes images to base64 and constructs multimodal prompts. Llama models are accessed via OpenRouter's API, GPT-4o uses OpenAI's API through Litellm; each call includes a 2-second delay to prevent rate limiting. A GPT-4o-based scorer evaluates model answers against ground truths, determining correctness by analyzing semantic alignment beyond formatting differences. The evaluation pipeline is automated through Weave, systematically feeding the dataset samples to each model, capturing outputs, and scoring predictions. Results measuring accuracy and latency help identify relative strengths and weaknesses among these models in chart-based multimodal reasoning tasks.
By leveraging Weave Evaluations, a detailed side-by-side analysis of each model’s strengths and weaknesses became possible. Each response logged by Weave provided insight into how models approached reasoning tasks, particularly multimodal ones. This allowed me to examine not only the correctness of their outputs but also the reasoning pathways they followed to arrive at their final answers.
In this evaluation, Llama 4 Maverick achieved the highest accuracy (correctness score: 0.85), outperforming GPT-4o and Llama 4 Scout, both of which earned a correctness score of 0.77. Maverick’s improvement in accuracy likely stems from its advanced multimodal reasoning capabilities and longer context window, enabling better performance on complex chart-based question-answering tasks.


The Weave comparison view

Weave's comparison view is particularly valuable for clearly visualizing differences in reasoning across multiple multimodal models. By displaying each model's outputs side-by-side, this view lets you immediately identify discrepancies in correctness, logical consistency, and handling of visual inputs such as charts or graphs. Through this intuitive interface, you can quickly pinpoint reasons why certain models fail while others are able to suceed.

Such insights make it easier to analyze and optimize multimodal reasoning performance effectively. By highlighting not only final predictions, but also underlying steps and reasoning paths, Weave’s comparison view simplifies evaluating and improving each model's multimodal reasoning capabilities.

Conclusion

In conclusion, Llama 4 represents a leap in open-weight foundation models—not just in scale and architecture, but in practical performance across a wide range of demanding tasks. From its ability to process up to 10 million tokens—allowing for coherent reasoning over entire books, codebases, and multimodal documents—to its strong multimodal capabilities via native image-text integration, Llama 4 demonstrates a clear focus on robust, real-world applicability rather than benchmark chasing alone.
The Scout and Maverick variants show how this architecture can be deployed flexibly: Scout delivers exceptional efficiency and competitive accuracy even on a single GPU setup, while Maverick pushes the bounds in terms of reasoning depth, language breadth, and image comprehension—rivaling and even surpassing closed models like GPT-4o on specific tasks. Cost-effectiveness is another distinguishing strength, with both variants offering a range of deployment options at a fraction of the inference costs of closed alternatives.
Empirical evaluation further reinforces Llama 4’s performance edge. In a controlled multimodal test using ChartQA and Weave Evaluations, Maverick outperformed both Scout and GPT-4o in correctness on visual question-answering tasks—demonstrating not just more accurate outputs but also more coherent reasoning traces. This level of granular insight into how and why a model reaches its conclusions reflects the kind of interpretability and transparency that matter in real-world deployment scenarios.
Ultimately, Llama 4 delivers more than just another high-performance LLM; it brings together architectural innovation, long-context stability, native multimodality, and accessibility in one unified framework. For researchers, developers, and enterprises alike, it offers a scalable, open foundation for building dependable and intelligent AI solutions—now with fewer trade-offs than ever before.









Iterate on AI agents and models faster. Try Weights & Biases today.