Skip to main content

Tutorial: Running inference with Llama 4 Scout using W&B Inference

Getting set up and running Llama 4 Scout, Meta's advanced long-context language model, in Python using W&B Inference.
Created on August 28|Last edited on September 2
Running inference with Llama 4 Scout through W&B Inference powered by CoreWeave is surprisingly straightforward and in this tutorial, we'll have you running inference and exploring advanced capabilities in depth.
Whether you're running inference on massive documents, building multilingual applications, or tackling complex reasoning tasks, this guide provides everything you need to run inference with Llama 4 Scout effectively.

Table of contents



What is Llama 4 Scout?

Llama 4 Scout is a large language model developed by Meta, designed for multimodal reasoning and processing of massive context. What sets it apart:
📖 Extended context window: Handles up to 10 million tokens (limited to 64k tokens using W&B Inference), making it possible to process entire books, long codebases, or large multimodal datasets in one run.
🌐 Multilingual support: Trained on 200+ languages with strong results in 12 core ones such as English, Arabic, French, German, Hindi, and Spanish, enabling use in global and cross-cultural contexts.
🖼️ Multimodal capability: Natively understands both text and images, excelling at visual reasoning, chart interpretation, and document analysis.
⚙️ Efficient architecture: Built on a mixture-of-experts system with 17B active parameters out of 109B total, delivering high performance while reducing compute requirements compared to dense models.
🧠 Reasoning and knowledge: Scores strongly on benchmarks like MMLU Pro and GPQA Diamond, making it effective for complex analysis, advanced reasoning, and assistant-style tasks.
For detailed technical specifications and performance benchmarks, visit the Llama 4 model documentation.

W&B Weave

W&B Weave simplifies the process of tracking and analyzing model outputs in your project. To get started with Weave, you'll first import it and initialize it with your project name.
One of its standout features is the @weave.op decorator. In Python, a decorator is a powerful tool that extends the behavior of a function. By placing @weave.op above any function in your code, you're telling Weave to log that function's inputs and outputs automatically. This makes it incredibly easy to keep track of what data goes in and what comes out.
Once your code runs, these logs appear in the Weave dashboard, where you'll find detailed visualizations and traces of the function calls. This not only aids in debugging but also helps structure your experimental data, making it much easier to develop and fine-tune models like Llama 4 Scout.

Tutorial: Running inference with Llama 4 Scout using W&B Inference

Let's jump right in. This tutorial assumes you're working in a Jupyter notebook (notable in some screenshots), but of course, the code will work in other applications.
We will be running inference with the Meta-Llama/Llama-4-Scout-17B-16E-Instruct model specifically.
If you're not familiar with Jupyter Notebooks, you can get set up in about 5 minutes. I walk you through it in this tutorial.
💡

Prerequisites

Before starting, ensure you have:
  • A Weights & Biases account (you can sign up free here)
  • Python 3.7 or higher installed
  • Basic familiarity with Python and API usage
  • Understanding of your use case requirements (document analysis, code review, multilingual tasks, etc.)

Step 1: Installation & setup

1. Install required packages

To get started running inference with Llama 4 Scout, all you need to install is OpenAI and Weave. However, we'll also show you how to simplify reviewing multiple outputs using W&B Weave, which makes the process much more efficient.
The code to do this is:
pip install openai wandb weave
Run this in your terminal or Jupyter cell after entering this code.
When you execute the cell, you'll notice an asterisk ([*]) appear between the brackets [ ]. This indicates that the cell is running, and you'll need to wait until the asterisk turns into a number before proceeding.

2. Get your W&B API key

  1. Copy your API key
  2. Keep it handy for the next step

Step 2: Environment configuration

Setting up your environment variables is crucial for secure and seamless operation. You'll need both your W&B API key.

Option 1: In a Jupyter Notebook

# Set environment variables in your notebook
%env WANDB_API_KEY=your-wandb-api-key-here

Option 2: In Terminal/Shell

export WANDB_API_KEY="your-wandb-api-key-here"

Option 3: In Python script

import os
# Set environment variables programmatically
os.environ["WANDB_API_KEY] = "your-wandb-api-key-here"

Step 3: Running basic inference with Llama 4 Scout

Hopefully, this hasn't been too painful because now we're at the fun part.
Here's a complete example to start running inference with Llama 4 Scout:
import os
import openai
import weave

PROJECT = "wandb_inference"
weave.init(PROJECT)

client = openai.OpenAI(
base_url="https://api.inference.wandb.ai/v1",
api_key=os.getenv("WANDB_API_KEY"),
project=PROJECT,
default_headers={
"OpenAI-Project": "wandb_fc/quickstart_playground" # replace with your actual team/project
}
)

resp = client.chat.completions.create(
model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant specialized in clear explanations."},
{"role": "user", "content": "Explain quantum computing in simple terms."}
],
temperature=0.7,
max_tokens=1000,
)

print(resp.choices[0].message.content)
You'll find the inputs and outputs recorded to your Weave dashboard with the parameters automatically included:


Step 4: Advanced Llama 4 Scout inference configuration

Understanding inference parameters

Adjust Llama 4 Scout's response behavior with these key inference parameters (feel free to play around with them and compare the outputs in Weave!).
import os
import openai
import weave

PROJECT = "wandb_inference"
weave.init(PROJECT)

client = openai.OpenAI(
base_url="https://api.inference.wandb.ai/v1",
api_key=os.getenv("WANDB_API_KEY"),
project=PROJECT,
default_headers={
"OpenAI-Project": "wandb_fc/quickstart_playground" # replace with your actual team/project
}
)

resp = client.chat.completions.create(
model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
messages=[
{"role": "system", "content": "You are a creative writing assistant."},
{"role": "user", "content": "Write a short story about time travel."}
],
temperature=0.8,
top_p=0.9,
max_tokens=2000,
)
print(resp.choices[0].message.content)

Parameter Guidelines:
  • Temperature: Use 0.1-0.3 for analytical tasks, 0.7-0.9 for creative work
  • Top_p: Combine with temperature; 0.9 works well for most applications
This gives us added flexibility to influence our model output. These parameters are also automatically logged to W&B Weave for observability:


Streaming inference responses

For real-time output and better user experience:
import os
import sys
import openai
import weave

PROJECT = "wandb_inference"
weave.init(PROJECT)

client = openai.OpenAI(
base_url="https://api.inference.wandb.ai/v1",
api_key=os.getenv("WANDB_API_KEY"),
project=PROJECT,
default_headers={
"OpenAI-Project": "wandb_fc/quickstart_playground" # replace with your actual team/project
}
)

stream = client.chat.completions.create(
model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a comprehensive story about space exploration."}
],
stream=True,
temperature=0.7,
)

sys.stdout.write("Response: ")
for chunk in stream:
delta = chunk.choices[0].delta
if delta and delta.content:
sys.stdout.write(delta.content)
sys.stdout.flush()
print()

We got a streaming response:

With the metrics logged to Weave:

As well as the full output:


Step 5: Running inference with Llama 4 Scout's unique capabilities

This is where running inference with Llama 4 Scout really shines. Let's explore what makes it special.

Long context inference

Llama 4 Scout excels at running inference on extensive documents. Here's a practical example:
import os
import io
import requests
import openai
import weave
from pypdf import PdfReader

PROJECT = "wandb_inference"
weave.init(PROJECT)

PDF_URL = "https://docs.aws.amazon.com/pdfs/bedrock-agentcore/latest/devguide/bedrock-agentcore-dg.pdf"
QUESTION = "How does AgentCore memory work?"

client = openai.OpenAI(
base_url="https://api.inference.wandb.ai/v1",
api_key=os.getenv("WANDB_API_KEY"),
project=PROJECT,
default_headers={
"OpenAI-Project": "wandb_fc/quickstart_playground" # replace with your actual team/project
}
)

r = requests.get(PDF_URL, timeout=60)
r.raise_for_status()

reader = PdfReader(io.BytesIO(r.content))
pages = reader.pages[:100]
text = "\n\n".join(page.extract_text() or "" for page in pages)

doc_snippet = text

prompt = (
"You analyze AWS Bedrock AgentCore docs and answer using only the provided text. "
"If something is not in the text, say you cannot find it.\n\n"
f"Document:\n{doc_snippet}\n\nQuestion: {QUESTION}\n"
"Cite exact phrases where possible."
)

resp = client.chat.completions.create(
model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
messages=[
{"role": "system", "content": "You are an expert on AWS Bedrock AgentCore."},
{"role": "user", "content": prompt}
],
temperature=0.2,
max_tokens=1500,
)

print(resp.choices[0].message.content)
Which outputs to Weave:


Multilingual inference

Leverage Llama 4 Scout's multilingual inference capabilities for international development:
import os
import openai
import weave

PROJECT = "wandb_inference"
weave.init(PROJECT)

client = openai.OpenAI(
base_url="https://api.inference.wandb.ai/v1",
api_key=os.getenv("WANDB_API_KEY"),
project=PROJECT,
default_headers={
"OpenAI-Project": "wandb_fc/quickstart_playground" # replace with your actual team/project
}
)



code_snippet = """
// Chinese identifiers with English comments
function 计算总价(商品列表, 折扣率) {
let 总价 = 0;
for (const 商品 of 商品列表) {
总价 += 商品.价格 * 商品.数量
}
const 折扣金额 = 总价 * 折扣率
return 总价 - 折扣金额
}

# Python with Chinese docstring
def validate_用户输入(user_data):
'''
验证用户输入数据的完整性和有效性
Validates user input data for completeness and validity
'''
required_fields = ['name', 'email', '年龄']
for field in required_fields:
if field not in user_data:
raise ValueError(f"Missing required field: {field}")
return True
"""

task = (
"Explain in English what the code does and provide a concise Chinese explanation. "
"Then suggest improvements for naming consistency and error handling. "
"Provide a refactored version using one language for identifiers."
)

resp = client.chat.completions.create(
model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
messages=[
{"role": "system", "content": "You are a senior engineer fluent in English and Chinese."},
{"role": "user", "content": f"{task}\n\nCode:\n{code_snippet}"}
],
temperature=0.2,
max_tokens=1200,
)

print(resp.choices[0].message.content)

Which logs to Weave as:


Complex multi-step reasoning inference with Llama 4 Scout

Utilize Llama 4 Scout's inference reasoning capabilities for complex problem solving:
import os
import openai
import weave

PROJECT = "wandb_inference"
weave.init(PROJECT)

client = openai.OpenAI(
base_url="https://api.inference.wandb.ai/v1",
api_key=os.getenv("WANDB_API_KEY"),
project=PROJECT,
default_headers={
"OpenAI-Project": "wandb_fc/quickstart_playground" # replace with your actual team/project
}
)

resp = client.chat.completions.create(
model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant specialized in SaaS growth strategy."},
{"role": "user", "content": """
Our SaaS company is experiencing a 15% monthly churn rate. Our customer acquisition cost (CAC) is $150,
average customer lifetime value (CLV) is $800, and monthly recurring revenue per customer is $50.
We have 2,000 active customers and are spending $60,000/month on marketing.
Please analyze this situation and provide a comprehensive strategy to improve our metrics,
including specific actions, expected timelines, and success metrics.
"""}
],
temperature=0.7,
max_tokens=1000,
)

print(resp.choices[0].message.content)
Which you'll see in the dashboard:


Monitoring Llama 4 Scout inference with W&B Weave

From the final cell, you can access the inference output and copy it from there. If you want to dig deeper into what's going on or review past inference requests (and I recommend you do) you can do so by visiting your Weights & Biases dashboard or clicking the links outputted with the response.
With Weave initialized using your environment variable, all your inference API calls are automatically tracked. Here's what gets logged and how to use it effectively:

What Weave tracks automatically

  • Request details: Model used, parameters, token counts
  • Response data: Content, processing time, success/failure status
  • Usage metrics: Token consumption, API costs, rate limit status
  • Performance: Response latency, throughput patterns

Accessing your logs

  • Visit your W&B project dashboard at: https://wandb.ai/[your-username]/[your-project]
  • Navigate to the "Weave" section
  • View detailed logs, filter by date/model/status
  • Analyze usage patterns and optimize accordingly

Custom Weave annotations

Add custom metadata and organize your API calls:
import os
import openai
import weave

PROJECT = "wandb_inference"
weave.init(PROJECT)

client = openai.OpenAI(
base_url="https://api.inference.wandb.ai/v1",
api_key=os.getenv("WANDB_API_KEY"),
project=PROJECT,
default_headers={
"OpenAI-Project": "wandb_fc/quickstart_playground" # replace with your actual team/project
}
)

@weave.op()
def analyze_customer_feedback(feedback_text, sentiment_threshold=0.5):
"""
Analyze feedback and return sentiment summary.
Tracked via weave.op since the OpenAI client has no built-in weave hook.
"""
resp = client.chat.completions.create(
model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
messages=[
{"role": "system", "content": "You score sentiment from -1 to 1 and list key topics."},
{"role": "user", "content": f"Feedback: {feedback_text}\nThreshold: {sentiment_threshold}"}
],
temperature=0.1,
max_tokens=500,
)
return resp.choices[0].message.content

if __name__ == "__main__":
out = analyze_customer_feedback(
"The new update is confusing and slow. I cannot find the features I used daily.",
sentiment_threshold=0.3,
)
print(out)
Which would appear as:


Best Practices

Here are some best practices to follow when testing and/or deploying Llama 4 Scout, or any other model for that matter.

Security and Configuration

  • Environment variables: Always store API keys in environment variables, never hardcode them
  • Project organization: Use clear, descriptive project names following the "team/project" format
  • Access control: Limit API key permissions to necessary scopes only

Prompt Engineering for Llama 4 Scout

  • Leverage long context: Don't hesitate to provide extensive context - Llama 4 Scout handles it well
  • Clear instructions: Be specific about the desired output format and style
  • System messages: Use detailed system prompts to establish expertise and context
  • Temperature selection: Lower values (0.1-0.3) for analytical tasks, higher (0.7-0.9) for creative work

Performance Optimization

  • Streaming: Use streaming for longer responses to improve user experience
  • Batch processing: Group similar requests when possible to improve efficiency
  • Token management: Monitor token usage to optimize costs and stay within limits
  • Caching: Implement response caching for frequently requested analyses

Monitoring and Debugging

  • Weave integration: Use Weave's automatic logging for all production calls
  • Custom annotations: Add meaningful metadata to track different use cases
  • Error analysis: Regularly review failed requests to identify patterns
  • Performance tracking: Monitor response times and adjust parameters accordingly

Next steps

Now that you're equipped with comprehensive Llama 4 Scout knowledge:
🔗 Explore Advanced Features
📊 Optimize Your Workflow
  • Set up automated monitoring dashboards for your specific use cases
  • Implement A/B testing between different prompting strategies
  • Create custom evaluation metrics for your domain-specific tasks
🚀 Scale Your Implementation
  • Build production pipelines with proper error handling and monitoring
  • Implement cost optimization strategies based on usage patterns
  • Explore integration with other W&B tools for end-to-end ML workflows
📚 Dive Deeper into Llama 4 Scout
  • Visit the Llama 4 Scout model card for detailed capability information
  • Explore community examples and use cases
  • Stay updated with model improvements and new features
With this comprehensive setup, you're ready to harness Llama 4 Scout's advanced capabilities while maintaining professional-grade monitoring, logging, and error handling through W&B Inference and Weave.
Iterate on AI agents and models faster. Try Weights & Biases today.