Tutorial: Running inference with Llama 3.1 8B using W&B Inference

Getting set up and running Llama 3.1 8B, Meta's advanced long-context language model, in Python using W&B Inference.
Created on September 4|Last edited on September 9
Comment
﻿
Running inference with Llama 3.1 8B through W&B Inference powered by CoreWeave is surprisingly straightforward, and will offer you a lot of flexibility in what you can do with it beyond what you might encounter through other interfaces. In this tutorial, we'll have you running inference and exploring advanced capabilities in depth.
Whether you're running inference on massive documents, building multilingual applications, or tackling complex reasoning tasks, this guide provides everything you need to run inference with Llama 3.1 8B effectively.
Table of contentsWhat is Llama 3.1 8B?W&B WeaveTutorial: Running inference with Llama 3.1 8B using W&B InferencePrerequisitesStep 1: Installation & setupStep 2: Environment configurationStep 3: Running basic inference with Llama 3.1 8BStep 4: Advanced Llama 3.1 8B inference configurationRunning inference with Llama 3.1 8B's unique capabilitiesMonitoring  Llama 3.1 8B inference with W&B WeaveBest Practices🔐 Security and configuration✍️ Prompt engineering for Llama 3.1 8B⚡ Performance optimization📊 Monitoring and debuggingNext steps
﻿
What is Llama 3.1 8B?﻿﻿Llama 3.1 8B is a mid-sized language model developed by Meta, designed to balance performance across reasoning, coding, math, and multilingual tasks while staying lightweight. What sets it apart:
📝 On general knowledge tasks like MMLU it scores 73.0, stronger than Mistral 7B at 60.5 and even above GPT-3.5 Turbo at 69.8, while edging out Gemma 2 9B IT at 72.3
💻 On coding benchmarks like HumanEval it reaches 72.6, well ahead of Mistral 7B at 40.2 and Gemma 2 9B IT at 54.3
🧑‍💻 On MBPP EvalPlus it scores 72.8, outperforming Gemma 2 9B IT at 71.7 and Mistral 7B at 49.5
➗ On math reasoning GSM8K it delivers 84.5, stronger than Gemma 2 9B IT at 76.7 and GPT-3.5 Turbo at 81.6
🧠 On reasoning ARC Challenge it hits 83.4, comparable to GPT-3.5 Turbo at 83.7 and just behind Gemma 2 9B IT at 87.6
⚙️ On long-context tasks it shows strong numbers with 81.0 on ZeroScrolls and 98.8 on NIH/Multi-needle, highlighting its robustness in extended context handling
🌍 On multilingual evaluation MGSM it scores 68.9, well above Gemma 2 9B IT at 53.2, Mistral 7B at 29.9, and GPT-3.5 Turbo at 51.4
﻿
For detailed technical specifications and performance benchmarks, visit the Llama 3.1 8B model documentation.
W&B Weave﻿W&B Weave goes beyond basic logging, giving you a structured way to track and analyze model outputs. Getting started is simple—just import the library and initialize it with your project name.
One of its most useful features is the @weave.op decorator. A standard Python function runs without recording anything, but once you add @weave.op, every call is automatically logged with its inputs and outputs. This removes the need for manual print statements or custom logging code.
All of this information flows into the Weave dashboard, where you can explore interactive visualizations, timelines, and traces for every function call. Instead of scrolling through scattered log lines, you get a connected view of how data moves through your models. You can drill down into details, compare different runs, and trace results back to their inputs.
The result is a more powerful workflow for model development. Weave doesn’t just capture outputs—it organizes your experimental data so you can debug faster, reproduce results more reliably, and refine models like Llama 3.1 8B with less effort.
Tutorial: Running inference with Llama 3.1 8B using W&B InferenceThis tutorial assumes you're working in a Jupyter notebook (notable in some screenshots), but of course, the code will work in other applications. 
We will be running inference with the meta-llama/Llama-3.1-8B-Instruct model specifically.
If you're not familiar with Jupyter Notebooks, you can get set up in about 5 minutes. I walk you through it in this tutorial.
💡
PrerequisitesBefore starting, ensure you have:
A Weights & Biases account (you can sign up free here)
Python 3.7 or higher installed
Basic familiarity with Python and API usage
Understanding of your use case requirements (document analysis, code review, multilingual tasks, etc.)
Step 1: Installation & setup
1. Install required packagesTo get started running inference with Llama 3.1 8B, all you need to install is OpenAI and Weave. However, we'll also show you how to simplify reviewing multiple outputs using W&B Weave, which makes the process much more efficient.
The code to do this is:
pip install openai wandb weave
Run this in your terminal or Jupyter cell after entering this code.
When you execute the cell, you'll notice an asterisk ([*]) appear between the brackets [ ]. This indicates that the cell is running, and you'll need to wait until the asterisk turns into a number before proceeding.
2. Get your W&B API keyVisit https://wandb.ai/authorize﻿
Copy your API key
Keep it handy for the next step
Step 2: Environment configurationSetting up your environment variables is crucial for secure and seamless operation. You'll need both your W&B API key.
Option 1: In a Jupyter Notebook# Set environment variables in your notebook
%env WANDB_API_KEY=your-wandb-api-key-here
Option 2: In Terminal/Shellexport WANDB_API_KEY="your-wandb-api-key-here"
Option 3: In Python scriptimport os
# Set environment variables programmatically
os.environ["WANDB_API_KEY] = "your-wandb-api-key-here"
Step 3: Running basic inference with Llama 3.1 8BHopefully, this hasn't been too painful because now we're at the fun part.
Here's a complete example to start running inference with Llama 3.1 8B:
import os
import openai
import weave
﻿
﻿
PROJECT = "wandb_inference"
weave.init(PROJECT)
﻿
client = openai.OpenAI(
    base_url="https://api.inference.wandb.ai/v1",
    api_key=os.getenv("WANDB_API_KEY"),
    project=PROJECT,
    default_headers={
        "OpenAI-Project": "wandb_fc/quickstart_playground"  # replace with your actual team/project
    }
)
﻿
resp = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are an educational assistant who excels at breaking down complex topics."},
        {"role": "user", "content": "Can you describe how quantum computing works using everyday language?"}
    ],
    temperature=0.7,
    max_tokens=1000,
)
﻿
print(resp.choices[0].message.content)
﻿
You'll find the inputs and outputs recorded to your Weave dashboard with the parameters automatically included:
﻿
Step 4: Advanced Llama 3.1 8B inference configuration
Understanding inference parametersAdjust Llama 3.1 8B's response behavior with these key inference parameters (feel free to play around with them and compare the outputs in Weave!).
﻿
import os
import openai
import weave
import os
﻿
PROJECT = "wandb_inference"
weave.init(PROJECT)
﻿
client = openai.OpenAI(
    base_url="https://api.inference.wandb.ai/v1",
    api_key=os.getenv("WANDB_API_KEY"),
    project=PROJECT,
    default_headers={
        "OpenAI-Project": "wandb_fc/quickstart_playground"  # replace with your actual team/project
    }    
)
﻿
resp = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are an imaginative storytelling assistant."},
        {"role": "user", "content": "Create a brief narrative about traveling through time."}
    ],
    temperature=0.8,
    top_p=0.9,
    max_tokens=2000,
)
print(resp.choices[0].message.content)
﻿
Parameter Guidelines:
Temperature: Use 0.1-0.3 for analytical tasks, 0.7-0.9 for creative work
Top_p: Combine with temperature; 0.9 works well for most applications
This gives us added flexibility to influence our model output. These parameters are also automatically logged to W&B Weave for observability:
﻿
Streaming inference responsesFor real-time output and better user experience:
import os
import sys
import openai
import weave
﻿
﻿
PROJECT = "wandb_inference"
weave.init(PROJECT)
﻿
client = openai.OpenAI(
    base_url="https://api.inference.wandb.ai/v1",
    api_key=os.getenv("WANDB_API_KEY"),
    project=PROJECT,
    default_headers={
        "OpenAI-Project": "wandb_fc/quickstart_playground"  # replace with your actual team/project
    }
)
﻿
stream = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a knowledgeable assistant."},
        {"role": "user", "content": "Share an engaging narrative about humanity's journey into space."}
    ],
    stream=True,
    temperature=0.7,
)
﻿
sys.stdout.write("Response: ")
for chunk in stream:
    delta = chunk.choices[0].delta
    if delta and delta.content:
        sys.stdout.write(delta.content)
        sys.stdout.flush()
print()
We got a streaming response:
﻿
With the metrics logged to Weave:
﻿
As well as the full output:
﻿
Running inference with Llama 3.1 8B's unique capabilitiesThis is where running inference with Llama 3.1 8B really shines. Let's explore what makes it special.
Long context inferenceLlama 3.1 8B excels at running inference on extensive documents. Here's a practical example:
import os
import io
import requests
import openai
import weave
from pypdf import PdfReader
﻿
PROJECT = "wandb_inference"
weave.init(PROJECT)
﻿
PDF_URL = "https://docs.aws.amazon.com/pdfs/bedrock-agentcore/latest/devguide/bedrock-agentcore-dg.pdf"
QUESTION = "What is the functionality of AgentCore's memory system?"
﻿
client = openai.OpenAI(
    base_url="https://api.inference.wandb.ai/v1",
    api_key=os.getenv("WANDB_API_KEY"),
    project=PROJECT,
    default_headers={
        "OpenAI-Project": "wandb_fc/quickstart_playground"  # replace with your actual team/project
    }
)
﻿
r = requests.get(PDF_URL, timeout=60)
r.raise_for_status()
﻿
reader = PdfReader(io.BytesIO(r.content))
pages = reader.pages[:50]
text = "\n\n".join(page.extract_text() or "" for page in pages)
﻿
doc_snippet = text
﻿
prompt = (
    "Please review the AWS Bedrock AgentCore documentation and respond based exclusively on the information provided. "
    "If the information isn't available in the text, please indicate that you cannot locate it.\n\n"
    f"Documentation:\n{doc_snippet}\n\nQuery: {QUESTION}\n"
    "Please reference specific text passages when possible."
)
﻿
resp = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a specialist in AWS Bedrock AgentCore technologies."},
        {"role": "user", "content": prompt}
    ],
    temperature=0.2,
    max_tokens=1500,
)
﻿
print(resp.choices[0].message.content)
﻿
Which outputs to Weave:
﻿
Multilingual inferenceLeverage Llama 3.1 8B's multilingual inference capabilities for international development:
import os
import openai
import weave
import os
﻿
PROJECT = "wandb_inference"
weave.init(PROJECT)
﻿
client = openai.OpenAI(
    base_url="https://api.inference.wandb.ai/v1",
    api_key=os.getenv("WANDB_API_KEY"),
    project=PROJECT,
    default_headers={
        "OpenAI-Project": "wandb_fc/quickstart_playground"  # replace with your actual team/project
    }
)
﻿
﻿
﻿
code_snippet = """
// Chinese identifiers with English comments
function 计算总价(商品列表, 折扣率) {
    let 总价 = 0;
    for (const 商品 of 商品列表) {
        总价 += 商品.价格 * 商品.数量
    }
    const 折扣金额 = 总价 * 折扣率
    return 总价 - 折扣金额
}
﻿
# Python with Chinese docstring
def validate_用户输入(user_data):
    '''
    验证用户输入数据的完整性和有效性
    Validates user input data for completeness and validity
    '''
    required_fields = ['name', 'email', '年龄']
    for field in required_fields:
        if field not in user_data:
            raise ValueError(f"Missing required field: {field}")
    return True
"""
﻿
task = (
    "Please describe in English what this code accomplishes and provide a brief Chinese summary. "
    "Then recommend enhancements for variable naming consistency and exception handling. "
    "Create a refactored version that uses consistent language for all identifiers."
)
﻿
resp = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are an experienced software developer proficient in both English and Chinese."},
        {"role": "user", "content": f"{task}\n\nCode:\n{code_snippet}"}
    ],
    temperature=0.2,
    max_tokens=1200,
)
﻿
print(resp.choices[0].message.content)
﻿
Which logs to Weave as:
﻿
Complex multi-step reasoning inference with Llama 3.1 8BUtilize Llama 3.1 8B's inference reasoning capabilities for complex problem solving:
import os
import openai
import weave
﻿
﻿
PROJECT = "wandb_inference"
weave.init(PROJECT)
﻿
﻿
client = openai.OpenAI(
    base_url="https://api.inference.wandb.ai/v1",
    api_key=os.getenv("WANDB_API_KEY"),
    project=PROJECT,
    default_headers={
        "OpenAI-Project": "wandb_fc/quickstart_playground"  # replace with your actual team/project
    }
)
﻿
﻿
resp = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are an expert consultant focused on SaaS business optimization."},
        {"role": "user", "content": """
        Our software-as-a-service business faces a 15% monthly customer churn rate. Customer acquisition costs $150 per customer, 
        while lifetime value averages $800, and we generate $50 in monthly recurring revenue per user. 
        Our customer base includes 2,000 active subscribers, and marketing expenses total $60,000 monthly.
        
        Evaluate this business scenario and recommend a detailed improvement strategy, 
        including actionable steps, projected timelines, and key performance indicators.
        """}
    ],
    temperature=0.7,
    max_tokens=1000,
)
﻿
﻿
print(resp.choices[0].message.content)
﻿
Which you'll see in the dashboard:
﻿
Monitoring  Llama 3.1 8B inference with W&B WeaveFrom the final cell, you can view the inference output and copy it as needed. To explore further or review past inference requests, open your Weave dashboard or follow the links included with the response.
With Weave initialized through your environment variable, all inference API calls are automatically tracked. Here’s what gets logged and how you can make the most of it:
What Weave tracks automaticallyRequest details: Model used, parameters, token counts
Response data: Content, processing time, success/failure status
Usage metrics: Token consumption, API costs, rate limit status
Performance: Response latency, throughput patterns
Accessing your logsVisit your W&B project dashboard at: https://wandb.ai/[your-username]/[your-project]
Navigate to the "Weave" section
View detailed logs, filter by date/model/status
Analyze usage patterns and optimize accordingly
Custom Weave annotationsAdd custom metadata and organize your API calls:
import os
import openai
import weave
import os
﻿
﻿
PROJECT = "wandb_inference"
weave.init(PROJECT)
﻿
client = openai.OpenAI(
    base_url="https://api.inference.wandb.ai/v1",
    api_key=os.getenv("WANDB_API_KEY"),
    project=PROJECT,
    default_headers={
        "OpenAI-Project": "wandb_fc/quickstart_playground"  # replace with your actual team/project
    }
)
﻿
@weave.op()
def analyze_customer_feedback(feedback_text, sentiment_threshold=0.5):
    """
    Analyze feedback and return sentiment summary.
    Tracked via weave.op since the OpenAI client has no built-in weave hook.
    """
    resp = client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=[
            {"role": "system", "content": "You evaluate sentiment on a scale of -1 to 1 and identify important themes."},
            {"role": "user", "content": f"Customer feedback: {feedback_text}\nSentiment threshold: {sentiment_threshold}"}
        ],
        temperature=0.1,
        max_tokens=500,
    )
    return resp.choices[0].message.content
﻿
if __name__ == "__main__":
    out = analyze_customer_feedback(
        "This latest software release is difficult to navigate and performs poorly. The features I rely on daily have become hard to locate.",
        sentiment_threshold=0.3,
    )
    print(out)
Which would appear as:
﻿
Best PracticesWhen testing or deploying Llama 3.3 70B, or any large model, following good practices will save time, improve reliability, and reduce costs.
🔐 Security and configurationEnvironment variables: Always store API keys in environment variables; never hardcode them.
Project organization: Use clear, descriptive names in the team/project format.
Access control: Restrict API key permissions to only what’s necessary.
✍️ Prompt engineering for Llama 3.1 8BLeverage long context: Don’t hesitate to supply rich context—Llama 3.3 70B is built to handle it.
Clear instructions: Be specific about the desired format, tone, or style.
System messages: Use detailed system prompts to establish expertise and guide responses.
Temperature tuning: Lower values (0.1–0.3) for analytical tasks; higher (0.7–0.9) for creative work.
⚡ Performance optimizationStreaming: Use streaming for long responses to improve interactivity.
Batch processing: Group similar requests when possible to boost efficiency.
Token management: Monitor token usage to control costs and stay within limits.
Caching: Cache frequent queries to reduce redundant calls.
📊 Monitoring and debuggingWeave integration: Enable automatic logging for all production calls.
Custom annotations: Add metadata to differentiate experiments and use cases.
Error analysis: Regularly review failed or slow requests to spot patterns.
Performance tracking: Keep an eye on latency and throughput to guide optimizations.
Next stepsNow that you’re equipped with a solid foundation for working with Llama 3.3 70B, here’s how you can extend your workflow:
🔗 Explore advanced features → Review the W&B Inference docs for advanced configuration, and check out Weave’s evaluation tools for systematic testing.
📊 Optimize your workflow → Build automated dashboards, experiment with A/B testing for prompt strategies, and design domain-specific evaluation metrics.
🚀 Scale your deployment → Develop production pipelines with robust error handling, apply cost optimization strategies, and integrate with other W&B tools for a complete ML workflow.
📚 Go deeper into the model → Read the official Llama 3.3 70B model card, explore real-world examples from the community, and stay updated as new capabilities roll out.
﻿
Add a comment
Tags: Articles, Inference, LLM
Iterate on AI agents and models faster. Try Weights & Biases today.