Skip to main content

Tutorials: GPT-5 evaluation across multiple tasks

These tutorials cover how to evaluate GPT-5’s image generation, coding evals, and automated debugging using W&B Weave.
Created on August 8|Last edited on August 13
The moment GPT-5 launched, we knew we had to dive in and play with some of the new API capabilities.
In the tutorials below, we'll explore some of GPT-5’s most powerful features. We start by experimenting with its image description and generation capabilities, compare performance against other models on real programming tasks, and finally see how it can drive an automated Python debugging agent that can generate concise code-fix summaries. Let's hop in!


What we're covering



Tutorial: Evaluating GPT-5 multimodal I/O

We'll start by evaluating multimodal I/O using GPT-5.
In this tutorial, GPT-5 receives an image both as a URL and as base64 data along with a description request. It returns a natural language description, which we then feed back into GPT-5’s image generation tool to create a brand new image. These functions are wrapped with @weave.op so each run is logged in W&B Weave, including the original image, prompt, generated description, and new image output.
In the Weave UI, you can click into a single run and see the whole sequence visually (you'll see that below this code block). This is perfect for tracking creative experiments, debugging unexpected outputs, or simply showing before and after transformations.

import requests
from io import BytesIO
from PIL import Image
import base64
from openai import OpenAI
import weave; weave.init("gptV_gen_and_desc")


OPENAI_API_KEY = ""
client = OpenAI(api_key=OPENAI_API_KEY)

@weave.op
def gptV_describe_image_with_url(
pil_img: Image.Image,
img_url: str,
prompt: str = "Describe what is in this image."
) -> str:
"""
Describe an image using both its URL and base64 encoding for the model,
and logs PIL image in Weave.
"""
# Prepare base64-encoded image for OpenAI input
inp = {
"role": "user",
"content": [
{"type": "input_text", "text": "Describe what is in this image."},
{"type": "input_image", "image_url": img_url}
]
}
resp = client.responses.create(
model="gpt-5",
input=[inp]
)
return resp.output_text


@weave.op
def gpt_generate_image(
prompt: str,
size: str = "1024x1024"
) -> Image.Image:
"""
Generate an image from a prompt using OpenAI DALL-E (PIL image output).
"""


prompt = f"Generate an image given the following description: {prompt}"
print(f"[DEBUG] Generating image with prompt: {prompt}")

try:
response = client.responses.create(
model="gpt-5",
input=prompt,
tools=[{"type": "image_generation"}], # no tool_choice
)
print(f"[DEBUG] Raw response received: {response}")
except Exception as e:
print(f"[ERROR] Failed to create response: {e}")
return None

try:
image_data = [
output.result
for output in response.output
if output.type == "image_generation_call"
]
print(f"[DEBUG] Extracted image data: {image_data}")
except Exception as e:
print(f"[ERROR] Failed to extract image data: {e}")
return None

if image_data:
try:
image_base64 = image_data[0]
filename = "generated_image.png"
with open(filename, "wb") as f:
f.write(base64.b64decode(image_base64))
print(f"[DEBUG] Image saved to {filename}")
pil_img = Image.open(BytesIO(base64.b64decode(image_base64)))
return pil_img
except Exception as e:
print(f"[ERROR] Failed to save image: {e}")
return None



# --- Main Example usage ---
if __name__ == "__main__":
img_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3f/Fronalpstock_big.jpg/800px-Fronalpstock_big.jpg"
headers = {"User-Agent": "Mozilla/5.0 (compatible; GeminiScript/1.0)"}
response = requests.get(img_url, headers=headers)
response.raise_for_status()
pil_img = Image.open(BytesIO(response.content))

# 1. DESCRIBE
desc = gptV_describe_image_with_url(
pil_img=pil_img,
img_url=img_url,
prompt="Describe what is in this image."
)
print("\nGPT-V description:")
print(desc)

# 2. GENERATE NEW IMAGE FROM DESCRIPTION
gen_img = gpt_generate_image(desc)
gen_img.save("gptV_generated.png")
print("\nGenerated image saved as gptV_generated.png")
After running the script, you will have both the original and the generated image saved locally, and thanks to Weave, you will also have a full visual record of the process in your dashboard. You can filter by prompt text, compare multiple runs side-by-side, and even share a link to a specific run with a teammate.
Here's what it looks like inside Weave after running your script:
First, we describe the image
Next, we attempt to generate a similar image!

Tutorial: Evaluating GPT-5 coding capabilities using Weave Evals

Next, we move into a coding evaluation of GPT-5. The model allows you to set how much thinking time it should spend on a task, letting you balance speed against accuracy. For this run, I evaluated three different models on 30 examples from the DeepMind Code Contests dataset, each producing Python functions for competitive programming problems. GPT-OSS-120B was tested with its high reasoning mode enabled. GPT-5 was run with a low thinking budget to see how it performed under tighter constraints. Claude 4.1 Opus was run with a 4k token thinking budget, which is relatively low for its capabilities.
Using the EvaluationLogger means we can run evaluations in any style we choose. Here, GPT-5, Claude 4.1 Opus, and GPT-OSS-120B each generate Python code for every problem, and we actually execute that code in a controlled environment. It is run against a set of predefined test cases, with the outputs compared to the expected results to determine correctness. We also record how long each run takes.
Every part of this process is logged in Weave, so for each problem, you can inspect the original prompt, the generated code, the exact test inputs, the outputs from execution, and whether they passed. This makes it easy to compare reasoning effort levels, low versus high, by looking at both the accuracy metrics and the specific differences in the code that led to those results. Here’s the code:
import os
import sys
import time
import subprocess
from datasets import load_dataset
from litellm import completion as oai_completion
from anthropic import Anthropic
from google import genai
from openai import OpenAI
import weave
import random
import numpy as np

from weave.flow.eval_imperative import EvaluationLogger
from google import genai
from google.genai import types
from litellm import completion as oai_completion
import re
from litellm import completion as oai_completion
import requests
import json




weave.init("codecontests_eval")


# API keys
# os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY", "sk-...")
CLAUDE_KEY = os.getenv("CLAUDE_API_KEY", "")
GENAI_KEY = os.getenv("GOOGLE_API_KEY", "")
OPENROUTER_KEY = os.getenv("OPENROUTER_API_KEY", "")


# Clients
anthropic_client = Anthropic(api_key=CLAUDE_KEY)
gemini_client = genai.Client(api_key=GENAI_KEY)
openrouter_client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key=OPENROUTER_KEY,
)
client = OpenAI(api_key="")



def clean_llm_code_block(text):
import re


cleaned_text = text.replace("```python", "").replace("```", "").strip()
code_blocks = re.findall(r"(def solve\(.*?)(?=^def |\Z)", cleaned_text, re.DOTALL | re.MULTILINE)
source_text = code_blocks[-1] if code_blocks else cleaned_text


prompt = (
"Given the following response from a language model, extract ONLY the valid Python code for the function. "
"Do not include any explanations, text, or formatting fences. Only the code.\n\n"
f"Response:\n{source_text}\n\n"
"Return ONLY the Python code, including any necessary imports:"
)


response = oai_completion(
model="openai/gpt-4o-2024-08-06",
messages=[{"role": "user", "content": prompt}],
temperature=0.0,
)


gpt4o_code = response["choices"][0]["message"]["content"]
gpt4o_code = gpt4o_code.replace("```python", "").replace("```", "").strip()
return gpt4o_code


@weave.op()
def generate_completion(model: str, prompt: str, effort: str="low") -> str:
# if model.startswith("openai/"):
# response = oai_completion(
# model=model,
# messages=[{"role": "user", "content": prompt}],
# reasoning_effort="low",
# )
# return response["choices"][0]["message"]["content"].strip()
if model.startswith("openai/"):
response = client.responses.create(
model=model.replace("openai/", ""),
reasoning={"effort": effort},
input=[
{"role": "user", "content": prompt}
]
)
return response.output_text.strip()


elif model.startswith("anthropic/"):
response = anthropic_client.messages.create(
model=model.replace("anthropic/", ""),
max_tokens=8000,
thinking={"type": "enabled", "budget_tokens": 4000},
messages=[{"role": "user", "content": prompt}],
)
for block in response.content:
if block.type == "text":
return block.text.strip()
return "[No Claude response]"


elif model.startswith("gemini/"):
result = gemini_client.models.generate_content(
model=model.replace("gemini/", ""),
config=types.GenerateContentConfig(
thinking_config=types.ThinkingConfig(thinking_budget=4000)
),
contents=[prompt]
)
return result.text.strip() if result.text else "[No Gemini response]"


elif model.startswith("openrouter/"):

url = "https://openrouter.ai/api/v1/chat/completions"
headers = {
"Authorization": f"Bearer {OPENROUTER_KEY}",
"Content-Type": "application/json"
}
payload = {
"model": model.replace("openrouter/", ""),
"messages": [
{"role": "system", "content": "Reasoning: high"},
{"role": "user", "content": prompt}
],
}
response = requests.post(url, headers=headers, data=json.dumps(payload))
resp_json = response.json()
if 'choices' in resp_json and resp_json['choices']:
# To get the reasoning, use: resp_json['choices'][0]['message'].get('reasoning')
return resp_json['choices'][0]['message'].get('content', '[No answer found]')
else:
return "[No choices found in OSS response]"

else:
raise ValueError(f"Unsupported model: {model}")




def ask_llm_for_function_implementation(description: str, model: str, effort: str | None = None) -> str:
prompt = (
f"Write a Python3 function named `solve` with typed input arguments for this problem -- eg the solve function should take arguments to handle different test cases:\n\n"
f"{description.strip()}\n\n"
"Return only a valid Python function -- no special packages that arent commonly used and NO MAIN function, no if __name__ == __main__....., JUST write the function -- that returns the result. No comments, no explanations."
f"HOWEVER, you still need to include necessary imports for libraries"
f"IF you do not include the right imports, the code will not be executable, and your response will be judged as incorrect!"
)
# Pass effort only to OpenAI via generate_completion when provided
if effort is not None and model.startswith("openai/"):
return clean_llm_code_block(generate_completion(model, prompt, effort=effort))
else:
return clean_llm_code_block(generate_completion(model, prompt))




@weave.op
def ask_llm_for_function_call(code: str, raw_input: str, model: str) -> str:


prompt = (
"You're given a Python function and a single input string. "
"Format it into a valid Python function call using only standard types.\n\n"
f"Function:\n{code}\n\n"
f"Input:\n{raw_input.strip()}\n\n"
"Return ONLY a valid function call (e.g., solve(3, 5)) WITH NO 'def' "
)


# Always use GPT-4o for this inference, regardless of the `model` argument.
response = oai_completion(
model="openai/gpt-4o-2024-08-06",
messages=[{"role": "user", "content": prompt}],
temperature=0.0,
)
# The LLM may return markdown code blocks; strip them just in case.
content = response["choices"][0]["message"]["content"]
content = content.replace("```python", "").replace("```", "").strip()
return content


def compare_output_with_llm(expected: str, actual: str, model: str) -> bool:
prompt = (
f"Expected output: {expected.strip()}\n"
f"Actual output: {actual.strip()}\n\n"
"Are these outputs equivalent? Eg ignore minor formatting errors etc, we are just looking for overall correctness in the output Reply YES or NO."
)
# response = generate_completion(model, prompt)


response = oai_completion(
model="openai/gpt-4o-2024-08-06",
messages=[{"role": "user", "content": prompt}],
temperature=0.0,
)
# The LLM may return markdown code blocks; strip them just in case.
res = 'YES' in str(response["choices"][0]["message"]["content"]).upper()
return res



def run_code_and_call_function(code: str, function_call: str, timeout=10):
full_code = code + f"\n\nprint({function_call})"
try:
start = time.time()
result = subprocess.run(
[sys.executable, "-c", full_code],
capture_output=True,
text=True,
timeout=timeout
)
latency = time.time() - start
return result.stdout.strip(), result.stderr.strip(), latency
except subprocess.TimeoutExpired:
return "", "Execution timed out.", timeout
except Exception as e:
return "", str(e), 0.0




def ask_model_for_pip_command(error_msg):
prompt = (
"Given this Python error:\n\n"
+ error_msg +
"\n\nWrite the pip install command needed to fix it. Only return the command, e.g.:\n"
"pip install requests"
)
return generate_completion("openai/gpt-4o-2024-08-06", prompt)



def run_pip_install(pip_command):
print(f"Running: {pip_command}")
try:
result = subprocess.run(
pip_command.split(),
capture_output=True,
text=True,
timeout=180
)
print(result.stdout.strip())
if result.stderr:
print(result.stderr.strip())
except Exception as e:
print(f"pip install failed: {e}")



def evaluate_model_on_code_contests(model_name: str, reasoning_effort: str | None = None):
print(f"\n\nRunning evaluation for model: {model_name}\n")

random.seed(42)
np.random.seed(42)
ds = load_dataset("deepmind/code_contests", split="test", streaming=True)
ds = list(ds.take(31))


# Build sanitized model identifier for Weave, including reasoning effort if provided
model_id = model_name.replace("-", "_").replace("/", "_").replace(".", "_")
if reasoning_effort:
effort_id = str(reasoning_effort).replace("-", "_").replace("/", "_").replace(".", "_")
model_id = f"{model_id}__{effort_id}"

eval_logger = EvaluationLogger(
model=model_id,
dataset="code_contests_test"
)
all_latencies = []


for i in range(30):
row = ds[i]
description = row["description"]
raw_inputs = row["public_tests"]["input"]
expected_outputs = row["public_tests"]["output"]


try:
# Forward reasoning_effort only to OpenAI generate_completion
code = ask_llm_for_function_implementation(
description,
model=model_name,
effort=reasoning_effort if (reasoning_effort and model_name.startswith("openai/")) else None,
)
print(f"\n=== Task {row['name']} ===", flush=True)
# print("Generated code:\n", code)


all_passed = True
task_latencies = []
results_lst, expected_lst = [], []


for j, raw_input in enumerate(raw_inputs):
expected = expected_outputs[j] if j < len(expected_outputs) else ""


try:
function_call = ask_llm_for_function_call(code, raw_input, model=model_name)
result, error, latency = run_code_and_call_function(code, function_call)
if latency < 99:
task_latencies.append(latency)


if error:
print(f"[{j}] Runtime error: {error}")
if "ModuleNotFoundError" in error:
pip_cmd = ask_model_for_pip_command(error)
run_pip_install(pip_cmd)
# Re-run once after pip install
result, error, latency = run_code_and_call_function(code, function_call)
task_latencies.append(latency)
if error:
print(f"[{j}] Retry failed: {error}")
all_passed = False
continue
else:
all_passed = False
continue


is_correct = compare_output_with_llm(expected, result, model="openai/gpt-4o-2024-08-06")######
results_lst.append(result)
expected_lst.append(expected)
if not is_correct:
all_passed = False
print(f"[{j}] input: {raw_input.strip()} → output: {result} | expected: {expected.strip()} | PASS: {is_correct} | latency: {latency:.2f}s")


except Exception as inner:
print(f"[{j}] Inner error: {repr(inner)}")
all_passed = False
task_avg_latency = sum(task_latencies) / len(task_latencies) if len(task_latencies) > 0 else 0.0
all_latencies.extend(task_latencies)


prediction_log = eval_logger.log_prediction(
inputs={"description": description},
output={'code': code, 'execution_result': results_lst, 'expected_execution_result': expected_lst}
)
prediction_log.log_score("correctness", all_passed)
prediction_log.log_score("code_latency", task_avg_latency)
prediction_log.finish()


except Exception as e:
print(f"[{i}] Top-level failure: {repr(e)}")
prediction_log = eval_logger.log_prediction(
inputs={"description": description},
output=str(e)
)
prediction_log.log_score("correctness", False)
prediction_log.finish()


avg_latency = sum(all_latencies) / len(all_latencies) if all_latencies else 0.0
eval_logger.log_summary({"avg_code_latency": avg_latency})
print(f"Evaluation complete for {model_name}. View in Weave UI.")




# Run for all models


evaluate_model_on_code_contests("openrouter/openai/gpt-oss-120b")
evaluate_model_on_code_contests("anthropic/claude-opus-4-1-20250805")
evaluate_model_on_code_contests("openai/gpt-5", reasoning_effort='low')

The code initializes the Weave project to log everything, then streams a small slice of the DeepMind Code Contests test set. For each problem, it asks a chosen model to write a Python function named solve. The generate_completion router supports several providers, and if you use an OpenAI model, you can pass a reasoning effort setting. Because models love to wrap answers in markdown, clean_llm_code_block strips fences and keeps only runnable code.
Next, for each public test input, the script asks GPT-4o to turn the raw example into a concrete function call, runs the candidate code in a subprocess, and measures latency. If execution errors with a missing package, it asks a model for the precise pip install command, installs it, and retries once. Outputs are compared to the expected answers using a tolerant GPT-4o check so trivial formatting differences do not tank correctness.
Every task is logged to Weave’s EvaluationLogger with inputs, the generated code, per-test outputs, correctness, and timing. At the end, it logs a summary that includes average execution latency. The result is a reproducible evaluation you can explore in Weave, drilling into each task to see the prompt, code, calls, outputs, and pass or fail status across different models and effort levels.
Once this evaluation is complete, you can open the Weave UI to explore it in detail. Weave’s evaluation viewer not only shows you aggregate statistics like accuracy but also lets you click into each individual example. You can see the prompt, the model’s full response, the score, and any metrics you logged alongside it. If you ran multiple models or settings, Weave will align them so you can directly compare outputs. This is invaluable when deciding whether the extra latency of higher reasoning effort is worth the improvement in correctness for your use case. Here are the results for my eval:

From the chart, GPT-5 (low thinking budget) had the highest correctness score at 0.733. GPT-OSS-120B (high reasoning) was close in correctness at 0.633. Claude 4.1 Opus (4k token budget) lagged with only 0.333 correctness and a mid-range latency of 0.620.

Tutorial: Building an GPT-5 Python debugger agent with Weave

Finally, we explore GPT-5 as a debugging assistant that can take an error log, put it in an external context, and propose an actionable fix. Every step is wrapped in @weave.op, so you can replay runs, inspect intermediate results, and compare outputs later. Once set up, you’ll run: agentpython myscript.py
This executes your script and triggers the agent automatically if an error occurs. Here's what's happening as it runs:

1. Log capture and cleanup

The agent starts by loading an error log saved via a shell alias or function that pipes stderr to a file. If it detects a Python traceback, GPT-5 strips away volatile details like file paths, line numbers, memory addresses, and run-specific shapes, keeping only the essential error type, message, and any clear library names.

2. GitHub search with OCR

Using this cleaned query, the script calls the GitHub API for matching issues, opens each result in a headless browser, captures full-page screenshots, and runs OCR to extract the entire discussion, including code blocks or image-only replies. GPT-5 then summarizes each thread in the context of the original error so you can see likely causes and fixes immediately.

3. Web search and static analysis

If web search is enabled, GPT-5 queries the live internet, merges relevant results with the error context, and returns an actionable answer with citations. Alternatively (or in parallel), the agent can run a pure offline static analysis, reading the implicated source file and traceback to propose specific edits, quick test snippets, and relevant install or version commands.

4. Final HTML report

The process ends with a single HTML report containing the raw error log, any related source snippet, GPT-5’s tool recommendations, static analysis results, GitHub issues with links, screenshots, and summaries, plus web search findings with citations. The final recommendation appears at the bottom so you can act immediately. The report opens automatically in Chrome, and because Weave logs every input and output, you can revisit, tweak prompts, or re-run parts without repeating expensive searches or OCR work.
import os
import sys
import re
import requests
import tempfile
import webbrowser
import html
import asyncio
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
from playwright.async_api import async_playwright
from PIL import Image
import pytesseract

from openai import OpenAI
import weave; weave.init("gpt5_agent")


# -------- CONFIG --------
LOGFILE = sys.argv[1] if len(sys.argv) > 1 else "/tmp/agentpython-stderr.log"
OUTPUT_DIR = "github_screenshots"
PARALLEL_PAGE_LOADS = 6 # how many pages to screenshot at once
OCR_WORKERS = min(8, (os.cpu_count() or 4))
# ------------------------

def verbose_print(msg):
print(f"\033[95m[LOG] {msg}\033[0m", flush=True)

def read_log(logfile):
verbose_print(f"Reading from log file: {logfile}")
if not os.path.exists(logfile) or os.path.getsize(logfile) == 0:
print("[LOG] Log file empty or not found. No action needed.", flush=True)
sys.exit(0)
with open(logfile) as f:
content = f.read()
print(f"\n--- Log Content ---\n{content}\n{'-'*40}", flush=True)
return content

def is_python_error(txt):
if "Traceback (most recent call last):" in txt or "Exception" in txt or "Error" in txt:
verbose_print("Looks like a Python error.")
return True
verbose_print("Not detected as Python error (using fallback toolchain).")
return False

@weave.op
def generate_search_query_openai(error_str):
verbose_print("Generating generalized search query using gpt-5 (OpenAI)...")
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
gpt_response = client.responses.create(
model="gpt-5",
reasoning={"effort": "low"},
input=[
{
"role": "user",
"content": (
"You are generating a GitHub search query from an error message. "
"Your goal is to create a generic query that will return relevant results from GitHub issues across many repositories. "
"Do NOT include overly specific details that would narrow results too much, such as:\n"
"- File paths\n"
"- Line numbers\n"
"- Exact tensor shapes, array sizes, or specific numeric values in parentheses\n"
"- Memory addresses\n"
"- Random seeds or run-specific values\n\n"
"Instead:\n"
"- Keep only the key error type and descriptive text\n"
"- Include the relevant library name if obvious (e.g., torch, numpy, pandas)\n"
"- Use quotes for the core error message if helpful\n\n"
"Output only the final search query string. No explanation, no extra words.\n\n"
f"Error:\n{error_str}"
)
}
]
)
query = (gpt_response.output_text or "").strip()
print("Generated search query:", repr(query), flush=True)
return query

async def _screenshot_one(page, url, path):
try:
await page.goto(url, timeout=20000)
await page.set_viewport_size({"width": 1920, "height": 1080})
await page.screenshot(path=path, full_page=True)
verbose_print(f"[+] Screenshot saved: {path}")
return True
except Exception as e:
verbose_print(f"[!] Failed screenshot for {url}: {e}")
return False

async def capture_screenshots_parallel(urls, out_dir, concurrency=6):
os.makedirs(out_dir, exist_ok=True)
results = [None] * len(urls)

async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context()

sem = asyncio.Semaphore(concurrency)
async def worker(i, url):
path = os.path.join(out_dir, f"issue_{i+1}.png")
async with sem:
page = await context.new_page()
ok = await _screenshot_one(page, url, path)
await page.close()
results[i] = path if ok else None

tasks = [asyncio.create_task(worker(i, url)) for i, url in enumerate(urls)]
await asyncio.gather(*tasks)
await browser.close()

return results # list of file paths (or None)

def run_ocr(image_path):
if not image_path or not os.path.exists(image_path):
return ""
try:
img = Image.open(image_path)
text = pytesseract.image_to_string(img)
# save alongside
txt_path = image_path.rsplit(".", 1)[0] + ".txt"
with open(txt_path, "w", encoding="utf-8") as f:
f.write(text)
return text
except Exception as e:
verbose_print(f"[!] OCR failed for {image_path}: {e}")
return ""
@weave.op
def summarize_with_gpt5(error_text, github_text):
if not github_text.strip():
return "[No OCR text to summarize]"
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
resp = client.responses.create(
model="gpt-5",
reasoning={"effort": "low"},
input=[
{
"role": "user",
"content": (
"You are assisting in debugging. The following is a Python error message, "
"and then OCR-extracted text from a GitHub issue discussing it. "
"Summarize the most likely cause and solution in a few sentences. "
"Only include relevant fix instructions. Be concise.\n\n"
f"Error:\n{error_text}\n\nGitHub Issue Content:\n{github_text}"
)
}
]
)
return (resp.output_text or "").strip()


@weave.op
def search_github(query, github_token=None, owner=None, repo=None, error_text=None):
verbose_print(f"Searching GitHub issues for: {query!r}")
url = 'https://api.github.com/search/issues'
headers = {'Accept': 'application/vnd.github.v3+json'}
if github_token:
headers['Authorization'] = f'token {github_token}'
if owner and repo:
gh_query = f'repo:{owner}/{repo} is:issue {query}'
else:
gh_query = query
params = {'q': gh_query, 'per_page': 5}
resp = requests.get(url, headers=headers, params=params)
if resp.status_code != 200:
print(f"[GitHub] Search failed: {resp.status_code} {resp.text}", flush=True)
return []

items = resp.json().get('items', [])
if not items:
print("[GitHub] No results found.", flush=True)
return []

issue_urls = [it.get('html_url', '') for it in items]
# Parallel screenshots
verbose_print("Capturing GitHub issues as screenshots in parallel...")
screenshots = asyncio.run(capture_screenshots_parallel(issue_urls, OUTPUT_DIR, PARALLEL_PAGE_LOADS))

# Parallel OCR
verbose_print("Running OCR on screenshots in parallel...")
ocr_texts = [""] * len(screenshots)
with ThreadPoolExecutor(max_workers=OCR_WORKERS) as ex:
futures = {ex.submit(run_ocr, path): i for i, path in enumerate(screenshots)}
for fut in as_completed(futures):
i = futures[fut]
try:
ocr_texts[i] = fut.result() or ""
except Exception as e:
verbose_print(f"[!] OCR worker error for index {i}: {e}")
ocr_texts[i] = ""

# Summarize in parallel
gh_results = []
summaries = [""] * len(items)

def _summarize_idx(i: int) -> str:
return summarize_with_gpt5(error_text or query, ocr_texts[i])

max_workers = min(8, len(items)) if items else 0
if max_workers > 0:
with ThreadPoolExecutor(max_workers=max_workers) as ex:
future_map = {ex.submit(_summarize_idx, i): i for i in range(len(items))}
for fut in as_completed(future_map):
i = future_map[fut]
try:
summaries[i] = fut.result() or ""
except Exception as e:
summaries[i] = f"[summarize error: {e}]"

for idx, item in enumerate(items):
summary = summaries[idx]
issue_info = {
"number": item.get("number", "?"),
"title": item.get("title", ""),
"url": item.get("html_url", ""),
"body": (item.get("body", "") or "")[:600] + ("..." if item.get("body") and len(item["body"]) > 600 else ""),
"ocr_summary": summary,
"screenshot": screenshots[idx] or ""
}
gh_results.append(issue_info)
print("=" * 60, flush=True)
print(f"Issue #{issue_info['number']}: {issue_info['title']}", flush=True)
print(f"URL: {issue_info['url']}", flush=True)
print(f"Screenshot: {issue_info['screenshot']}", flush=True)
print(f"Solution Summary: {summary}", flush=True)
print("=" * 60, flush=True)

return gh_results


@weave.op
def openai_web_search(query):
verbose_print(f"Querying OpenAI gpt-5 web search for: {query!r}")
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
search_response = client.responses.create(
model="gpt-5",
tools=[{"type": "web_search_preview"}],
reasoning={"effort": "low"},
input=query
)
print("\n=== [OpenAI] Web Search AI Answer ===", flush=True)
print(search_response.output_text, flush=True)
links = re.findall(r'\[([^\]]+)\]\((https?://[^\)]+)\)', search_response.output_text or "")
link_objs = []
if links:
for title, url in links:
link_objs.append({'title': title, 'url': url})
else:
print("No citations found in output_text.", flush=True)
return {'output_text': search_response.output_text, 'citations': link_objs}


@weave.op
def write_html_report(
log,
file_snippet,
tools,
gh_results,
web_results,
static_result=None,
out_path=None
):
"""Write the HTML debug report and return both the path and the raw HTML."""
verbose_print("Writing HTML report ...")
out_path = out_path or os.path.join(tempfile.gettempdir(), 'dbg_report.html')
css = """
body { font-family: 'Segoe UI', sans-serif; background: #f5f7fa; color: #333; margin: 0; padding: 0; }
header { background: #1e293b; color: white; padding: 20px; text-align: center; font-size: 1.5em; }
section { padding: 20px; margin: 20px; background: white; border-radius: 8px; box-shadow: 0 2px 6px rgba(0,0,0,0.1); }
h2 { border-bottom: 2px solid #e5e7eb; padding-bottom: 5px; margin-bottom: 10px; color: #1f2937; }
pre { background: #0f172a; color: #e2e8f0; padding: 15px; border-radius: 6px; overflow-x: auto; font-size: 0.9em; }
a { color: #2563eb; text-decoration: none; }
a:hover { text-decoration: underline; }
.gh-issue { border: 1px solid #e5e7eb; padding: 10px; border-radius: 6px; margin-bottom: 16px; background: #f9fafb; }
.shot { margin: 8px 0; display: block; max-width: 100%; border: 1px solid #e5e7eb; border-radius: 6px; }
.label { font-weight: 600; color: #111827; }
"""
html_parts = []
html_parts.append(f"<html><head><meta charset='utf-8'><title>Debug Results</title><style>{css}</style></head><body>\n")
html_parts.append("<header>Debugging Session Report</header>\n")
html_parts.append("<section><h2>Error Log</h2>")
html_parts.append(f"<pre>{html.escape(log or 'None')}</pre></section>")
if file_snippet:
html_parts.append("<section><h2>Relevant Source Snippet</h2>")
html_parts.append(f"<pre>{html.escape(file_snippet)}</pre></section>")
if tools:
html_parts.append("<section><h2>LLM Tool Recommendations</h2>")
html_parts.append(f"<pre>{html.escape(str(tools))}</pre></section>")
if static_result:
html_parts.append("<section><h2>Static Analysis</h2>")
diag = static_result.get("diagnosis", "")
fixes = "\n".join(static_result.get("fixes", []) or [])
patch = static_result.get("patch", "")
test_snip = static_result.get("test_snippet", "")
notes = static_result.get("notes", "")
html_parts.append(f"<div class='label'>Diagnosis</div><pre>{html.escape(diag)}</pre>")
if fixes:
html_parts.append(f"<div class='label'>Proposed Fixes</div><pre>{html.escape(fixes)}</pre>")
if patch:
html_parts.append(f"<div class='label'>Proposed Patch</div><pre>{html.escape(patch)}</pre>")
if test_snip:
html_parts.append(f"<div class='label'>Quick Test</div><pre>{html.escape(test_snip)}</pre>")
if notes:
html_parts.append(f"<div class='label'>Notes</div><pre>{html.escape(notes)}</pre>")
html_parts.append("</section>")
if gh_results:
html_parts.append("<section><h2>GitHub Related Issues</h2>")
for res in gh_results:
html_parts.append(f"<div class='gh-issue'><div class='label'>#{res['number']}: {html.escape(res['title'])}</div>")
html_parts.append(f"<a href='{res['url']}'>{res['url']}</a><br>")
html_parts.append(f"<div class='label'>Issue Preview</div><pre>{html.escape(res['body'])}</pre>")
html_parts.append(f"<div class='label'>Solution Summary</div><pre>{html.escape(res.get('ocr_summary',''))}</pre></div>")
html_parts.append("</section>")
if web_results:
html_parts.append("<section><h2>Web Search AI Answer</h2>")
html_parts.append(f"<pre>{html.escape(web_results.get('output_text', ''))}</pre>")
if web_results.get('citations'):
html_parts.append("<ul>")
for c in web_results['citations']:
html_parts.append(f"<li><a href='{c['url']}'>{html.escape(c['title'])}</a></li>")
html_parts.append("</ul>")
html_parts.append("</section>")
html_parts.append("</body></html>")
raw_html = ''.join(html_parts)
with open(out_path, "w", encoding="utf-8") as f:
f.write(raw_html)
verbose_print(f"HTML written at: {out_path}")
return out_path, raw_html

def open_html_in_chrome(path):
verbose_print(f"Opening HTML report in browser ...")
url = Path(path).resolve().as_uri()
if sys.platform == 'darwin':
chrome = '/Applications/Google Chrome.app/Contents/MacOS/Google Chrome'
if os.path.exists(chrome):
os.system(f'open -a "{chrome}" "{url}"')
else:
webbrowser.open(url)
elif sys.platform == 'win32':
import subprocess
try:
subprocess.Popen(['start', 'chrome', url], shell=True)
except Exception:
webbrowser.open(url)
else:
try:
os.system(f'google-chrome "{url}"')
except Exception:
webbrowser.open(url)

def find_files_from_log_gpt(log_content):
verbose_print("Invoking LLM to identify implicated files from the log...")
user_prompt = (
"Given this error message or traceback, list all file paths (and, if available, line numbers) "
"involved in the error. Output one JSON per line, as:\n"
'{"file": "path/to/file.py", "line": 123}\n'
'If line is not found, use null.\n'
f"\nError:\n{log_content}"
)
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
llm_resp = client.responses.create(
model="gpt-5",
reasoning={"effort": "low"},
input=[{"role": "user", "content": user_prompt}]
)
output = llm_resp.output_text or ""
results = []
for l in output.splitlines():
l = l.strip()
if not l:
continue
try:
results.append(eval(l, {"null": None}))
except Exception as exc:
verbose_print(f"[File Extraction Skipped Line]: {l!r} ({exc})")
verbose_print(f"LLM File Extraction Result: {results}")
return results

def get_file_snippet(file_path, n_lines=20, line=None):
if not os.path.exists(file_path):
verbose_print(f"[WARN] File not found: {file_path}")
return None
code = []
with open(file_path, "r") as f:
lines = f.readlines()
if line and 1 <= line <= len(lines):
s = max(0, line-6)
e = min(len(lines), line+5)
code = lines[s:e]
else:
code = lines[:n_lines]
return "".join(code)


@weave.op
def suggest_tools(error_message, code_snippet):
import ast, json
verbose_print("Asking LLM: Based on the error and file, which tool to use next?")
prompt = (
"You are an AI debugging orchestrator. The following is a Python error message and a snippet of code "
"from a file involved in the error. Based on this, choose which tools should be used next, and explain why. "
"Possible tools: github_issue_search, web_search, static_analysis. "
"Output a single python dictionary (not JSON, not explanation). Example: "
"{'recommendations':['web_search', 'github_issue_search'], 'justification': 'Searching the web and GitHub can help resolve import errors quickly.'}\n"
"Error:\n" + error_message +
"\n\nFile snippet:\n" + code_snippet +
"\n\nOutput only the dictionary. No preamble or explanation."
"alwqays use the github tool man"
)
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
resp = client.responses.create(
model="gpt-5",
reasoning={"effort": "low"},
input=[{"role": "user", "content": prompt}]
)
output = (resp.output_text or "").strip()
try:
if output.startswith("```") and output.endswith("```"):
output = output[3:-3].strip()
obj = ast.literal_eval(output)
if isinstance(obj, dict):
verbose_print(f"LLM Tool Suggestion: {obj}")
return obj
except Exception:
pass
m = re.search(r'\{.*\}', output, re.DOTALL)
if m:
try:
obj = ast.literal_eval(m.group(0))
if isinstance(obj, dict):
verbose_print(f"LLM Tool Suggestion: {obj}")
return obj
except Exception:
pass
verbose_print(f"LLM Suggestion RAW output (not parsable): {output!r}")
return {"recommendations": [], "justification": 'Could not parse LLM response'}



@weave.op
def final_recommendation_with_gpt5(
error_text: str,
code_snippet: str | None,
tool_suggestion: dict | None,
gh: list | None,
web: dict | None,
query: str,
) -> str:
"""Synthesize a concise, actionable plan from all gathered signals."""
from openai import OpenAI
import json, os

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

gh_brief = []
if gh:
for item in gh[:5]:
gh_brief.append({
"title": item.get("title", ""),
"url": item.get("url", ""),
"summary": item.get("ocr_summary", "")
})

web_brief = {
"answer": (web or {}).get("output_text") if web else None,
"citations": (web or {}).get("citations") if web else None
}

payload = {
"error_text": error_text,
"code_snippet": code_snippet,
"tool_suggestion": tool_suggestion,
"search_query": query,
"github_findings": gh_brief,
"web_findings": web_brief
}

prompt = (
"You are a debugging assistant. Based on the following data, produce a short, actionable plan.\n"
"Include:\n"
"1. Likely root cause in one or two sentences.\n"
"2. Concrete next steps that can be executed now.\n"
"3. If shapes or types are mismatched, propose exact code edits.\n"
"4. If library problems are implicated, propose install or version pin commands.\n"
"5. If no external search is needed, say so and outline local static checks.\n\n"
f"DATA:\n{json.dumps(payload, ensure_ascii=False, indent=2)}\n\n"
"Return a concise plan. No preamble."
)

resp = client.responses.create(
model="gpt-5",
reasoning={"effort": "low"},
input=[{"role": "user", "content": prompt}]
)
return (resp.output_text or "").strip()




@weave.op
def static_analysis_gpt5(error_text: str, code_snippet: str | None) -> dict:
"""
Pure GPT-5 static analysis. No web or GitHub.
Returns a dict with fields: diagnosis, fixes, patch, test_snippet, notes.
"""
from openai import OpenAI
import os, json

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

system = (
"You are a Python static analyzer. Read the error and the code snippet. "
"Find the root cause and propose concrete code edits. "
"If there is a tensor shape mismatch, compute the exact shapes and provide the corrected operation. "
"Return strict JSON with keys: diagnosis, fixes, patch, test_snippet, notes."
)

user = {
"error_text": error_text,
"code_snippet": code_snippet or ""
}

prompt = (
"Analyze the following and return strict JSON only. "
"Do not include commentary outside JSON.\n\n"
f"{json.dumps(user, ensure_ascii=False, indent=2)}\n\n"
"{ \"diagnosis\": \"...\", "
"\"fixes\": [\"...\"], "
"\"patch\": \"diff or edited code\", "
"\"test_snippet\": \"python code to quickly sanity check\", "
"\"notes\": \"short notes\" }"
)

resp = client.responses.create(
model="gpt-5",
reasoning={"effort": "low"},
input=[
{"role": "system", "content": system},
{"role": "user", "content": prompt}
]
)

raw = (resp.output_text or "").strip()
try:
data = json.loads(raw)
except Exception:
data = {
"diagnosis": "Could not parse JSON from model",
"fixes": [],
"patch": "",
"test_snippet": "",
"notes": raw[:500]
}
return data

@weave.op
def main(force_use_all_tools: bool = True):
import os
GITHUB_TOKEN = ""
os.environ['OPENAI_API_KEY'] = "" # set your key here


error_content = read_log(LOGFILE)
search_query = generate_search_query_openai(error_content) if is_python_error(error_content) \
else error_content.strip().replace("\n", " ")

files_info, snippet, tools = None, None, None
try:
files_info = find_files_from_log_gpt(error_content)
if files_info:
file_to_examine, line_hint = files_info[0].get("file"), files_info[0].get("line")
verbose_print(f"Selected file: {file_to_examine}, line: {line_hint}")
snippet = get_file_snippet(file_to_examine, line=line_hint)
if snippet:
print("\n--- Snippet from implicated file ---\n", flush=True)
print(snippet, flush=True)
print("-" * 60, flush=True)
tools = suggest_tools(error_content, snippet)
print("\n[TOOL RECOMMENDATION]:", tools, flush=True)
else:
verbose_print(f"Could not get snippet from file {file_to_examine}")
else:
verbose_print("Did not find any file to examine in the error.")
except Exception as e:
verbose_print(f"[WARN] File inference failed: {e}")

gh_results = []
web_results = None
static_result = None

# run static analysis
if force_use_all_tools or (tools and "static_analysis" in tools.get("recommendations", [])):
static_result = static_analysis_gpt5(error_content, snippet)

# run github search
if force_use_all_tools or (tools and "github_issue_search" in tools.get("recommendations", [])):
gh_results = search_github(
search_query,
github_token=GITHUB_TOKEN,
error_text=error_content
)

# run web search
if force_use_all_tools or (tools and "web_search" in tools.get("recommendations", [])):
try:
web_results = openai_web_search(search_query)
except Exception as ex:
print(f"[OpenAI] Search failed: {ex}", flush=True)

final_plan = final_recommendation_with_gpt5(
error_text=error_content,
code_snippet=snippet,
tool_suggestion=tools,
gh=gh_results,
web=web_results,
query=search_query
)
print("\n=== FINAL RECOMMENDATION ===\n", final_plan, "\n", flush=True)

html_path, raw_html = write_html_report(
log=error_content,
file_snippet=snippet,
tools=tools,
gh_results=gh_results,
web_results=web_results,
static_result=static_result # pass it so the section renders
)

appended = raw_html.replace(
"</body></html>",
f"<section><h2>Final Recommendation</h2><pre>{html.escape(final_plan or '')}</pre></section></body></html>"
)
with open(html_path, "w", encoding="utf-8") as f:
f.write(appended)

open_html_in_chrome(html_path)
verbose_print("Searches complete. Examine the HTML report in Chrome for summary and results.\n")



if __name__ == "__main__":
# Fix for Windows event loop policy (Playwright + asyncio)
if sys.platform.startswith("win"):
try:
asyncio.set_event_loop_policy(asyncio.WindowsProactorEventLoopPolicy()) # type: ignore[attr-defined]
except Exception:
pass
main(force_use_all_tools=True)


The script starts by reading a Python error log from disk (more on this later). If it looks like a real traceback, GPT-5 turns the noisy text into a short, generalized search query. The prompt removes volatile details such as file paths, line numbers, memory addresses, and run-specific tensor shapes. What remains is the core error phrase and any clear library names, the kind of query that consistently surfaces useful matches.
To use the debugging assistant script in a real development loop, you can wrap your Python command in a small shell function so that it automatically triggers the debugger whenever an error occurs.
For example, you can add the following function to your shell profile (such as .zshrc or .bashrc):
agentpython() {
logfile="/tmp/agentpython-stderr.log"
python "$@" 2> >(tee "$logfile" >&2)
if [[ -s "$logfile" ]]; then
# If logfile is NOT empty, run check script
python /Users/...FULL_PATH_TO/your_debug.py "$logfile"
else
# If logfile is empty, clear it (truncate to zero length)
> "$logfile"
fi
}
After you add the function to your .zshrc or .bashrc, you can load it into your current terminal session without restarting by running: . ~/.zshrc or . ~/.bashrc depending on your system.
💡
Here is how it works:
  • logfile points to a temporary file where all stderr from your Python script will be captured.
  • You run your Python script normally by calling agentpython myscript.py
  • The stderr output is both printed to the terminal and saved into logfile with tee.
  • If the logfile is not empty, meaning an error occurred, it immediately calls the debugging assistant script debug.py with the path to the error log.
  • The assistant script then runs the GPT-5 + Weave pipeline, generating the search query, fetching and OCR-ing GitHub issues, summarizing solutions, and building the HTML report.
  • If the logfile is empty (no errors), it is just cleared.
This lets you integrate the GPT-5 debugging flow directly into your normal development process. You run agentpython instead of python, and whenever something breaks, the debugger automatically kicks in, fetches related issues, logs all inputs and outputs to Weave, and opens a report you can immediately use to investigate the problem.
To demo the script, I made a script that contained an error:
import torch

a = torch.randn(3, 4) # 3x4
b = torch.randn(5, 6) # 5x6

result = torch.matmul(a, b)

print(result)
and ran the command:
agentpython bad_code.py
This triggered an error, which subsequently triggered our agent:

With the query in hand, the script can branch into multiple paths. It calls the GitHub API for top issues, spinning up a headless Chromium session with Playwright to capture full-page screenshots. These are processed through Tesseract OCR, so even long, image-only threads can be read. GPT-5 then summarizes each OCR extract in the context of the original error, returning a concise cause and fix you can act on immediately.
Here the code searches GitHub for related issues
If web search is recommended by the tool suggestion model call, GPT-5 queries the live internet as well. It reads relevant pages, merges them with the error context, and produces a direct, actionable answer along with citation links to the sources it relied on. This step expands the scope beyond GitHub, catching solutions from documentation, blog posts, and Q&A forums.
Here, the agent searches the web for related issues.
There is also a pure static analysis path. When selected, GPT-5 reads the traceback and the implicated code snippet entirely offline, then returns strict JSON containing a diagnosis, targeted fixes, a proposed patch, a small test snippet to verify the change, and any supporting notes. This path is well-suited for local problems like tensor shape mismatches or misuse of a library API, where the fix is already in your code.
Every key function is wrapped in @weave.op, so Weave records inputs and outputs for log reading, query generation, GitHub scraping, OCR, web search, summarization, static analysis, and final plan synthesis. You can step through the run in the Weave UI, see exactly how each result was produced, and compare outputs across sessions.



At the end, the script builds a single HTML report and opens it in Chrome. It includes the raw error log, a relevant source snippet if found, GPT-5’s tool recommendations, any static analysis results, GitHub issues with links, screenshots, and summaries, a web search section with AI answers and citations, and finally a short, consolidated recommendation from GPT-5 so you can execute a fix without delay.




Because Weave is logging everything, you can later revisit a debugging run and tweak any part, for example, trying a different GPT-5 reasoning effort level when summarizing fixes, or adjusting the search query prompt, and directly compare results without re-running the entire browser and OCR pipeline. Over time, this builds a library of debugging patterns tied to real model behaviors, which can help refine the process for future issues.

Conclusion

GPT-5 opens up a range of possibilities that go far beyond simple text generation.
In this tutorial, we walked through three very different use cases and showed how they can all benefit from being instrumented with Weave.
  • We saw GPT-5 working with images to describe and create visual content,
    adjusting its reasoning depth to tackle code challenges, and
    driving a full debugging pipeline that ties together log parsing, static analysis, GitHub search with OCR, and live web results.
Weave acted as the connective tissue in every step, logging model inputs, outputs, and intermediate artifacts so the entire process is transparent and reproducible. Whether you are exploring creative workflows, running model evaluations, or troubleshooting complex errors, having a full visual history in Weave means you can understand why a result came out the way it did, compare alternative runs, and iterate faster.
By combining GPT-5’s multimodal capabilities with Weave’s observability, you not only get powerful automation but also a permanent, inspectable record of how your tools behave. This makes each run a learning resource you can return to, refine, and reuse, turning experiments and debugging sessions into a growing knowledge base for future work.
Iterate on AI agents and models faster. Try Weights & Biases today.