チュートリアル：GPT-5 のマルチタスク横断評価

これらのチュートリアルでは、W&B Weave を用いて GPT-5 の画像生成、コーディング評価、そして自動デバッグを評価する方法を解説します。本記事は機械翻訳です。誤訳の可能性があれば、コメント欄でご報告ください。
Created on August 27|Last edited on September 3
Comment
その瞬間 GPT-5 が公開されました私たちは、新しい API 機能を実際に触って検証する必要があるとすぐに判断しました。
以下のチュートリアルでは、いくつかの GPT-5’の強力な機能です。まずは画像の記述と生成機能を実験し、実際のプログラミング課題で他モデルと性能を比較し、最後に自動化された操作をどのように駆動できるかを確認します。 Python デバッグエージェント 簡潔なコード修正要約を生成できます。では、始めましょう。 
﻿
本記事の内容チュートリアル：GPT-5 のマルチモーダル入出力を評価するチュートリアル：Weave Evals で GPT-5 のコーディング能力を評価するチュートリアル：Weave で GPT-5 を用いた Python デバッガエージェントを構築するまとめ
﻿
チュートリアル：GPT-5 のマルチモーダル入出力を評価するまずは、GPT-5 を用いてマルチモーダル入出力を評価します。
このチュートリアルでは、GPT-5 に対して画像を URL と base64 データの両方で渡し、あわせて記述リクエストを与えます。GPT-5 は自然言語の記述を返し、その記述をそのまま GPT-5 の画像生成ツールに入力して、新しい画像を生成します。これらの機能は Weave で計測・記録できるように包んであります @weave.op つまり 各実行は W&B Weave に記録されます（元の画像、プロンプト、生成された記述、新しい画像の出力を含む）
Weave の UI では、任意の実行をクリックして、その一連の流れを視覚的に確認できます（このコードブロックの下に表示されます）。創造的な試行の追跡、想定外の出力のデバッグ、あるいはビフォーアフターの変換を示すのに最適です。
﻿
import requests
from io import BytesIO
from PIL import Image
import base64
from openai import OpenAI
import weave; weave.init("gptV_gen_and_desc")
﻿
﻿
OPENAI_API_KEY = ""
client = OpenAI(api_key=OPENAI_API_KEY)
﻿
@weave.op
def gptV_describe_image_with_url(
    pil_img: Image.Image,
    img_url: str,
    prompt: str = "Describe what is in this image."
) -> str:
    """
    Describe an image using both its URL and base64 encoding for the model, 
    and logs PIL image in Weave.
    """
    # Prepare base64-encoded image for OpenAI input
    inp = {
        "role": "user",
        "content": [
            {"type": "input_text", "text": "Describe what is in this image."},
            {"type": "input_image", "image_url": img_url}
        ]
    }
    resp = client.responses.create(
        model="gpt-5",
        input=[inp]
    )
    return resp.output_text
﻿
﻿
@weave.op
def gpt_generate_image(
    prompt: str,
    size: str = "1024x1024"
) -> Image.Image:
    """
    Generate an image from a prompt using OpenAI DALL-E (PIL image output).
    """
﻿
﻿
    prompt = f"Generate an image given the following description: {prompt}"
    print(f"[DEBUG] Generating image with prompt: {prompt}")
﻿
    try:
        response = client.responses.create(
            model="gpt-5",
            input=prompt,
            tools=[{"type": "image_generation"}],  # no tool_choice
        )
        print(f"[DEBUG] Raw response received: {response}")
    except Exception as e:
        print(f"[ERROR] Failed to create response: {e}")
        return None
﻿
    try:
        image_data = [
            output.result
            for output in response.output
            if output.type == "image_generation_call"
        ]
        print(f"[DEBUG] Extracted image data: {image_data}")
    except Exception as e:
        print(f"[ERROR] Failed to extract image data: {e}")
        return None
﻿
    if image_data:
        try:
            image_base64 = image_data[0]
            filename = "generated_image.png"
            with open(filename, "wb") as f:
                f.write(base64.b64decode(image_base64))
            print(f"[DEBUG] Image saved to {filename}")
            
            pil_img = Image.open(BytesIO(base64.b64decode(image_base64)))
            return pil_img
        
        
        except Exception as e:
            print(f"[ERROR] Failed to save image: {e}")
            return None
﻿
﻿
﻿
# --- Main Example usage ---
if __name__ == "__main__":
    img_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3f/Fronalpstock_big.jpg/800px-Fronalpstock_big.jpg"
    headers = {"User-Agent": "Mozilla/5.0 (compatible; GeminiScript/1.0)"}
    response = requests.get(img_url, headers=headers)
    response.raise_for_status()
    pil_img = Image.open(BytesIO(response.content))
﻿
    # 1. DESCRIBE
    desc = gptV_describe_image_with_url(
        pil_img=pil_img,
        img_url=img_url,
        prompt="Describe what is in this image."
    )
    print("\nGPT-V description:")
    print(desc)
﻿
    # 2. GENERATE NEW IMAGE FROM DESCRIPTION
    gen_img = gpt_generate_image(desc)
    gen_img.save("gptV_generated.png")
    print("\nGenerated image saved as gptV_generated.png")
スクリプトを実行すると、元の画像と生成された画像の両方がローカルに保存され、さらに Weave により、ダッシュボード上でプロセス全体の可視的な記録も残ります。プロンプト文でフィルタしたり、複数の実行を並べて比較したり、特定の実行へのリンクをチームメイトと共有することもできます。
スクリプトを実行したあとの Weave 内の表示は次のとおりです。
まず最初に、画像を記述します
次に、似た画像の生成を試します。
チュートリアル：Weave Evals で GPT-5 のコーディング能力を評価する次に、GPT-5 のコーディング評価に進みます。このモデルでは、タスクに割く思考時間（思考予算）を設定でき、速度と正確性のバランスを調整できます。今回の実行では、DeepMind Code Contests データセットからの30件の例題に対して3つの異なるモデルを評価し、いずれも競技プログラミング問題に対する Python 関数を生成させました。 GPT-OSS-120B は高い思考予算（reasoning effort）を有効にしてテストしました。GPT-5 は制約の厳しい条件下での性能を見るため、低い思考予算で実行しました。 Claude 4.1 Opus は 4k トークンの思考予算（reasoning effort）で実行しました。これはその能力からすると相対的に低い設定です。
…を使用して EvaluationLogger これにより、評価の実行様式を自由に設計できます。ここでは、GPT-5、Claude 4.1 Opus、GPT-OSS-120B が各課題ごとに Python コードを生成し、そのコードを制御された環境で実際に実行します。あらかじめ用意したテストケース群に対して実行し、出力を期待結果と照合して正答を判定します。あわせて各実行に要した時間も記録します。
この一連の処理はすべて Weave に記録されるため、各問題について、元のプロンプト、生成されたコード、正確なテスト入力、実行時の出力、合否を確認できます。これにより、精度指標と、結果に至ったコード上の具体的な差分の両方を見比べることで、思考予算（reasoning effort）の低設定と高設定を容易に比較できます。コードは次のとおりです。
import os
import sys
import time
import subprocess
from datasets import load_dataset
from litellm import completion as oai_completion
from anthropic import Anthropic
from google import genai
from openai import OpenAI
import weave
import random
import numpy as np
﻿
from weave.flow.eval_imperative import EvaluationLogger
from google import genai
from google.genai import types
from litellm import completion as oai_completion
import re
from litellm import completion as oai_completion
import requests
import json
﻿
﻿
﻿
﻿
weave.init("codecontests_eval")
﻿
﻿
# API keys
# os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY", "sk-...")
CLAUDE_KEY = os.getenv("CLAUDE_API_KEY", "")
GENAI_KEY = os.getenv("GOOGLE_API_KEY", "")
OPENROUTER_KEY = os.getenv("OPENROUTER_API_KEY", "")
﻿
﻿
# Clients
anthropic_client = Anthropic(api_key=CLAUDE_KEY)
gemini_client = genai.Client(api_key=GENAI_KEY)
openrouter_client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=OPENROUTER_KEY,
)
client = OpenAI(api_key="")
﻿
﻿
﻿
def clean_llm_code_block(text):
    import re
﻿
﻿
    cleaned_text = text.replace("```python", "").replace("```", "").strip()
    code_blocks = re.findall(r"(def solve\(.*?)(?=^def |\Z)", cleaned_text, re.DOTALL | re.MULTILINE)
    source_text = code_blocks[-1] if code_blocks else cleaned_text
﻿
﻿
    prompt = (
        "Given the following response from a language model, extract ONLY the valid Python code for the function. "
        "Do not include any explanations, text, or formatting fences. Only the code.\n\n"
        f"Response:\n{source_text}\n\n"
        "Return ONLY the Python code, including any necessary imports:"
    )
﻿
﻿
    response = oai_completion(
        model="openai/gpt-4o-2024-08-06",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,
    )
﻿
﻿
    gpt4o_code = response["choices"][0]["message"]["content"]
    gpt4o_code = gpt4o_code.replace("```python", "").replace("```", "").strip()
    return gpt4o_code
﻿
﻿
@weave.op()
def generate_completion(model: str, prompt: str, effort: str="low") -> str:
    # if model.startswith("openai/"):
    #     response = oai_completion(
    #         model=model,
    #         messages=[{"role": "user", "content": prompt}],
    #         reasoning_effort="low",
    #     )
    #     return response["choices"][0]["message"]["content"].strip()
    if model.startswith("openai/"):
        response = client.responses.create(
            model=model.replace("openai/", ""),
            reasoning={"effort": effort},
            input=[
                {"role": "user", "content": prompt}
            ]
        )
        return response.output_text.strip()
﻿
﻿
    elif model.startswith("anthropic/"):
        response = anthropic_client.messages.create(
            model=model.replace("anthropic/", ""),
            max_tokens=8000,
            thinking={"type": "enabled", "budget_tokens": 4000},
            messages=[{"role": "user", "content": prompt}],
        )
        for block in response.content:
            if block.type == "text":
                return block.text.strip()
        return "[No Claude response]"
﻿
﻿
    elif model.startswith("gemini/"):
        result = gemini_client.models.generate_content(
            model=model.replace("gemini/", ""),
            config=types.GenerateContentConfig(
                thinking_config=types.ThinkingConfig(thinking_budget=4000)
            ),
            contents=[prompt]
        )
        return result.text.strip() if result.text else "[No Gemini response]"
﻿
﻿
    elif model.startswith("openrouter/"):
﻿
        url = "https://openrouter.ai/api/v1/chat/completions"
        headers = {
            "Authorization": f"Bearer {OPENROUTER_KEY}",
            "Content-Type": "application/json"
        }
        payload = {
            "model": model.replace("openrouter/", ""),
            "messages": [
                {"role": "system", "content": "Reasoning: high"},
                {"role": "user", "content": prompt}
            ],
        }
        response = requests.post(url, headers=headers, data=json.dumps(payload))
        resp_json = response.json()
        if 'choices' in resp_json and resp_json['choices']:
            # To get the reasoning, use: resp_json['choices'][0]['message'].get('reasoning')
            return resp_json['choices'][0]['message'].get('content', '[No answer found]')
        else:
            return "[No choices found in OSS response]"
﻿
    else:
        raise ValueError(f"Unsupported model: {model}")
    
﻿
﻿
﻿
﻿
def ask_llm_for_function_implementation(description: str, model: str, effort: str | None = None) -> str:
    prompt = (
        f"Write a Python3 function named `solve` with typed input arguments for this problem -- eg the solve function should take arguments to handle different test cases:\n\n"
        f"{description.strip()}\n\n"
        "Return only a valid Python function -- no special packages that arent commonly used and NO MAIN function, no  if __name__ == __main__....., JUST write the function --  that returns the result. No comments, no explanations."
        f"HOWEVER, you still need to include necessary imports for libraries"
        f"IF you do not include the right imports, the code will not be executable, and your response will be judged as incorrect!"
    )
    # Pass effort only to OpenAI via generate_completion when provided
    if effort is not None and model.startswith("openai/"):
        return clean_llm_code_block(generate_completion(model, prompt, effort=effort))
    else:
        return clean_llm_code_block(generate_completion(model, prompt))
﻿
﻿
﻿
﻿
@weave.op
def ask_llm_for_function_call(code: str, raw_input: str, model: str) -> str:
﻿
﻿
    prompt = (
        "You're given a Python function and a single input string. "
        "Format it into a valid Python function call using only standard types.\n\n"
        f"Function:\n{code}\n\n"
        f"Input:\n{raw_input.strip()}\n\n"
        "Return ONLY a valid function call (e.g., solve(3, 5)) WITH NO 'def' "
    )
﻿
﻿
    # Always use GPT-4o for this inference, regardless of the `model` argument.
    response = oai_completion(
        model="openai/gpt-4o-2024-08-06",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,
    )
    # The LLM may return markdown code blocks; strip them just in case.
    content = response["choices"][0]["message"]["content"]
    content = content.replace("```python", "").replace("```", "").strip()
    return content
﻿
﻿
def compare_output_with_llm(expected: str, actual: str, model: str) -> bool:
    prompt = (
        f"Expected output: {expected.strip()}\n"
        f"Actual output: {actual.strip()}\n\n"
        "Are these outputs equivalent? Eg ignore minor formatting errors etc, we are just looking for overall correctness in the output Reply YES or NO."
    )
    # response = generate_completion(model, prompt)
﻿
﻿
    response = oai_completion(
        model="openai/gpt-4o-2024-08-06",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,
    )
    # The LLM may return markdown code blocks; strip them just in case.
    res = 'YES' in str(response["choices"][0]["message"]["content"]).upper()
    return res 
﻿
﻿
﻿
def run_code_and_call_function(code: str, function_call: str, timeout=10):
    full_code = code + f"\n\nprint({function_call})"
    try:
        start = time.time()
        result = subprocess.run(
            [sys.executable, "-c", full_code],
            capture_output=True,
            text=True,
            timeout=timeout
        )
        latency = time.time() - start
        return result.stdout.strip(), result.stderr.strip(), latency
    except subprocess.TimeoutExpired:
        return "", "Execution timed out.", timeout
    except Exception as e:
        return "", str(e), 0.0
﻿
﻿
﻿
﻿
def ask_model_for_pip_command(error_msg):
    prompt = (
        "Given this Python error:\n\n"
        + error_msg +
        "\n\nWrite the pip install command needed to fix it. Only return the command, e.g.:\n"
        "pip install requests"
    )
    return generate_completion("openai/gpt-4o-2024-08-06", prompt)
﻿
﻿
﻿
def run_pip_install(pip_command):
    print(f"Running: {pip_command}")
    try:
        result = subprocess.run(
            pip_command.split(),
            capture_output=True,
            text=True,
            timeout=180
        )
        print(result.stdout.strip())
        if result.stderr:
            print(result.stderr.strip())
    except Exception as e:
        print(f"pip install failed: {e}")
﻿
﻿
﻿
def evaluate_model_on_code_contests(model_name: str, reasoning_effort: str | None = None):
    print(f"\n\nRunning evaluation for model: {model_name}\n")
﻿
    random.seed(42)
    np.random.seed(42)
    ds = load_dataset("deepmind/code_contests", split="test", streaming=True)
    ds = list(ds.take(31))
﻿
﻿
    # Build sanitized model identifier for Weave, including reasoning effort if provided
    model_id = model_name.replace("-", "_").replace("/", "_").replace(".", "_")
    if reasoning_effort:
        effort_id = str(reasoning_effort).replace("-", "_").replace("/", "_").replace(".", "_")
        model_id = f"{model_id}__{effort_id}"
﻿
    eval_logger = EvaluationLogger(
        model=model_id,
        dataset="code_contests_test"
    )
    all_latencies = []
﻿
﻿
    for i in range(30):
        row = ds[i]
        description = row["description"]
        raw_inputs = row["public_tests"]["input"]
        expected_outputs = row["public_tests"]["output"]
﻿
﻿
        try:
            # Forward reasoning_effort only to OpenAI generate_completion
            code = ask_llm_for_function_implementation(
                description,
                model=model_name,
                effort=reasoning_effort if (reasoning_effort and model_name.startswith("openai/")) else None,
            )
            print(f"\n=== Task {row['name']} ===", flush=True)
            # print("Generated code:\n", code)
﻿
﻿
            all_passed = True
            task_latencies = []
            results_lst, expected_lst = [], []
﻿
﻿
            for j, raw_input in enumerate(raw_inputs):
                expected = expected_outputs[j] if j < len(expected_outputs) else ""
﻿
﻿
                try:
                    function_call = ask_llm_for_function_call(code, raw_input, model=model_name)
                    result, error, latency = run_code_and_call_function(code, function_call)
                    if latency < 99:
                        task_latencies.append(latency)
﻿
﻿
                
                    if error:
                        print(f"[{j}] Runtime error: {error}")
                        if "ModuleNotFoundError" in error:
                            pip_cmd = ask_model_for_pip_command(error)
                            run_pip_install(pip_cmd)
                            # Re-run once after pip install
                            result, error, latency = run_code_and_call_function(code, function_call)
                            
                            task_latencies.append(latency)
                            if error:
                                print(f"[{j}] Retry failed: {error}")
                                all_passed = False
                                continue
                        else:
                            all_passed = False
                            continue
﻿
﻿
                    is_correct = compare_output_with_llm(expected, result, model="openai/gpt-4o-2024-08-06")###### 
                    results_lst.append(result)
                    expected_lst.append(expected)
                    if not is_correct:
                        all_passed = False
                    print(f"[{j}] input: {raw_input.strip()} → output: {result} | expected: {expected.strip()} | PASS: {is_correct} | latency: {latency:.2f}s")
﻿
﻿
                except Exception as inner:
                    print(f"[{j}] Inner error: {repr(inner)}")
                    all_passed = False
            
            task_avg_latency = sum(task_latencies) / len(task_latencies) if len(task_latencies) > 0 else 0.0
            all_latencies.extend(task_latencies)
﻿
﻿
            prediction_log = eval_logger.log_prediction(
                inputs={"description": description},
                output={'code': code, 'execution_result': results_lst, 'expected_execution_result': expected_lst}
            )
            prediction_log.log_score("correctness", all_passed)
            prediction_log.log_score("code_latency", task_avg_latency)
            prediction_log.finish()
﻿
﻿
        except Exception as e:
            print(f"[{i}] Top-level failure: {repr(e)}")
            prediction_log = eval_logger.log_prediction(
                inputs={"description": description},
                output=str(e)
            )
            prediction_log.log_score("correctness", False)
            prediction_log.finish()
﻿
﻿
    avg_latency = sum(all_latencies) / len(all_latencies) if all_latencies else 0.0
    eval_logger.log_summary({"avg_code_latency": avg_latency})
    print(f"Evaluation complete for {model_name}. View in Weave UI.")
﻿
﻿
﻿
﻿
# Run for all models
﻿
﻿
evaluate_model_on_code_contests("openrouter/openai/gpt-oss-120b")
evaluate_model_on_code_contests("anthropic/claude-opus-4-1-20250805")
evaluate_model_on_code_contests("openai/gpt-5", reasoning_effort='low')
﻿
このコードは、すべてを記録するために Weave プロジェクトを初期化し、DeepMind Code Contests のテストセットから一部をストリーミングします。各問題に対して、選択したモデルに solve という名前の Python 関数の作成を求めます。 generate_completion router は複数のプロバイダをサポートしており、もしあなたが an を使用する場合は OpenAI モデル、reasoning effort（思考予算）を指定できます。モデルは回答を Markdown で包みがちなので、 clean_llm_code_block フェンスを取り除き、実行可能なコードだけを保持します。
次に、各公開テスト入力ごとに、このスクリプトは問いかけます GPT-4o 生の例を具体的な関数呼び出しへ変換し、候補コードをサブプロセスで実行してレイテンシ（遅延）を計測します。実行時にパッケージ不足でエラーになった場合は、モデルに正確な pip install コマンドを問い合わせてインストールし、1 回だけ再試行します。出力は寛容な GPT-4o によるチェックで期待解と照合し、些細な書式差によって正答率が不当に低下しないようにします。 
各タスクは、入力、生成コード、テストごとの出力、正答、タイミングとともに Weave の EvaluationLogger に記録されます。最後に、平均実行レイテンシ（遅延）を含むサマリーも記録します。これにより、Weave 上で再現可能な評価が得られ、各タスクにドリルダウンして、プロンプト、コード、呼び出し、出力、そしてモデルや思考予算（reasoning effort）の違いにまたがる合否ステータスを確認できます。
評価が完了したら、Weave の UI を開いて詳細を確認できます。Weave の評価ビューアは、精度などの集計統計を表示するだけでなく、各サンプルをクリックして個別に掘り下げることもできます。プロンプト、モデルの完全な応答、スコア、あわせて記録した任意のメトリクスを確認できます。複数のモデルや設定で実行した場合は、Weave が結果を整列して、出力を直接比較できるようにします。これは、思考予算（reasoning effort）を高くしたことによるレイテンシ（遅延）の増加が、あなたのユースケースにおいて正答率の改善に見合うかどうかを判断するうえで非常に有用です。以下は私の評価結果です。
﻿
グラフからは、GPT-5（思考予算：低）が正答率 0.733 で最も高いスコアでした。GPT-OSS-120B（思考予算：高）は正答率 0.633 と近い成績でした。Claude 4.1 Opus（4k トークン予算）は正答率 0.333 に留まり、レイテンシ（遅延）は中程度の 0.620 でした。
チュートリアル：Weave で GPT-5 を用いた Python デバッガエージェントを構築する最後に、エラーログを取り込み、外部コンテキストと照合して、実行可能な修正案を提案できるデバッグ支援としての GPT-5 を検討します。各ステップはすべて wrapping されています @weave.op、実行履歴を再生し、中間結果を確認し、後から出力を比較できるようにします。セットアップが完了したら、次を実行します。 agentpython myscript.py
これでスクリプトが実行され、エラーが発生した場合は自動的にエージェントが起動します。実行中には次のことが行われます。
1. ログの取得とクリーンアップエージェントはまず、stderr をファイルへパイプするシェルのエイリアスまたは関数で保存されたエラーログを読み込みます。Python のトレースバックを検出すると、GPT-5 はファイルパス、行番号、メモリアドレス、実行時固有のシェイプといった変動しやすい詳細を取り除き、エラーの種類、メッセージ、明確なライブラリ名などの本質的な情報だけを残します。
2. OCR を用いた GitHub 検索このクリーンアップ済みクエリを用いて、スクリプトは GitHub API で一致する Issue を検索し、各結果をヘッドレスブラウザで開いてページ全体のスクリーンショットを取得し、OCR を実行してコードブロックや画像のみの返信を含む議論全体を抽出します。続いて GPT-5 が、元のエラーの文脈に沿って各スレッドを要約するため、想定される原因と修正方法を即座に確認できます。
3. ウェブ検索と静的解析ウェブ検索が有効な場合、GPT-5 はインターネット上の最新情報を検索し、エラーの文脈と関連結果を統合して、引用付きの実行可能な解答を返します。別案として（または並行して）、エージェントは完全オフラインの静的解析を実行し、問題のあるソースファイルとトレースバックを読み取って、具体的な修正内容、簡易テスト用スニペット、関連するインストールまたはバージョン指定のコマンドを提案できます。
4. 最終 HTML レポート処理の最後に、元のエラーログ、関連するソースの抜粋、GPT-5 のツール推奨、静的解析結果、リンク付きの GitHub Issues、スクリーンショットと要約、さらに引用付きのウェブ検索結果をひとつにまとめた HTML レポートが生成されます。最終的な推奨事項は末尾に配置され、すぐに対応できます。レポートは Chrome で自動的に開き、Weave があらゆる入出力を記録しているため、後から参照してプロンプトを調整したり、一部のステップだけを再実行したりしても、高コストな検索や OCR をやり直す必要はありません。
import os
import sys
import re
import requests
import tempfile
import webbrowser
import html
import asyncio
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
from playwright.async_api import async_playwright
from PIL import Image
import pytesseract
﻿
from openai import OpenAI
import weave; weave.init("gpt5_agent")
﻿
﻿
# -------- CONFIG --------
LOGFILE = sys.argv[1] if len(sys.argv) > 1 else "/tmp/agentpython-stderr.log"
OUTPUT_DIR = "github_screenshots"
PARALLEL_PAGE_LOADS = 6   # how many pages to screenshot at once
OCR_WORKERS = min(8, (os.cpu_count() or 4))
# ------------------------
﻿
def verbose_print(msg):
    print(f"\033[95m[LOG] {msg}\033[0m", flush=True)
﻿
def read_log(logfile):
    verbose_print(f"Reading from log file: {logfile}")
    if not os.path.exists(logfile) or os.path.getsize(logfile) == 0:
        print("[LOG] Log file empty or not found. No action needed.", flush=True)
        sys.exit(0)
    with open(logfile) as f:
        content = f.read()
    print(f"\n--- Log Content ---\n{content}\n{'-'*40}", flush=True)
    return content
﻿
def is_python_error(txt):
    if "Traceback (most recent call last):" in txt or "Exception" in txt or "Error" in txt:
        verbose_print("Looks like a Python error.")
        return True
    verbose_print("Not detected as Python error (using fallback toolchain).")
    return False
﻿
@weave.op 
def generate_search_query_openai(error_str):
    verbose_print("Generating generalized search query using gpt-5 (OpenAI)...")
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    gpt_response = client.responses.create(
        model="gpt-5",
        reasoning={"effort": "low"},
        input=[
            {
                "role": "user",
                "content": (
                    "You are generating a GitHub search query from an error message. "
                    "Your goal is to create a generic query that will return relevant results from GitHub issues across many repositories. "
                    "Do NOT include overly specific details that would narrow results too much, such as:\n"
                    "- File paths\n"
                    "- Line numbers\n"
                    "- Exact tensor shapes, array sizes, or specific numeric values in parentheses\n"
                    "- Memory addresses\n"
                    "- Random seeds or run-specific values\n\n"
                    "Instead:\n"
                    "- Keep only the key error type and descriptive text\n"
                    "- Include the relevant library name if obvious (e.g., torch, numpy, pandas)\n"
                    "- Use quotes for the core error message if helpful\n\n"
                    "Output only the final search query string. No explanation, no extra words.\n\n"
                    f"Error:\n{error_str}"
                )
            }
        ]
    )
    query = (gpt_response.output_text or "").strip()
    print("Generated search query:", repr(query), flush=True)
    return query
﻿
async def _screenshot_one(page, url, path):
    try:
        await page.goto(url, timeout=20000)
        await page.set_viewport_size({"width": 1920, "height": 1080})
        await page.screenshot(path=path, full_page=True)
        verbose_print(f"[+] Screenshot saved: {path}")
        return True
    except Exception as e:
        verbose_print(f"[!] Failed screenshot for {url}: {e}")
        return False
﻿
async def capture_screenshots_parallel(urls, out_dir, concurrency=6):
    os.makedirs(out_dir, exist_ok=True)
    results = [None] * len(urls)
﻿
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context()
﻿
        sem = asyncio.Semaphore(concurrency)
        async def worker(i, url):
            path = os.path.join(out_dir, f"issue_{i+1}.png")
            async with sem:
                page = await context.new_page()
                ok = await _screenshot_one(page, url, path)
                await page.close()
                results[i] = path if ok else None
﻿
        tasks = [asyncio.create_task(worker(i, url)) for i, url in enumerate(urls)]
        await asyncio.gather(*tasks)
        await browser.close()
﻿
    return results  # list of file paths (or None)
﻿
def run_ocr(image_path):
    if not image_path or not os.path.exists(image_path):
        return ""
    try:
        img = Image.open(image_path)
        text = pytesseract.image_to_string(img)
        # save alongside
        txt_path = image_path.rsplit(".", 1)[0] + ".txt"
        with open(txt_path, "w", encoding="utf-8") as f:
            f.write(text)
        return text
    except Exception as e:
        verbose_print(f"[!] OCR failed for {image_path}: {e}")
        return ""
@weave.op 
def summarize_with_gpt5(error_text, github_text):
    if not github_text.strip():
        return "[No OCR text to summarize]"
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    resp = client.responses.create(
        model="gpt-5",
        reasoning={"effort": "low"},
        input=[
            {
                "role": "user",
                "content": (
                    "You are assisting in debugging. The following is a Python error message, "
                    "and then OCR-extracted text from a GitHub issue discussing it. "
                    "Summarize the most likely cause and solution in a few sentences. "
                    "Only include relevant fix instructions. Be concise.\n\n"
                    f"Error:\n{error_text}\n\nGitHub Issue Content:\n{github_text}"
                )
            }
        ]
    )
    return (resp.output_text or "").strip()
﻿
﻿
@weave.op 
def search_github(query, github_token=None, owner=None, repo=None, error_text=None):
    verbose_print(f"Searching GitHub issues for: {query!r}")
    url = 'https://api.github.com/search/issues'
    headers = {'Accept': 'application/vnd.github.v3+json'}
    if github_token:
        headers['Authorization'] = f'token {github_token}'
    if owner and repo:
        gh_query = f'repo:{owner}/{repo} is:issue {query}'
    else:
        gh_query = query
    params = {'q': gh_query, 'per_page': 5}
    resp = requests.get(url, headers=headers, params=params)
    if resp.status_code != 200:
        print(f"[GitHub] Search failed: {resp.status_code} {resp.text}", flush=True)
        return []
﻿
    items = resp.json().get('items', [])
    if not items:
        print("[GitHub] No results found.", flush=True)
        return []
﻿
    issue_urls = [it.get('html_url', '') for it in items]
    # Parallel screenshots
    verbose_print("Capturing GitHub issues as screenshots in parallel...")
    screenshots = asyncio.run(capture_screenshots_parallel(issue_urls, OUTPUT_DIR, PARALLEL_PAGE_LOADS))
﻿
    # Parallel OCR
    verbose_print("Running OCR on screenshots in parallel...")
    ocr_texts = [""] * len(screenshots)
    with ThreadPoolExecutor(max_workers=OCR_WORKERS) as ex:
        futures = {ex.submit(run_ocr, path): i for i, path in enumerate(screenshots)}
        for fut in as_completed(futures):
            i = futures[fut]
            try:
                ocr_texts[i] = fut.result() or ""
            except Exception as e:
                verbose_print(f"[!] OCR worker error for index {i}: {e}")
                ocr_texts[i] = ""
﻿
    # Summarize in parallel
    gh_results = []
    summaries = [""] * len(items)
﻿
    def _summarize_idx(i: int) -> str:
        return summarize_with_gpt5(error_text or query, ocr_texts[i])
﻿
    max_workers = min(8, len(items)) if items else 0
    if max_workers > 0:
        with ThreadPoolExecutor(max_workers=max_workers) as ex:
            future_map = {ex.submit(_summarize_idx, i): i for i in range(len(items))}
            for fut in as_completed(future_map):
                i = future_map[fut]
                try:
                    summaries[i] = fut.result() or ""
                except Exception as e:
                    summaries[i] = f"[summarize error: {e}]"
﻿
    for idx, item in enumerate(items):
        summary = summaries[idx]
        issue_info = {
            "number": item.get("number", "?"),
            "title": item.get("title", ""),
            "url": item.get("html_url", ""),
            "body": (item.get("body", "") or "")[:600] + ("..." if item.get("body") and len(item["body"]) > 600 else ""),
            "ocr_summary": summary,
            "screenshot": screenshots[idx] or ""
        }
        gh_results.append(issue_info)
        print("=" * 60, flush=True)
        print(f"Issue #{issue_info['number']}: {issue_info['title']}", flush=True)
        print(f"URL: {issue_info['url']}", flush=True)
        print(f"Screenshot: {issue_info['screenshot']}", flush=True)
        print(f"Solution Summary: {summary}", flush=True)
    print("=" * 60, flush=True)
﻿
    return gh_results
﻿
﻿
@weave.op 
def openai_web_search(query):
    verbose_print(f"Querying OpenAI gpt-5 web search for: {query!r}")
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    search_response = client.responses.create(
        model="gpt-5",
        tools=[{"type": "web_search_preview"}],
        reasoning={"effort": "low"},
        input=query
    )
    print("\n=== [OpenAI] Web Search AI Answer ===", flush=True)
    print(search_response.output_text, flush=True)
    links = re.findall(r'\[([^\]]+)\]\((https?://[^\)]+)\)', search_response.output_text or "")
    link_objs = []
    if links:
        for title, url in links:
            link_objs.append({'title': title, 'url': url})
    else:
        print("No citations found in output_text.", flush=True)
    return {'output_text': search_response.output_text, 'citations': link_objs}
﻿
﻿
@weave.op
def write_html_report(
    log,
    file_snippet,
    tools,
    gh_results,
    web_results,
    static_result=None,
    out_path=None
):
    """Write the HTML debug report and return both the path and the raw HTML."""
    verbose_print("Writing HTML report ...")
    out_path = out_path or os.path.join(tempfile.gettempdir(), 'dbg_report.html')
    css = """
    body { font-family: 'Segoe UI', sans-serif; background: #f5f7fa; color: #333; margin: 0; padding: 0; }
    header { background: #1e293b; color: white; padding: 20px; text-align: center; font-size: 1.5em; }
    section { padding: 20px; margin: 20px; background: white; border-radius: 8px; box-shadow: 0 2px 6px rgba(0,0,0,0.1); }
    h2 { border-bottom: 2px solid #e5e7eb; padding-bottom: 5px; margin-bottom: 10px; color: #1f2937; }
    pre { background: #0f172a; color: #e2e8f0; padding: 15px; border-radius: 6px; overflow-x: auto; font-size: 0.9em; }
    a { color: #2563eb; text-decoration: none; }
    a:hover { text-decoration: underline; }
    .gh-issue { border: 1px solid #e5e7eb; padding: 10px; border-radius: 6px; margin-bottom: 16px; background: #f9fafb; }
    .shot { margin: 8px 0; display: block; max-width: 100%; border: 1px solid #e5e7eb; border-radius: 6px; }
    .label { font-weight: 600; color: #111827; }
    """
    html_parts = []
    html_parts.append(f"<html><head><meta charset='utf-8'><title>Debug Results</title><style>{css}</style></head><body>\n")
    html_parts.append("<header>Debugging Session Report</header>\n")
    html_parts.append("<section><h2>Error Log</h2>")
    html_parts.append(f"<pre>{html.escape(log or 'None')}</pre></section>")
    if file_snippet:
        html_parts.append("<section><h2>Relevant Source Snippet</h2>")
        html_parts.append(f"<pre>{html.escape(file_snippet)}</pre></section>")
    if tools:
        html_parts.append("<section><h2>LLM Tool Recommendations</h2>")
        html_parts.append(f"<pre>{html.escape(str(tools))}</pre></section>")
    if static_result:
        html_parts.append("<section><h2>Static Analysis</h2>")
        diag = static_result.get("diagnosis", "")
        fixes = "\n".join(static_result.get("fixes", []) or [])
        patch = static_result.get("patch", "")
        test_snip = static_result.get("test_snippet", "")
        notes = static_result.get("notes", "")
        html_parts.append(f"<div class='label'>Diagnosis</div><pre>{html.escape(diag)}</pre>")
        if fixes:
            html_parts.append(f"<div class='label'>Proposed Fixes</div><pre>{html.escape(fixes)}</pre>")
        if patch:
            html_parts.append(f"<div class='label'>Proposed Patch</div><pre>{html.escape(patch)}</pre>")
        if test_snip:
            html_parts.append(f"<div class='label'>Quick Test</div><pre>{html.escape(test_snip)}</pre>")
        if notes:
            html_parts.append(f"<div class='label'>Notes</div><pre>{html.escape(notes)}</pre>")
        html_parts.append("</section>")
    if gh_results:
        html_parts.append("<section><h2>GitHub Related Issues</h2>")
        for res in gh_results:
            html_parts.append(f"<div class='gh-issue'><div class='label'>#{res['number']}: {html.escape(res['title'])}</div>")
            html_parts.append(f"<a href='{res['url']}'>{res['url']}</a><br>")
            html_parts.append(f"<div class='label'>Issue Preview</div><pre>{html.escape(res['body'])}</pre>")
            html_parts.append(f"<div class='label'>Solution Summary</div><pre>{html.escape(res.get('ocr_summary',''))}</pre></div>")
        html_parts.append("</section>")
    if web_results:
        html_parts.append("<section><h2>Web Search AI Answer</h2>")
        html_parts.append(f"<pre>{html.escape(web_results.get('output_text', ''))}</pre>")
        if web_results.get('citations'):
            html_parts.append("<ul>")
            for c in web_results['citations']:
                html_parts.append(f"<li><a href='{c['url']}'>{html.escape(c['title'])}</a></li>")
            html_parts.append("</ul>")
        html_parts.append("</section>")
    html_parts.append("</body></html>")
    raw_html = ''.join(html_parts)
    with open(out_path, "w", encoding="utf-8") as f:
        f.write(raw_html)
    verbose_print(f"HTML written at: {out_path}")
    return out_path, raw_html
﻿
def open_html_in_chrome(path):
    verbose_print(f"Opening HTML report in browser ...")
    url = Path(path).resolve().as_uri()
    if sys.platform == 'darwin':
        chrome = '/Applications/Google Chrome.app/Contents/MacOS/Google Chrome'
        if os.path.exists(chrome):
            os.system(f'open -a "{chrome}" "{url}"')
        else:
            webbrowser.open(url)
    elif sys.platform == 'win32':
        import subprocess
        try:
            subprocess.Popen(['start', 'chrome', url], shell=True)
        except Exception:
            webbrowser.open(url)
    else:
        try:
            os.system(f'google-chrome "{url}"')
        except Exception:
            webbrowser.open(url)
﻿
def find_files_from_log_gpt(log_content):
    verbose_print("Invoking LLM to identify implicated files from the log...")
    user_prompt = (
        "Given this error message or traceback, list all file paths (and, if available, line numbers) "
        "involved in the error. Output one JSON per line, as:\n"
        '{"file": "path/to/file.py", "line": 123}\n'
        'If line is not found, use null.\n'
        f"\nError:\n{log_content}"
    )
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    llm_resp = client.responses.create(
        model="gpt-5",
        reasoning={"effort": "low"},
        input=[{"role": "user", "content": user_prompt}]
    )
    output = llm_resp.output_text or ""
    results = []
    for l in output.splitlines():
        l = l.strip()
        if not l:
            continue
        try:
            results.append(eval(l, {"null": None}))
        except Exception as exc:
            verbose_print(f"[File Extraction Skipped Line]: {l!r} ({exc})")
    verbose_print(f"LLM File Extraction Result: {results}")
    return results
﻿
def get_file_snippet(file_path, n_lines=20, line=None):
    if not os.path.exists(file_path):
        verbose_print(f"[WARN] File not found: {file_path}")
        return None
    code = []
    with open(file_path, "r") as f:
        lines = f.readlines()
        if line and 1 <= line <= len(lines):
            s = max(0, line-6)
            e = min(len(lines), line+5)
            code = lines[s:e]
        else:
            code = lines[:n_lines]
    return "".join(code)
﻿
﻿
@weave.op 
def suggest_tools(error_message, code_snippet):
    import ast, json
    verbose_print("Asking LLM: Based on the error and file, which tool to use next?")
    prompt = (
        "You are an AI debugging orchestrator. The following is a Python error message and a snippet of code "
        "from a file involved in the error. Based on this, choose which tools should be used next, and explain why. "
        "Possible tools: github_issue_search, web_search, static_analysis. "
        "Output a single python dictionary (not JSON, not explanation). Example: "
        "{'recommendations':['web_search', 'github_issue_search'], 'justification': 'Searching the web and GitHub can help resolve import errors quickly.'}\n"
        "Error:\n" + error_message +
        "\n\nFile snippet:\n" + code_snippet +
        "\n\nOutput only the dictionary. No preamble or explanation."
        "alwqays use the github tool man"
    )
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    resp = client.responses.create(
        model="gpt-5",
        reasoning={"effort": "low"},
        input=[{"role": "user", "content": prompt}]
    )
    output = (resp.output_text or "").strip()
    try:
        if output.startswith("```") and output.endswith("```"):
            output = output[3:-3].strip()
        obj = ast.literal_eval(output)
        if isinstance(obj, dict):
            verbose_print(f"LLM Tool Suggestion: {obj}")
            return obj
    except Exception:
        pass
    m = re.search(r'\{.*\}', output, re.DOTALL)
    if m:
        try:
            obj = ast.literal_eval(m.group(0))
            if isinstance(obj, dict):
                verbose_print(f"LLM Tool Suggestion: {obj}")
                return obj
        except Exception:
            pass
    verbose_print(f"LLM Suggestion RAW output (not parsable): {output!r}")
    return {"recommendations": [], "justification": 'Could not parse LLM response'}
﻿
﻿
﻿
@weave.op
def final_recommendation_with_gpt5(
    error_text: str,
    code_snippet: str | None,
    tool_suggestion: dict | None,
    gh: list | None,
    web: dict | None,
    query: str,
) -> str:
    """Synthesize a concise, actionable plan from all gathered signals."""
    from openai import OpenAI
    import json, os
﻿
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
﻿
    gh_brief = []
    if gh:
        for item in gh[:5]:
            gh_brief.append({
                "title": item.get("title", ""),
                "url": item.get("url", ""),
                "summary": item.get("ocr_summary", "")
            })
﻿
    web_brief = {
        "answer": (web or {}).get("output_text") if web else None,
        "citations": (web or {}).get("citations") if web else None
    }
﻿
    payload = {
        "error_text": error_text,
        "code_snippet": code_snippet,
        "tool_suggestion": tool_suggestion,
        "search_query": query,
        "github_findings": gh_brief,
        "web_findings": web_brief
    }
﻿
    prompt = (
        "You are a debugging assistant. Based on the following data, produce a short, actionable plan.\n"
        "Include:\n"
        "1. Likely root cause in one or two sentences.\n"
        "2. Concrete next steps that can be executed now.\n"
        "3. If shapes or types are mismatched, propose exact code edits.\n"
        "4. If library problems are implicated, propose install or version pin commands.\n"
        "5. If no external search is needed, say so and outline local static checks.\n\n"
        f"DATA:\n{json.dumps(payload, ensure_ascii=False, indent=2)}\n\n"
        "Return a concise plan. No preamble."
    )
﻿
    resp = client.responses.create(
        model="gpt-5",
        reasoning={"effort": "low"},
        input=[{"role": "user", "content": prompt}]
    )
    return (resp.output_text or "").strip()
﻿
﻿
﻿
﻿
@weave.op
def static_analysis_gpt5(error_text: str, code_snippet: str | None) -> dict:
    """
    Pure GPT-5 static analysis. No web or GitHub.
    Returns a dict with fields: diagnosis, fixes, patch, test_snippet, notes.
    """
    from openai import OpenAI
    import os, json
﻿
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
﻿
    system = (
        "You are a Python static analyzer. Read the error and the code snippet. "
        "Find the root cause and propose concrete code edits. "
        "If there is a tensor shape mismatch, compute the exact shapes and provide the corrected operation. "
        "Return strict JSON with keys: diagnosis, fixes, patch, test_snippet, notes."
    )
﻿
    user = {
        "error_text": error_text,
        "code_snippet": code_snippet or ""
    }
﻿
    prompt = (
        "Analyze the following and return strict JSON only. "
        "Do not include commentary outside JSON.\n\n"
        f"{json.dumps(user, ensure_ascii=False, indent=2)}\n\n"
        "{ \"diagnosis\": \"...\", "
        "\"fixes\": [\"...\"], "
        "\"patch\": \"diff or edited code\", "
        "\"test_snippet\": \"python code to quickly sanity check\", "
        "\"notes\": \"short notes\" }"
    )
﻿
    resp = client.responses.create(
        model="gpt-5",
        reasoning={"effort": "low"},
        input=[
            {"role": "system", "content": system},
            {"role": "user", "content": prompt}
        ]
    )
﻿
    raw = (resp.output_text or "").strip()
    try:
        data = json.loads(raw)
    except Exception:
        data = {
            "diagnosis": "Could not parse JSON from model",
            "fixes": [],
            "patch": "",
            "test_snippet": "",
            "notes": raw[:500]
        }
    return data
﻿
@weave.op
def main(force_use_all_tools: bool = True):
    import os
    GITHUB_TOKEN = ""
    os.environ['OPENAI_API_KEY'] = ""  # set your key here
﻿
﻿
    error_content = read_log(LOGFILE)
    search_query = generate_search_query_openai(error_content) if is_python_error(error_content) \
                   else error_content.strip().replace("\n", " ")
﻿
    files_info, snippet, tools = None, None, None
    try:
        files_info = find_files_from_log_gpt(error_content)
        if files_info:
            file_to_examine, line_hint = files_info[0].get("file"), files_info[0].get("line")
            verbose_print(f"Selected file: {file_to_examine}, line: {line_hint}")
            snippet = get_file_snippet(file_to_examine, line=line_hint)
            if snippet:
                print("\n--- Snippet from implicated file ---\n", flush=True)
                print(snippet, flush=True)
                print("-" * 60, flush=True)
                tools = suggest_tools(error_content, snippet)
                print("\n[TOOL RECOMMENDATION]:", tools, flush=True)
            else:
                verbose_print(f"Could not get snippet from file {file_to_examine}")
        else:
            verbose_print("Did not find any file to examine in the error.")
    except Exception as e:
        verbose_print(f"[WARN] File inference failed: {e}")
﻿
    gh_results = []
    web_results = None
    static_result = None
﻿
    # run static analysis
    if force_use_all_tools or (tools and "static_analysis" in tools.get("recommendations", [])):
        static_result = static_analysis_gpt5(error_content, snippet)
﻿
    # run github search
    if force_use_all_tools or (tools and "github_issue_search" in tools.get("recommendations", [])):
        gh_results = search_github(
            search_query,
            github_token=GITHUB_TOKEN,
            error_text=error_content
        )
﻿
    # run web search
    if force_use_all_tools or (tools and "web_search" in tools.get("recommendations", [])):
        try:
            web_results = openai_web_search(search_query)
        except Exception as ex:
            print(f"[OpenAI] Search failed: {ex}", flush=True)
﻿
    final_plan = final_recommendation_with_gpt5(
        error_text=error_content,
        code_snippet=snippet,
        tool_suggestion=tools,
        gh=gh_results,
        web=web_results,
        query=search_query
    )
    print("\n=== FINAL RECOMMENDATION ===\n", final_plan, "\n", flush=True)
﻿
    html_path, raw_html = write_html_report(
        log=error_content,
        file_snippet=snippet,
        tools=tools,
        gh_results=gh_results,
        web_results=web_results,
        static_result=static_result   # pass it so the section renders
    )
﻿
    appended = raw_html.replace(
        "</body></html>",
        f"<section><h2>Final Recommendation</h2><pre>{html.escape(final_plan or '')}</pre></section></body></html>"
    )
    with open(html_path, "w", encoding="utf-8") as f:
        f.write(appended)
﻿
    open_html_in_chrome(html_path)
    verbose_print("Searches complete. Examine the HTML report in Chrome for summary and results.\n")
﻿
﻿
﻿
if __name__ == "__main__":
    # Fix for Windows event loop policy (Playwright + asyncio)
    if sys.platform.startswith("win"):
        try:
            asyncio.set_event_loop_policy(asyncio.WindowsProactorEventLoopPolicy())  # type: ignore[attr-defined]
        except Exception:
            pass
    main(force_use_all_tools=True)
﻿
﻿
スクリプトはまず、ディスクから Python のエラーログを読み込みます（詳細は後述）。それが実際のトレースバックに見える場合、GPT-5 は雑多なテキストを短く一般化した検索クエリへと変換します。プロンプトでは、ファイルパス、行番号、メモリアドレス、実行時に依存するテンソルの形状といった変動しやすい情報を取り除きます。残るのは、エラーの中核となるフレーズと明確なライブラリ名だけで、役立つ一致結果を安定して引き出せるタイプのクエリになります。
本番の開発ループでデバッグ支援スクリプトを使うには、Python コマンドを小さなシェル関数で包み、エラー発生時に自動でデバッガーが起動するようにすると便利です。
たとえば、次の関数をシェルのプロファイル（例：）に追加できます .zshrc または .bashrc）
agentpython() {
    logfile="/tmp/agentpython-stderr.log"
    python "$@" 2> >(tee "$logfile" >&2)
    if [[ -s "$logfile" ]]; then
        # If logfile is NOT empty, run check script
        python /Users/...FULL_PATH_TO/your_debug.py "$logfile"
    else
        # If logfile is empty, clear it (truncate to zero length)
        > "$logfile"
    fi
}
.zshrc または .bashrc に関数を追加したら、次のコマンドを実行することで、ターミナルを再起動せずに現在のセッションへ読み込めます。 . ~/.zshrc または . ~/.bashrc お使いのシステムに応じて 
💡
しくみは次のとおりです。
logfile Python スクリプトの標準エラー出力（stderr）をすべて捕捉する一時ファイルを指します。
通常どおり、次のように呼び出して Python スクリプトを実行します agentpython myscript.py  
その stderr 出力は tee を使ってターミナルに表示されると同時に、ログファイルにも保存されます。
もし logfile が空でない場合、つまりエラーが発生したことを示しているため、直ちにデバッグ支援スクリプトを呼び出します。 debug.py エラーログへのパスを指定して
その補助スクリプトは続いて GPT-5 と Weave のパイプラインを実行し、検索クエリを生成して GitHub Issues を取得・OCR 処理し、解決策を要約して、HTML レポートを作成します。
もし logfile が空（エラーなし）の場合は、その内容をクリアするだけです。
これにより、通常の開発プロセスに GPT-5 のデバッグフローを直接組み込めます。python の代わりに agentpython を実行し、何かが壊れたときはいつでも、デバッガが自動的に起動し、関連する Issue を取得し、すべての入出力を Weave に記録し、問題の調査にすぐ使えるレポートを開きます。
スクリプトをデモするために、あえてエラーを含むスクリプトを作成しました。
import torch
﻿
a = torch.randn(3, 4)  # 3x4
b = torch.randn(5, 6)  # 5x6
﻿
result = torch.matmul(a, b)
﻿
print(result)
そして次のコマンドを実行しました。
agentpython bad_code.py
これによりエラーが発生し、その結果エージェントが起動しました。
﻿
クエリが得られたら、スクリプトは複数の分岐に進みます。まず GitHub API を呼び出して上位の Issue を取得し、Playwright でヘッドレスの Chromium セッションを起動してページ全体のスクリーンショットを撮ります。これらは Tesseract OCR を通して処理されるため、長い画像のみのスレッドでも読み取れます。続いて GPT-5 が、元のエラーの文脈に沿って各 OCR 抽出結果を要約し、原因と修正を簡潔に返すので、そのまま即座に対処できます。
ここでは、関連する Issue を GitHub で検索します。
ツール提案モデルの指示でウェブ検索が推奨される場合、GPT-5 はインターネット上の最新情報も照会します。関連ページを読み取り、エラーの文脈と統合したうえで、根拠となる出典リンク付きの、直接的で実行可能な回答を生成します。これにより対象範囲が GitHub を超えて広がり、ドキュメント、ブログ記事、Q&A フォーラムなどからの解決策も拾い上げられます。
ここでは、エージェントが関連する Issue をウェブで検索します。
純粋な静的解析のみで完結する経路もあります。これを選ぶと、GPT-5 はトレースバックと該当するコードスニペットを完全にオフラインで読み取り、診断結果、的確な修正案、提案パッチ、変更を検証するための小さなテストスニペット、補足メモを含む厳密な JSON を返します。この経路は、テンソルの形状不一致やライブラリ API の誤用のように、修正がすでに手元のコード内で完結するローカルな問題に適しています。
主要な各関数はすべてでラップされています @weave.op, そのため Weave は、ログ読み取り、クエリ生成、GitHub Issue 取得、OCR、ウェブ検索、要約、静的解析、最終プラン統合まで、各ステップの入力と出力を記録します。Weave の UI で実行を逐次確認し、各結果がどのように得られたかを正確に追跡し、セッション間で出力を比較できます。
﻿
﻿
﻿
最後に、スクリプトは単一の HTML レポートを生成して Chrome で開きます。レポートには、生のエラーログ、見つかった場合の関連ソーススニペット、GPT-5 のツール推奨、静的解析結果、リンク付きの GitHub Issues、スクリーンショットと要約、AI による回答と出典を含むウェブ検索セクションが含まれ、最後に GPT-5 からの短い統合推奨が提示されるため、すぐに修正を実行できます。
﻿
﻿
﻿
﻿
Weave がすべてを記録しているため、後からデバッグ実行を再訪し、あらゆる部分を調整できます。たとえば、修正案の要約時に GPT-5 の思考予算（reasoning effort）を別レベルで試す、検索クエリのプロンプトを調整する、といった変更を行い、ブラウザや OCR のパイプライン全体を再実行せずに結果を直接比較できます。こうした蓄積によって、実際のモデル挙動に紐づくデバッグパターンのライブラリが形成され、将来の Issue に対するプロセスの洗練に役立ちます。
まとめGPT-5 は、単なる機能をはるかに超えた幅広い可能性を切り開きます テキスト生成。
このチュートリアルでは、性質の異なる3つのユースケースを取り上げ、いずれも Weave で計測・記録することで大きな効果が得られることを示しました。
ここまでで、GPT-5 が画像を扱い、内容を記述したりビジュアルコンテンツを生成したりする様子を確認しました。
コード課題に取り組めるよう思考の深さを調整し、
ログ解析、静的解析、OCR を用いた GitHub 検索、そしてライブなウェブ結果を結び付けて統合する、完全なデバッグパイプラインを駆動します。
Weave は各ステップの「結合組織」として機能し、モデルの入力・出力・中間生成物をすべて記録して、プロセス全体を可視化し再現可能にします。クリエイティブなワークフローを探求する場合でも、 モデル評価の実行あるいは複雑なエラーのトラブルシューティングでも、Weave に完全なビジュアル履歴が残っていれば、結果がそのようになった理由を把握し、別の実行結果と比較し、より速く反復できます。
GPT-5 のマルチモーダル機能と組み合わせることで Weave の可観測性、強力な自動化に加えて、ツールの挙動を永続的かつ検査可能な形で記録できます。これにより、各実行が再訪・改善・再利用可能な学習リソースとなり、実験やデバッグのセッションを将来の作業に活用できる知識ベースへと育てられます。
﻿
 この記事は AI による翻訳です。誤訳の可能性があれば、コメント欄でお知らせください。元のレポートはこちらをご覧ください。 元のレポートを見る﻿
﻿
Add a comment
Tags: Articles, GPT, OpenAI, LLM, Evaluations, Agents
Iterate on AI agents and models faster. Try Weights & Biases today.