튜토리얼: 다양한 과제 전반의 GPT-5 평가

이 튜토리얼에서는 W&B Weave를 사용해 GPT-5의 이미지 생성, 코딩 평가, 자동 디버깅을 어떻게 평가하는지 다룹니다. 이 글은 AI 번역본입니다. 오역이 의심되는 부분이 있다면 댓글로 알려 주세요.
Created on September 12|Last edited on September 12
Comment
그 순간 GPT-5 출시, 우리는 새로운 API 기능 몇 가지를 직접 파고들어 써 봐야 한다는 걸 알고 있었습니다.
아래 튜토리얼에서는 몇 가지를 살펴보겠습니다 GPT-5’의 가장 강력한 기능입니다. 먼저 이미지 설명과 생성 기능을 실험해 보고, 실제 프로그래밍 과제에서 다른 모델들과 성능을 비교한 뒤, 마지막으로 자동화된 작업을 어떻게 구동할 수 있는지 살펴보겠습니다. Python 디버깅 에이전트 간결한 코드 수정 요약을 생성할 수 있습니다. 바로 시작해 봅시다! 
﻿
이번에 다룰 내용튜토리얼: GPT-5 멀티모달 I/O 평가튜토리얼: Weave Evals로 GPT-5의 코딩 역량 평가하기 튜토리얼: Weave로 GPT-5 Python 디버거 에이전트 만들기 결론
﻿
튜토리얼: GPT-5 멀티모달 I/O 평가우리는 GPT-5를 사용해 멀티모달 I/O를 평가하는 것부터 시작하겠습니다.
이 튜토리얼에서는 GPT-5에 이미지 URL과 base64 데이터, 그리고 설명 요청을 함께 전달합니다. GPT-5는 자연어 설명을 반환하고, 우리는 그 설명을 다시 GPT-5의 이미지 생성 도구에 입력해 완전히 새로운 이미지를 만듭니다. 이러한 기능들은 다음과 함께 래핑되어 있습니다 @weave.op 그래서 각 실행은 W&B Weave에 기록됩니다원본 이미지, 프롬프트, 생성된 설명, 그리고 새로운 이미지 출력까지 포함하여.
Weave UI에서 단일 실행을 클릭하면 전체 과정을 시각적으로 확인할 수 있습니다(아래 코드 블록 아래에서 볼 수 있습니다). 이는 창의적 실험을 추적하거나 예기치 않은 출력물을 디버깅하거나, 전후 변환 과정을 보여 주기에 완벽합니다. 
﻿
import requests
from io import BytesIO
from PIL import Image
import base64
from openai import OpenAI
import weave; weave.init("gptV_gen_and_desc")
﻿
﻿
OPENAI_API_KEY = ""
client = OpenAI(api_key=OPENAI_API_KEY)
﻿
@weave.op
def gptV_describe_image_with_url(
    pil_img: Image.Image,
    img_url: str,
    prompt: str = "Describe what is in this image."
) -> str:
    """
    Describe an image using both its URL and base64 encoding for the model, 
    and logs PIL image in Weave.
    """
    # Prepare base64-encoded image for OpenAI input
    inp = {
        "role": "user",
        "content": [
            {"type": "input_text", "text": "Describe what is in this image."},
            {"type": "input_image", "image_url": img_url}
        ]
    }
    resp = client.responses.create(
        model="gpt-5",
        input=[inp]
    )
    return resp.output_text
﻿
﻿
@weave.op
def gpt_generate_image(
    prompt: str,
    size: str = "1024x1024"
) -> Image.Image:
    """
    Generate an image from a prompt using OpenAI DALL-E (PIL image output).
    """
﻿
﻿
    prompt = f"Generate an image given the following description: {prompt}"
    print(f"[DEBUG] Generating image with prompt: {prompt}")
﻿
    try:
        response = client.responses.create(
            model="gpt-5",
            input=prompt,
            tools=[{"type": "image_generation"}],  # no tool_choice
        )
        print(f"[DEBUG] Raw response received: {response}")
    except Exception as e:
        print(f"[ERROR] Failed to create response: {e}")
        return None
﻿
    try:
        image_data = [
            output.result
            for output in response.output
            if output.type == "image_generation_call"
        ]
        print(f"[DEBUG] Extracted image data: {image_data}")
    except Exception as e:
        print(f"[ERROR] Failed to extract image data: {e}")
        return None
﻿
    if image_data:
        try:
            image_base64 = image_data[0]
            filename = "generated_image.png"
            with open(filename, "wb") as f:
                f.write(base64.b64decode(image_base64))
            print(f"[DEBUG] Image saved to {filename}")
            
            pil_img = Image.open(BytesIO(base64.b64decode(image_base64)))
            return pil_img
        
        
        except Exception as e:
            print(f"[ERROR] Failed to save image: {e}")
            return None
﻿
﻿
﻿
# --- Main Example usage ---
if __name__ == "__main__":
    img_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3f/Fronalpstock_big.jpg/800px-Fronalpstock_big.jpg"
    headers = {"User-Agent": "Mozilla/5.0 (compatible; GeminiScript/1.0)"}
    response = requests.get(img_url, headers=headers)
    response.raise_for_status()
    pil_img = Image.open(BytesIO(response.content))
﻿
    # 1. DESCRIBE
    desc = gptV_describe_image_with_url(
        pil_img=pil_img,
        img_url=img_url,
        prompt="Describe what is in this image."
    )
    print("\nGPT-V description:")
    print(desc)
﻿
    # 2. GENERATE NEW IMAGE FROM DESCRIPTION
    gen_img = gpt_generate_image(desc)
    gen_img.save("gptV_generated.png")
    print("\nGenerated image saved as gptV_generated.png")
스크립트를 실행하면 원본 이미지와 생성된 이미지가 모두 로컬에 저장되며, Weave 덕분에 대시보드에서 전체 과정을 시각적으로 기록한 내역도 확인할 수 있습니다. 프롬프트 텍스트로 필터링하고, 여러 실행을 나란히 비교하며, 특정 실행에 대한 링크를 팀원과 공유할 수도 있습니다. 
스크립트를 실행한 뒤 Weave에서 보이는 화면은 다음과 같습니다: 
먼저 이미지를 설명합니다 
다음으로, 비슷한 이미지를 생성해 보겠습니다! 
튜토리얼: Weave Evals로 GPT-5의 코딩 역량 평가하기 다음으로 GPT-5의 코딩 평가로 넘어갑니다. 이 모델은 작업에 투입할 사고 시간을 설정할 수 있어 속도와 정확성 사이의 균형을 조정할 수 있습니다. 이번 실행에서는 DeepMind Code Contests 데이터셋의 30개 예제를 대상으로 세 가지 모델을 평가했으며, 각 모델은 알고리즘 문제를 해결하는 Python 함수를 생성했습니다. GPT-OSS-120B 는 고도의 추론 모드를 활성화한 상태로 테스트되었습니다. GPT-5는 더 엄격한 제약에서의 성능을 확인하기 위해 낮은 사고 예산으로 실행되었습니다. Claude 4.1 Opus 해당 모델은 4k 토큰의 사고 예산으로 실행되었으며, 이는 해당 역량 대비 비교적 낮은 수준입니다.
다음을 사용하여 EvaluationLogger 즉, 우리가 원하는 방식으로 평가를 수행할 수 있다는 뜻입니다. 여기서는 GPT-5, Claude 4.1 Opus, 그리고 GPT-OSS-120B가 각 문제마다 Python 코드를 생성하고, 그 코드를 통제된 환경에서 실제로 실행합니다. 미리 정의된 테스트 케이스 집합을 기준으로 실행하며, 출력 결과를 기대값과 비교해 정답 여부를 판단합니다. 또한 각 실행에 소요된 시간도 기록합니다.
이 과정의 모든 단계는 Weave에 기록되므로, 각 문제마다 원본 프롬프트, 생성된 코드, 정확한 테스트 입력, 실행 결과 출력, 그리고 통과 여부까지 확인할 수 있습니다. 이를 통해 정확도 지표와 결과에 영향을 준 코드의 구체적 차이를 함께 살펴보며 낮은 사고 노력과 높은 사고 노력 수준을 손쉽게 비교할 수 있습니다. 코드는 다음과 같습니다:
import os
import sys
import time
import subprocess
from datasets import load_dataset
from litellm import completion as oai_completion
from anthropic import Anthropic
from google import genai
from openai import OpenAI
import weave
import random
import numpy as np
﻿
from weave.flow.eval_imperative import EvaluationLogger
from google import genai
from google.genai import types
from litellm import completion as oai_completion
import re
from litellm import completion as oai_completion
import requests
import json
﻿
﻿
﻿
﻿
weave.init("codecontests_eval")
﻿
﻿
# API keys
# os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY", "sk-...")
CLAUDE_KEY = os.getenv("CLAUDE_API_KEY", "")
GENAI_KEY = os.getenv("GOOGLE_API_KEY", "")
OPENROUTER_KEY = os.getenv("OPENROUTER_API_KEY", "")
﻿
﻿
# Clients
anthropic_client = Anthropic(api_key=CLAUDE_KEY)
gemini_client = genai.Client(api_key=GENAI_KEY)
openrouter_client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=OPENROUTER_KEY,
)
client = OpenAI(api_key="")
﻿
﻿
﻿
def clean_llm_code_block(text):
    import re
﻿
﻿
    cleaned_text = text.replace("```python", "").replace("```", "").strip()
    code_blocks = re.findall(r"(def solve\(.*?)(?=^def |\Z)", cleaned_text, re.DOTALL | re.MULTILINE)
    source_text = code_blocks[-1] if code_blocks else cleaned_text
﻿
﻿
    prompt = (
        "Given the following response from a language model, extract ONLY the valid Python code for the function. "
        "Do not include any explanations, text, or formatting fences. Only the code.\n\n"
        f"Response:\n{source_text}\n\n"
        "Return ONLY the Python code, including any necessary imports:"
    )
﻿
﻿
    response = oai_completion(
        model="openai/gpt-4o-2024-08-06",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,
    )
﻿
﻿
    gpt4o_code = response["choices"][0]["message"]["content"]
    gpt4o_code = gpt4o_code.replace("```python", "").replace("```", "").strip()
    return gpt4o_code
﻿
﻿
@weave.op()
def generate_completion(model: str, prompt: str, effort: str="low") -> str:
    # if model.startswith("openai/"):
    #     response = oai_completion(
    #         model=model,
    #         messages=[{"role": "user", "content": prompt}],
    #         reasoning_effort="low",
    #     )
    #     return response["choices"][0]["message"]["content"].strip()
    if model.startswith("openai/"):
        response = client.responses.create(
            model=model.replace("openai/", ""),
            reasoning={"effort": effort},
            input=[
                {"role": "user", "content": prompt}
            ]
        )
        return response.output_text.strip()
﻿
﻿
    elif model.startswith("anthropic/"):
        response = anthropic_client.messages.create(
            model=model.replace("anthropic/", ""),
            max_tokens=8000,
            thinking={"type": "enabled", "budget_tokens": 4000},
            messages=[{"role": "user", "content": prompt}],
        )
        for block in response.content:
            if block.type == "text":
                return block.text.strip()
        return "[No Claude response]"
﻿
﻿
    elif model.startswith("gemini/"):
        result = gemini_client.models.generate_content(
            model=model.replace("gemini/", ""),
            config=types.GenerateContentConfig(
                thinking_config=types.ThinkingConfig(thinking_budget=4000)
            ),
            contents=[prompt]
        )
        return result.text.strip() if result.text else "[No Gemini response]"
﻿
﻿
    elif model.startswith("openrouter/"):
﻿
        url = "https://openrouter.ai/api/v1/chat/completions"
        headers = {
            "Authorization": f"Bearer {OPENROUTER_KEY}",
            "Content-Type": "application/json"
        }
        payload = {
            "model": model.replace("openrouter/", ""),
            "messages": [
                {"role": "system", "content": "Reasoning: high"},
                {"role": "user", "content": prompt}
            ],
        }
        response = requests.post(url, headers=headers, data=json.dumps(payload))
        resp_json = response.json()
        if 'choices' in resp_json and resp_json['choices']:
            # To get the reasoning, use: resp_json['choices'][0]['message'].get('reasoning')
            return resp_json['choices'][0]['message'].get('content', '[No answer found]')
        else:
            return "[No choices found in OSS response]"
﻿
    else:
        raise ValueError(f"Unsupported model: {model}")
    
﻿
﻿
﻿
﻿
def ask_llm_for_function_implementation(description: str, model: str, effort: str | None = None) -> str:
    prompt = (
        f"Write a Python3 function named `solve` with typed input arguments for this problem -- eg the solve function should take arguments to handle different test cases:\n\n"
        f"{description.strip()}\n\n"
        "Return only a valid Python function -- no special packages that arent commonly used and NO MAIN function, no  if __name__ == __main__....., JUST write the function --  that returns the result. No comments, no explanations."
        f"HOWEVER, you still need to include necessary imports for libraries"
        f"IF you do not include the right imports, the code will not be executable, and your response will be judged as incorrect!"
    )
    # Pass effort only to OpenAI via generate_completion when provided
    if effort is not None and model.startswith("openai/"):
        return clean_llm_code_block(generate_completion(model, prompt, effort=effort))
    else:
        return clean_llm_code_block(generate_completion(model, prompt))
﻿
﻿
﻿
﻿
@weave.op
def ask_llm_for_function_call(code: str, raw_input: str, model: str) -> str:
﻿
﻿
    prompt = (
        "You're given a Python function and a single input string. "
        "Format it into a valid Python function call using only standard types.\n\n"
        f"Function:\n{code}\n\n"
        f"Input:\n{raw_input.strip()}\n\n"
        "Return ONLY a valid function call (e.g., solve(3, 5)) WITH NO 'def' "
    )
﻿
﻿
    # Always use GPT-4o for this inference, regardless of the `model` argument.
    response = oai_completion(
        model="openai/gpt-4o-2024-08-06",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,
    )
    # The LLM may return markdown code blocks; strip them just in case.
    content = response["choices"][0]["message"]["content"]
    content = content.replace("```python", "").replace("```", "").strip()
    return content
﻿
﻿
def compare_output_with_llm(expected: str, actual: str, model: str) -> bool:
    prompt = (
        f"Expected output: {expected.strip()}\n"
        f"Actual output: {actual.strip()}\n\n"
        "Are these outputs equivalent? Eg ignore minor formatting errors etc, we are just looking for overall correctness in the output Reply YES or NO."
    )
    # response = generate_completion(model, prompt)
﻿
﻿
    response = oai_completion(
        model="openai/gpt-4o-2024-08-06",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,
    )
    # The LLM may return markdown code blocks; strip them just in case.
    res = 'YES' in str(response["choices"][0]["message"]["content"]).upper()
    return res 
﻿
﻿
﻿
def run_code_and_call_function(code: str, function_call: str, timeout=10):
    full_code = code + f"\n\nprint({function_call})"
    try:
        start = time.time()
        result = subprocess.run(
            [sys.executable, "-c", full_code],
            capture_output=True,
            text=True,
            timeout=timeout
        )
        latency = time.time() - start
        return result.stdout.strip(), result.stderr.strip(), latency
    except subprocess.TimeoutExpired:
        return "", "Execution timed out.", timeout
    except Exception as e:
        return "", str(e), 0.0
﻿
﻿
﻿
﻿
def ask_model_for_pip_command(error_msg):
    prompt = (
        "Given this Python error:\n\n"
        + error_msg +
        "\n\nWrite the pip install command needed to fix it. Only return the command, e.g.:\n"
        "pip install requests"
    )
    return generate_completion("openai/gpt-4o-2024-08-06", prompt)
﻿
﻿
﻿
def run_pip_install(pip_command):
    print(f"Running: {pip_command}")
    try:
        result = subprocess.run(
            pip_command.split(),
            capture_output=True,
            text=True,
            timeout=180
        )
        print(result.stdout.strip())
        if result.stderr:
            print(result.stderr.strip())
    except Exception as e:
        print(f"pip install failed: {e}")
﻿
﻿
﻿
def evaluate_model_on_code_contests(model_name: str, reasoning_effort: str | None = None):
    print(f"\n\nRunning evaluation for model: {model_name}\n")
﻿
    random.seed(42)
    np.random.seed(42)
    ds = load_dataset("deepmind/code_contests", split="test", streaming=True)
    ds = list(ds.take(31))
﻿
﻿
    # Build sanitized model identifier for Weave, including reasoning effort if provided
    model_id = model_name.replace("-", "_").replace("/", "_").replace(".", "_")
    if reasoning_effort:
        effort_id = str(reasoning_effort).replace("-", "_").replace("/", "_").replace(".", "_")
        model_id = f"{model_id}__{effort_id}"
﻿
    eval_logger = EvaluationLogger(
        model=model_id,
        dataset="code_contests_test"
    )
    all_latencies = []
﻿
﻿
    for i in range(30):
        row = ds[i]
        description = row["description"]
        raw_inputs = row["public_tests"]["input"]
        expected_outputs = row["public_tests"]["output"]
﻿
﻿
        try:
            # Forward reasoning_effort only to OpenAI generate_completion
            code = ask_llm_for_function_implementation(
                description,
                model=model_name,
                effort=reasoning_effort if (reasoning_effort and model_name.startswith("openai/")) else None,
            )
            print(f"\n=== Task {row['name']} ===", flush=True)
            # print("Generated code:\n", code)
﻿
﻿
            all_passed = True
            task_latencies = []
            results_lst, expected_lst = [], []
﻿
﻿
            for j, raw_input in enumerate(raw_inputs):
                expected = expected_outputs[j] if j < len(expected_outputs) else ""
﻿
﻿
                try:
                    function_call = ask_llm_for_function_call(code, raw_input, model=model_name)
                    result, error, latency = run_code_and_call_function(code, function_call)
                    if latency < 99:
                        task_latencies.append(latency)
﻿
﻿
                
                    if error:
                        print(f"[{j}] Runtime error: {error}")
                        if "ModuleNotFoundError" in error:
                            pip_cmd = ask_model_for_pip_command(error)
                            run_pip_install(pip_cmd)
                            # Re-run once after pip install
                            result, error, latency = run_code_and_call_function(code, function_call)
                            
                            task_latencies.append(latency)
                            if error:
                                print(f"[{j}] Retry failed: {error}")
                                all_passed = False
                                continue
                        else:
                            all_passed = False
                            continue
﻿
﻿
                    is_correct = compare_output_with_llm(expected, result, model="openai/gpt-4o-2024-08-06")###### 
                    results_lst.append(result)
                    expected_lst.append(expected)
                    if not is_correct:
                        all_passed = False
                    print(f"[{j}] input: {raw_input.strip()} → output: {result} | expected: {expected.strip()} | PASS: {is_correct} | latency: {latency:.2f}s")
﻿
﻿
                except Exception as inner:
                    print(f"[{j}] Inner error: {repr(inner)}")
                    all_passed = False
            
            task_avg_latency = sum(task_latencies) / len(task_latencies) if len(task_latencies) > 0 else 0.0
            all_latencies.extend(task_latencies)
﻿
﻿
            prediction_log = eval_logger.log_prediction(
                inputs={"description": description},
                output={'code': code, 'execution_result': results_lst, 'expected_execution_result': expected_lst}
            )
            prediction_log.log_score("correctness", all_passed)
            prediction_log.log_score("code_latency", task_avg_latency)
            prediction_log.finish()
﻿
﻿
        except Exception as e:
            print(f"[{i}] Top-level failure: {repr(e)}")
            prediction_log = eval_logger.log_prediction(
                inputs={"description": description},
                output=str(e)
            )
            prediction_log.log_score("correctness", False)
            prediction_log.finish()
﻿
﻿
    avg_latency = sum(all_latencies) / len(all_latencies) if all_latencies else 0.0
    eval_logger.log_summary({"avg_code_latency": avg_latency})
    print(f"Evaluation complete for {model_name}. View in Weave UI.")
﻿
﻿
﻿
﻿
# Run for all models
﻿
﻿
evaluate_model_on_code_contests("openrouter/openai/gpt-oss-120b")
evaluate_model_on_code_contests("anthropic/claude-opus-4-1-20250805")
evaluate_model_on_code_contests("openai/gpt-5", reasoning_effort='low')
﻿
코드는 모든 내용을 기록하기 위해 Weave 프로젝트를 초기화한 뒤, DeepMind Code Contests 테스트 세트에서 작은 일부만 스트리밍합니다. 각 문제마다 선택한 모델에게 solve라는 이름의 Python 함수를 작성하���록 요청합니다. generate_completion router는 여러 공급자를 지원하며, 사용자가 OpenAI 모델, 추론 노력 설정을 전달할 수 있습니다. 모델들은 답변을 마크다운으로 감싸는 경향이 있기 때문에, clean_llm_code_block 펜스를 제거하고 실행 가능한 코드만 유지합니다.
다음으로, 각 공개 테스트 입력마다 스크립트가 요청합니다 GPT-4o 원시 예제를 구체적인 함수 호출로 변환하고, 후보 코드를 하위 프로세스에서 실행하여 지연 시간을 측정합니다. 실행 중 누락된 패키지로 오류가 발생하면, 모델에 정확한 pip install 명령을 요청해 설치한 뒤 한 번 재시도합니다. 출력은 관대한 GPT-4o 검사를 통해 예상 답과 비교하여, 사소한 형식 차이로 정답률이 떨어지지 않도록 합니다. 
모든 작업은 입력, 생성된 코드, 테스트별 출력, 정답 여부, 실행 시간과 함께 Weave의 EvaluationLogger에 기록됩니다. 마지막에는 평균 실행 지연 시간을 포함한 요약도 남깁니다. 그 결과는 Weave에서 탐색 가능한 재현 가능한 평가로, 각 작업으로 들어가 프롬프트, 코드, 호출, 출력, 그리고 다양한 모델과 노력 수준에서의 합격/불합격 상태를 확인할 수 있습니다.
평가가 완료되면 Weave UI를 열어 세부적으로 탐색할 수 있습니다. Weave의 평가 뷰어는 정확도 같은 집계 통계뿐 아니라 개별 예제마다 클릭해 들어가 볼 수 있게 해줍니다. 각 예제의 프롬프트, 모델의 전체 응답, 점수, 그리고 함께 기록한 모든 지표를 확인할 수 있습니다. 여러 모델이나 설정으로 실행했다면, Weave가 결과를 정렬해 직접 출력물을 비교할 수 있게 합니다. 이는 더 높은 추론 노력으로 인한 추가 지연이 정확도 개선 대비 당신의 사용 사례에서 가치가 있는지 판단할 때 매우 유용합니다. 아래는 제 평가 결과입니다: 
﻿
차트에 따르면 GPT-5(낮은 사고 예산)이 정확도 0.733으로 가장 높았습니다. GPT-OSS-120B(높은 추론)는 정확도 0.633으로 근소하게 뒤를 이었습니다. Claude 4.1 Opus(4k 토큰 예산)는 정확도 0.333으로 뒤처졌고, 지연 시간은 0.620으로 중간 수준이었습니다.
튜토리얼: Weave로 GPT-5 Python 디버거 에이전트 만들기 마지막으로, 오류 로그를 받아 외부 문맥에 대입하고 실행 가능한 수정안을 제안할 수 있는 디버깅 도우미로서의 GPT-5를 살펴봅니다. 모든 단계는 다음과 같이 포괄적으로 기록됩니다 @weave.op, 그래서 이후에 실행 기록을 재생하고, 중간 결과를 확인하며, 출력물을 비교할 수 있습니다. 설정을 마친 뒤에는 다음을 실행합니다: agentpython myscript.py
이 명령은 스크립트를 실행하고 오류가 발생하면 에이전트를 자동으로 트리거합니다. 실행되는 동안 다음과 같은 일이 일어납니다:
1. 로그 캡처와 정리에이전트는 먼저 셸 별칭이나 함수로 stderr를 파일로 파이프해 저장한 오류 로그를 불러옵니다. Python traceback을 감지하면, GPT-5는 파일 경로, 줄 번호, 메모리 주소, 실행별 텐서 형태 같은 변동 가능한 세부 정보를 제거하고, 핵심적인 오류 유형, 메시지, 그리고 명확한 라이브러리 이름만 남깁니다.
2. OCR를 활용한 GitHub 검색정제된 쿼리를 사용해 스크립트는 GitHub API를 호출하여 일치하는 이슈를 찾고, 각 결과를 헤드리스 브라우저로 열어 전체 페이지 스크린샷을 캡처한 뒤, 코드 블록이나 이미지 전용 댓글까지 포함해 전체 토론을 추출하기 위해 OCR을 실행합니다. 그런 다음 GPT-5가 원래 오류의 문맥에서 각 스레드를 요약하여 가능한 원인과 수정안을 즉시 확인할 수 있게 합니다.
3. 웹 검색과 정적 분석웹 검색이 활성화되어 있으면 GPT-5는 실시간 인터넷을 조회해 관련 결과를 오류 문맥과 병합하고, 인용을 포함한 실행 가능한 답을 반환합니다. 또는(또는 병행하여) 에이전트가 완전 오프라인 정적 분석을 수행해, 관련 소스 파일과 트레이스백을 읽고 구체적인 수정 사항, 빠른 테스트 스니펫, 관련 설치 또는 버전 명령을 제안할 수 있습니다.
4. 최종 HTML 보고서프로세스는 원시 오류 로그, 관련 소스 스니펫, GPT-5의 도구 추천, 정적 분석 결과, 링크가 포함된 GitHub 이슈, 스크린샷과 요약, 그리고 인용이 포함된 웹 검색 결과를 하나의 HTML 보고서로 정리하며 끝납니다. 최종 권장 사항은 맨 아래에 표시되어 바로 실행할 수 있습니다. 보고서는 Chrome에서 자동으로 열리며, Weave가 모든 입력과 출력을 기록하므로 비용이 많이 드는 검색이나 OCR 작업을 반복하지 않고도 다시 확인하고, 프롬프트를 조정하거나, 일부만 재실행할 수 있습니다.
import os
import sys
import re
import requests
import tempfile
import webbrowser
import html
import asyncio
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
from playwright.async_api import async_playwright
from PIL import Image
import pytesseract
﻿
from openai import OpenAI
import weave; weave.init("gpt5_agent")
﻿
﻿
# -------- CONFIG --------
LOGFILE = sys.argv[1] if len(sys.argv) > 1 else "/tmp/agentpython-stderr.log"
OUTPUT_DIR = "github_screenshots"
PARALLEL_PAGE_LOADS = 6   # how many pages to screenshot at once
OCR_WORKERS = min(8, (os.cpu_count() or 4))
# ------------------------
﻿
def verbose_print(msg):
    print(f"\033[95m[LOG] {msg}\033[0m", flush=True)
﻿
def read_log(logfile):
    verbose_print(f"Reading from log file: {logfile}")
    if not os.path.exists(logfile) or os.path.getsize(logfile) == 0:
        print("[LOG] Log file empty or not found. No action needed.", flush=True)
        sys.exit(0)
    with open(logfile) as f:
        content = f.read()
    print(f"\n--- Log Content ---\n{content}\n{'-'*40}", flush=True)
    return content
﻿
def is_python_error(txt):
    if "Traceback (most recent call last):" in txt or "Exception" in txt or "Error" in txt:
        verbose_print("Looks like a Python error.")
        return True
    verbose_print("Not detected as Python error (using fallback toolchain).")
    return False
﻿
@weave.op 
def generate_search_query_openai(error_str):
    verbose_print("Generating generalized search query using gpt-5 (OpenAI)...")
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    gpt_response = client.responses.create(
        model="gpt-5",
        reasoning={"effort": "low"},
        input=[
            {
                "role": "user",
                "content": (
                    "You are generating a GitHub search query from an error message. "
                    "Your goal is to create a generic query that will return relevant results from GitHub issues across many repositories. "
                    "Do NOT include overly specific details that would narrow results too much, such as:\n"
                    "- File paths\n"
                    "- Line numbers\n"
                    "- Exact tensor shapes, array sizes, or specific numeric values in parentheses\n"
                    "- Memory addresses\n"
                    "- Random seeds or run-specific values\n\n"
                    "Instead:\n"
                    "- Keep only the key error type and descriptive text\n"
                    "- Include the relevant library name if obvious (e.g., torch, numpy, pandas)\n"
                    "- Use quotes for the core error message if helpful\n\n"
                    "Output only the final search query string. No explanation, no extra words.\n\n"
                    f"Error:\n{error_str}"
                )
            }
        ]
    )
    query = (gpt_response.output_text or "").strip()
    print("Generated search query:", repr(query), flush=True)
    return query
﻿
async def _screenshot_one(page, url, path):
    try:
        await page.goto(url, timeout=20000)
        await page.set_viewport_size({"width": 1920, "height": 1080})
        await page.screenshot(path=path, full_page=True)
        verbose_print(f"[+] Screenshot saved: {path}")
        return True
    except Exception as e:
        verbose_print(f"[!] Failed screenshot for {url}: {e}")
        return False
﻿
async def capture_screenshots_parallel(urls, out_dir, concurrency=6):
    os.makedirs(out_dir, exist_ok=True)
    results = [None] * len(urls)
﻿
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context()
﻿
        sem = asyncio.Semaphore(concurrency)
        async def worker(i, url):
            path = os.path.join(out_dir, f"issue_{i+1}.png")
            async with sem:
                page = await context.new_page()
                ok = await _screenshot_one(page, url, path)
                await page.close()
                results[i] = path if ok else None
﻿
        tasks = [asyncio.create_task(worker(i, url)) for i, url in enumerate(urls)]
        await asyncio.gather(*tasks)
        await browser.close()
﻿
    return results  # list of file paths (or None)
﻿
def run_ocr(image_path):
    if not image_path or not os.path.exists(image_path):
        return ""
    try:
        img = Image.open(image_path)
        text = pytesseract.image_to_string(img)
        # save alongside
        txt_path = image_path.rsplit(".", 1)[0] + ".txt"
        with open(txt_path, "w", encoding="utf-8") as f:
            f.write(text)
        return text
    except Exception as e:
        verbose_print(f"[!] OCR failed for {image_path}: {e}")
        return ""
@weave.op 
def summarize_with_gpt5(error_text, github_text):
    if not github_text.strip():
        return "[No OCR text to summarize]"
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    resp = client.responses.create(
        model="gpt-5",
        reasoning={"effort": "low"},
        input=[
            {
                "role": "user",
                "content": (
                    "You are assisting in debugging. The following is a Python error message, "
                    "and then OCR-extracted text from a GitHub issue discussing it. "
                    "Summarize the most likely cause and solution in a few sentences. "
                    "Only include relevant fix instructions. Be concise.\n\n"
                    f"Error:\n{error_text}\n\nGitHub Issue Content:\n{github_text}"
                )
            }
        ]
    )
    return (resp.output_text or "").strip()
﻿
﻿
@weave.op 
def search_github(query, github_token=None, owner=None, repo=None, error_text=None):
    verbose_print(f"Searching GitHub issues for: {query!r}")
    url = 'https://api.github.com/search/issues'
    headers = {'Accept': 'application/vnd.github.v3+json'}
    if github_token:
        headers['Authorization'] = f'token {github_token}'
    if owner and repo:
        gh_query = f'repo:{owner}/{repo} is:issue {query}'
    else:
        gh_query = query
    params = {'q': gh_query, 'per_page': 5}
    resp = requests.get(url, headers=headers, params=params)
    if resp.status_code != 200:
        print(f"[GitHub] Search failed: {resp.status_code} {resp.text}", flush=True)
        return []
﻿
    items = resp.json().get('items', [])
    if not items:
        print("[GitHub] No results found.", flush=True)
        return []
﻿
    issue_urls = [it.get('html_url', '') for it in items]
    # Parallel screenshots
    verbose_print("Capturing GitHub issues as screenshots in parallel...")
    screenshots = asyncio.run(capture_screenshots_parallel(issue_urls, OUTPUT_DIR, PARALLEL_PAGE_LOADS))
﻿
    # Parallel OCR
    verbose_print("Running OCR on screenshots in parallel...")
    ocr_texts = [""] * len(screenshots)
    with ThreadPoolExecutor(max_workers=OCR_WORKERS) as ex:
        futures = {ex.submit(run_ocr, path): i for i, path in enumerate(screenshots)}
        for fut in as_completed(futures):
            i = futures[fut]
            try:
                ocr_texts[i] = fut.result() or ""
            except Exception as e:
                verbose_print(f"[!] OCR worker error for index {i}: {e}")
                ocr_texts[i] = ""
﻿
    # Summarize in parallel
    gh_results = []
    summaries = [""] * len(items)
﻿
    def _summarize_idx(i: int) -> str:
        return summarize_with_gpt5(error_text or query, ocr_texts[i])
﻿
    max_workers = min(8, len(items)) if items else 0
    if max_workers > 0:
        with ThreadPoolExecutor(max_workers=max_workers) as ex:
            future_map = {ex.submit(_summarize_idx, i): i for i in range(len(items))}
            for fut in as_completed(future_map):
                i = future_map[fut]
                try:
                    summaries[i] = fut.result() or ""
                except Exception as e:
                    summaries[i] = f"[summarize error: {e}]"
﻿
    for idx, item in enumerate(items):
        summary = summaries[idx]
        issue_info = {
            "number": item.get("number", "?"),
            "title": item.get("title", ""),
            "url": item.get("html_url", ""),
            "body": (item.get("body", "") or "")[:600] + ("..." if item.get("body") and len(item["body"]) > 600 else ""),
            "ocr_summary": summary,
            "screenshot": screenshots[idx] or ""
        }
        gh_results.append(issue_info)
        print("=" * 60, flush=True)
        print(f"Issue #{issue_info['number']}: {issue_info['title']}", flush=True)
        print(f"URL: {issue_info['url']}", flush=True)
        print(f"Screenshot: {issue_info['screenshot']}", flush=True)
        print(f"Solution Summary: {summary}", flush=True)
    print("=" * 60, flush=True)
﻿
    return gh_results
﻿
﻿
@weave.op 
def openai_web_search(query):
    verbose_print(f"Querying OpenAI gpt-5 web search for: {query!r}")
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    search_response = client.responses.create(
        model="gpt-5",
        tools=[{"type": "web_search_preview"}],
        reasoning={"effort": "low"},
        input=query
    )
    print("\n=== [OpenAI] Web Search AI Answer ===", flush=True)
    print(search_response.output_text, flush=True)
    links = re.findall(r'\[([^\]]+)\]\((https?://[^\)]+)\)', search_response.output_text or "")
    link_objs = []
    if links:
        for title, url in links:
            link_objs.append({'title': title, 'url': url})
    else:
        print("No citations found in output_text.", flush=True)
    return {'output_text': search_response.output_text, 'citations': link_objs}
﻿
﻿
@weave.op
def write_html_report(
    log,
    file_snippet,
    tools,
    gh_results,
    web_results,
    static_result=None,
    out_path=None
):
    """Write the HTML debug report and return both the path and the raw HTML."""
    verbose_print("Writing HTML report ...")
    out_path = out_path or os.path.join(tempfile.gettempdir(), 'dbg_report.html')
    css = """
    body { font-family: 'Segoe UI', sans-serif; background: #f5f7fa; color: #333; margin: 0; padding: 0; }
    header { background: #1e293b; color: white; padding: 20px; text-align: center; font-size: 1.5em; }
    section { padding: 20px; margin: 20px; background: white; border-radius: 8px; box-shadow: 0 2px 6px rgba(0,0,0,0.1); }
    h2 { border-bottom: 2px solid #e5e7eb; padding-bottom: 5px; margin-bottom: 10px; color: #1f2937; }
    pre { background: #0f172a; color: #e2e8f0; padding: 15px; border-radius: 6px; overflow-x: auto; font-size: 0.9em; }
    a { color: #2563eb; text-decoration: none; }
    a:hover { text-decoration: underline; }
    .gh-issue { border: 1px solid #e5e7eb; padding: 10px; border-radius: 6px; margin-bottom: 16px; background: #f9fafb; }
    .shot { margin: 8px 0; display: block; max-width: 100%; border: 1px solid #e5e7eb; border-radius: 6px; }
    .label { font-weight: 600; color: #111827; }
    """
    html_parts = []
    html_parts.append(f"<html><head><meta charset='utf-8'><title>Debug Results</title><style>{css}</style></head><body>\n")
    html_parts.append("<header>Debugging Session Report</header>\n")
    html_parts.append("<section><h2>Error Log</h2>")
    html_parts.append(f"<pre>{html.escape(log or 'None')}</pre></section>")
    if file_snippet:
        html_parts.append("<section><h2>Relevant Source Snippet</h2>")
        html_parts.append(f"<pre>{html.escape(file_snippet)}</pre></section>")
    if tools:
        html_parts.append("<section><h2>LLM Tool Recommendations</h2>")
        html_parts.append(f"<pre>{html.escape(str(tools))}</pre></section>")
    if static_result:
        html_parts.append("<section><h2>Static Analysis</h2>")
        diag = static_result.get("diagnosis", "")
        fixes = "\n".join(static_result.get("fixes", []) or [])
        patch = static_result.get("patch", "")
        test_snip = static_result.get("test_snippet", "")
        notes = static_result.get("notes", "")
        html_parts.append(f"<div class='label'>Diagnosis</div><pre>{html.escape(diag)}</pre>")
        if fixes:
            html_parts.append(f"<div class='label'>Proposed Fixes</div><pre>{html.escape(fixes)}</pre>")
        if patch:
            html_parts.append(f"<div class='label'>Proposed Patch</div><pre>{html.escape(patch)}</pre>")
        if test_snip:
            html_parts.append(f"<div class='label'>Quick Test</div><pre>{html.escape(test_snip)}</pre>")
        if notes:
            html_parts.append(f"<div class='label'>Notes</div><pre>{html.escape(notes)}</pre>")
        html_parts.append("</section>")
    if gh_results:
        html_parts.append("<section><h2>GitHub Related Issues</h2>")
        for res in gh_results:
            html_parts.append(f"<div class='gh-issue'><div class='label'>#{res['number']}: {html.escape(res['title'])}</div>")
            html_parts.append(f"<a href='{res['url']}'>{res['url']}</a><br>")
            html_parts.append(f"<div class='label'>Issue Preview</div><pre>{html.escape(res['body'])}</pre>")
            html_parts.append(f"<div class='label'>Solution Summary</div><pre>{html.escape(res.get('ocr_summary',''))}</pre></div>")
        html_parts.append("</section>")
    if web_results:
        html_parts.append("<section><h2>Web Search AI Answer</h2>")
        html_parts.append(f"<pre>{html.escape(web_results.get('output_text', ''))}</pre>")
        if web_results.get('citations'):
            html_parts.append("<ul>")
            for c in web_results['citations']:
                html_parts.append(f"<li><a href='{c['url']}'>{html.escape(c['title'])}</a></li>")
            html_parts.append("</ul>")
        html_parts.append("</section>")
    html_parts.append("</body></html>")
    raw_html = ''.join(html_parts)
    with open(out_path, "w", encoding="utf-8") as f:
        f.write(raw_html)
    verbose_print(f"HTML written at: {out_path}")
    return out_path, raw_html
﻿
def open_html_in_chrome(path):
    verbose_print(f"Opening HTML report in browser ...")
    url = Path(path).resolve().as_uri()
    if sys.platform == 'darwin':
        chrome = '/Applications/Google Chrome.app/Contents/MacOS/Google Chrome'
        if os.path.exists(chrome):
            os.system(f'open -a "{chrome}" "{url}"')
        else:
            webbrowser.open(url)
    elif sys.platform == 'win32':
        import subprocess
        try:
            subprocess.Popen(['start', 'chrome', url], shell=True)
        except Exception:
            webbrowser.open(url)
    else:
        try:
            os.system(f'google-chrome "{url}"')
        except Exception:
            webbrowser.open(url)
﻿
def find_files_from_log_gpt(log_content):
    verbose_print("Invoking LLM to identify implicated files from the log...")
    user_prompt = (
        "Given this error message or traceback, list all file paths (and, if available, line numbers) "
        "involved in the error. Output one JSON per line, as:\n"
        '{"file": "path/to/file.py", "line": 123}\n'
        'If line is not found, use null.\n'
        f"\nError:\n{log_content}"
    )
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    llm_resp = client.responses.create(
        model="gpt-5",
        reasoning={"effort": "low"},
        input=[{"role": "user", "content": user_prompt}]
    )
    output = llm_resp.output_text or ""
    results = []
    for l in output.splitlines():
        l = l.strip()
        if not l:
            continue
        try:
            results.append(eval(l, {"null": None}))
        except Exception as exc:
            verbose_print(f"[File Extraction Skipped Line]: {l!r} ({exc})")
    verbose_print(f"LLM File Extraction Result: {results}")
    return results
﻿
def get_file_snippet(file_path, n_lines=20, line=None):
    if not os.path.exists(file_path):
        verbose_print(f"[WARN] File not found: {file_path}")
        return None
    code = []
    with open(file_path, "r") as f:
        lines = f.readlines()
        if line and 1 <= line <= len(lines):
            s = max(0, line-6)
            e = min(len(lines), line+5)
            code = lines[s:e]
        else:
            code = lines[:n_lines]
    return "".join(code)
﻿
﻿
@weave.op 
def suggest_tools(error_message, code_snippet):
    import ast, json
    verbose_print("Asking LLM: Based on the error and file, which tool to use next?")
    prompt = (
        "You are an AI debugging orchestrator. The following is a Python error message and a snippet of code "
        "from a file involved in the error. Based on this, choose which tools should be used next, and explain why. "
        "Possible tools: github_issue_search, web_search, static_analysis. "
        "Output a single python dictionary (not JSON, not explanation). Example: "
        "{'recommendations':['web_search', 'github_issue_search'], 'justification': 'Searching the web and GitHub can help resolve import errors quickly.'}\n"
        "Error:\n" + error_message +
        "\n\nFile snippet:\n" + code_snippet +
        "\n\nOutput only the dictionary. No preamble or explanation."
        "alwqays use the github tool man"
    )
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    resp = client.responses.create(
        model="gpt-5",
        reasoning={"effort": "low"},
        input=[{"role": "user", "content": prompt}]
    )
    output = (resp.output_text or "").strip()
    try:
        if output.startswith("```") and output.endswith("```"):
            output = output[3:-3].strip()
        obj = ast.literal_eval(output)
        if isinstance(obj, dict):
            verbose_print(f"LLM Tool Suggestion: {obj}")
            return obj
    except Exception:
        pass
    m = re.search(r'\{.*\}', output, re.DOTALL)
    if m:
        try:
            obj = ast.literal_eval(m.group(0))
            if isinstance(obj, dict):
                verbose_print(f"LLM Tool Suggestion: {obj}")
                return obj
        except Exception:
            pass
    verbose_print(f"LLM Suggestion RAW output (not parsable): {output!r}")
    return {"recommendations": [], "justification": 'Could not parse LLM response'}
﻿
﻿
﻿
@weave.op
def final_recommendation_with_gpt5(
    error_text: str,
    code_snippet: str | None,
    tool_suggestion: dict | None,
    gh: list | None,
    web: dict | None,
    query: str,
) -> str:
    """Synthesize a concise, actionable plan from all gathered signals."""
    from openai import OpenAI
    import json, os
﻿
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
﻿
    gh_brief = []
    if gh:
        for item in gh[:5]:
            gh_brief.append({
                "title": item.get("title", ""),
                "url": item.get("url", ""),
                "summary": item.get("ocr_summary", "")
            })
﻿
    web_brief = {
        "answer": (web or {}).get("output_text") if web else None,
        "citations": (web or {}).get("citations") if web else None
    }
﻿
    payload = {
        "error_text": error_text,
        "code_snippet": code_snippet,
        "tool_suggestion": tool_suggestion,
        "search_query": query,
        "github_findings": gh_brief,
        "web_findings": web_brief
    }
﻿
    prompt = (
        "You are a debugging assistant. Based on the following data, produce a short, actionable plan.\n"
        "Include:\n"
        "1. Likely root cause in one or two sentences.\n"
        "2. Concrete next steps that can be executed now.\n"
        "3. If shapes or types are mismatched, propose exact code edits.\n"
        "4. If library problems are implicated, propose install or version pin commands.\n"
        "5. If no external search is needed, say so and outline local static checks.\n\n"
        f"DATA:\n{json.dumps(payload, ensure_ascii=False, indent=2)}\n\n"
        "Return a concise plan. No preamble."
    )
﻿
    resp = client.responses.create(
        model="gpt-5",
        reasoning={"effort": "low"},
        input=[{"role": "user", "content": prompt}]
    )
    return (resp.output_text or "").strip()
﻿
﻿
﻿
﻿
@weave.op
def static_analysis_gpt5(error_text: str, code_snippet: str | None) -> dict:
    """
    Pure GPT-5 static analysis. No web or GitHub.
    Returns a dict with fields: diagnosis, fixes, patch, test_snippet, notes.
    """
    from openai import OpenAI
    import os, json
﻿
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
﻿
    system = (
        "You are a Python static analyzer. Read the error and the code snippet. "
        "Find the root cause and propose concrete code edits. "
        "If there is a tensor shape mismatch, compute the exact shapes and provide the corrected operation. "
        "Return strict JSON with keys: diagnosis, fixes, patch, test_snippet, notes."
    )
﻿
    user = {
        "error_text": error_text,
        "code_snippet": code_snippet or ""
    }
﻿
    prompt = (
        "Analyze the following and return strict JSON only. "
        "Do not include commentary outside JSON.\n\n"
        f"{json.dumps(user, ensure_ascii=False, indent=2)}\n\n"
        "{ \"diagnosis\": \"...\", "
        "\"fixes\": [\"...\"], "
        "\"patch\": \"diff or edited code\", "
        "\"test_snippet\": \"python code to quickly sanity check\", "
        "\"notes\": \"short notes\" }"
    )
﻿
    resp = client.responses.create(
        model="gpt-5",
        reasoning={"effort": "low"},
        input=[
            {"role": "system", "content": system},
            {"role": "user", "content": prompt}
        ]
    )
﻿
    raw = (resp.output_text or "").strip()
    try:
        data = json.loads(raw)
    except Exception:
        data = {
            "diagnosis": "Could not parse JSON from model",
            "fixes": [],
            "patch": "",
            "test_snippet": "",
            "notes": raw[:500]
        }
    return data
﻿
@weave.op
def main(force_use_all_tools: bool = True):
    import os
    GITHUB_TOKEN = ""
    os.environ['OPENAI_API_KEY'] = ""  # set your key here
﻿
﻿
    error_content = read_log(LOGFILE)
    search_query = generate_search_query_openai(error_content) if is_python_error(error_content) \
                   else error_content.strip().replace("\n", " ")
﻿
    files_info, snippet, tools = None, None, None
    try:
        files_info = find_files_from_log_gpt(error_content)
        if files_info:
            file_to_examine, line_hint = files_info[0].get("file"), files_info[0].get("line")
            verbose_print(f"Selected file: {file_to_examine}, line: {line_hint}")
            snippet = get_file_snippet(file_to_examine, line=line_hint)
            if snippet:
                print("\n--- Snippet from implicated file ---\n", flush=True)
                print(snippet, flush=True)
                print("-" * 60, flush=True)
                tools = suggest_tools(error_content, snippet)
                print("\n[TOOL RECOMMENDATION]:", tools, flush=True)
            else:
                verbose_print(f"Could not get snippet from file {file_to_examine}")
        else:
            verbose_print("Did not find any file to examine in the error.")
    except Exception as e:
        verbose_print(f"[WARN] File inference failed: {e}")
﻿
    gh_results = []
    web_results = None
    static_result = None
﻿
    # run static analysis
    if force_use_all_tools or (tools and "static_analysis" in tools.get("recommendations", [])):
        static_result = static_analysis_gpt5(error_content, snippet)
﻿
    # run github search
    if force_use_all_tools or (tools and "github_issue_search" in tools.get("recommendations", [])):
        gh_results = search_github(
            search_query,
            github_token=GITHUB_TOKEN,
            error_text=error_content
        )
﻿
    # run web search
    if force_use_all_tools or (tools and "web_search" in tools.get("recommendations", [])):
        try:
            web_results = openai_web_search(search_query)
        except Exception as ex:
            print(f"[OpenAI] Search failed: {ex}", flush=True)
﻿
    final_plan = final_recommendation_with_gpt5(
        error_text=error_content,
        code_snippet=snippet,
        tool_suggestion=tools,
        gh=gh_results,
        web=web_results,
        query=search_query
    )
    print("\n=== FINAL RECOMMENDATION ===\n", final_plan, "\n", flush=True)
﻿
    html_path, raw_html = write_html_report(
        log=error_content,
        file_snippet=snippet,
        tools=tools,
        gh_results=gh_results,
        web_results=web_results,
        static_result=static_result   # pass it so the section renders
    )
﻿
    appended = raw_html.replace(
        "</body></html>",
        f"<section><h2>Final Recommendation</h2><pre>{html.escape(final_plan or '')}</pre></section></body></html>"
    )
    with open(html_path, "w", encoding="utf-8") as f:
        f.write(appended)
﻿
    open_html_in_chrome(html_path)
    verbose_print("Searches complete. Examine the HTML report in Chrome for summary and results.\n")
﻿
﻿
﻿
if __name__ == "__main__":
    # Fix for Windows event loop policy (Playwright + asyncio)
    if sys.platform.startswith("win"):
        try:
            asyncio.set_event_loop_policy(asyncio.WindowsProactorEventLoopPolicy())  # type: ignore[attr-defined]
        except Exception:
            pass
    main(force_use_all_tools=True)
﻿
﻿
스크립트는 먼저 디스크에서 Python 오류 로그를 읽어들입니다(자세한 내용은 뒤에서 다룸). 실제 traceback으로 보이면, GPT-5가 산만한 텍스트를 짧고 일반화된 검색 쿼리로 변환합니다. 프롬프트는 파일 경로, 줄 번호, 메모리 주소, 실행별 텐서 형태 같은 변동 가능한 세부 정보를 제거합니다. 남는 것은 핵심 오류 문구와 명확한 라이브러리 이름으로, 일관되게 유용한 일치 결과를 끌어올리는 유형의 쿼리입니다.
실제 개발 루프에서 디버깅 보조 스크립트를 사용하려면, Python 명령을 작은 셸 함수로 감싸 오류가 발생할 때마다 디버거가 자동으로 실행되도록 설정하면 됩니다.
예를 들어, 다음 함수를 셸 프로필(예: .zshrc 또는 .bashrc):
agentpython() {
    logfile="/tmp/agentpython-stderr.log"
    python "$@" 2> >(tee "$logfile" >&2)
    if [[ -s "$logfile" ]]; then
        # If logfile is NOT empty, run check script
        python /Users/...FULL_PATH_TO/your_debug.py "$logfile"
    else
        # If logfile is empty, clear it (truncate to zero length)
        > "$logfile"
    fi
}
.zshrc 또는 .bashrc에 함수를 추가한 뒤, 터미널을 재시작하지 않고도 다음 명령을 실행해 현재 터미널 세션에 로드할 수 있습니다: . ~/.zshrc 또는 . ~/.bashrc 시스템에 따라 다릅니다. 
💡
작동 방식은 다음과 같습니다: 
logfile Python 스크립트의 모든 stderr가 캡처될 임시 파일을 가리킵니다.
다음과 같이 호출해 평소처럼 Python 스크립트를 실행합니다 agentpython myscript.py  
그 stderr 출력은 터미널에 표시되는 동시에 tee를 사용해 로그 파일에도 저장됩니다.
만약 그 logfile 비어 있지 않다면, 즉 오류가 발생했다는 의미이므로 즉시 디버깅 보조 스크립트를 호출합니다. debug.py 오류 로그의 경로와 함께.
이 보조 스크립트는 곧바로 GPT-5 + Weave 파이프라인을 실행해 검색 쿼리를 생성하고, GitHub 이슈를 가져와 OCR을 수행한 뒤, 해결책을 요약하고, HTML 보고서를 작성합니다.
만약 그 logfile 비어 있다면(오류가 없다면), 그냥 초기화됩니다.
이렇게 하면 일반적인 개발 프로세스에 GPT-5 디버깅 흐름을 바로 통합할 수 있습니다. python 대신 agentpython을 실행하면, 문제가 발생할 때마다 디버거가 자동으로 작동합니다. 관련 이슈를 가져오고, 모든 입력과 출력을 Weave에 기록하며, 바로 문제를 조사하는 데 사용할 수 있는 보고서를 엽니다.
스크립트를 시연하기 위해, 오류를 포함한 스크립트를 하나 만들었습니다. 
import torch
﻿
a = torch.randn(3, 4)  # 3x4
b = torch.randn(5, 6)  # 5x6
﻿
result = torch.matmul(a, b)
﻿
print(result)
그리고 다음 명령을 실행했습니다:
agentpython bad_code.py
이로 인해 오류가 발생했고, 이어서 우리의 에이전트가 작동했습니다. 
﻿
쿼리를 확보하면 스크립트는 여러 경로로 분기할 수 있습니다. GitHub API를 호출해 주요 이슈를 가져오고, Playwright로 헤드리스 Chromium 세션을 띄워 전체 페이지 스크린샷을 캡처합니다. 그런 다음 Tesseract OCR로 처리하여 길고 이미지로만 이루어진 스레드도 읽을 수 있게 합니다. 이후 GPT-5가 원래 오류 문맥에서 각 OCR 추출물을 요약해, 즉시 실행 가능한 간결한 원인과 해결책을 반환합니다.
여기서는 코드가 GitHub에서 관련 이슈를 검색합니다
도구 추천 모델 호출에서 웹 검색이 권장되면, GPT-5는 실시간 인터넷도 조회합니다. 관련 페이지를 읽고 오류 문맥과 통합한 뒤, 근거로 삼은 출처 링크와 함께 직접 실행 가능한 답을 제공합니다. 이 단계는 GitHub를 넘어 문서화 자료, 블로그 글, Q&A 포럼까지 범위를 확장해 해결책을 포착합니다.
여기서 에이전트는 관련 이슈를 찾아 웹을 검색합니다. 
순수 정적 분석 경로도 제공합니다. 이 경로를 선택하면 GPT-5가 트레이스백과 관련된 코드 스니펫을 완전히 오프라인에서 읽은 뒤, 진단, 표적 수정안, 제안된 패치, 변경 사항을 검증하기 위한 작은 테스트 스니펫, 그리고 보조 메모를 포함한 엄격한 JSON을 반환합니다. 이 방식은 텐서 형태 불일치나 라이브러리 API의 오사용처럼 수정이 이미 코드에 존재하는 로컬 문제에 특히 적합합니다.
모든 핵심 함수는 …에 래핑되어 있습니다 @weave.op, 그래서 Weave는 로그 읽기, 쿼리 생성, GitHub 스크래핑, OCR, 웹 검색, 요약, 정적 분석, 최종 계획 합성에 대한 입력과 출력을 기록합니다. Weave UI에서 실행을 단계별로 따라가며 각 결과가 어떻게 생성되었는지 정확히 확인하고, 세션 간 출력을 비교할 수 있습니다.
﻿
﻿
﻿
마지막 단계에서 스크립트는 단일 HTML 보고서를 생성해 Chrome에서 엽니다. 이 보고서에는 원본 오류 로그, 발견된 경우 관련 소스 스니펫, GPT-5의 도구 추천, 정적 분석 결과, 링크가 포함된 GitHub 이슈, 스크린샷과 요약, AI 답변과 출처가 포함된 웹 검색 섹션, 그리고 지체 없이 수정을 실행할 수 있도록 GPT-5가 제공하는 간결하고 통합된 최종 권고안이 담겨 있습니다.
﻿
﻿
﻿
﻿
Weave가 모든 것을 로깅하므로, 이후에 디버깅 실행을 다시 열어 어떤 부분이든 손볼 수 있습니다. 예를 들어, 수정안을 요약할 때 다른 GPT-5 추론 노력 수준을 시도하거나 검색 쿼리 프롬프트를 조정하고, 전체 브라우저와 OCR 파이프라인을 다시 실행하지 않고도 결과를 직접 비교할 수 있습니다. 시간이 지나면 실제 모델 동작에 연결된 디버깅 패턴의 라이브러리가 구축되어, 향후 이슈에 대한 프로세스를 정교화하는 데 도움이 됩니다.
결론GPT-5는 단순함을 훨씬 넘어서는 폭넓은 가능성을 열어 줍니다 텍스트 생성.
이 튜토리얼에서는 서로 매우 다른 세 가지 사용 사례를 살펴보며, 모두 Weave로 계측했을 때 어떤 이점을 얻을 수 있는지 보여주었습니다.
우리는 GPT-5가 이미지를 다루며 시각적 콘텐츠를 설명하고 생성하는 모습을 확인했습니다. 
코드 과제를 해결하도록 추론 깊이를 조정하고,
로그 파싱, 정적 분석, OCR을 통한 GitHub 검색, 실시간 웹 결과를 하나로 엮어내는 전체 디버깅 파이프라인을 구동합니다.
Weave는 모든 단계에서 결합 조직 역할을 하며, 모델 입력, 출력, 중간 산출물을 기록해 전체 과정을 투명하고 재현 가능하게 만들었습니다. 창의적인 워크플로를 탐색하든, 모델 평가 실행또는 복잡한 오류를 트러블슈팅할 때도, Weave에 전체 시각적 이력이 남아 있으면 결과가 그렇게 나온 이유를 이해하고, 대안 실행을 비교하며, 더 빠르게 반복할 수 있습니다.
GPT-5의 멀티모달 기능을 결합하여 Weave의 관측 가능성뿐만 아니라 강력한 자동화를 얻을 뿐 아니라, 도구가 어떻게 동작했는지에 대한 영구적이고 점검 가능한 기록도 확보하게 됩니다. 덕분에 각 실행이 다시 돌아가 살펴보고, 다듬고, 재사용할 수 있는 학습 자원이 되며, 실험과 디버깅 세션이 앞으로의 작업을 위한 지식 기반으로 꾸준히 축적됩니다.
﻿
 이 글은 AI로 번역된 기사입니다. 오역이 의심되는 부분이 있으면 댓글로 알려 주세요. 원문 보고서는 다음 링크에서 확인하실 수 있습니다: 원문 보고서 보기﻿
﻿
Add a comment