코드 생성 성능 비교: o4-mini, Claude 3.7, Gemini 2.5 Pro 평가

경쟁 프로그래밍 문제에서 Gemini 2.5 Pro, o4-mini, Claude 3.7 Sonnet을 정면 비교한 실전 테스트—정확성 추적, 버그 탐지, 과도한 벤치마크 열기를 걷어내기 위해 Weave 통합을 갖춘 맞춤 실행 프레임워크 위에 구축했습니다. 이 글은 AI 번역본입니다. 오역이 있을 경우 댓글로 자유롭게 알려주세요.
Brett Young
Created on September 12|Last edited on September 12
Comment
구글의 새로운 Gemini 2.5 Pro 미리보기 (I/O Edition)가 출시되어 다음을 약속합니다 더 강력한 코드 생성 그리고 다양한 개발 작업에서 더 똑똑한 추론을 제공합니다. 주장만으로는 충분하지 않습니다. 이 분야는 매우 빠르게 발전하고 있으며, 이제 Gemini는 OpenAI의 o4-mini와 Anthropic의 Claude 3.7 Sonnet과 함께 최상위 모델들과 어깨를 나란히 하며 모든 개발자의 최애 도우미 자리를 두고 경쟁하고 있습니다.
실제로 어떤 모델이 버티는지 확인하기 위해, 우리는 Gemini 2.5 Pro를 OpenAI의 o4-mini와 같은 최강급 동급 모델들과 동일한 실전 프로그래밍 과제에 투입했습니다. Claude 3.7 Sonnet Anthropic의 모델입니다. 여기에는 임의 선별이나 편법이 전혀 없습니다. 각 모델은 신뢰할 수 있는 벤치마크 셋에서 동일한 코딩 문제를 받고, 모든 해답은 프로그램으로 실행한 뒤 정확성 기준으로 점수화됩니다.
이건 리더보드 마케팅 이야기가 아닙니다. 실제로 중요한 코드에서 어떤 모델이 제대로 실행되어 정답을 내는지를 보려는 것입니다. 우리는 Gemini 2.5 Pro가 과연 기대에 부응하는지 확인하고 싶습니다. 단독 성능만이 아니라, 최전선에 있는 다른 모델들과 직접 비교했을 때도요. 이번 정면 승부를 통해 현재 Gemini의 위치가 분명해지고, 개발자들이 실제로 중요한, 필터링 없는 까다로운 코딩 작업을 맡길 때 무엇을 기대할 수 있는지 알 수 있을 것입니다.
﻿
목차우리의 벤치마크: CodeContests 우리의 벤치마크: CodeContests 경량 코드 실행 프레임워크 구축Gemini 2.5 Pro Preview, o4-mini, Claude 3.7 Sonnet 평가정면 대결 결과: Gemini 2.5 Pro vs o4-mini vs Claude 3.7 SonnetWeave 비교 뷰Weave로 버그 잡기 결론
﻿
우리의 벤치마크: CodeContests 증가하는 문제 in LLM 평가 모델들이 명시적으로 그 데이터로 학습되지 않았더라도 벤치마크 테스트 세트에 “과적합”되는 듯 보인다는 점입니다. 이는 전통적인 과적합 (모델이 학습 중에 본 정답을 그대로 외워두는 현상). 대신, 이는 그보다는 학습 방식 때문에 대규모로 발생하는 데이터 누출에 가깝습니다. LLM 이다 사전 학습된.
GSM8K, HumanEval, MMLU 같은 인기 벤치마크는 수년 동안 온라인에 널리 유통되어 왔습니다. GitHub, 학술 논문, 블로그 글, 튜토리얼 등 곳곳에서 쉽게 찾을 수 있습니다. LLM이 공개 인터넷을 대규모로 크롤링한 데이터로 사전 학습될 경우, 이들 벤치마크의 일부—혹은 전체 복사본—를 보았을 가능성이 큽니다. 이는 평가 시, 비록 파인튜닝 단계에서 해당 벤치마크를 제외했다 하더라도 모델이 이미 그 과제를 “알고” 있을 수 있음을 의미합니다.
이는 강한 일반화 능력이 있는 것처럼 보이게 하는 착시를 만듭니다. 모델이 진짜로 문제를 해결해서가 아니라, 형식을 알아보고 비슷한 문구를 기억하거나 정답 분포를 외워서 높은 점수를 받는 것입니다. 그래서 어떤 모델은 테스트 세트에서 좋은 성능을 보일 수 있습니다. 그것들에 대해 별도로 학습된 적 없이도. 실제 환경에서 매우 비슷한(혹은 동일한) 문제를 수없이 많이 접했기 때문에, 일반화한 것처럼 “보이게” 속일 수 있을 뿐입니다.
이는 특히 여러 모델이 유사한 코퍼스로 학습될 때 벤치마크 점수를 신뢰하기 어렵게 만듭니다. 사전 학습과 평가 사이의 경계가 지나치게 흐릿하고, 벤치마크가 재사용될수록 그 가치가 약화됩니다. 추론이나 일반화에 대한 어떤 중대한 주장도, 실제로 한 번도 노출되지 않은 비공개 평가로 뒷받침되지 않으면 신중하게 받아들여야 합니다. 따라서 우리가 이후에 실행할 벤치마크들이 참된 일반화를 반영한다고 보장할 수는 없습니다. 공개 벤치마크는 점점 포화 상태에 이르고 있으며(파인튜닝에 명시적으로 사용되지 않았더라도 모델이 사전 학습 중에 이미 접했을 가능성이 큽니다), 단지 그 이유만으로 일반화를 주장할 수는 없기 때문입니다. 
그렇다고 해서 벤치마크가 쓸모없다는 뜻은 아닙니다. 벤치마크는 모델을 비교하고, 시간에 따른 변화를 추적하며, 특정 능력을 스트레스 테스트하는 공통 기준점을 제공합니다. 다만 벤치마크는 정답이 아니라 진단 도구로 취급해야 합니다. 벤치마크에서 높은 점수를 받았다고 해서 모델이 과제를 이해한다는 의미는 아닙니다. 이전에 본 내용을 그대로 되풀이했을 수도 있습니다.
우리의 벤치마크: CodeContests 우리는 CodeContests에서 Gemini 2.5 Pro, o4-mini, Claude 3.7 Sonnet을 실행해 실제 경쟁 프로그래밍 문제를 누가 가장 잘 해결하는지 평가할 것입니다. 이 컬렉션은 Codeforces, AtCoder, CodeChef와 같은 플랫폼에서 수집한 10,000개가 넘는 실제 경쟁 프로그래밍 문제를 모았습니다. 각 문제에는 자연어 설명, 다수의 공개 및 비공개 테스트 케이스, 태그와 난이도 등 메타데이터가 포함되어 있습니다.
CodeContests는 다양한 문제 유형과 엄격한 평가 방식 덕분에 코드 생성 모델의 벤치마크로 널리 사용됩니다. 이 문제들은 LLM을 염두에 두고 설계된 것이 아니며, 그럴듯한 코드가 아니라 명확한 추론, 정밀한 구현, 테스트 케이스에서의 실제 정답을 요구합니다.
공정한 맞대결 비교를 위해 우리는 공개 테스트 스플릿을 사용하여, 모든 모델이 이전에 보지 못한 실제 문제에 맞서도록 보장합니다. 이 벤치마크는 Gemini 2.5 Pro Preview, o4-mini, Claude 3.7 Sonnet이 현실 세계의 코딩 과제를 얼마나 잘 처리하는지 직접적이고 현실적인 척도를 제공합니다.
경량 코드 실행 프레임워크 구축우리는 전체 평가를 자동화하기 위해 맞춤형 경량 프레임워크를 사용할 것입니다. 각 문제에 대해 모든 모델은 자연어로 된 문제 설명만 보게 되며, 해법으로서 Python 함수 생성을 요청받습니다. 이 프레임워크는 Gemini 2.5 Pro, o4-mini, Claude 3.7 Sonnet에 대한 호출을 표준화하여, 모든 코드 한 줄과 모든 테스트가 공정하게 비교되도록 합니다.
효율성과 유연성을 높이기 위해, 보조 작업을 처리하는 데 LLM을 활용할 것입니다. 예를 들어, 적절한 함수 호출을 구성하고 실행을 위한 예시 입력을 포맷팅하는 등의 작업입니다. 각 모델의 코드를 공개 테스트 케이스에 대해 실행한 후에는, 모델의 출력과 정답을 비교하는 데도 LLM을 사용할 것입니다. 이 방식은 코드 출력 검증에서 흔히 발생하는 경계 사례(미묘한 포맷 차이, 부동소수점 허용 오차 등)를 스마트하게 처리할 수 있게 해줍니다.
LLM으로 함수 호출과 정답 평가를 자동화하면, 평가 과정을 반복 가능하게 표준화할 수 있습니다. 그러면서도 초점을 진정 중요한 지점에 맞출 수 있습니다: 모델���이 실제 문제에 대해 제대로 작동하는 코드를 실제로 생성할 수 있는지 여부에 대해.
다음은 우리 코드 실행 프레임워크의 핵심 구성 요소 일부를 보여 주는 간단한 스크립트입니다:
﻿
import subprocess
import sys
from litellm import completion
﻿
﻿
def clean_llm_code_block(text):
    return text.replace("```python", "").replace("```", "").strip()
﻿
﻿
def ask_llm_for_function_call(code: str, raw_input: str) -> str:
    prompt = (
        "You're given a Python function and a single input string. "
        "Format it into a valid Python function call using only standard types.\n\n"
        f"Function:\n{code}\n\n"
        f"Input:\n{raw_input.strip()}\n\n"
        "Return ONLY a valid function call (e.g., solve(3, 5))."
    )
    response = completion(
        model="openai/gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        max_tokens=100,
    )
    raw = response["choices"][0]["message"]["content"]
    return clean_llm_code_block(raw)
﻿
def compare_output_with_llm(expected: str, actual: str) -> bool:
    prompt = (
        f"Expected output: {expected.strip()}\n"
        f"Actual output: {actual.strip()}\n\n"
        "Are these outputs semantically equivalent? Reply YES or NO."
    )
    response = completion(
        model="openai/gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        max_tokens=10,
    )
    return response["choices"][0]["message"]["content"].strip().upper() == "YES"
﻿
def run_code_and_call_function(code: str, function_call: str, timeout=10):
    full_code = code + f"\n\nprint({function_call})"
    try:
        result = subprocess.run(
            [sys.executable, "-c", full_code],
            capture_output=True,
            text=True,
            timeout=timeout
        )
        return result.stdout.strip(), result.stderr.strip()
    except subprocess.TimeoutExpired:
        return "", "Execution timed out."
    except Exception as e:
        return "", str(e)
﻿
# Example LLM-generated code (you'd swap this in dynamically)
code_from_llm = """
def solve(a, b, mod):
    return pow(a, -1, mod)
"""
﻿
# Example test case block
test_case = {
    "input": [
        "3 2 998244353\n",
        "9 3 998244353\n",
        "3 1 998244353\n",
        "9 4 998244353\n"
    ],
    "output": [
        "665496236\n",
        "449209967\n",
        "499122178\n",
        "665496237\n"
    ]
}
﻿
# Run test cases
for i, input_line in enumerate(test_case["input"]):
    expected = test_case["output"][i]
    try:
        call = ask_llm_for_function_call(code_from_llm, input_line)
        result, error = run_code_and_call_function(code_from_llm, call)
﻿
        if error:
            print(f"[{i}] ERROR during execution: {error}")
            continue
﻿
        is_correct = compare_output_with_llm(expected, result)
        print(f"[{i}] input: {input_line.strip()} → output: {result} | expected: {expected.strip()} | PASS: {is_correct}")
﻿
    except Exception as e:
        print(f"[{i}] Failed: {repr(e)}")
﻿
복잡한 평가를 설계할 때는, 먼저 평가에 포함될 핵심 과제를 검증할 수 있는 단순한 단일 예시 버전을 만들어 보는 것이 도움이 된다고 생각합니다. 이렇게 한 뒤에는 더 많은 모델, 지표, 점수 계산 함수를 추가하는 일이 훨씬 수월해집니다. 
Gemini 2.5 Pro Preview, o4-mini, Claude 3.7 Sonnet 평가공정한 일대일 비교를 위해, 동일한 테스트 세트에서 선택한 30개의 CodeContests 문제 배치로 각 LLM을 실행하고, 해법 생성부터 테스트 케이스 평가까지 전 과정을 자동화합니다. 각 문제마다 모델은 과제 설명만 제공받으며, 독립적으로 실행 가능한 Python 함수 solve를 출력해야 합니다. 이후 코드 추출 루틴(견고성을 위한 백업 LLM 파서 포함)을 사용해 출력 정합성을 강제하여, 실행 단계에는 유효하고 import 가능한 코드만 전달되도록 보장합니다.
다음으로 각 테스트 입력에 대해, 또 다른 LLM 프롬프트를 사용해 필요한 함수 호출을 자동 생성합니다. 이후 후보 코드는 격리된 환경에서 실행되며, 출력이 기대 결과와 비교됩니다. 
import os
import sys
import time
import subprocess
from datasets import load_dataset
from litellm import completion as oai_completion
from anthropic import Anthropic
from google import genai
import weave
﻿
from weave.flow.eval_imperative import EvaluationLogger
from google import genai
from google.genai import types
from litellm import completion as oai_completion
import re
from litellm import completion as oai_completion
﻿
﻿
weave.init("codecontests_eval")
﻿
# API keys
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY", "sk-...")
CLAUDE_KEY = os.getenv("CLAUDE_API_KEY", "sk-...)
GENAI_KEY = os.getenv("GOOGLE_API_KEY", "sk-...")
﻿
# Clients
anthropic_client = Anthropic(api_key=CLAUDE_KEY)
gemini_client = genai.Client(api_key=GENAI_KEY)
﻿
﻿
﻿
def clean_llm_code_block(text):
    import re
﻿
    cleaned_text = text.replace("```python", "").replace("```", "").strip()
    code_blocks = re.findall(r"(def solve\(.*?)(?=^def |\Z)", cleaned_text, re.DOTALL | re.MULTILINE)
    source_text = code_blocks[-1] if code_blocks else cleaned_text
﻿
    prompt = (
        "Given the following response from a language model, extract ONLY the valid Python code for the function. "
        "Do not include any explanations, text, or formatting fences. Only the code.\n\n"
        f"Response:\n{source_text}\n\n"
        "Return ONLY the Python code, including any necessary imports:"
    )
﻿
    response = oai_completion(
        model="openai/gpt-4o-2024-08-06",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,
    )
﻿
    gpt4o_code = response["choices"][0]["message"]["content"]
    gpt4o_code = gpt4o_code.replace("```python", "").replace("```", "").strip()
    return gpt4o_code
﻿
@weave.op()
def generate_completion(model: str, prompt: str) -> str:
    if model.startswith("openai/"):
        response = oai_completion(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            reasoning_effort="low",
        )
        return response["choices"][0]["message"]["content"].strip()
﻿
    elif model.startswith("anthropic/"):
        response = anthropic_client.messages.create(
            model=model.replace("anthropic/", ""),
            max_tokens=8000,
            thinking={"type": "enabled", "budget_tokens": 4000},
            messages=[{"role": "user", "content": prompt}],
        )
        for block in response.content:
            if block.type == "text":
                return block.text.strip()
        return "[No Claude response]"
﻿
    elif model.startswith("gemini/"):
        result = gemini_client.models.generate_content(
            model=model.replace("gemini/", ""),
            config=types.GenerateContentConfig(
                thinking_config=types.ThinkingConfig(thinking_budget=4000)
            ),
            contents=[prompt]
        )
        return result.text.strip() if result.text else "[No Gemini response]"
﻿
    else:
        raise ValueError(f"Unsupported model: {model}")
    
﻿
﻿
def ask_llm_for_function_implementation(description: str, model: str) -> str:
    prompt = (
        f"Write a Python3 function named `solve` with typed input arguments for this problem -- eg the solve function should take arguments to handle different test cases:\n\n"
        f"{description.strip()}\n\n"
        "Return only a valid Python function -- no special packages that arent commonly used and NO MAIN function, no  if __name__ == __main__....., JUST write the function --  that returns the result. No comments, no explanations."
        f"HOWEVER, you still need to include necessary imports for libraries"
        f"IF you do not include the right imports, the code will not be executable, and your response will be judged as incorrect!"
    )
    return clean_llm_code_block(generate_completion(model, prompt))
﻿
﻿
﻿
@weave.op
def ask_llm_for_function_call(code: str, raw_input: str, model: str) -> str:
﻿
    prompt = (
        "You're given a Python function and a single input string. "
        "Format it into a valid Python function call using only standard types.\n\n"
        f"Function:\n{code}\n\n"
        f"Input:\n{raw_input.strip()}\n\n"
        "Return ONLY a valid function call (e.g., solve(3, 5)) WITH NO 'def' "
    )
﻿
    # Always use GPT-4o for this inference, regardless of the `model` argument.
    response = oai_completion(
        model="openai/gpt-4o-2024-08-06",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,
    )
    # The LLM may return markdown code blocks; strip them just in case.
    content = response["choices"][0]["message"]["content"]
    content = content.replace("```python", "").replace("```", "").strip()
    return content
﻿
﻿
﻿
def compare_output_with_llm(expected: str, actual: str, model: str) -> bool:
    prompt = (
        f"Expected output: {expected.strip()}\n"
        f"Actual output: {actual.strip()}\n\n"
        "Are these outputs equivalent? Eg ignore minor formatting errors etc, we are just looking for overall correctness in the output Reply YES or NO."
    )
    # response = generate_completion(model, prompt)
﻿
    response = oai_completion(
        model="openai/gpt-4o-2024-08-06",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,
    )
    # The LLM may return markdown code blocks; strip them just in case.
    res = 'YES' in str(response["choices"][0]["message"]["content"]).upper()
    return res 
﻿
﻿
﻿
def run_code_and_call_function(code: str, function_call: str, timeout=10):
    full_code = code + f"\n\nprint({function_call})"
    try:
        start = time.time()
        result = subprocess.run(
            [sys.executable, "-c", full_code],
            capture_output=True,
            text=True,
            timeout=timeout
        )
        latency = time.time() - start
        return result.stdout.strip(), result.stderr.strip(), latency
    except subprocess.TimeoutExpired:
        return "", "Execution timed out.", timeout
    except Exception as e:
        return "", str(e), 0.0
﻿
﻿
def ask_model_for_pip_command(error_msg):
    prompt = (
        "Given this Python error:\n\n"
        + error_msg +
        "\n\nWrite the pip install command needed to fix it. Only return the command, e.g.:\n"
        "pip install requests"
    )
    return generate_completion("openai/gpt-4o-2024-08-06", prompt)
﻿
﻿
def run_pip_install(pip_command):
    print(f"Running: {pip_command}")
    try:
        result = subprocess.run(
            pip_command.split(),
            capture_output=True,
            text=True,
            timeout=180
        )
        print(result.stdout.strip())
        if result.stderr:
            print(result.stderr.strip())
    except Exception as e:
        print(f"pip install failed: {e}")
﻿
﻿
﻿
def evaluate_model_on_code_contests(model_name: str):
    print(f"\n\nRunning evaluation for model: {model_name}\n")
    ds = load_dataset("deepmind/code_contests", split="test", streaming=True)
    ds = list(ds.take(31))
﻿
    eval_logger = EvaluationLogger(
        model=model_name.replace("-", "_").replace("/", "_").replace(".", "_"),
        dataset="code_contests_test"
    )
    all_latencies = []
﻿
    for i in range(30):
        row = ds[i]
        description = row["description"]
        raw_inputs = row["public_tests"]["input"]
        expected_outputs = row["public_tests"]["output"]
﻿
        try:
            code = ask_llm_for_function_implementation(description, model=model_name)
            print(f"\n=== Task {row['name']} ===", flush=True)
            # print("Generated code:\n", code)
﻿
            all_passed = True
            task_latencies = []
            results_lst, expected_lst = [], []
﻿
            for j, raw_input in enumerate(raw_inputs):
                expected = expected_outputs[j] if j < len(expected_outputs) else ""
﻿
                try:
                    function_call = ask_llm_for_function_call(code, raw_input, model=model_name)
                    result, error, latency = run_code_and_call_function(code, function_call)
                    if latency < 99:
                        task_latencies.append(latency)
﻿
                
                    if error:
                        print(f"[{j}] Runtime error: {error}")
                        if "ModuleNotFoundError" in error:
                            pip_cmd = ask_model_for_pip_command(error)
                            run_pip_install(pip_cmd)
                            # Re-run once after pip install
                            result, error, latency = run_code_and_call_function(code, function_call)
                            
                            task_latencies.append(latency)
                            if error:
                                print(f"[{j}] Retry failed: {error}")
                                all_passed = False
                                continue
                        else:
                            all_passed = False
                            continue
﻿
                    is_correct = compare_output_with_llm(expected, result, model="openai/gpt-4o-2024-08-06")###### 
                    results_lst.append(result)
                    expected_lst.append(expected)
                    if not is_correct:
                        all_passed = False
                    print(f"[{j}] input: {raw_input.strip()} → output: {result} | expected: {expected.strip()} | PASS: {is_correct} | latency: {latency:.2f}s")
﻿
                except Exception as inner:
                    print(f"[{j}] Inner error: {repr(inner)}")
                    all_passed = False
            
            task_avg_latency = sum(task_latencies) / len(task_latencies) if len(task_latencies) > 0 else 0.0
            all_latencies.extend(task_latencies)
﻿
            prediction_log = eval_logger.log_prediction(
                inputs={"description": description},
                output={'code': code, 'execution_result': results_lst, 'expected_execution_result': expected_lst}
            )
            prediction_log.log_score("correctness", all_passed)
            prediction_log.log_score("code_latency", task_avg_latency)
            prediction_log.finish()
﻿
        except Exception as e:
            print(f"[{i}] Top-level failure: {repr(e)}")
            prediction_log = eval_logger.log_prediction(
                inputs={"description": description},
                output=str(e)
            )
            prediction_log.log_score("correctness", False)
            prediction_log.finish()
﻿
    avg_latency = sum(all_latencies) / len(all_latencies) if all_latencies else 0.0
    eval_logger.log_summary({"avg_code_latency": avg_latency})
    print(f"Evaluation complete for {model_name}. View in Weave UI.")
﻿
﻿
# Run for all models
﻿
evaluate_model_on_code_contests("gemini/gemini-2.5-pro-preview-05-06")
evaluate_model_on_code_contests("anthropic/claude-3-7-sonnet-20250219")
evaluate_model_on_code_contests("openai/o4-mini")
﻿
모든 과정을 재현 가능하게 유지하고 상세 로그를 추적하기 위해, 우리는 사용합니다 위브 기록하기 우리의 평가 지표 원시 모델 코드, 테스트 입력, 출력, 오류, 평균 지연 시간 같은 지표까지 모두 캡처합니다. 이렇게 하면 실패 사례를 점검하고, 모델 전반의 추세를 파악하며, 모델과 ��치마크가 발전하더라도 모든 결과와 버그를 추적 가능하게 유지할 수 있습니다. 이를 위해 Weave의 새 EvaluationLogger를 사용했는데, 평가를 계측하는 보다 유연한 방법을 제공합니다.
사용하여 EvaluationLogger엄격한 형식에 묶일 필요는 없습니다. 대신 데이터셋을 수동으로 순회하면서, 원하는 방식으로 모델을 호출하고, 예측 결과와 점수를 생성되는 대로 기록할 수 있습니다. 이렇게 하면 기존 파이프라인이나 커스텀 평가 루프에 Weave를 코드 전면 수정 없이 손쉽게 끼워 넣을 수 있습니다.
정면 대결 결과: Gemini 2.5 Pro vs o4-mini vs Claude 3.7 Sonnet제 평가 결과는 다음과 같습니다: 
﻿
사고 모드는 계산 비용이 많이 들어서 30개 샘플만 테스트했습니다. OpenAI의 o4-mini 모델에는 “low” 사고 예산 설정을 사용했고, 모든 모델에서 예산을 약 4,000토큰으로 맞추려고 했습니다. 그럼에도 불구하고 Gemini는 이를 자주 초과했고, 응답당 평균 약 25,000토큰을 사용했습니다(버그였는지 제 실수였는지는 확실치 않습니다). 제 경험상, 이렇게 장황해지면 보통 성능이 올라가는데, 하지만 Gemini 2.5 Pro의 정답률은 여전히 0.333으로, Claude 3.7 Sonnet의 0.367 바로 뒤였고 OpenAI의 o4-mini 0.5에는 한참 못 미쳤습니다.
Weave 비교 뷰Weave의 비교 뷰는 여러 코드 생성 모델의 추론 차이를 명확하게 시각화하는 데 특히 유용합니다. 각 모델의 출력을 나란히 보여 주기 때문에 정답 여부, 논리적 일관성, 차트나 그래프 같은 시각 입력 처리 방식의 차이를 즉시 파악할 수 있습니다. 이 직관적인 인터페이스를 통해 어떤 모델은 실패하고 다른 모델은 성공하는 이유를 빠르게 짚어낼 수 있습니다.
﻿
이러한 인사이트는 코딩 성능을 효과적으로 분석하고 최적화하는 데 도움을 줍니다. 결과뿐만 아니라 근본적인 코딩 스타일까지 드러내 줌으로써, Weave의 비교 뷰는 각 모델의 추론 능력을 평가하고 개선하는 과정을 단순화합니다.
Weave로 버그 잡기 평가 중 하나를 시작한 뒤, 모델들의 초기 응답 몇 가지를 살펴보았습니다. 그리고 코드에 대한 비코드형 설명도 포함되어 있었습니다. 분명한 문제였습니다그리고 추가로 조사해 보니, 문제는 제 …에서 발생하고 있었습니다 clean_llm_code_block당시에는 코드 추출에 Regex만 사용하고 있었는데, LLM 출력의 전체 분포를 견딜 만큼 견고하지 않았던 것 같습니다(아니면 제가 정규식 필터를 못 쓴 걸 수도 있고요). 그래서 쉬운 길을 택해, 별도의 LLM으로 코드를 파싱해 거의 100%에 가까운 신뢰성을 확보하기로 했습니다. Weave 덕분에 이 버그를 일찍 잡을 수 있었어요!
﻿
﻿
결론Gemini 2.5 Pro (I/O Edition), o4-mini, Claude 3.7 Sonnet을 벤치마크해 본 우리의 여정은 AI 코딩 어시스턴트 분야가 빠르게 진화하고 있으며 경쟁이 매우 치열하다는 사실을 보여줍니다. 이들 모델은 복잡한 실제 프로그래밍 문제를 놀라울 정도로 일관되게 처리하지만, 겉으로 드러나는 핵심 지표를 해석할 때는 주의가 필요합니다. 우리가 강조했듯, CodeContests 같은 공개 벤치마크는 여전히 유용한 진단 도구이지만, 잠재적 데이터 누출과 공개 웹 전반에 걸친 테스트 세트 포화로 인해 그 가치에는 한계가 있습니다. 추론과 일반화에서의 진정한 도약은 여전히 조작하기 어렵고, 비공개이거나 새로운 평가 방법을 요구합니다.
이러한 주의점에도 불구하고, 자동화된 LLM 보강 실행 프레임워크에 기반한 우리의 평가는 실제로 보지 못한 코딩 과제에 직면했을 때 최첨단 모델들이 무엇을 실제로 제공하는지 공정하고 재현 가능하며 투명하게 비교하는 방법을 제시합니다. Weave 같은 도구는 초기 단계에서 신뢰성 문제를 잡는 데 도움을 주었고, 견고한 평가 인프라의 중요성을 다시 한 번 강조해 주었습니다.
결국 벤치마크가 완벽하지는 않더라도, 잘 선별된 과제에서 최신 최상위 모델들을 스트레스 테스트하는 일은 이들의 실제 강점과 한계를 이해하는 데 여전히 중요합니다. Gemini 2.5 Pro 같은 도구를 고려하는 개발자들은 그 자체 성능에서 고무를 받을 수 있지만, 리더보드 성과와 현실 세계에서의 탄탄한 복원력 사이의 차이를 냉철하게 인식해야 합니다. 분야가 앞으로 나아가면서, 더 똑똑할 뿐 아니라 더욱 엄밀하게 평가된 코드 어시스턴트를 구축하기 위한 경쟁은 아직 끝나지 않았습니다.
﻿
﻿
﻿
﻿
﻿
﻿
﻿
 이 글은 AI로 번역된 기사입니다. 오역이 있을 경우 댓글로 알려 주세요. 원문 보고서는 다음 링크에서 확인할 수 있습니다: 원문 보고서 보기﻿
﻿
Add a comment