Retrieval Augmented Thinking(RAT)은 무엇이며 어떻게 작동하나요?

Retrieval Augmented Thinking(RAT)은 AI의 추론을 응답 생성과 분리하여, 구조화된 사고를 담당하는 모델과 최종 출력을 담당하는 다른 모델을 함께 사용함으로써 효율성, 해석 가능성, 맞춤화를 개선합니다. 이 글은 AI 번역본입니다. 오역이 의심되는 부분이 있으면 댓글로 알려주세요.
Brett Young
Created on September 12|Last edited on September 12
Comment
Retrieval Augmented Thinking(RAT) AI의 추론을 응답 생성과 분리하여, 구조화된 사고는 한 모델이, 최종 출력은 다른 모델이 담당합니다. 이를 통해 효율성, 해석 가능성, 그리고 맞춤성이 향상됩니다.
이 접근 방식은 다음과 같은 잠재적 이점이 있습니다. 예를 들어 오픈 소스 추론 모델과의 페어링을 포함해 DeepSeek-R1 최종 답변은 별도의 모델이 담당하므로 비용과 성능을 보다 세밀하게 제어할 수 있습니다. 비추론 모델은 대개 다음과 같은 이유로 더 쉽게 미세 조정 추론 모델보다 비추론 모델을 다루기가 더 쉬우므로, RAT은 핵심 추론 과정을 변경하지 않고도 출력 형식을 자유롭게 맞춤화할 수 있게 해줍니다. 또한 여러 개의 추론 모델을 조합해 한 모델을 “생성” 역할로 사용하고, 다른 추론 모델이 오류를 점검하도록 구성하는 것도 가능하므로 정확도를 높일 수 있습니다.
이 글에서는 RAT의 이점을 살펴보고, DeepSeek-R1 같은 모델을 활용해 RAT를 구현하는 방법도 함께 다룹니다. Claude 3.7, 그리고 OpenAI의 o3-mini.
﻿
목차목차Retrieval Augmented Thinking이란 무엇인가?튜토리얼: Retrieval Augmented Thinking 구현하기 여러 개의 추론 모델 앙상블 Weave 비교 대시보드 결론 
﻿
Retrieval Augmented Thinking이란 무엇인가?Retrieval Augmented Thinking(RAT)은 하나의 AI 모델이 구조화된 추론을 생성하고, 다른 모델이 그 추론을 활용해 최종 응답을 만들어 내는 접근 방식입니다. 하나의 모델이 동시에 생각하고 답하도록 맡기는 대신, RAT은 추론 과정을 두 단계로 분리하여 효율성, 해석 가능성, 그리고 제어 가능성을 한층 더 최적화할 수 있게 해줍니다.
DeepSeek-R1와 Claude 같은 모델은 내부 추론 과정을 공개해 RAT에 유용합니다. 응답을 생성하기 전에 추론을 먼저 포착함으로써, RAT은 더 구조적이고 신중한 답변을 가능하게 합니다. 구현 방식에 따라, 이 방법은 여러 가지 이점을 제공할 수 있습니다:
비용이 낮은 추론 모델을 비공개 모델과 짝지어 사용하기: DeepSeek 같은 오픈 모델은 구조화된 추론을 처리할 수 있고, 반면 클로즈드 모델은 GPT-4o 다듬어진 응답을 생성할 수 있습니다. 이는 높은 품질을 유지하면서 비용을 절감합니다.
비공개 추론 모델과 비교했을 때 더 높은 해석 가능성: 오픈소스 모델에서 추론을 추출하면, 이를 클로즈드소스 모델에 넘기기 전에 의사결정 과정을 더 투명하게 만들 수 있습니다.
최적화된 토큰 사용과 잠재적인 지연 시간 감소:" 추론에 사용하는 토큰 수를 제한하고 응답 생성은 더 작고 빠른 모델로 오프로드하면, RAT은 두 작업을 모두 하나의 대형 모델로 처리하는 방식에 비해 효율을 높이고 지연 시간을 줄일 수 있습니다.
최종 출력의 맞춤화 확대: 비추론 모델은 보통 파인튜닝이 더 쉽기 때문에, RAT을 사용하면 추론 과정을 변경하지 않고도 사용자 응답을 자유롭게 맞춤화할 수 있습니다.
검증을 위한 다중 추론 모델 결합: RAT은 추론 단계에서 다중 모델을 활용할 수도 있습니다. 한 가지 방법은 DeepSeek 같은 “오픈 트레이스” 모델로 구조화된 추론 과정을 생성하고, 그 단계들을 O3-mini 또는 Gemini Flash Thinking 같은 두 번째 추론 모델에 전달하여 검증하는 것입니다. 이렇게 하면 추론 과정에 대한 완전한 가시성을 유지하면서 두 모델의 지능을 모두 활용할 수 있습니다. 오픈 트레이스 모델은 투명성을 보장하고, 클로즈드소스 모델은 단계를 정제하거나 비판하여 정확도를 높일 수 있습니다.
구조화된 사고와 유연한 응답 생성을 결합함으로써, RAT은 다양한 요구와 제약에 맞게 적용할 수 있는 모듈형 AI 추론 접근법을 제공합니다. 이러한 방식으로 다중 추론 모델을 활용하면 강력한 비공개 모델의 장점을 포기하지 않으면서도 신뢰성을 높이고, 오류를 탐지하며, 해석 가능성을 크게 향상할 수 있습니다.
튜토리얼: Retrieval Augmented Thinking 구현하기 Retrieval Augmented Thinking은 추론 과정을 추적해 가시화할 수 있는 어떤 추론 모델과도 함께 사용할 수 있습니다. 현재 이 기능을 수행할 수 있는 주요 모델은 DeepSeek-R1과 Claude 3.7입니다. 목표는 구조화된 추론 과정을 최종 답변 생성과 분리하여 효율성과 해석 가능성을 높이는 것입니다.
이 구현에서는 모델이 응답을 반환하기 전에 생각할 수 있는 토큰 수 또는 시간 한도를 설정합니다. 이렇게 하면 비용을 통제하고, 추론을 간결하게 유지하며, 불필요한 지연을 방지할 수 있습니다. DeepSeek-R1은 이러한 제약 안에서 구조화된 사고 과정을 생성하고, 해당 추론은 이후 응답 모델로 전달됩니다. 응답 모델은 추론 트레이스를 문맥으로 사용해 최종 답변을 생성하는 비공개 모델인 GPT-4o-mini이거나, DeepSeek-chat일 수 있습니다.
이 접근법은 유연한 응답 생성을 허용하면서도 AI의 추론 과정에 대한 가시성을 유지하는 방법을 제공합니다. 추론 과정을 제한하고 응답 생성을 다른 모델로 오프로딩함으로써, 이 구성은 해석 가능성, 효율성, 비용 통제를 균형 있게 달성합니다. 또한 여러 추론 모델이 서로를 검증하는 하이브리드 방식도 가능하게 합니다.
코드는 다음과 같습니다: 
import os
from litellm import completion
from openai import OpenAI
import weave
import time
from transformers import AutoTokenizer
﻿
import weave; weave.init("rat")
﻿
﻿
# Model Constants
DEEPSEEK_MODEL = "deepseek-reasoner"
GPT_MODEL = "gpt-4o-mini"
﻿
# Initialize DeepSeek client and tokenizer
weave.init("deepseek_r1_api_streaming")
r1_api_key = "your deepseek api key"  # Replace with your API key
deepseek_client = OpenAI(api_key=r1_api_key, base_url="https://api.deepseek.com")
﻿
# Initialize DeepSeek tokenizer
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V2", trust_remote_code=True)
﻿
﻿
@weave.op
def get_deepseek_reasoning(prompt, max_think_tokens=1000, max_think_time=60):
﻿
    start_time = time.time()
    
    print("\nReasoning Process:")
    
    modified_prompt = f"{prompt}\n\nPlease limit your response to approximately {max_think_tokens} tokens."
    
    # Use direct OpenAI client for DeepSeek
    response = deepseek_client.chat.completions.create(
        model=DEEPSEEK_MODEL,
        max_tokens=1,  # This triggers reasoner mode
        messages=[{"role": "user", "content": modified_prompt}],
        stream=True
    )
﻿
    reasoning_content = ""
    final_content = ""
    token_count = 0
    limit_reached = False
    limit_reason = ""
﻿
    try:
        # Process chunks with timeout and token limit
        for chunk in response:
            # Check if we've exceeded the max thinking time
            current_time = time.time()
            elapsed = current_time - start_time
            
            if elapsed > max_think_time and not limit_reached:
                limit_reached = True
                limit_reason = "TIME"
                print("\n\n[TIME LIMIT REACHED]")
                # Explicitly break out of the loop on time limit
                break
                
            # Process the chunk
            if chunk.choices[0].delta.reasoning_content:
                reasoning_piece = chunk.choices[0].delta.reasoning_content
                
                # Count tokens in the new piece and check token limit
                new_tokens = len(tokenizer.encode(reasoning_piece))
                token_count += new_tokens
                
                if token_count > max_think_tokens and not limit_reached:
                    limit_reached = True
                    limit_reason = "TOKEN"
                    print("\n\n[TOKEN LIMIT REACHED]")
                    # Explicitly break out of the loop on token limit
                    break
                    
                reasoning_content += reasoning_piece
                print(reasoning_piece, end="", flush=True)
            elif chunk.choices[0].delta.content:
                final_content += chunk.choices[0].delta.content
    
    finally:
        # Properly terminate the stream
        try:
            # This closes the HTTP connection for the stream
            response.close()
            print("[Stream terminated]")
        except Exception as e:
            print(f"Error closing stream: {e}")
﻿
    # Calculate and display elapsed time
    elapsed_time = time.time() - start_time
    if elapsed_time >= 60:
        time_str = f"{elapsed_time/60:.1f} minutes"
    else:
        time_str = f"{elapsed_time:.1f} seconds"
    
    print(f"\n\nThought for {time_str} ({token_count} tokens)")
    if limit_reached:
        print(f"Stopped due to {limit_reason} limit")
    print("\n")
    
    return reasoning_content
﻿
﻿
@weave.op
def get_gpt_response(prompt, reasoning):
    """Get response from GPT model via LiteLLM"""
    combined_prompt = (
        f"<question>{prompt}</question>\n\n"
        f"<thinking>{reasoning}</thinking>\n\n"
    )
    
    print(f"\n{GPT_MODEL}:")
    
    try:
        # Using LiteLLM for GPT
        completion_response = completion(
            model=GPT_MODEL,
            messages=[{"role": "user", "content": combined_prompt}],
            stream=True
        )
        
        full_response = ""
        for chunk in completion_response:
            try:
                if hasattr(chunk, 'choices') and len(chunk.choices) > 0:
                    delta = chunk.choices[0].delta
                    if hasattr(delta, 'content') and delta.content:
                        content_piece = delta.content
                        full_response += content_piece
                        print(content_piece, end="", flush=True)
            except Exception as e:
                print(f"\nError processing chunk: {str(e)}")
                continue
                
    except Exception as e:
        print(f"\nError in streaming response: {str(e)}")
        return "Error occurred while streaming response"
    
    print("\n")
    return full_response
﻿
﻿
@weave.op
def get_deepseek_final_answer(prompt, reasoning):
    """
    Get a final answer from DeepSeek using the reasoning
    """
    combined_prompt = (
        f"<question>{prompt}</question>\n\n"
        f"<thinking>{reasoning}</thinking>\n\n"
        f"Based on the above reasoning, provide a clear and concise answer."
    )
    
    print(f"\nDeepSeek Final Answer:")
    
    try:
        # Use DeepSeek for the final answer
        response = deepseek_client.chat.completions.create(
            model="deepseek-chat",  # Using the chat model for the final answer
            messages=[{"role": "user", "content": combined_prompt}],
            temperature=0.7,
            max_tokens=500,
            stream=True
        )
        
        full_response = ""
        for chunk in response:
            try:
                if hasattr(chunk.choices[0].delta, 'content') and chunk.choices[0].delta.content:
                    content_piece = chunk.choices[0].delta.content
                    full_response += content_piece
                    print(content_piece, end="", flush=True)
            except Exception as e:
                print(f"\nError processing chunk: {str(e)}")
                continue
                
    except Exception as e:
        print(f"\nError in streaming response: {str(e)}")
        return "Error occurred while streaming response"
    
    print("\n")
    return full_response
﻿
def main():
    # Hardcoded prompt
    prompt = "Explain the concept of quantum computing in simple terms."
    
    # Process with both token and time limits (whichever comes first)
    reasoning = get_deepseek_reasoning(
        prompt, 
        max_think_tokens=500,  # Limit to 500 tokens
        max_think_time=500      # Limit to 30 seconds
    )
    
    # Choose either GPT or DeepSeek for the final answer
    use_gpt = False
    
    if use_gpt:
        # Option 1: Use GPT for the final answer
        final_answer = get_gpt_response(prompt, reasoning)
    else:
        # Option 2: Use DeepSeek for the final answer
        final_answer = get_deepseek_final_answer(prompt, reasoning)
﻿
if __name__ == "__main__":
    main()
추론은 생성되는 즉시 스트리밍되어, 필요할 경우 곧바로 중단할 수 있습니다. 전체 추론이 끝날 때까지 기다렸다가 조치를 취하는 대신, 이 방식은 AI가 과도하게 길거나 불필요한 추론을 만들어 내기 시작하면 실시간으로 멈출 수 있게 합니다. 시간 한도에 도달하면 추론은 즉시 잘리고, 지연 없이 바로 응답 생성 단계로 넘어가 불필요한 대기를 방지합니다. 추론 단계가 완료되면, 생성된 사고 과정은 두 번째 모델에 문맥으로 전달됩니다. 
﻿Weave 는 추론 및 응답 생성 과정 전반에서 입력, 출력, 중간 단계를 추적하는 데 사용됩니다. 다음과 같은 핵심 함수를 래핑하여 get_deepseek_reasoning 그리고 get_deepseek_final_answer 와 @weave.op이를 통해 시스템이 다양한 프롬프트를 어떻게 처리하는지 로깅하고 분석할 수 있습니다. 프로덕션 환경에서는 시간 경과에 따른 모델 성능을 추적하고 이상 징후를 감지하며, 실제 사용 패턴에 맞춰 토큰 또는 시간 한도를 최적화하는 데 유용합니다. 
RAT은 여러 모델이 상호작용하는 방식이므로, Weave는 모델들 사이에서 추론이 어떻게 흐르는지 추적하고, 출력의 불일치를 모니터링하며, 더 나은 디버깅과 평가를 위해 중간 단계를 로그로 남길 수 있는 방법을 제공합니다. 이를 통해 각기 다른 모델이 최종 응답에 어떻게 기여하는지에 대한 투명성을 높이고, 시간이 지남에 따라 그 상호작용을 최적화하는 데 도움이 됩니다.
여러 개의 추론 모델 앙상블 Retrieval Augmented Thinking은 단일 추론 모델에 의존할 필요가 없습니다대신 여러 모델을 함께 사용하여 추론 품질을 높이고, 검증을 개선하며, 해석 가능성을 강화할 수 있습니다. 이 구현에서는 Claude 3.7을 기본 추론 모델로, o3-mini를 보조 검증자로 짝지어 생성된 추론을 비판하고 정제합니다. 이 접근 방식은 solve-critique-fix 루프를 따르며, Claude가 초기 추론 기록을 생성하고 o3-mini가 오류를 점검한 뒤, 유효한 최종 답이 나올 때까지(오류를 찾지 못하거나 최대 반복 횟수에 도달할 때까지) 과정을 반복합니다.
여러 개의 추론 모델을 함께 사용하면 오픈소스와 클로즈드소스 모델 각각의 강점을 활용하면서도 추론 과정에 대한 가시성을 유지할 수 있습니다. Claude 3.7은 명시적인 추론 기록을 제공하므로 RAT에 적합합니다. o3-mini는 추론 검증자로서 불일치나 오류를 식별합니다. 여러 단계에 걸쳐 반복함으로써, 시스템은 최종 답변의 신뢰성을 더욱 높입니다.
코드는 다음과 같습니다:
import asyncio
import time
from anthropic import Anthropic
from openai import OpenAI
import weave
import json
from datasets import load_dataset
﻿
# Initialize Weave
weave.init("aime_evaluation")
﻿
# Claude 3.7 configuration
CLAUDE_API_KEY = "your claude api key"  # Replace with your actual API key
CLAUDE_MODEL = "claude-3-7-sonnet-20250219"
﻿
THINKING_BUDGET = 4000
MAX_TOKENS = THINKING_BUDGET + 4000
﻿
﻿
# Initialize Anthropic client for Claude 3.7
claude_client = Anthropic(api_key=CLAUDE_API_KEY)
﻿
# Initialize OpenAI client for O3-mini (mistake detection)
openai_client = OpenAI(
    api_key="your OpenAI api key"  # Replace with your actual API key
)
﻿
# # Initialize OpenAI client for GPT-4o (scoring)
# gpt4o_client = OpenAI(
#     api_key="your_gpt4o_api_key",  # Replace with your GPT-4o API key
# )
﻿
# System prompt for Claude
SYSTEM_MESSAGE = "Solve the following problem. Put your final answer within \\boxed{}: "
﻿
class ClaudeO3Verifier(weave.Model):
    """
    A simplified model that combines Claude 3.7 for reasoning with O3-mini for verification
    and error correction, following the "solve-critique-fix" loop.
    """
    
    @staticmethod
    async def get_claude_reasoning(prompt: str, critic=None, verbose=False) -> dict:
        """
        Get reasoning from Claude 3.7 with optional critic feedback.
        """
        try:
            start_time = time.time()
            
            # Prepare full prompt with critic feedback if available
            full_prompt = prompt
            if critic and critic != "Correct Reasoning" and critic != "":
                full_prompt = f"""Problem: {prompt}
﻿
A review of your previous solution found this error:
{critic}
﻿
Please solve this problem again with correct reasoning."""
            
            # Using streaming mode with Claude 3.7
            full_response = ""
            thinking_content = ""
            
            if verbose:
                print(f"\nStreaming Claude 3.7 Reasoning for problem: {prompt[:50]}...")
                print("-" * 50)
            
            # Stream Claude's response
            with claude_client.messages.stream(
                model=CLAUDE_MODEL,
                max_tokens=MAX_TOKENS,
                thinking={"type": "enabled", "budget_tokens": THINKING_BUDGET},
                messages=[{"role": "user", "content": SYSTEM_MESSAGE + full_prompt}]
            ) as stream:
                current_block_type = None
                for event in stream:
                    if event.type == "content_block_start":
                        current_block_type = event.content_block.type
                        if verbose:
                            print(f"\n--- Starting {current_block_type} block ---")
                    elif event.type == "content_block_delta":
                        if event.delta.type == "thinking_delta":
                            thinking_content += event.delta.thinking
                            if verbose and len(thinking_content) % 100 == 0:
                                print(".", end="", flush=True)
                        elif event.delta.type == "text_delta":
                            full_response += event.delta.text
                            if verbose and len(full_response) % 100 == 0:
                                print("*", end="", flush=True)
                    elif event.type == "content_block_stop":
                        if verbose:
                            print(f"\n--- End {current_block_type} block ---")
                    elif event.type == "message_stop":
                        if verbose:
                            print("\n--- Message complete ---")
            
            end_time = time.time()
            if verbose:
                print("\n" + "-" * 50)
                print(f"Total time: {end_time - start_time:.2f} seconds")
                print(f"Total thinking length: {len(thinking_content)} characters")
                print(f"Total response length: {len(full_response)} characters")
            
            return {
                "thinking": thinking_content,
                "response": full_response
            }
            
        except Exception as e:
            end_time = time.time()
            print(f"\nError after {end_time - start_time:.2f} seconds: {str(e)}")
            return {"thinking": "", "response": f"Error: {str(e)}"}
    @weave.op
    @staticmethod
    async def check_reasoning_correctness(problem: str, reasoning: dict) -> str:
        """
        Check the full reasoning for correctness using O3-mini.
        Returns "Correct Reasoning" or "Error - {description}".
        """
        if not reasoning or (not reasoning.get("thinking") and not reasoning.get("response")):
            return "Error - Empty reasoning provided"
        
        # Prioritize thinking content, but fall back to response if thinking is empty
        reasoning_text = reasoning.get("thinking") or reasoning.get("response")
        
        # Clean reasoning if needed
        clean_reasoning = reasoning_text.strip()
        
        validation_prompt = "\n".join([
            "Examine this complete mathematical reasoning trace for any logical mistakes, calculation errors, or incorrect assumptions:",
            "",
            "the problem: ", 
            problem, 
            "reasoning: ",
            clean_reasoning,
            "",
            "Focus on mathematical correctness. Pay special attention to:",
            "1. Arithmetic errors",
            "2. Algebraic manipulation errors",
            "3. Logical flaws in the reasoning",
            "4. Incorrect application of formulas",
            "5. Unwarranted assumptions",
            "",
            "Respond with EXACTLY:",
            "- \"Correct Reasoning\" if no mistakes are found",
            "- \"Error - [a explanation of which step the error occurs, detailed explanation of the specific error, along with an explanation of what to do instead]\" if you find any mistakes"
        ])
﻿
﻿
        try:
            response = openai_client.chat.completions.create(
                model="o3-mini",
                messages=[{"role": "user", "content": validation_prompt}]
            )
            result = response.choices[0].message.content.strip()
            
            if result.lower().startswith("correct reasoning"):
                return "Correct Reasoning"
            elif result.lower().startswith("error"):
                return result
            else:
                # Force result into expected format
                if "error" in result.lower():
                    return f"Error - DO NOT REPEAT THIS MISTAKE AGAIN - {result}"
                else:
                    return "Correct Reasoning"  # Default to correct if unclear
                    
        except Exception as e:
            print(f"Failed to validate reasoning: {e}")
            return f"Error - Validation failed: {str(e)}"
﻿
    @weave.op
    async def predict(self, text: str) -> str:
        """
        Run the complete inference pipeline using iterative solve-critique-fix loop.
        
        Args:
            text: The mathematical problem to solve
            
        Returns:
            str: The final solution to the problem
        """
        try:
            print(f"Problem: {text[:100]}...")
            
            # Initialize loop variables
            max_iterations = 5  # Maximum number of attempts
            current_iteration = 0
            critic = ""
            done = False
            final_reasoning = None
            partial_critic = None 
            
            # Iterative solve-critique-fix loop
            while not done and current_iteration < max_iterations:
                current_iteration += 1
                print(f"\nIteration {current_iteration}/{max_iterations}")
                
                # Get reasoning from Claude 3.7
                print("Getting reasoning from Claude 3.7...")
                reasoning = await self.get_claude_reasoning(text, critic, verbose=False)
                final_reasoning = reasoning
                
                if not reasoning.get("response"):
                    return "Failed to get reasoning"
                
                # Check reasoning with O3-mini
                print("Checking reasoning with O3-mini...")
                partial_critic = await ClaudeO3Verifier.check_reasoning_correctness(text, reasoning)
                
                print(f"Critic: {critic[:100]}...")
                
                # Check if we're done
                if partial_critic == "Correct Reasoning" or current_iteration == max_iterations:
                    done = True
            
                critic += "Mistake: " + partial_critic
            # Return the final response for evaluation
            return final_reasoning.get("response", "No response generated")
            
        except Exception as e:
            print(f"Critical error in predict method: {str(e)}")
            return f"Error processing problem. Details: {str(e)}"
﻿
@weave.op
async def gpt4o_scorer(label: str, model_output: str) -> dict:
    """Score the model's output by comparing it with the ground truth."""
    query = f"""
    YOU ARE A LLM JUDGE DETERMINING IF THE FOLLOWING MODEL GENERATED ANSWER IS THE SAME AS THE CORRECT ANSWER 
    I WILL GIVE YOU THE LAST 100 CHARS OF THE MODEL'S REASONING PATH, WHICH WILL CONTAIN THE FINAL ANSWER -> 
﻿
    Model's Answer (last 100 chars): {str(model_output)[-100:]}
    Correct Answer: {label}
    
    Your task:
    1. State the model's predicted answer (answer only).
    2. State the ground truth (answer only).
    3. Determine if the model's final answer is correct (ignore formatting differences, etc.). RESPOND with the predicted and ground truth answer, followed with a JSON object containing the correctness encapsulated within the following delimiters:
       ```json
       {{ "correctness": true/false }}
       ```
    """
    
    try:
        # Perform inference using GPT-4o client
        response = openai_client.chat.completions.create(
            model="gpt-4o-2024-08-06",
            messages=[{"role": "user", "content": query}]
        )
        
        result = response.choices[0].message.content
        
        # Extract correctness JSON object from the response
        try:
            json_start = result.index("```json") + 7
            json_end = result.index("```", json_start)
            correctness = json.loads(result[json_start:json_end].strip()).get("correctness", False)
        except (ValueError, IndexError, json.JSONDecodeError):
            correctness = False
            print("Failed to parse correctness JSON")
﻿
        return {"correctness": correctness, "reasoning": result}
    
    except Exception as e:
        print(f"Scoring error: {e}")
        return {"correctness": False, "reasoning": f"Scoring error: {str(e)}"}
﻿
# Load and preprocess dataset
def load_ds():
    print("Loading AIME dataset...")
    dataset = load_dataset("Maxwell-Jia/AIME_2024")["train"]  # no test set here 
    return [{"text": row["Problem"], "label": row["Answer"]} for row in dataset]
﻿
async def run_evaluations():
    print("Loading dataset...")
    dataset = load_ds()
    
    print(f"Dataset loaded with {len(dataset)} problems")
    
    # For testing purposes, you might want to use a smaller subset initially
    # Uncomment the next line to use a smaller subset
    # dataset = dataset[:5]  # Use only 5 problems for testing
    
    print("Initializing model...")
    model = ClaudeO3Verifier()
    
    print("Preparing dataset for evaluation...")
    dataset_prepared = [{"text": row["text"], "label": row["label"]} for row in dataset]
    
    print("Running evaluation...")
    scorers = [gpt4o_scorer]
    
    evaluation = weave.Evaluation(
        dataset=dataset_prepared,
        scorers=scorers,
        name="Claude 3.7 AIME Evaluation"
    )
    
    print("Starting evaluation...")
    results = await evaluation.evaluate(model)
    print(f"Evaluation complete!")
    print(f"Results: {results}")
    
    return results
﻿
if __name__ == "__main__":
    asyncio.run(run_evaluations())
﻿Weave 평가 여러 단계의 추론 전반에 걸친 모델 성능을 추적하는 데 사용됩니다. 평가를 구조화하면 정확도 추세를 측정하고, o3-mini가 Claude의 추론에서 오류를 얼마나 자주 발견하는지 평가하며, 시간이 지남에 따라 검증 프로세스를 개선하기가 쉬워집니다. 아래는 제 평가 결과입니다: 
﻿
Claude + o3-mini 검증 조합은 일반 Claude 모델과 비교해 더 높은 정답률(0.433 대 0.367)을 달성했지만, 지연 시간은 증가했습니다. 생성 추론 모델은 응답당 사고 토큰을 4,000개로 제한했지만, 오류가 발생할 때마다 추가로 4,000개의 사고 토큰을 배정하므로, 적어도 한 번이라도 실수한 샘플에서는 총 토큰 사용량이 기술적으로 더 많아집니다. 이는 잠재적 교란 요인일 수 있지만, 이 접근 방식이 정확도에서 뚜렷한 개선을 보이므로 결과는 여전히 의미 있어 보입니다.
Weave 비교 대시보드 평가가 완료되면 Weave는 결과를 대화형 대시보드로 정리합니다. 이 강력한 도구를 통해 모델 출력물을 나란히 비교할 수 있습니다. 이러한 시각적 접근은 디버깅을 단순화하고 모델 성능에 대한 깊은 인사이트를 제공하여, Weave를 추적과 개선을 위한 필수 도구로 만들어 줍니다. 대규모 언어 모델.
DeepSeek-R1이 다루는 것처럼 추론 중심의 작업에는, 비교 뷰가 각 모델의 의사 결정 과정을 단계별 추적 형태로 제공합니다. 이러한 세분화된 출력물을 분석하면, 모델이 특정 작업에서 성공하거나 실패하는 이유를 더 잘 이해할 수 있습니다. 이 통찰은 문제를 진단하고, 모델을 개선하며, 복잡한 추론 시나리오에 맞춰 능력을 조정하는 데 매우 유용합니다.
비교 뷰가 어떻게 보이는지 보여 주는 스크린샷입니다: 
﻿
결론 Retrieval Augmented Thinking(RAT)은 추론을 응답 생성과 분리하는 새로운 방식을 도입하여 효율성, 해석 가능성, 맞춤화 측면에서 이점을 제공합니다. 사고 과정을 공개하는 모델을 활용함으로써 RAT은 보다 구조화된 추론을 가능하게 하면서도 최종 출력의 유연성을 유지합니다.
DeepSeek-R1, Claude 등 다양한 모델을 활용한 구현 사례는, 추론을 최적화하고 검증하며, 여러 차례의 반복을 통해 개선할 수 있음을 보여 줍니다.
Weave와 같은 도구는 투명성을 한층 높여 주어, 모델 성능과 상호작용에 대한 유용한 인사이트를 제공합니다. AI가 계속 발전함에 따라, RAT은 복잡한 추론 작업에서 비용, 정확도, 제어의 균형을 맞추는 유망한 접근법을 제시합니다.
﻿
﻿
﻿
﻿
﻿
 이 글은 AI 번역본입니다. 오역이 의심되는 부분이 있다면 댓글로 알려 주세요. 원문 보고서는 아래 링크에서 확인하실 수 있습니다: 원문 보고서 보기﻿
﻿
Add a comment