Skip to main content

What is Retrieval Augmented Thinking (RAT) and how does it work?

Retrieval Augmented Thinking (RAT) separates AI reasoning from response generation, improving efficiency, interpretability, and customization by using one model for structured thought and another for the final output.
Created on March 13|Last edited on April 8
Retrieval Augmented Thinking (RAT) separates AI reasoning from response generation, using one model for structured thought and another for the final output. This improves efficiency, interpretability, and customization.
This approach has several potential benefits, including pairing an open-source reasoning model like DeepSeek-R1 with a seperate model for the final answer, allowing for more control over cost and performance. Since non-reasoning models are often easier to fine-tune than reasoning models, RAT makes it possible to customize outputs without altering the core reasoning process. Additionally, it's even possible to combine multiple reasoning models, where one acts as the "generation" model and the other reasoning model checks for mistakes, improving accuracy.
In this article, we will explore the benefits of RAT, and also implement RAT using models like Deepseek R1, Claude 3.7, and OpenAI's o3-mini.


Table of contents



What is Retrieval Augmented Thinking?

Retrieval Augmented Thinking (RAT) is an approach where one AI model generates structured reasoning, and another model uses that reasoning to produce the final response. Instead of relying on a single model to think and answer simultaneously, RAT allows for a breaking up the inference process into two stages, which allows for further optimization of efficiency, interpretability, and control.
Models like DeepSeek-R1 and Claude provide access to their internal thought processes, making them useful for RAT. By capturing reasoning before generating a response, RAT allows for more structured, deliberate answers. Depending on the implementation, this method can offer several advantages:
  • Pairing a cheaper reasoning model with a closed-source model: Open models like DeepSeek can handle structured reasoning, while closed models like GPT-4o can generate polished responses. This reduces costs while maintaining high-quality outputs.
  • More interpretability compared to closed-source reasoning models: Extracting reasoning from an open-source model makes the decision-making process more transparent before passing it to a closed-source model.
  • Optimized token usage and potential latency reduction: By limiting the number of tokens spent on reasoning and offloading response generation to a smaller, faster model, RAT can improve efficiency and reduce latency compared to using a single large model for both tasks.
  • More customization in final outputs: Since non-reasoning models are often easier to fine-tune, RAT allows users to customize responses without altering the reasoning process.
  • Combining multiple reasoning models for verification: RAT can also leverage multiple models in the reasoning stage. One approach is to use an "open trace" model (like DeepSeek) to generate a structured reasoning process while passing those steps to a second reasoning model—such as O3-mini or Gemini Flash Thinking—for verification. This allows us to benefit from the intelligence of both models while maintaining full visibility into the reasoning process. The open trace model ensures transparency, while the closed-source model can refine or critique the steps to improve accuracy.
By combining structured thinking with flexible response generation, RAT offers a modular approach to AI reasoning that can be adapted to different needs and constraints. Using multiple reasoning models in this way ensures higher reliability, error detection, and greater interpretability without sacrificing the advantages of powerful closed-source systems.

Tutorial: Implementing Retrieval Augmented Thinking

Retrieval Augmented Thinking can be used with any reasoning model that makes its reasoning trace visible. Right now, the main models capable of this are DeepSeek-R1 and Claude 3.7. The goal is to separate the structured reasoning process from the final answer generation, improving efficiency and interpretability.
In this implementation, we limit the number of tokens or the amount of time the model is allowed to think before returning a response. This helps control costs, ensures reasoning stays concise, and prevents unnecessary delays. DeepSeek-R1 generates structured thought processes under these constraints, and its reasoning is then passed to a response model. The response model can be either GPT-4o-mini, a closed-source model that generates a final response using the reasoning trace as context, or DeepSeek-chat.
This approach provides a way to maintain visibility into the AI’s reasoning while allowing for flexible response generation. By limiting the reasoning process and offloading response generation to another model, this setup balances interpretability, efficiency, and cost control while also enabling hybrid approaches where multiple reasoning models verify each other.
Here’s the code:
import os
from litellm import completion
from openai import OpenAI
import weave
import time
from transformers import AutoTokenizer

import weave; weave.init("rat")


# Model Constants
DEEPSEEK_MODEL = "deepseek-reasoner"
GPT_MODEL = "gpt-4o-mini"

# Initialize DeepSeek client and tokenizer
weave.init("deepseek_r1_api_streaming")
r1_api_key = "your deepseek api key" # Replace with your API key
deepseek_client = OpenAI(api_key=r1_api_key, base_url="https://api.deepseek.com")

# Initialize DeepSeek tokenizer
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V2", trust_remote_code=True)


@weave.op
def get_deepseek_reasoning(prompt, max_think_tokens=1000, max_think_time=60):

start_time = time.time()
print("\nReasoning Process:")
modified_prompt = f"{prompt}\n\nPlease limit your response to approximately {max_think_tokens} tokens."
# Use direct OpenAI client for DeepSeek
response = deepseek_client.chat.completions.create(
model=DEEPSEEK_MODEL,
max_tokens=1, # This triggers reasoner mode
messages=[{"role": "user", "content": modified_prompt}],
stream=True
)

reasoning_content = ""
final_content = ""
token_count = 0
limit_reached = False
limit_reason = ""

try:
# Process chunks with timeout and token limit
for chunk in response:
# Check if we've exceeded the max thinking time
current_time = time.time()
elapsed = current_time - start_time
if elapsed > max_think_time and not limit_reached:
limit_reached = True
limit_reason = "TIME"
print("\n\n[TIME LIMIT REACHED]")
# Explicitly break out of the loop on time limit
break
# Process the chunk
if chunk.choices[0].delta.reasoning_content:
reasoning_piece = chunk.choices[0].delta.reasoning_content
# Count tokens in the new piece and check token limit
new_tokens = len(tokenizer.encode(reasoning_piece))
token_count += new_tokens
if token_count > max_think_tokens and not limit_reached:
limit_reached = True
limit_reason = "TOKEN"
print("\n\n[TOKEN LIMIT REACHED]")
# Explicitly break out of the loop on token limit
break
reasoning_content += reasoning_piece
print(reasoning_piece, end="", flush=True)
elif chunk.choices[0].delta.content:
final_content += chunk.choices[0].delta.content
finally:
# Properly terminate the stream
try:
# This closes the HTTP connection for the stream
response.close()
print("[Stream terminated]")
except Exception as e:
print(f"Error closing stream: {e}")

# Calculate and display elapsed time
elapsed_time = time.time() - start_time
if elapsed_time >= 60:
time_str = f"{elapsed_time/60:.1f} minutes"
else:
time_str = f"{elapsed_time:.1f} seconds"
print(f"\n\nThought for {time_str} ({token_count} tokens)")
if limit_reached:
print(f"Stopped due to {limit_reason} limit")
print("\n")
return reasoning_content


@weave.op
def get_gpt_response(prompt, reasoning):
"""Get response from GPT model via LiteLLM"""
combined_prompt = (
f"<question>{prompt}</question>\n\n"
f"<thinking>{reasoning}</thinking>\n\n"
)
print(f"\n{GPT_MODEL}:")
try:
# Using LiteLLM for GPT
completion_response = completion(
model=GPT_MODEL,
messages=[{"role": "user", "content": combined_prompt}],
stream=True
)
full_response = ""
for chunk in completion_response:
try:
if hasattr(chunk, 'choices') and len(chunk.choices) > 0:
delta = chunk.choices[0].delta
if hasattr(delta, 'content') and delta.content:
content_piece = delta.content
full_response += content_piece
print(content_piece, end="", flush=True)
except Exception as e:
print(f"\nError processing chunk: {str(e)}")
continue
except Exception as e:
print(f"\nError in streaming response: {str(e)}")
return "Error occurred while streaming response"
print("\n")
return full_response


@weave.op
def get_deepseek_final_answer(prompt, reasoning):
"""
Get a final answer from DeepSeek using the reasoning
"""
combined_prompt = (
f"<question>{prompt}</question>\n\n"
f"<thinking>{reasoning}</thinking>\n\n"
f"Based on the above reasoning, provide a clear and concise answer."
)
print(f"\nDeepSeek Final Answer:")
try:
# Use DeepSeek for the final answer
response = deepseek_client.chat.completions.create(
model="deepseek-chat", # Using the chat model for the final answer
messages=[{"role": "user", "content": combined_prompt}],
temperature=0.7,
max_tokens=500,
stream=True
)
full_response = ""
for chunk in response:
try:
if hasattr(chunk.choices[0].delta, 'content') and chunk.choices[0].delta.content:
content_piece = chunk.choices[0].delta.content
full_response += content_piece
print(content_piece, end="", flush=True)
except Exception as e:
print(f"\nError processing chunk: {str(e)}")
continue
except Exception as e:
print(f"\nError in streaming response: {str(e)}")
return "Error occurred while streaming response"
print("\n")
return full_response

def main():
# Hardcoded prompt
prompt = "Explain the concept of quantum computing in simple terms."
# Process with both token and time limits (whichever comes first)
reasoning = get_deepseek_reasoning(
prompt,
max_think_tokens=500, # Limit to 500 tokens
max_think_time=500 # Limit to 30 seconds
)
# Choose either GPT or DeepSeek for the final answer
use_gpt = False
if use_gpt:
# Option 1: Use GPT for the final answer
final_answer = get_gpt_response(prompt, reasoning)
else:
# Option 2: Use DeepSeek for the final answer
final_answer = get_deepseek_final_answer(prompt, reasoning)

if __name__ == "__main__":
main()
The reasoning is streamed as it’s being generated, allowing for immediate interruption if needed. Instead of waiting for the full reasoning process to complete before taking action, this method ensures that if the AI starts generating excessively long or unnecessary reasoning, it can be stopped in real time. If it hits the time limit, the reasoning is cut off, and the response generation step begins immediately, avoiding unnecessary delays. Once the reasoning step is complete, the generated thought process is passed as context to the second model.
Weave is used to track inputs, outputs, and intermediate steps throughout the reasoning and response generation process. By wrapping key functions like get_deepseek_reasoning and get_deepseek_final_answer with @weave.op, it enables logging and analysis of how the system processes different prompts. In a production setting, this can be useful for tracking model performance over time, detecting anomalies, and optimizing token or time limits based on real-world usage patterns.
Since RAT involves multiple models interacting, Weave provides a way to track how reasoning flows between them, monitor discrepancies in outputs, and log intermediate steps for better debugging and evaluation. This ensures greater transparency in how different models contribute to the final response and helps optimize their interaction over time.

Ensembling multiple reasoning models

Retrieval Augmented Thinking doesn’t have to rely on a single reasoning model. Instead, multiple models can be used together to enhance reasoning quality, improve verification, and increase interpretability. This implementation pairs Claude 3.7 as the primary reasoning model with o3-mini as a secondary verifier to critique and refine the generated reasoning. The approach follows a solve-critique-fix loop, where Claude generates an initial reasoning trace, o3-mini checks for errors, and the process repeats until a valid final answer is produced (o3-mini is unable to find errors, or a maximum number or iterations is reached).
Using multiple reasoning models allows us to leverage the strengths of both open and closed-source models while maintaining visibility into the thought process. Claude 3.7 is used because it provides an explicit reasoning trace, making it useful for RAT. o3-mini acts as a reasoning validator, identifying inconsistencies or mistakes. By iterating over multiple steps, the system ensures higher reliability in its final answers.
Here’s the code:
import asyncio
import time
from anthropic import Anthropic
from openai import OpenAI
import weave
import json
from datasets import load_dataset

# Initialize Weave
weave.init("aime_evaluation")

# Claude 3.7 configuration
CLAUDE_API_KEY = "your claude api key" # Replace with your actual API key
CLAUDE_MODEL = "claude-3-7-sonnet-20250219"

THINKING_BUDGET = 4000
MAX_TOKENS = THINKING_BUDGET + 4000


# Initialize Anthropic client for Claude 3.7
claude_client = Anthropic(api_key=CLAUDE_API_KEY)

# Initialize OpenAI client for O3-mini (mistake detection)
openai_client = OpenAI(
api_key="your OpenAI api key" # Replace with your actual API key
)

# # Initialize OpenAI client for GPT-4o (scoring)
# gpt4o_client = OpenAI(
# api_key="your_gpt4o_api_key", # Replace with your GPT-4o API key
# )

# System prompt for Claude
SYSTEM_MESSAGE = "Solve the following problem. Put your final answer within \\boxed{}: "

class ClaudeO3Verifier(weave.Model):
"""
A simplified model that combines Claude 3.7 for reasoning with O3-mini for verification
and error correction, following the "solve-critique-fix" loop.
"""
@staticmethod
async def get_claude_reasoning(prompt: str, critic=None, verbose=False) -> dict:
"""
Get reasoning from Claude 3.7 with optional critic feedback.
"""
try:
start_time = time.time()
# Prepare full prompt with critic feedback if available
full_prompt = prompt
if critic and critic != "Correct Reasoning" and critic != "":
full_prompt = f"""Problem: {prompt}

A review of your previous solution found this error:
{critic}

Please solve this problem again with correct reasoning."""
# Using streaming mode with Claude 3.7
full_response = ""
thinking_content = ""
if verbose:
print(f"\nStreaming Claude 3.7 Reasoning for problem: {prompt[:50]}...")
print("-" * 50)
# Stream Claude's response
with claude_client.messages.stream(
model=CLAUDE_MODEL,
max_tokens=MAX_TOKENS,
thinking={"type": "enabled", "budget_tokens": THINKING_BUDGET},
messages=[{"role": "user", "content": SYSTEM_MESSAGE + full_prompt}]
) as stream:
current_block_type = None
for event in stream:
if event.type == "content_block_start":
current_block_type = event.content_block.type
if verbose:
print(f"\n--- Starting {current_block_type} block ---")
elif event.type == "content_block_delta":
if event.delta.type == "thinking_delta":
thinking_content += event.delta.thinking
if verbose and len(thinking_content) % 100 == 0:
print(".", end="", flush=True)
elif event.delta.type == "text_delta":
full_response += event.delta.text
if verbose and len(full_response) % 100 == 0:
print("*", end="", flush=True)
elif event.type == "content_block_stop":
if verbose:
print(f"\n--- End {current_block_type} block ---")
elif event.type == "message_stop":
if verbose:
print("\n--- Message complete ---")
end_time = time.time()
if verbose:
print("\n" + "-" * 50)
print(f"Total time: {end_time - start_time:.2f} seconds")
print(f"Total thinking length: {len(thinking_content)} characters")
print(f"Total response length: {len(full_response)} characters")
return {
"thinking": thinking_content,
"response": full_response
}
except Exception as e:
end_time = time.time()
print(f"\nError after {end_time - start_time:.2f} seconds: {str(e)}")
return {"thinking": "", "response": f"Error: {str(e)}"}
@weave.op
@staticmethod
async def check_reasoning_correctness(problem: str, reasoning: dict) -> str:
"""
Check the full reasoning for correctness using O3-mini.
Returns "Correct Reasoning" or "Error - {description}".
"""
if not reasoning or (not reasoning.get("thinking") and not reasoning.get("response")):
return "Error - Empty reasoning provided"
# Prioritize thinking content, but fall back to response if thinking is empty
reasoning_text = reasoning.get("thinking") or reasoning.get("response")
# Clean reasoning if needed
clean_reasoning = reasoning_text.strip()
validation_prompt = "\n".join([
"Examine this complete mathematical reasoning trace for any logical mistakes, calculation errors, or incorrect assumptions:",
"",
"the problem: ",
problem,
"reasoning: ",
clean_reasoning,
"",
"Focus on mathematical correctness. Pay special attention to:",
"1. Arithmetic errors",
"2. Algebraic manipulation errors",
"3. Logical flaws in the reasoning",
"4. Incorrect application of formulas",
"5. Unwarranted assumptions",
"",
"Respond with EXACTLY:",
"- \"Correct Reasoning\" if no mistakes are found",
"- \"Error - [a explanation of which step the error occurs, detailed explanation of the specific error, along with an explanation of what to do instead]\" if you find any mistakes"
])


try:
response = openai_client.chat.completions.create(
model="o3-mini",
messages=[{"role": "user", "content": validation_prompt}]
)
result = response.choices[0].message.content.strip()
if result.lower().startswith("correct reasoning"):
return "Correct Reasoning"
elif result.lower().startswith("error"):
return result
else:
# Force result into expected format
if "error" in result.lower():
return f"Error - DO NOT REPEAT THIS MISTAKE AGAIN - {result}"
else:
return "Correct Reasoning" # Default to correct if unclear
except Exception as e:
print(f"Failed to validate reasoning: {e}")
return f"Error - Validation failed: {str(e)}"

@weave.op
async def predict(self, text: str) -> str:
"""
Run the complete inference pipeline using iterative solve-critique-fix loop.
Args:
text: The mathematical problem to solve
Returns:
str: The final solution to the problem
"""
try:
print(f"Problem: {text[:100]}...")
# Initialize loop variables
max_iterations = 5 # Maximum number of attempts
current_iteration = 0
critic = ""
done = False
final_reasoning = None
partial_critic = None
# Iterative solve-critique-fix loop
while not done and current_iteration < max_iterations:
current_iteration += 1
print(f"\nIteration {current_iteration}/{max_iterations}")
# Get reasoning from Claude 3.7
print("Getting reasoning from Claude 3.7...")
reasoning = await self.get_claude_reasoning(text, critic, verbose=False)
final_reasoning = reasoning
if not reasoning.get("response"):
return "Failed to get reasoning"
# Check reasoning with O3-mini
print("Checking reasoning with O3-mini...")
partial_critic = await ClaudeO3Verifier.check_reasoning_correctness(text, reasoning)
print(f"Critic: {critic[:100]}...")
# Check if we're done
if partial_critic == "Correct Reasoning" or current_iteration == max_iterations:
done = True
critic += "Mistake: " + partial_critic
# Return the final response for evaluation
return final_reasoning.get("response", "No response generated")
except Exception as e:
print(f"Critical error in predict method: {str(e)}")
return f"Error processing problem. Details: {str(e)}"

@weave.op
async def gpt4o_scorer(label: str, model_output: str) -> dict:
"""Score the model's output by comparing it with the ground truth."""
query = f"""
YOU ARE A LLM JUDGE DETERMINING IF THE FOLLOWING MODEL GENERATED ANSWER IS THE SAME AS THE CORRECT ANSWER
I WILL GIVE YOU THE LAST 100 CHARS OF THE MODEL'S REASONING PATH, WHICH WILL CONTAIN THE FINAL ANSWER ->

Model's Answer (last 100 chars): {str(model_output)[-100:]}
Correct Answer: {label}
Your task:
1. State the model's predicted answer (answer only).
2. State the ground truth (answer only).
3. Determine if the model's final answer is correct (ignore formatting differences, etc.). RESPOND with the predicted and ground truth answer, followed with a JSON object containing the correctness encapsulated within the following delimiters:
```json
{{ "correctness": true/false }}
```
"""
try:
# Perform inference using GPT-4o client
response = openai_client.chat.completions.create(
model="gpt-4o-2024-08-06",
messages=[{"role": "user", "content": query}]
)
result = response.choices[0].message.content
# Extract correctness JSON object from the response
try:
json_start = result.index("```json") + 7
json_end = result.index("```", json_start)
correctness = json.loads(result[json_start:json_end].strip()).get("correctness", False)
except (ValueError, IndexError, json.JSONDecodeError):
correctness = False
print("Failed to parse correctness JSON")

return {"correctness": correctness, "reasoning": result}
except Exception as e:
print(f"Scoring error: {e}")
return {"correctness": False, "reasoning": f"Scoring error: {str(e)}"}

# Load and preprocess dataset
def load_ds():
print("Loading AIME dataset...")
dataset = load_dataset("Maxwell-Jia/AIME_2024")["train"] # no test set here
return [{"text": row["Problem"], "label": row["Answer"]} for row in dataset]

async def run_evaluations():
print("Loading dataset...")
dataset = load_ds()
print(f"Dataset loaded with {len(dataset)} problems")
# For testing purposes, you might want to use a smaller subset initially
# Uncomment the next line to use a smaller subset
# dataset = dataset[:5] # Use only 5 problems for testing
print("Initializing model...")
model = ClaudeO3Verifier()
print("Preparing dataset for evaluation...")
dataset_prepared = [{"text": row["text"], "label": row["label"]} for row in dataset]
print("Running evaluation...")
scorers = [gpt4o_scorer]
evaluation = weave.Evaluation(
dataset=dataset_prepared,
scorers=scorers,
name="Claude 3.7 AIME Evaluation"
)
print("Starting evaluation...")
results = await evaluation.evaluate(model)
print(f"Evaluation complete!")
print(f"Results: {results}")
return results

if __name__ == "__main__":
asyncio.run(run_evaluations())
Weave Evaluations is used to track model performance across multiple reasoning steps. By structuring evaluations, it becomes easier to measure accuracy trends, assess how often o3-mini detects errors in Claude’s reasoning, and refine the verification process over time. Here are the results for my evaluation:

The Claude + o3-mini verifier achieves higher correctness (0.433 vs. 0.367) compared to the regular Claude model but at the cost of increased latency. The generation reasoning model was limited to 4K thinking tokens per response, but since we allocate another 4K thinking tokens for every mistake it makes, total token usage is technically higher for any sample where the model makes at least one mistake. While this could be a possible confounder, the results still seem meaningful since this approach does show a clear improvement in accuracy.

The Weave comparisons dashboard

Once evaluations are complete, Weave organizes results into an interactive dashboard. This powerful tool enables you to compare model outputs side by side. This visual approach simplifies debugging and provides deep insights into model performance, making Weave an indispensable tool for tracking and refining large language models.
For reasoning-focused tasks like those tackled by DeepSeek-R1, the comparisons view offers a step-by-step trace of each model’s decision-making process. By analyzing these granular outputs, you can better understand why models succeed or fail on specific tasks. This insight is invaluable for diagnosing issues, improving models, and tailoring their capabilities for complex reasoning scenarios.
Here’s a screenshot of what the comparison view looks like:


Conclusion

Retrieval Augmented Thinking (RAT) introduces a new way to separate reasoning from response generation, offering advantages in efficiency, interpretability, and customization. By leveraging models that expose their thought processes, RAT allows for more structured reasoning while maintaining flexibility in final outputs.
Implementations using DeepSeek-R1, Claude, and other models demonstrate how reasoning can be optimized, verified, and improved through multiple iterations.
Tools like Weave further enhance transparency, providing valuable insights into model performance and interactions. As AI continues to evolve, RAT presents a promising approach to balancing cost, accuracy, and control in complex reasoning tasks.




Iterate on AI agents and models faster. Try Weights & Biases today.