What is Retrieval Augmented Thinking (RAT) and how does it work?
Retrieval Augmented Thinking (RAT) separates AI reasoning from response generation, improving efficiency, interpretability, and customization by using one model for structured thought and another for the final output.
Created on March 13|Last edited on April 8
Comment
Retrieval Augmented Thinking (RAT) separates AI reasoning from response generation, using one model for structured thought and another for the final output. This improves efficiency, interpretability, and customization.
This approach has several potential benefits, including pairing an open-source reasoning model like DeepSeek-R1 with a seperate model for the final answer, allowing for more control over cost and performance. Since non-reasoning models are often easier to fine-tune than reasoning models, RAT makes it possible to customize outputs without altering the core reasoning process. Additionally, it's even possible to combine multiple reasoning models, where one acts as the "generation" model and the other reasoning model checks for mistakes, improving accuracy.
In this article, we will explore the benefits of RAT, and also implement RAT using models like Deepseek R1, Claude 3.7, and OpenAI's o3-mini.

Table of contents
Table of contentsWhat is Retrieval Augmented Thinking?Tutorial: Implementing Retrieval Augmented Thinking Ensembling multiple reasoning models The Weave comparisons dashboard Conclusion
What is Retrieval Augmented Thinking?
Retrieval Augmented Thinking (RAT) is an approach where one AI model generates structured reasoning, and another model uses that reasoning to produce the final response. Instead of relying on a single model to think and answer simultaneously, RAT allows for a breaking up the inference process into two stages, which allows for further optimization of efficiency, interpretability, and control.
Models like DeepSeek-R1 and Claude provide access to their internal thought processes, making them useful for RAT. By capturing reasoning before generating a response, RAT allows for more structured, deliberate answers. Depending on the implementation, this method can offer several advantages:
- Pairing a cheaper reasoning model with a closed-source model: Open models like DeepSeek can handle structured reasoning, while closed models like GPT-4o can generate polished responses. This reduces costs while maintaining high-quality outputs.
- More interpretability compared to closed-source reasoning models: Extracting reasoning from an open-source model makes the decision-making process more transparent before passing it to a closed-source model.
- Optimized token usage and potential latency reduction: By limiting the number of tokens spent on reasoning and offloading response generation to a smaller, faster model, RAT can improve efficiency and reduce latency compared to using a single large model for both tasks.
- More customization in final outputs: Since non-reasoning models are often easier to fine-tune, RAT allows users to customize responses without altering the reasoning process.
- Combining multiple reasoning models for verification: RAT can also leverage multiple models in the reasoning stage. One approach is to use an "open trace" model (like DeepSeek) to generate a structured reasoning process while passing those steps to a second reasoning model—such as O3-mini or Gemini Flash Thinking—for verification. This allows us to benefit from the intelligence of both models while maintaining full visibility into the reasoning process. The open trace model ensures transparency, while the closed-source model can refine or critique the steps to improve accuracy.
By combining structured thinking with flexible response generation, RAT offers a modular approach to AI reasoning that can be adapted to different needs and constraints. Using multiple reasoning models in this way ensures higher reliability, error detection, and greater interpretability without sacrificing the advantages of powerful closed-source systems.
Tutorial: Implementing Retrieval Augmented Thinking
Retrieval Augmented Thinking can be used with any reasoning model that makes its reasoning trace visible. Right now, the main models capable of this are DeepSeek-R1 and Claude 3.7. The goal is to separate the structured reasoning process from the final answer generation, improving efficiency and interpretability.
In this implementation, we limit the number of tokens or the amount of time the model is allowed to think before returning a response. This helps control costs, ensures reasoning stays concise, and prevents unnecessary delays. DeepSeek-R1 generates structured thought processes under these constraints, and its reasoning is then passed to a response model. The response model can be either GPT-4o-mini, a closed-source model that generates a final response using the reasoning trace as context, or DeepSeek-chat.
This approach provides a way to maintain visibility into the AI’s reasoning while allowing for flexible response generation. By limiting the reasoning process and offloading response generation to another model, this setup balances interpretability, efficiency, and cost control while also enabling hybrid approaches where multiple reasoning models verify each other.
Here’s the code:
import osfrom litellm import completionfrom openai import OpenAIimport weaveimport timefrom transformers import AutoTokenizerimport weave; weave.init("rat")# Model ConstantsDEEPSEEK_MODEL = "deepseek-reasoner"GPT_MODEL = "gpt-4o-mini"# Initialize DeepSeek client and tokenizerweave.init("deepseek_r1_api_streaming")r1_api_key = "your deepseek api key" # Replace with your API keydeepseek_client = OpenAI(api_key=r1_api_key, base_url="https://api.deepseek.com")# Initialize DeepSeek tokenizertokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V2", trust_remote_code=True)@weave.opdef get_deepseek_reasoning(prompt, max_think_tokens=1000, max_think_time=60):start_time = time.time()print("\nReasoning Process:")modified_prompt = f"{prompt}\n\nPlease limit your response to approximately {max_think_tokens} tokens."# Use direct OpenAI client for DeepSeekresponse = deepseek_client.chat.completions.create(model=DEEPSEEK_MODEL,max_tokens=1, # This triggers reasoner modemessages=[{"role": "user", "content": modified_prompt}],stream=True)reasoning_content = ""final_content = ""token_count = 0limit_reached = Falselimit_reason = ""try:# Process chunks with timeout and token limitfor chunk in response:# Check if we've exceeded the max thinking timecurrent_time = time.time()elapsed = current_time - start_timeif elapsed > max_think_time and not limit_reached:limit_reached = Truelimit_reason = "TIME"print("\n\n[TIME LIMIT REACHED]")# Explicitly break out of the loop on time limitbreak# Process the chunkif chunk.choices[0].delta.reasoning_content:reasoning_piece = chunk.choices[0].delta.reasoning_content# Count tokens in the new piece and check token limitnew_tokens = len(tokenizer.encode(reasoning_piece))token_count += new_tokensif token_count > max_think_tokens and not limit_reached:limit_reached = Truelimit_reason = "TOKEN"print("\n\n[TOKEN LIMIT REACHED]")# Explicitly break out of the loop on token limitbreakreasoning_content += reasoning_pieceprint(reasoning_piece, end="", flush=True)elif chunk.choices[0].delta.content:final_content += chunk.choices[0].delta.contentfinally:# Properly terminate the streamtry:# This closes the HTTP connection for the streamresponse.close()print("[Stream terminated]")except Exception as e:print(f"Error closing stream: {e}")# Calculate and display elapsed timeelapsed_time = time.time() - start_timeif elapsed_time >= 60:time_str = f"{elapsed_time/60:.1f} minutes"else:time_str = f"{elapsed_time:.1f} seconds"print(f"\n\nThought for {time_str} ({token_count} tokens)")if limit_reached:print(f"Stopped due to {limit_reason} limit")print("\n")return reasoning_content@weave.opdef get_gpt_response(prompt, reasoning):"""Get response from GPT model via LiteLLM"""combined_prompt = (f"<question>{prompt}</question>\n\n"f"<thinking>{reasoning}</thinking>\n\n")print(f"\n{GPT_MODEL}:")try:# Using LiteLLM for GPTcompletion_response = completion(model=GPT_MODEL,messages=[{"role": "user", "content": combined_prompt}],stream=True)full_response = ""for chunk in completion_response:try:if hasattr(chunk, 'choices') and len(chunk.choices) > 0:delta = chunk.choices[0].deltaif hasattr(delta, 'content') and delta.content:content_piece = delta.contentfull_response += content_pieceprint(content_piece, end="", flush=True)except Exception as e:print(f"\nError processing chunk: {str(e)}")continueexcept Exception as e:print(f"\nError in streaming response: {str(e)}")return "Error occurred while streaming response"print("\n")return full_response@weave.opdef get_deepseek_final_answer(prompt, reasoning):"""Get a final answer from DeepSeek using the reasoning"""combined_prompt = (f"<question>{prompt}</question>\n\n"f"<thinking>{reasoning}</thinking>\n\n"f"Based on the above reasoning, provide a clear and concise answer.")print(f"\nDeepSeek Final Answer:")try:# Use DeepSeek for the final answerresponse = deepseek_client.chat.completions.create(model="deepseek-chat", # Using the chat model for the final answermessages=[{"role": "user", "content": combined_prompt}],temperature=0.7,max_tokens=500,stream=True)full_response = ""for chunk in response:try:if hasattr(chunk.choices[0].delta, 'content') and chunk.choices[0].delta.content:content_piece = chunk.choices[0].delta.contentfull_response += content_pieceprint(content_piece, end="", flush=True)except Exception as e:print(f"\nError processing chunk: {str(e)}")continueexcept Exception as e:print(f"\nError in streaming response: {str(e)}")return "Error occurred while streaming response"print("\n")return full_responsedef main():# Hardcoded promptprompt = "Explain the concept of quantum computing in simple terms."# Process with both token and time limits (whichever comes first)reasoning = get_deepseek_reasoning(prompt,max_think_tokens=500, # Limit to 500 tokensmax_think_time=500 # Limit to 30 seconds)# Choose either GPT or DeepSeek for the final answeruse_gpt = Falseif use_gpt:# Option 1: Use GPT for the final answerfinal_answer = get_gpt_response(prompt, reasoning)else:# Option 2: Use DeepSeek for the final answerfinal_answer = get_deepseek_final_answer(prompt, reasoning)if __name__ == "__main__":main()
The reasoning is streamed as it’s being generated, allowing for immediate interruption if needed. Instead of waiting for the full reasoning process to complete before taking action, this method ensures that if the AI starts generating excessively long or unnecessary reasoning, it can be stopped in real time. If it hits the time limit, the reasoning is cut off, and the response generation step begins immediately, avoiding unnecessary delays. Once the reasoning step is complete, the generated thought process is passed as context to the second model.
Weave is used to track inputs, outputs, and intermediate steps throughout the reasoning and response generation process. By wrapping key functions like get_deepseek_reasoning and get_deepseek_final_answer with @weave.op, it enables logging and analysis of how the system processes different prompts. In a production setting, this can be useful for tracking model performance over time, detecting anomalies, and optimizing token or time limits based on real-world usage patterns.
Since RAT involves multiple models interacting, Weave provides a way to track how reasoning flows between them, monitor discrepancies in outputs, and log intermediate steps for better debugging and evaluation. This ensures greater transparency in how different models contribute to the final response and helps optimize their interaction over time.
Ensembling multiple reasoning models
Retrieval Augmented Thinking doesn’t have to rely on a single reasoning model. Instead, multiple models can be used together to enhance reasoning quality, improve verification, and increase interpretability. This implementation pairs Claude 3.7 as the primary reasoning model with o3-mini as a secondary verifier to critique and refine the generated reasoning. The approach follows a solve-critique-fix loop, where Claude generates an initial reasoning trace, o3-mini checks for errors, and the process repeats until a valid final answer is produced (o3-mini is unable to find errors, or a maximum number or iterations is reached).
Using multiple reasoning models allows us to leverage the strengths of both open and closed-source models while maintaining visibility into the thought process. Claude 3.7 is used because it provides an explicit reasoning trace, making it useful for RAT. o3-mini acts as a reasoning validator, identifying inconsistencies or mistakes. By iterating over multiple steps, the system ensures higher reliability in its final answers.
Here’s the code:
import asyncioimport timefrom anthropic import Anthropicfrom openai import OpenAIimport weaveimport jsonfrom datasets import load_dataset# Initialize Weaveweave.init("aime_evaluation")# Claude 3.7 configurationCLAUDE_API_KEY = "your claude api key" # Replace with your actual API keyCLAUDE_MODEL = "claude-3-7-sonnet-20250219"THINKING_BUDGET = 4000MAX_TOKENS = THINKING_BUDGET + 4000# Initialize Anthropic client for Claude 3.7claude_client = Anthropic(api_key=CLAUDE_API_KEY)# Initialize OpenAI client for O3-mini (mistake detection)openai_client = OpenAI(api_key="your OpenAI api key" # Replace with your actual API key)# # Initialize OpenAI client for GPT-4o (scoring)# gpt4o_client = OpenAI(# api_key="your_gpt4o_api_key", # Replace with your GPT-4o API key# )# System prompt for ClaudeSYSTEM_MESSAGE = "Solve the following problem. Put your final answer within \\boxed{}: "class ClaudeO3Verifier(weave.Model):"""A simplified model that combines Claude 3.7 for reasoning with O3-mini for verificationand error correction, following the "solve-critique-fix" loop."""@staticmethodasync def get_claude_reasoning(prompt: str, critic=None, verbose=False) -> dict:"""Get reasoning from Claude 3.7 with optional critic feedback."""try:start_time = time.time()# Prepare full prompt with critic feedback if availablefull_prompt = promptif critic and critic != "Correct Reasoning" and critic != "":full_prompt = f"""Problem: {prompt}A review of your previous solution found this error:{critic}Please solve this problem again with correct reasoning."""# Using streaming mode with Claude 3.7full_response = ""thinking_content = ""if verbose:print(f"\nStreaming Claude 3.7 Reasoning for problem: {prompt[:50]}...")print("-" * 50)# Stream Claude's responsewith claude_client.messages.stream(model=CLAUDE_MODEL,max_tokens=MAX_TOKENS,thinking={"type": "enabled", "budget_tokens": THINKING_BUDGET},messages=[{"role": "user", "content": SYSTEM_MESSAGE + full_prompt}]) as stream:current_block_type = Nonefor event in stream:if event.type == "content_block_start":current_block_type = event.content_block.typeif verbose:print(f"\n--- Starting {current_block_type} block ---")elif event.type == "content_block_delta":if event.delta.type == "thinking_delta":thinking_content += event.delta.thinkingif verbose and len(thinking_content) % 100 == 0:print(".", end="", flush=True)elif event.delta.type == "text_delta":full_response += event.delta.textif verbose and len(full_response) % 100 == 0:print("*", end="", flush=True)elif event.type == "content_block_stop":if verbose:print(f"\n--- End {current_block_type} block ---")elif event.type == "message_stop":if verbose:print("\n--- Message complete ---")end_time = time.time()if verbose:print("\n" + "-" * 50)print(f"Total time: {end_time - start_time:.2f} seconds")print(f"Total thinking length: {len(thinking_content)} characters")print(f"Total response length: {len(full_response)} characters")return {"thinking": thinking_content,"response": full_response}except Exception as e:end_time = time.time()print(f"\nError after {end_time - start_time:.2f} seconds: {str(e)}")return {"thinking": "", "response": f"Error: {str(e)}"}@weave.op@staticmethodasync def check_reasoning_correctness(problem: str, reasoning: dict) -> str:"""Check the full reasoning for correctness using O3-mini.Returns "Correct Reasoning" or "Error - {description}"."""if not reasoning or (not reasoning.get("thinking") and not reasoning.get("response")):return "Error - Empty reasoning provided"# Prioritize thinking content, but fall back to response if thinking is emptyreasoning_text = reasoning.get("thinking") or reasoning.get("response")# Clean reasoning if neededclean_reasoning = reasoning_text.strip()validation_prompt = "\n".join(["Examine this complete mathematical reasoning trace for any logical mistakes, calculation errors, or incorrect assumptions:","","the problem: ",problem,"reasoning: ",clean_reasoning,"","Focus on mathematical correctness. Pay special attention to:","1. Arithmetic errors","2. Algebraic manipulation errors","3. Logical flaws in the reasoning","4. Incorrect application of formulas","5. Unwarranted assumptions","","Respond with EXACTLY:","- \"Correct Reasoning\" if no mistakes are found","- \"Error - [a explanation of which step the error occurs, detailed explanation of the specific error, along with an explanation of what to do instead]\" if you find any mistakes"])try:response = openai_client.chat.completions.create(model="o3-mini",messages=[{"role": "user", "content": validation_prompt}])result = response.choices[0].message.content.strip()if result.lower().startswith("correct reasoning"):return "Correct Reasoning"elif result.lower().startswith("error"):return resultelse:# Force result into expected formatif "error" in result.lower():return f"Error - DO NOT REPEAT THIS MISTAKE AGAIN - {result}"else:return "Correct Reasoning" # Default to correct if unclearexcept Exception as e:print(f"Failed to validate reasoning: {e}")return f"Error - Validation failed: {str(e)}"@weave.opasync def predict(self, text: str) -> str:"""Run the complete inference pipeline using iterative solve-critique-fix loop.Args:text: The mathematical problem to solveReturns:str: The final solution to the problem"""try:print(f"Problem: {text[:100]}...")# Initialize loop variablesmax_iterations = 5 # Maximum number of attemptscurrent_iteration = 0critic = ""done = Falsefinal_reasoning = Nonepartial_critic = None# Iterative solve-critique-fix loopwhile not done and current_iteration < max_iterations:current_iteration += 1print(f"\nIteration {current_iteration}/{max_iterations}")# Get reasoning from Claude 3.7print("Getting reasoning from Claude 3.7...")reasoning = await self.get_claude_reasoning(text, critic, verbose=False)final_reasoning = reasoningif not reasoning.get("response"):return "Failed to get reasoning"# Check reasoning with O3-miniprint("Checking reasoning with O3-mini...")partial_critic = await ClaudeO3Verifier.check_reasoning_correctness(text, reasoning)print(f"Critic: {critic[:100]}...")# Check if we're doneif partial_critic == "Correct Reasoning" or current_iteration == max_iterations:done = Truecritic += "Mistake: " + partial_critic# Return the final response for evaluationreturn final_reasoning.get("response", "No response generated")except Exception as e:print(f"Critical error in predict method: {str(e)}")return f"Error processing problem. Details: {str(e)}"@weave.opasync def gpt4o_scorer(label: str, model_output: str) -> dict:"""Score the model's output by comparing it with the ground truth."""query = f"""YOU ARE A LLM JUDGE DETERMINING IF THE FOLLOWING MODEL GENERATED ANSWER IS THE SAME AS THE CORRECT ANSWERI WILL GIVE YOU THE LAST 100 CHARS OF THE MODEL'S REASONING PATH, WHICH WILL CONTAIN THE FINAL ANSWER ->Model's Answer (last 100 chars): {str(model_output)[-100:]}Correct Answer: {label}Your task:1. State the model's predicted answer (answer only).2. State the ground truth (answer only).3. Determine if the model's final answer is correct (ignore formatting differences, etc.). RESPOND with the predicted and ground truth answer, followed with a JSON object containing the correctness encapsulated within the following delimiters:```json{{ "correctness": true/false }}```"""try:# Perform inference using GPT-4o clientresponse = openai_client.chat.completions.create(model="gpt-4o-2024-08-06",messages=[{"role": "user", "content": query}])result = response.choices[0].message.content# Extract correctness JSON object from the responsetry:json_start = result.index("```json") + 7json_end = result.index("```", json_start)correctness = json.loads(result[json_start:json_end].strip()).get("correctness", False)except (ValueError, IndexError, json.JSONDecodeError):correctness = Falseprint("Failed to parse correctness JSON")return {"correctness": correctness, "reasoning": result}except Exception as e:print(f"Scoring error: {e}")return {"correctness": False, "reasoning": f"Scoring error: {str(e)}"}# Load and preprocess datasetdef load_ds():print("Loading AIME dataset...")dataset = load_dataset("Maxwell-Jia/AIME_2024")["train"] # no test set herereturn [{"text": row["Problem"], "label": row["Answer"]} for row in dataset]async def run_evaluations():print("Loading dataset...")dataset = load_ds()print(f"Dataset loaded with {len(dataset)} problems")# For testing purposes, you might want to use a smaller subset initially# Uncomment the next line to use a smaller subset# dataset = dataset[:5] # Use only 5 problems for testingprint("Initializing model...")model = ClaudeO3Verifier()print("Preparing dataset for evaluation...")dataset_prepared = [{"text": row["text"], "label": row["label"]} for row in dataset]print("Running evaluation...")scorers = [gpt4o_scorer]evaluation = weave.Evaluation(dataset=dataset_prepared,scorers=scorers,name="Claude 3.7 AIME Evaluation")print("Starting evaluation...")results = await evaluation.evaluate(model)print(f"Evaluation complete!")print(f"Results: {results}")return resultsif __name__ == "__main__":asyncio.run(run_evaluations())
Weave Evaluations is used to track model performance across multiple reasoning steps. By structuring evaluations, it becomes easier to measure accuracy trends, assess how often o3-mini detects errors in Claude’s reasoning, and refine the verification process over time. Here are the results for my evaluation:

The Claude + o3-mini verifier achieves higher correctness (0.433 vs. 0.367) compared to the regular Claude model but at the cost of increased latency. The generation reasoning model was limited to 4K thinking tokens per response, but since we allocate another 4K thinking tokens for every mistake it makes, total token usage is technically higher for any sample where the model makes at least one mistake. While this could be a possible confounder, the results still seem meaningful since this approach does show a clear improvement in accuracy.
The Weave comparisons dashboard
Once evaluations are complete, Weave organizes results into an interactive dashboard. This powerful tool enables you to compare model outputs side by side. This visual approach simplifies debugging and provides deep insights into model performance, making Weave an indispensable tool for tracking and refining large language models.
For reasoning-focused tasks like those tackled by DeepSeek-R1, the comparisons view offers a step-by-step trace of each model’s decision-making process. By analyzing these granular outputs, you can better understand why models succeed or fail on specific tasks. This insight is invaluable for diagnosing issues, improving models, and tailoring their capabilities for complex reasoning scenarios.
Here’s a screenshot of what the comparison view looks like:

Conclusion
Retrieval Augmented Thinking (RAT) introduces a new way to separate reasoning from response generation, offering advantages in efficiency, interpretability, and customization. By leveraging models that expose their thought processes, RAT allows for more structured reasoning while maintaining flexibility in final outputs.
Implementations using DeepSeek-R1, Claude, and other models demonstrate how reasoning can be optimized, verified, and improved through multiple iterations.
Tools like Weave further enhance transparency, providing valuable insights into model performance and interactions. As AI continues to evolve, RAT presents a promising approach to balancing cost, accuracy, and control in complex reasoning tasks.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.