Skip to main content

Agentic RAG: Enhancing retrieval-augmented generation with AI agents

This article explores how agentic RAG enhances retrieval-augmented generation by using AI agents to dynamically refine search strategies, coordinate multiple data sources, and improve response accuracy.
Created on March 10|Last edited on March 26
Retrieval-augmented generation (RAG) enhances AI systems by combining pre-trained knowledge with external information retrieval, allowing models to generate more accurate and contextually relevant responses. Traditional RAG operates in two steps: it retrieves relevant documents based on a query and integrates them into a language model’s response. While effective for straightforward lookups, this static retrieval approach struggles with complex reasoning, multi-step queries, and ambiguous requests, limiting its ability to handle more sophisticated tasks.
Agentic RAG addresses these challenges by introducing intelligent agents that actively manage retrieval, refine search strategies, and coordinate multiple information sources. Instead of a fixed retrieval process, these agents dynamically adapt based on the query, improving precision and scalability. By leveraging multi-agent architectures, real-time tool use, and monitoring frameworks like Weave, Agentic RAG enables AI systems to process complex queries more efficiently.
This article explores Agentic RAG, comparing single-agent and multi-agent architectures and analyzing their strengths, trade-offs, and practical applications in enhancing AI-driven retrieval.
Monitoring agent tool use with W&B Weave.

Table of contents

Table of contentsWhat is retrieval-augmented generation?Traditional RAG vs. Agentic RAGAgentic RAG architecture: Single-agent vs. Multi-agent SystemsSingle-agent systems: Simplicity with limitationsSingle-agent systems for query routingMulti-agent systems: Specialization and scalabilityMulti-agent systems for advanced reasoning and coordinationWhen to use multi-agent RAG:Beyond single vs. multi-agent: Other RAG architecturesStep-by-step tutorial: Building a single-agent RAG systemUnderstanding user queries in software engineeringOverview of the implementation1. Collect and process data2. Convert text into embeddings3. Implement query classification and retrieval4. Generate responses with GPT-4oNext Step: Writing the scriptBuilding the query routing agentHow the query router worksEnhancing accuracy with code-based signalsHandling ambiguous queriesImplementation approachNext step: Writing the query routing agent in codeQuerying our databases Step 1: Query classificationStep 2: Searching the vector databaseStep 3: Generating a responseNext Step: Writing the codeFuture areas of improvement (click to expand)Integrating Claude 3.7 tool-use Vector search and tool-based retrievalClaude 3.7 tool selectionWeave for monitoring and optimizationDynamic query processing and iterationReal-world use cases and alternative applicationsLegal research: More accurate and adaptive case law retrievalFinancial analysis: Real-time market insightsChallenges and mitigation strategies in agentic RAG systems1. Increased latency due to complex retrieval2. Reliability in tool selection3. Ethical concerns: Privacy and bias risksConclusion


What is retrieval-augmented generation?

Vanilla RAG works in two steps: retrieval and generation. When a user submits a query, it is converted into an embedding (a numerical representation of its meaning). This embedding is compared against a vector database of preprocessed document embeddings to find the most relevant matches. The system retrieves these documents using similarity scores.
The retrieved documents are then passed to the generation module (an LLM), which incorporates them into its response. This allows the model to generate answers that blend its pre-trained knowledge with up-to-date external information.
Vanilla RAG performs this retrieval once per query, meaning it does not refine or iterate on its results. While effective for simple lookups, it struggles with complex reasoning or multi-step queries, which is where Agentic RAG improves by adding iterative retrieval and autonomous query handling.

Traditional RAG vs. Agentic RAG

As we noted above, traditional RAG retrieves information once per query from a static, preprocessed database and incorporates it directly into the response. This approach, while effective for simple lookups, struggles when queries are complex, ambiguous, or require multiple steps to fully resolve. It cannot adapt its retrieval strategy, refine its results, or integrate information from multiple diverse sources.
Agentic RAG addresses these limitations by introducing intelligent agents that actively manage and enhance the retrieval process. Rather than relying solely on pre-indexed data, these agents can dynamically interact with various tools and databases—including real-time sources like websites, structured databases accessed via advanced queries (e.g., SQL), and external APIs. The agents adapt their retrieval strategy based on the nature of the query, performing iterative retrieval, refining their searches through self-reflection, and adjusting their approach as needed. This ability to reason autonomously, coordinate across multiple information sources, and continuously refine both the queries and retrieval methods significantly improves accuracy, reduces hallucinations, and ensures highly relevant and contextually informed responses.

Agentic RAG architecture: Single-agent vs. Multi-agent Systems

Agentic RAG architectures rely on LLM-powered agents to retrieve and process information from diverse sources. These systems can be structured as single-agent or multi-agent, each offering trade-offs in efficiency, modularity, and scalability.
This section explores single-agent and multi-agent architectures, detailing their capabilities, trade-offs, and use cases for retrieval-augmented generation (RAG) systems.

Single-agent systems: Simplicity with limitations


In a single-agent system, a single agent manages the entire retrieval pipeline. It:
  • Processes the query
  • Determines relevant data sources
  • Retrieves necessary information
  • Integrates results before passing them to the LLM
This approach is straightforward and efficient but has limitations as retrieval tasks become more complex. A single agent must handle:
  • Multiple retrieval strategies (e.g., vector search, keyword matching)
  • Diverse query formats (e.g., structured vs. unstructured data)
  • Different ranking mechanisms to prioritize relevant information
As retrieval complexity grows, a single agent can become a bottleneck, making the system harder to scale and maintain.

Single-agent systems for query routing

Despite these challenges, single-agent architectures can still be highly effective for structured and relatively simple retrieval tasks. They function as centralized query routers, efficiently managing query processing and response generation.
Key capabilities of a single-agent RAG system include:
  • Query pre-processing: Refining and restructuring queries before retrieval
  • Multi-step retrieval: Iteratively refining searches and synthesizing multiple retrieval passes
  • Validation & ranking: Filtering out low-confidence sources and cross-referencing retrieved data
  • External tool integration: Connecting with vector databases, APIs, web search modules, and structured knowledge bases
When to use single-agent RAG:
  • Tasks with consistent query structures (e.g., structured database lookups)
  • Low-latency applications where fast, centralized decision-making is needed
  • Systems that don’t require multi-step reasoning or complex cross-referencing
However, for more complex retrieval needs, a multi-agent system offers a scalable alternative. Enter multi-agent systems.

Multi-agent systems: Specialization and scalability



A multi-agent system overcomes single-agent limitations by distributing retrieval tasks across specialized agents, each optimized for a specific retrieval method or data type. Instead of a single agent managing all retrieval logic, tasks are delegated, enabling a modular and scalable retrieval framework.
For example:
  • Vector search agents retrieve information from vector databases using semantic search
  • Web search agents fetch real-time data from external sources
  • Structured data agents query databases, enterprise records, and email systems
Advantages of multi-agent systems
  • Improved scalability: New retrieval agents can be added without disrupting the existing system
  • Better optimization: Each agent is tailored to specific data types, improving retrieval accuracy
  • Easier debugging & development: Changes can be made to individual agents without affecting the entire pipeline
However, multi-agent systems also introduce challenges:
  • Coordination overhead: Managing multiple agents requires sophisticated query routing & aggregation techniques
  • Higher computational costs: Multiple agent calls increase LLM processing overhead, impacting performance

Multi-agent systems for advanced reasoning and coordination

Multi-agent architectures excel in handling complex retrieval workflows, particularly when dealing with:
  • Cross-referencing multiple data sources for more informed responses
  • Multi-step reasoning tasks, where retrieved information must be iteratively refined
  • Adaptive retrieval, where agents adjust their retrieval strategies based on query complexity
By collaborating in a structured manner, these agents refine search strategies iteratively, ensuring that responses are more accurate, contextually informed, and explainable.

When to use multi-agent RAG:

  • High-stakes retrieval tasks (e.g., legal research, financial analytics)
  • Dynamic queries requiring real-time cross-referencing
  • Knowledge-intensive applications needing multi-source verification
While multi-agent systems increase complexity, their ability to process and refine information dynamically makes them powerful for advanced reasoning tasks.

Beyond single vs. multi-agent: Other RAG architectures

Beyond these architectures, other variations of RAG exist, such as:
  • GraphRAG, which leverages knowledge graphs for structured, relationship-driven retrieval
  • Modular RAG, where retrieval components are designed to be plug-and-play for different data sources
For a more comprehensive overview of these architectures, refer to Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG by Aditi Singh et al. This paper provides detailed insights and diagrams contextualizing the broader RAG landscape.
I would like to personally thank the authors of the paper for granting me permission to use their excellent diagrams in the next few sections.
💡

Step-by-step tutorial: Building a single-agent RAG system

In this tutorial, we will implement a single-agent retrieval-augmented generation (RAG) system designed to search across repositories for two key tools from Weights & Biases (W&B):
  • W&B Experiment Tracking: A framework for tracking, visualizing, and managing machine learning experiments. It helps practitioners monitor model performance, manage hyperparameters, and organize datasets throughout the development lifecycle.
  • W&B Weave: A toolkit for developers building applications powered by large language models (LLMs). It provides programmatic tools to track, evaluate, and debug LLM prompts and interactions, making it easier to refine workflows.
By integrating retrieval across both tools, our system will provide relevant information for:
  • General ML experimentation (via W&B documentation and issue tracking)
  • LLM development and debugging (via Weave documentation and discussions)

Understanding user queries in software engineering

User queries typically fall into two categories:
  1. "How-to" and API usage questions: Users need guidance on using a feature, function, or API endpoint
  2. Troubleshooting and issue resolution: Users report errors, unexpected behaviors, or bugs
To address these queries, our single-agent RAG system will:
  • Classify whether the query relates to API usage or debugging
  • Search documentation for instructional queries
  • Retrieve relevant GitHub issues for troubleshooting queries
  • Generate responses using GPT-4o, incorporating additional GitHub discussions if needed

Overview of the implementation

Our system will follow these four key steps:

1. Collect and process data

  • Clone the latest versions of the W&B and Weave repositories
  • Extract documentation (Markdown files) and clean them for indexing
  • Fetch GitHub issues, pulling titles, descriptions, and metadata (excluding comments for now)

2. Convert text into embeddings

  • Use OpenAI’s text-embedding-3-small model to generate vector embeddings for all documents
  • Store the embeddings in ChromaDB, a high-performance vector database
  • Differentiate between documentation and issue reports to improve search relevance

3. Implement query classification and retrieval

  • Determine whether a user query is related to API documentation or debugging
  • If API-related, search documentation embeddings for relevant results
  • If related to issues, retrieve and rank GitHub issues
  • Fetch additional GitHub comments if further context is required

4. Generate responses with GPT-4o

  • Use the retrieved results as context for GPT-4o to generate a response
  • Ensure that responses reference the original documentation or issue discussions to maintain transparency

Next Step: Writing the script

In the next section, we will write the actual implementation, starting with data collection and vectorization
import os
import glob
import re
import requests
import time
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# GitHub repository details
REPOS = {
"weave": "wandb/weave",
"wandb": "wandb/docs", # For documentation
}

# GitHub repositories for issues
ISSUE_REPOS = {
"weave": "wandb/weave",
"wandb": "wandb/wandb", # Correct repo for WandB issues
}

# GitHub issues URL
ISSUES_ENDPOINT = "/issues?state=all&per_page=100"

# Initialize OpenAI Embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Function to download repo docs using manual git commands
def download_docs(repo_name, repo_url, path):
if os.path.exists(path):
print(f"{repo_name} documentation already exists, skipping download.")
return
print(f"Downloading {repo_name} documentation...")
# Simple git clone without sparse-checkout (more reliable)
clone_cmd = f"git clone --depth 1 https://github.com/{repo_url}.git {path}"
result = os.system(clone_cmd)
if result != 0 or not os.path.exists(path):
print(f"❌ Failed to clone {repo_name}. Check if the repository URL is correct.")
return
print(f"✅ {repo_name} documentation downloaded successfully.")

# Function to clean markdown content
def clean_markdown(text):
if not text:
return ""
text = re.sub(r'!\[.*?\]\(.*?\)', '', text) # Remove images
text = re.sub(r'\[([^\]]+)\]\((.*?)\)', r'\1', text) # Keep link text but remove URLs
return text.strip()

# Function to fetch GitHub issues without comments
def fetch_issues(repo_name, repo_url, max_pages=5):
all_issues = []
page = 1
# Paginate through the results
while page <= max_pages:
url = f"https://api.github.com/repos/{repo_url}{ISSUES_ENDPOINT}&page={page}"
headers = {"Accept": "application/vnd.github.v3+json"}
try:
print(f"Fetching page {page} of issues from {repo_name}...")
response = requests.get(url, headers=headers)
if response.status_code == 200:
issues = response.json()
# If no more issues are returned, break the loop
if not issues:
break
# Process each issue without fetching comments
for issue in issues:
# Add basic issue data to the list
all_issues.append({
"id": issue["id"],
"number": issue["number"],
"title": issue["title"],
"body": issue.get("body", ""),
"state": issue["state"],
"created_at": issue["created_at"],
"comments_count": issue.get("comments", 0),
"github_url": issue["html_url"], # URL to the GitHub issue page
"comments_url": issue["comments_url"], # Direct API URL for comments
"repo": repo_url # Store the repo info for later reference
})
print(f"Fetched {len(issues)} issues from {repo_name} (page {page})")
page += 1
time.sleep(1) # Small delay between pages
elif response.status_code == 403 and "rate limit exceeded" in response.text.lower():
print("⚠️ GitHub API rate limit exceeded. Waiting for 60 seconds...")
time.sleep(60) # Wait for rate limit to reset
continue # Retry the same page
else:
print(f"Error fetching issues from {repo_name}: {response.status_code}")
break
except Exception as e:
print(f"Exception fetching issues for {repo_name}: {e}")
break
print(f"Total issues fetched from {repo_name}: {len(all_issues)}")
return all_issues

# Function to create documents from issues (without comments)
def create_issue_documents(issues):
docs = []
for issue in issues:
# Create a document for the main issue
issue_text = f"Title: {issue['title']}\nIssue #{issue['number']}\nState: {issue['state']}\nComments: {issue['comments_count']}\n"
# Add a note about comments and where to find them
if issue['comments_count'] > 0:
issue_text += f"\nThis issue has {issue['comments_count']} comments. View at: {issue['github_url']}\n"
issue_text += f"\nBody:\n{issue['body']}"
# Create the document object with rich metadata
docs.append(Document(
page_content=clean_markdown(issue_text),
metadata={
"id": issue["id"],
"number": issue["number"],
"created_at": issue["created_at"],
"state": issue["state"],
"comments_count": issue["comments_count"],
"github_url": issue["github_url"],
"comments_url": issue["comments_url"],
"repo": issue["repo"],
"type": "issue"
}
))
return docs

# Function to process Markdown docs
def create_markdown_documents(repo_name, doc_path):
if not os.path.exists(doc_path):
print(f"⚠️ {doc_path} does not exist, skipping doc processing.")
return []
# Find markdown files recursively
md_files = glob.glob(f'{doc_path}/**/*.md', recursive=True)
if not md_files:
print(f"No markdown files found in {doc_path}.")
return []
docs = []
for md_file in md_files:
try:
with open(md_file, 'r', encoding='utf-8') as f:
content = clean_markdown(f.read())
# Create a proper Langchain Document object
docs.append(Document(
page_content=content,
metadata={
"file": md_file,
"type": "documentation"
}
))
except Exception as e:
print(f"Error processing {md_file}: {e}")
return docs

# Function to query the vector store
def search_docs(query, db_directory, k=5):
"""Search a Chroma vector store for documents similar to the query."""
try:
# Load the Chroma database
db = Chroma(persist_directory=db_directory, embedding_function=embeddings)
# Search for similar documents
results = db.similarity_search_with_score(query, k=k)
if not results:
print(f"No results found for '{query}'")
return []
print(f"Found {len(results)} results for '{query}':")
for i, (doc, score) in enumerate(results):
print(f"\nResult {i+1}:")
print(f"Score: {score}")
# Print metadata
print("Metadata:")
for key, value in doc.metadata.items():
if key in ["id", "number", "state", "comments_count", "type"]:
print(f" {key}: {value}")
# For issues with comments, show the GitHub URL
if doc.metadata.get("type") == "issue" and doc.metadata.get("comments_count", 0) > 0:
print(f" GitHub URL: {doc.metadata.get('github_url')}")
# Print content (truncated if too long)
content = doc.page_content
if len(content) > 500:
content = content[:500] + "... [truncated]"
print(f"\nContent: {content}\n")
print("-" * 40)
return results
except Exception as e:
print(f"Error searching database: {e}")
return []

# Main execution
if __name__ == "__main__":
# Create directory for vector stores
os.makedirs("db", exist_ok=True)
for repo_name, repo_url in REPOS.items():
print(f"\nProcessing {repo_name.upper()}...")

# Fetch issues from the correct repository
issue_repo_url = ISSUE_REPOS[repo_name]
print(f"Fetching issues from {issue_repo_url}...")
issues = fetch_issues(repo_name, issue_repo_url)

# Download docs
doc_path = f"{repo_name}_docs"
download_docs(repo_name, repo_url, doc_path)

# Process issues
if issues:
print(f"Creating vector store for {repo_name} issues...")
issue_docs = create_issue_documents(issues)
# Create Chroma vector store for issues
issues_db = Chroma.from_documents(
documents=issue_docs,
embedding=embeddings,
persist_directory=f"db/{repo_name}_issues"
)
print(f"✅ Saved {repo_name} issues vector store ({len(issue_docs)} documents)")
else:
print("No issues to process.")

# Process docs
print(f"Creating vector store for {repo_name} documentation...")
md_docs = create_markdown_documents(repo_name, doc_path)
if md_docs:
# Create Chroma vector store for documentation
docs_db = Chroma.from_documents(
documents=md_docs,
embedding=embeddings,
persist_directory=f"db/{repo_name}_docs"
)
print(f"✅ Saved {repo_name} documentation vector store ({len(md_docs)} documents)")
else:
print("No documentation to process.")

print("\n✅ Data ingestion completed successfully.")
# Example search to verify everything works
print("\nTesting search functionality...")
search_docs("authentication issues", "db/weave_issues", k=2)
With the vector database in place, the system will be ready to process user queries, retrieving relevant documentation or past discussions from GitHub issues. The next step will involve implementing the retrieval mechanism that searches the database based on similarity to user queries.

Building the query routing agent

A query-routing agent is essential for optimizing retrieval in our RAG pipeline. Rather than searching all available sources indiscriminately, the routing agent classifies user queries and directs them to the most relevant knowledge bases, improving efficiency and accuracy.

How the query router works

The routing agent performs two key classification tasks:
  1. Identifying the relevant tool: Does the query relate to W&B Experiment Tracking or Weave?
  2. Determining query type: Is the user asking for documentation or troubleshooting an issue?
  • Documentation queries typically involve user guides, tutorials, and API references
  • Troubleshooting queries are best answered by retrieving GitHub discussions, bug reports, and issue resolutions
If a query is ambiguous, the agent returns multiple knowledge bases instead of forcing a single, potentially incorrect choice

Enhancing accuracy with code-based signals

To improve classification, the agent leverages syntax and code snippets commonly associated with each tool:
  • W&B-related queries often contain wandb.init() or references to logging metrics
  • Weave-related queries may include weave.op() or mention function tracking
When these patterns appear in a query, the agent uses them as additional routing signals to refine search accuracy

Handling ambiguous queries

The agent is designed to intelligently manage vague queries.
For example, if a user asks:
"How do I debug errors in my AI workflow?"
The system cannot determine with certainty whether the issue relates to W&B or Weave. In such cases, the agent returns both relevant knowledge bases to ensure comprehensive retrieval.

Implementation approach

To implement this logic, we define a system prompt that instructs the agent to:
  • Analyze the user’s query and classify it accordingly
  • Return structured JSON output specifying which knowledge bases to search
  • Support multi-source retrieval for ambiguous or overlapping queries
By implementing this routing mechanism, we:
  • Reduce unnecessary searches
  • Improve response relevance
  • Create a structured, efficient retrieval process

Next step: Writing the query routing agent in code

Now, we will implement the query-routing agent in code and integrate it into our retrieval pipeline.
You are a query-routing assistant that determines whether a user's question is related to Weights & Biases (wandb) or Weave and directs them to the most relevant database. Users may ask about documentation, troubleshooting issues, or implementation details for either tool.

Your task is to analyze the query and route it to all relevant sources if uncertainty exists. If a query could apply to multiple areas, return multiple results rather than choosing just one.

The available knowledge bases are:
1. "Weave Issues" - Contains GitHub issues from the wandb/weave repository
2. "Weave Documentation" - Contains documentation for Weave
3. "WandB Issues" - Contains GitHub issues from the wandb/wandb repository
4. "WandB Documentation" - Contains documentation for Weights & Biases

Return a JSON object with the following format:
{
"knowledge_bases": ["name_of_kb1", "name_of_kb2"]
}

## Library Descriptions for Context

### Weights & Biases (wandb) – Experiment Tracking & ML Workflow Management
Weights & Biases (wandb) is a tool for tracking and visualizing machine learning experiments. It helps researchers and engineers:
* Log and monitor metrics like loss, accuracy, and learning rate.
* Track hyperparameters and model configurations for experiment reproducibility.
* Store and visualize model artifacts and training progress.
* Integrate seamlessly with deep learning frameworks like PyTorch, TensorFlow, and Hugging Face Transformers.
* Collaborate via dashboards that display real-time and historical training runs.
It is widely used in research and production to optimize models, compare runs, and fine-tune hyperparameters efficiently.

### W&B Weave – Generative AI & LLM Debugging Toolkit
Weave is a toolkit designed for developing, debugging, and evaluating Generative AI applications, particularly those using Large Language Models (LLMs). It enables users to:
* Log inputs and outputs from LLMs for better debugging.
* Build evaluations to compare different model responses.
* Organize information across LLM workflows from experimentation to production.
* Track interactions within multi-step AI pipelines to analyze decision-making processes.
Weave is essential for understanding and improving the performance of AI agents, LLM-based chatbots, and retrieval-augmented generation (RAG) systems.

## Routing Logic
* If the user asks about experiment tracking, ML metrics, logging, hyperparameters, dashboards, or integrations, route to W&B Docs.
* If they report an error, unexpected behavior, or API issue related to W&B, route to W&B Issues.
* If the user asks about LLM evaluation, function tracking, Weave UI development, or Weave's role in ML workflows, route to Weave Docs.
* If they report errors related to Weave tracking or data storage, route to Weave Issues.
* If uncertain, include all possibly relevant sources instead of picking only one.

## Code Samples for Context
If the query is vague or unclear, match it against the functionality of each tool using these sample code snippets:

### W&B Experiment Tracking Example (wandb Docs/Issues)
```python
import wandb
wandb.init(project="my_project", config={"learning_rate": 0.001, "epochs": 10})
for epoch in range(wandb.config.epochs):
train_loss = 0.01 * (wandb.config.epochs - epoch)
train_accuracy = 0.1 * epoch
wandb.log({'epoch': epoch, 'train_loss': train_loss, 'train_accuracy': train_accuracy})
```
If the query involves logging, training loops, metrics, dashboards, or hyperparameters, it should likely be routed to W&B Docs or W&B Issues if errors are involved.

### Weave Function Tracking Example (Weave Docs/Issues)
```python
import weave
weave.init(project_name="my_weave_project")
@weave.op()
def add_numbers(a: int, b: int) -> int:
return a + b
result = add_numbers(5, 7)
print(f'Result: {result}')
```
If the query involves tracking function calls, logging LLM interactions, debugging AI workflows, or analyzing execution steps, route it to Weave Docs or Weave Issues if errors occur.


Focus on understanding:
- Is the query about Weave or WandB?
- Is the user looking for documentation or issues/bugs?
- If the query mentions "bug" or "issue" or describes a problem, prioritize the Issues knowledge base
- If the query is asking how to use a feature or understand concepts, prioritize Documentation
- If unclear whether the query is about Weave or WandB, include knowledge bases for both

Examples:
1. For "How do I track experiments in wandb?"
- Knowledge bases: ["WandB Documentation"]

2. For "Weave authentication failing"
- Knowledge bases: ["Weave Issues"]

3. For "How to visualize neural networks in Weights & Biases"
- Knowledge bases: ["WandB Documentation"]

4. For "LLM output tracking shows errors in production"
- Knowledge bases: ["Weave Issues", "Weave Documentation"]

Always include at least one knowledge base in your response.

Return a JSON object with the following format:
{
"knowledge_bases": ["name_of_kb1", "name_of_kb2"]
}

Querying our databases

Now, we will implement query retrieval using our vector database and query-routing agent. This process consists of three key steps:
  1. Query classification: The AI-powered router determines which knowledge bases should be searched
  2. Vector search: The system retrieves the most relevant results from our Chroma database, ranking them by semantic similarity
  3. Response generation: GPT-4o synthesizes an answer, ensuring transparency by including references to the retrieved sources

Step 1: Query classification

The query-routing agent classifies user queries and returns a structured JSON response specifying:
  • Relevant knowledge bases (e.g., W&B documentation, Weave docs, GitHub issues)
  • Multi-source retrieval if the query is ambiguous, maximizing relevance
Rather than searching all sources blindly, this approach improves efficiency and accuracy.

Step 2: Searching the vector database

Once the routing decision is made, the system performs a semantic search using ChromaDB, which contains:
  • Embeddings of W&B and Weave documentation
  • Historical GitHub issues for troubleshooting queries
If a retrieved document is a GitHub issue, the system performs additional processing to fetch relevant comments and discussions, providing more context and potential solutions.

Step 3: Generating a response

The retrieved content is passed to GPT-4o, which:
  • Synthesizes a response based on the most relevant documents
  • Includes references to documentation and GitHub issues to ensure transparency
  • Handles edge cases – If no relevant results are found, the system:
    • Prompts the user for clarification.
    • Suggests an alternative query to refine the search.
By integrating query classification, semantic search, and intelligent response generation, this pipeline ensures accurate, transparent, and contextually relevant answers.

Next Step: Writing the code

Now, let's implement the querying logic with the following script:
import sys
import os
import json
from litellm import completion
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
import weave; weave.init("agentic_rag")
import requests
import os
import json
import time

# Create cache directory if it doesn't exist
os.makedirs("comment_cache", exist_ok=True)

# Initialize OpenAI Embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Configure which databases to search (can be modified to include/exclude databases)
DATABASES_TO_SEARCH = [
{"name": "Weave Issues", "path": "db/weave_issues"},
{"name": "Weave Documentation", "path": "db/weave_docs"},
{"name": "WandB Issues", "path": "db/wandb_issues"},
{"name": "WandB Documentation", "path": "db/wandb_docs"}
]

def load_router_prompt():
"""Load the router prompt from router_prompt.txt"""
try:
with open("router_prompt.txt", "r") as f:
return f.read()
except FileNotFoundError:
print("❌ Error: router_prompt.txt not found.")
print("Please create this file with your router prompt before running the script.")
sys.exit(1)

def search_vector_store(query, db_directory, k=5):
"""Search a vector store and return results."""
try:
# Load the Chroma database
db = Chroma(persist_directory=db_directory, embedding_function=embeddings)
# Search for similar documents
results = db.similarity_search_with_score(query, k=k)
return results
except Exception as e:
print(f"❌ Error searching database at {db_directory}: {e}")
return []


def extract_json_from_response(response_text):
"""Extract JSON from response text, handling code blocks."""
# Try to find JSON in code blocks
import re
json_pattern = r'```(?:json)?\s*(\{.*?\})\s*```'
match = re.search(json_pattern, response_text, re.DOTALL)
if match:
# Extract JSON from code block
json_str = match.group(1)
else:
# Assume the entire response is JSON
json_str = response_text
# Remove any non-JSON content
json_str = json_str.strip()
return json.loads(json_str)


def route_query(user_query):
"""Use LiteLLM to route the query to the appropriate databases."""
router_prompt = load_router_prompt()
try:
# Call LiteLLM with the router prompt
response = completion(
model="openai/gpt-4o",
messages=[
{"role": "system", "content": router_prompt},
{"role": "user", "content": user_query}
]
)
# Extract the content from the response
router_response = response.choices[0].message.content
# Parse the JSON response
try:
# parsed_response = json.loads(router_response)
parsed_response = extract_json_from_response(router_response)
return parsed_response
except json.JSONDecodeError:
print("❌ Error: Router didn't return valid JSON. Using all knowledge bases.")
print(f"Router response: {router_response}")
return {
"knowledge_bases": [db["name"] for db in DATABASES_TO_SEARCH],
"search_query": user_query,
"reasoning": "Fallback due to JSON parsing error"
}
except Exception as e:
print(f"❌ Error calling LiteLLM: {e}")
return {
"knowledge_bases": [db["name"] for db in DATABASES_TO_SEARCH],
"search_query": user_query,
"reasoning": f"Fallback due to error: {str(e)}"
}

def get_kb_by_name(name):
"""Get the knowledge base by name."""
for db in DATABASES_TO_SEARCH:
if db["name"] == name:
return db
return None

def fetch_comments_for_issue(repo, issue_number):

# Check cache first
cache_file = f"comment_cache/{repo.replace('/', '_')}_issue_{issue_number}_comments.json"
if os.path.exists(cache_file):
try:
with open(cache_file, 'r') as f:
comments = json.load(f)
print(f"Loaded {len(comments)} comments for issue #{issue_number} from cache")
return comments
except Exception as e:
print(f"Error reading cache: {e}")
# Continue to fetch if cache read fails
# Fetch from GitHub API
url = f"https://api.github.com/repos/{repo}/issues/{issue_number}/comments"
headers = {"Accept": "application/vnd.github.v3+json"}
# Add GitHub token if available
if os.environ.get("GITHUB_TOKEN"):
headers["Authorization"] = f"token {os.environ.get('GITHUB_TOKEN')}"
try:
print(f"Fetching comments for issue #{issue_number} from {repo}...")
response = requests.get(url, headers=headers)
if response.status_code == 200:
comments = response.json()
# Cache the comments
with open(cache_file, 'w') as f:
json.dump(comments, f, indent=2)
print(f"✅ Fetched and cached {len(comments)} comments for issue #{issue_number}")
return comments
elif response.status_code == 403 and "rate limit exceeded" in response.text.lower():
print("⚠️ GitHub API rate limit exceeded. Waiting 60 seconds...")
time.sleep(60)
return fetch_comments_for_issue(repo, issue_number) # Retry after waiting
else:
print(f"Error fetching comments: {response.status_code}")
return []
except Exception as e:
print(f"Exception fetching comments: {e}")
return []

def format_comments_for_context(comments):

if not comments:
return ""
comments_text = "\n\nCOMMENTS:\n"
for i, comment in enumerate(comments):
user = comment.get('user', {}).get('login', 'anonymous')
created_at = comment.get('created_at', 'unknown date')
body = comment.get('body', '').strip()
comments_text += f"\n[Comment #{i+1} by {user} on {created_at}]\n{body}\n"
return comments_text

def generate_response(user_query, relevant_docs, query_analysis):
"""Generate a response using LiteLLM based on the retrieved documents."""
# Prepare context from documents
context = ""
github_links = []
for i, (doc, score) in enumerate(relevant_docs):
# Add the document to context
context += f"--- Document {i+1} (Relevance: {score:.4f}) ---\n"
# Add source information
if doc.metadata.get('type') == 'issue':
issue_num = doc.metadata.get('number', 'Unknown')
repo = doc.metadata.get('repo', 'Unknown')
context += f"Source: GitHub Issue #{issue_num} in {repo}\n"
# If issue has comments, fetch and add them to context
if doc.metadata.get('comments_count', 0) > 0:
github_url = doc.metadata.get('github_url')
if github_url:
github_links.append(github_url)
# Fetch and append comments
print(f"Fetching comments for issue #{issue_num}...")
comments = fetch_comments_for_issue(repo, issue_num)
if comments:
comments_text = format_comments_for_context(comments)
context += doc.page_content + comments_text + "\n\n"
continue # Skip the standard content addition below
else:
context += f"Source: {doc.metadata.get('file', 'Unknown')}\n"
# Add content (only reached if the issue doesn't have comments or isn't an issue)
context += doc.page_content
context += "\n\n"
# Create the prompt for the response
response_prompt = f"""You are an AI assistant for Weights & Biases (wandb) and Weave.
You've been given several documents retrieved from a search based on the user's query.
Use the information in these documents to answer the user's question.

If the documents don't contain the necessary information to answer the question, admit that you don't know
rather than making up an answer. If appropriate, suggest what the user might search for instead.

User Query: {user_query}

Retrieved Documents:
{context}

Based on these documents, provide a helpful response to the user's query.
Pay special attention to issue comments as they often contain solutions to the problems described in the issues.
"""

try:
# Call LiteLLM with the response prompt
response = completion(
model="openai/gpt-4o",
messages=[
{"role": "system", "content": response_prompt},
{"role": "user", "content": user_query}
]
)
# Extract the content from the response
return response.choices[0].message.content
except Exception as e:
print(f"❌ Error generating response: {e}")
return f"I encountered an error while generating a response: {str(e)}"


def agentic_rag(user_query):
"""Main function for the agentic RAG system."""
print(f"\n🔍 Processing query: '{user_query}'")
print("-" * 80)
# Step 1: Route the query
query_analysis = route_query(user_query)
print(f"✅ Query Analysis:")
print(f"- Selected Knowledge Bases: {', '.join(query_analysis['knowledge_bases'])}")
# print(f"- Refined Search Query: '{query_analysis['search_query']}'")
# print(f"- Reasoning: {query_analysis['reasoning']}")
# Step 2: Search the selected knowledge bases
all_results = []
for kb_name in query_analysis["knowledge_bases"]:
kb = get_kb_by_name(kb_name)
if kb and os.path.exists(kb["path"]):
# print(f"\n🔍 Searching '{kb['name']}' for: '{query_analysis['search_query']}'")
results = search_vector_store(user_query, kb["path"], k=3)
if results:
print(f"Found {len(results)} results in {kb['name']}")
# Print out GitHub URLs for issues with comments
for doc, _ in results:
if doc.metadata.get('type') == 'issue' and doc.metadata.get('comments_count', 0) > 0:
github_url = doc.metadata.get('github_url')
if github_url:
print(f" - Issue with comments: {github_url}")
all_results.extend(results)
else:
print(f"No results found in {kb['name']}")
else:
print(f"⚠️ Knowledge base '{kb_name}' not found or not available.")
# Step 3: Sort results by score (ascending since lower score = more relevant in Chroma)
all_results.sort(key=lambda x: x[1])
# Take top results (up to 5)
top_results = all_results[:5]
if not top_results:
print("❌ No relevant documents found across any knowledge base.")
return "I couldn't find any relevant information to answer your question. Could you please rephrase or provide more details?"
# Step 4: Generate a response
print(f"\n💬 Generating response based on {len(top_results)} documents...")
response = generate_response(user_query, top_results, query_analysis)
return response

def interactive_mode():
"""Run the RAG system in interactive mode."""
print("\n" + "=" * 80)
print("AGENTIC RAG SYSTEM")
print("=" * 80)
print("Type 'exit' to quit.")
print("To ask a question, type your question (can be multiple lines)")
print("When finished, press Enter on an empty line to submit.")
while True:
print("\nYour question (multi-line, empty line to submit):")
lines = []
while True:
line = input().strip()
# Check for exit command on single line
if not lines and line.lower() in ['exit', 'quit']:
return
# Empty line finishes input if we already have some content
if not line and lines:
break
# Otherwise add the line to our input
if line:
lines.append(line)
# Combine all lines into a single query
user_query = "\n".join(lines)
if not user_query:
print("Please enter a question.")
continue
print("\n" + "-" * 80)
print("Processing your query...")
response = agentic_rag(user_query)
print("\n" + "=" * 80)
print("ANSWER:")
print(response)
print("=" * 80)

if __name__ == "__main__":
# Check for API key
if not os.environ.get("OPENAI_API_KEY"):
api_key = input("Please enter your OpenAI API key: ").strip()
if api_key:
os.environ["OPENAI_API_KEY"] = api_key
else:
print("❌ No API key provided. Exiting.")
sys.exit(1)
interactive_mode()
Weave is used to monitor system behavior and track interactions, allowing us to analyze query patterns, response quality, and retrieval efficiency. This provides insight into how well the system is performing and where improvements can be made. By logging queries, retrieved results, and generated responses, Weave helps us refine the RAG pipeline and optimize retrieval strategies. It also allows us to debug potential issues by tracing how queries are routed and ensuring that the system selects the most relevant knowledge sources.
This single-agent design ensures efficient query handling while maintaining a simple, centralized retrieval workflow. It sets the foundation for later expansion into a multi-agent system where different agents handle structured data, real-time search, or iterative refinement.

Future areas of improvement (click to expand)

Integrating Claude 3.7 tool-use

Our current system, while functional, is somewhat manual and rigid. It requires careful prompt tuning, explicit database configurations, and a fair amount of human intervention whenever we want to modify how it retrieves information. Adding a new data source—whether it's an additional documentation set or another GitHub repository—means manually updating the vector database, modifying retrieval logic, and ensuring everything integrates correctly. While this works, it’s not scalable.
To make the system more flexible and autonomous, we can integrate Claude 3.7 with tool use, allowing it to reason about the retrieval process and dynamically query our knowledge sources. Instead of relying on a predefined set of rules to determine which database to search, Claude can analyze the user’s query, decide which sources are most relevant, and execute searches in real time. This approach follows the ReAct paradigm, where Claude is not just retrieving information passively but actively reasoning about how to retrieve the most relevant results.
This strategy has several advantages:
  • Less manual intervention: Instead of constantly refining prompts and retrieval rules, we can let Claude make retrieval decisions dynamically.
  • Easier scaling: Adding a new data source won’t require deep modifications to the system. Claude can recognize new tools and use them as needed.
  • Better query handling: If a query is ambiguous, Claude can reformulate it or run multiple searches instead of just returning poor results.
Weave plays a key role in this by tracking interactions, helping us visualize which retrieval strategies work best, and refining how queries are processed over time.
Below, we implement this by giving Claude direct access to our vector database through tool use, allowing it to search documentation, GitHub issues, and user discussions in real time.
import os
import json
import sys
from anthropic import Anthropic
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
import requests
import time
import weave; weave.init("claude_agentic_rag")

# Create cache directory if it doesn't exist
os.makedirs("comment_cache", exist_ok=True)

# Initialize OpenAI Embeddings for vector search
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Configure knowledge bases
KNOWLEDGE_BASES = [
{"name": "weave_issues", "path": "db/weave_issues", "description": "GitHub issues related to Weave, containing bug reports and feature requests."},
{"name": "weave_docs", "path": "db/weave_docs", "description": "Documentation for Weave, explaining its features, APIs, and usage."},
{"name": "wandb_issues", "path": "db/wandb_issues", "description": "GitHub issues related to Weights & Biases (wandb), containing bug reports and feature requests."},
{"name": "wandb_docs", "path": "db/wandb_docs", "description": "Documentation for Weights & Biases (wandb), explaining its features, APIs, and usage."}
]

# Initialize Anthropic client
CLAUDE_API_KEY = os.environ.get("ANTHROPIC_API_KEY")
MODEL = "claude-3-7-sonnet-20250219"
client = Anthropic(api_key=CLAUDE_API_KEY)

# Define search tools
TOOLS = [
{
"name": "search_weave_issues",
"description": "Search for GitHub issues related to Weave. Use this for debugging, error messages, or when the user mentions bugs or problems with Weave.",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "The search query to find relevant Weave issues."}
},
"required": ["query"]
}
},
{
"name": "search_weave_docs",
"description": "Search Weave documentation. Use this for understanding Weave features, APIs, usage examples, or 'how-to' questions about Weave.",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "The search query to find relevant Weave documentation."}
},
"required": ["query"]
}
},
{
"name": "search_wandb_issues",
"description": "Search for GitHub issues related to Weights & Biases (wandb). Use this for debugging, error messages, or when the user mentions bugs or problems with wandb.",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "The search query to find relevant Weights & Biases issues."}
},
"required": ["query"]
}
},
{
"name": "search_wandb_docs",
"description": "Search Weights & Biases documentation. Use this for understanding wandb features, APIs, usage examples, or 'how-to' questions about wandb.",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "The search query to find relevant Weights & Biases documentation."}
},
"required": ["query"]
}
}
]
@weave.op
def search_vector_store(kb_name, query, k=3):
"""Search a vector store and return results, automatically fetching comments for issues."""
kb = next((kb for kb in KNOWLEDGE_BASES if kb["name"] == kb_name), None)
if not kb:
return {"error": f"Knowledge base '{kb_name}' not found"}
db_path = kb["path"]
if not os.path.exists(db_path):
return {"error": f"Knowledge base at '{db_path}' does not exist"}
try:
# Load the Chroma database
db = Chroma(persist_directory=db_path, embedding_function=embeddings)
# Search for similar documents
results = db.similarity_search_with_score(query, k=k)
# Format results
formatted_results = []
for i, (doc, score) in enumerate(results):
# Format document
result = {
"document_id": i + 1,
"relevance_score": float(score),
"content": doc.page_content,
"metadata": doc.metadata
}
# Add special handling for issues
if doc.metadata.get('type') == 'issue':
issue_number = doc.metadata.get('number')
repo = doc.metadata.get('repo')
result["issue_number"] = issue_number
result["repo"] = repo
result["github_url"] = doc.metadata.get('github_url')
result["comments_count"] = doc.metadata.get('comments_count', 0)
# Automatically fetch comments if available
if doc.metadata.get('comments_count', 0) > 0 and repo and issue_number:
print(f" Fetching {doc.metadata.get('comments_count')} comments for {repo}#{issue_number}")
comments_result = fetch_github_comments(repo, issue_number)
result["comments"] = comments_result
formatted_results.append(result)
return {
"results_count": len(formatted_results),
"results": formatted_results
}
except Exception as e:
return {"error": f"Error searching knowledge base: {str(e)}"}

def fetch_github_comments(repo, issue_number):
"""Fetch comments for a GitHub issue."""
# Check cache first
cache_file = f"comment_cache/{repo.replace('/', '_')}_issue_{issue_number}_comments.json"
if os.path.exists(cache_file):
try:
with open(cache_file, 'r') as f:
comments = json.load(f)
return {
"source": "cache",
"comments_count": len(comments),
"comments": comments
}
except Exception as e:
# Continue to fetch if cache read fails
pass
# Fetch from GitHub API
url = f"https://api.github.com/repos/{repo}/issues/{issue_number}/comments"
headers = {"Accept": "application/vnd.github.v3+json"}
# Add GitHub token if available
if os.environ.get("GITHUB_TOKEN"):
headers["Authorization"] = f"token {os.environ.get('GITHUB_TOKEN')}"
try:
response = requests.get(url, headers=headers)
if response.status_code == 200:
comments = response.json()
# Cache the comments
with open(cache_file, 'w') as f:
json.dump(comments, f, indent=2)
# Format the comments
formatted_comments = []
for comment in comments:
formatted_comments.append({
"id": comment.get("id"),
"user": comment.get("user", {}).get("login", "anonymous"),
"created_at": comment.get("created_at"),
"body": comment.get("body", "")
})
return {
"source": "github_api",
"comments_count": len(formatted_comments),
"comments": formatted_comments
}
elif response.status_code == 403 and "rate limit exceeded" in response.text.lower():
return {
"error": "GitHub API rate limit exceeded",
"suggestion": "Set a GITHUB_TOKEN environment variable to increase rate limits"
}
else:
return {
"error": f"Error fetching comments: HTTP {response.status_code}",
"response": response.text
}
except Exception as e:
return {"error": f"Exception fetching comments: {str(e)}"}

def execute_tool(tool_name, tool_input):
"""Execute the appropriate tool based on the name and input."""
if tool_name == "search_weave_issues":
return search_vector_store("weave_issues", tool_input["query"])
elif tool_name == "search_weave_docs":
return search_vector_store("weave_docs", tool_input["query"])
elif tool_name == "search_wandb_issues":
return search_vector_store("wandb_issues", tool_input["query"])
elif tool_name == "search_wandb_docs":
return search_vector_store("wandb_docs", tool_input["query"])
else:
return {"error": "Unknown tool requested"}

def get_system_prompt():
"""Get the system prompt for Claude."""
return """You are an AI assistant for Weights & Biases (wandb) and Weave.

You have access to several searchable knowledge bases:
1. Weave Issues - Contains GitHub issues related to Weave
2. Weave Documentation - Contains documentation for Weave
3. WandB Issues - Contains GitHub issues related to Weights & Biases
4. WandB Documentation - Contains documentation for Weights & Biases

To assist the user effectively:
1. Analyze their question to determine if it's about Weave or WandB (or both)
2. Determine if they're asking about documentation or having an issue
3. Search the appropriate knowledge base(s) using the search tools
4. If an issue looks relevant but lacks context, fetch its GitHub comments
5. Synthesize the retrieved information to provide a detailed answer
6. Make sure to provide the full links to the github issue so the user can investigate further
7. For doc links- DO NOT hallucinate links ---> just provide a verbatim repeat of the docs you think could be helpful

For code-related queries or error messages, search the issues database first. For "how to" questions, search the documentation database first.

Example workflows:
- For "How do I track experiments in wandb?": Search wandb_docs
- For "Weave authentication failing": Search weave_issues, then fetch comments if needed
- For a query mentioning an error message: Search the relevant issues database with the error message

Aim to provide comprehensive answers based on the most relevant retrieved documents. If you don't find relevant information in your first search, try different search queries or additional knowledge bases.
"""

@weave.op
def claude_agentic_rag(user_query):
"""
Run the agentic RAG system using Claude with tools.
Args:
user_query (str): The user's query
Returns:
dict: The complete response, including search results and Claude's answer
"""
# Check for API key
if not CLAUDE_API_KEY:
raise ValueError("ANTHROPIC_API_KEY environment variable is not set")
system_prompt = get_system_prompt()
print(f"\n🔍 Processing query: '{user_query}'")
print("-" * 80)
# Initial request with the user's prompt
response = client.messages.create(
model=MODEL,
max_tokens=4000,
system=system_prompt,
thinking={"type": "enabled", "budget_tokens": 2000},
tools=TOOLS,
messages=[{"role": "user", "content": user_query}]
)
# Display thinking
thinking_blocks = [b for b in response.content if b.type == "thinking"]
for block in thinking_blocks:
print("\n🧠 THINKING:")
print(block.thinking[:300] + "..." if len(block.thinking) > 300 else block.thinking)
# Process tool use if needed
conversation = [{"role": "user", "content": user_query}]
search_results = []
# We might need multiple tool calls, so loop until we get a final answer
while response.stop_reason == "tool_use":
tool_block = next((b for b in response.content if b.type == "tool_use"), None)
if tool_block:
# Show which tool was selected
print(f"\n🔧 USING TOOL: {tool_block.name}")
print(f"Tool input: {json.dumps(tool_block.input, indent=2)}")
# Execute the appropriate tool
tool_result = execute_tool(tool_block.name, tool_block.input)
print(f"Tool found {tool_result.get('results_count', 0)} results" if 'results_count' in tool_result else f"Tool completed")
# Save search results for return value
search_results.append({
"tool": tool_block.name,
"input": tool_block.input,
"result": tool_result
})
# Save assistant's response (thinking + tool use)
assistant_blocks = thinking_blocks + [tool_block]
conversation.append({"role": "assistant", "content": assistant_blocks})
# Add tool result to conversation
conversation.append({
"role": "user",
"content": [{
"type": "tool_result",
"tool_use_id": tool_block.id,
"content": json.dumps(tool_result)
}]
})
# Get next response
response = client.messages.create(
model=MODEL,
max_tokens=4000,
system=system_prompt,
thinking={"type": "enabled", "budget_tokens": 2000},
tools=TOOLS,
messages=conversation
)
# Update thinking blocks for next iteration
thinking_blocks = [b for b in response.content if b.type == "thinking"]
for block in thinking_blocks:
print("\n🧠 ADDITIONAL THINKING:")
print(block.thinking[:300] + "..." if len(block.thinking) > 300 else block.thinking)
# Get final text response
final_text = ""
for block in response.content:
if block.type == "text":
final_text += block.text
print("\n" + "=" * 80)
print("ANSWER:")
print(final_text)
print("=" * 80)
# Return the complete result with all context
return {
"query": user_query,
"answer": final_text,
"search_results": search_results,
"conversation_history": conversation
}

def interactive_mode():
"""Run the Claude agentic RAG system in interactive mode."""
print("\n" + "=" * 80)
print("CLAUDE 3.7 AGENTIC RAG SYSTEM")
print("=" * 80)
print("Type 'exit' or 'quit' to exit.")
while True:
print("\nYour question: ", end="")
user_query = input().strip()
# Check for exit command
if user_query.lower() in ['exit', 'quit']:
return
# Skip empty queries
if not user_query:
print("Please enter a question.")
continue
print("\n" + "-" * 80)
print("Processing your query...")
try:
# Process the query using Claude
claude_agentic_rag(user_query)
except Exception as e:
print(f"❌ Error: {str(e)}")


if __name__ == "__main__":
# Check for API keys
if not os.environ.get("ANTHROPIC_API_KEY"):
api_key = input("Please enter your Anthropic API key: ").strip()
if api_key:
os.environ["ANTHROPIC_API_KEY"] = api_key
CLAUDE_API_KEY = api_key
else:
print("❌ No Anthropic API key provided. Exiting.")
sys.exit(1)
# Check GitHub token
if not os.environ.get("GITHUB_TOKEN"):
print("\n⚠️ No GitHub token found. You may hit rate limits when fetching comments.")
print("To set a token: export GITHUB_TOKEN=your_token_here\n")
# Run the interactive mode
interactive_mode()
Our implementation utilizing Claude 3.7 with tool use enables the system to dynamically decide how to retrieve relevant information from multiple knowledge bases without requiring manual tuning. By integrating Weave, we can also monitor how Claude selects retrieval tools, track its reasoning process, and improve system performance iteratively. Here’s a breakdown of the core components of our system.

Vector search and tool-based retrieval

At the heart of the system is the vector search implementation, which allows Claude to retrieve relevant documents using semantic search rather than simple keyword matching. We use ChromaDB for efficient vector storage and OpenAI's text-embedding-3-small model to generate vector representations. The function search_vector_store handles queries by:
  1. Identifying the relevant knowledge base (e.g., Weave issues, Weave docs, Weights & Biases issues, Weights & Biases docs).
  2. Performing a similarity search against the stored vector database.
  3. Returning the top results, including metadata like GitHub issue numbers, associated repositories, and comments.
If a search result is a GitHub issue, we automatically fetch comments using the fetch_github_comments function, allowing the system to pull additional context that might contain solutions or relevant discussions.

Claude 3.7 tool selection

Claude simplifies our system due to its ability to choose the right tool dynamically based on the user’s query. The system is given access to four retrieval tools:
  • search_weave_issues: Queries Weave-related GitHub issues.
  • search_weave_docs: Searches Weave documentation.
  • search_wandb_issues: Retrieves issues related to Weights & Biases.
  • search_wandb_docs: Looks up relevant sections of the WandB documentation.
The execute_tool function runs the correct retrieval function based on Claude’s decision. This removes the need for static retrieval logic, allowing Claude to reason about which knowledge base to search.

Weave for monitoring and optimization

One challenge with tool-augmented retrieval systems is understanding how they make decisions and whether those decisions lead to optimal results. We use Weave to monitor which tools Claude selects for each query, what reasoning it provides for its selections, and how retrieved results influence the final response.
Each function involved in retrieval is wrapped in @weave.op, meaning all interactions can be logged, visualized, and analyzed. This is critical for debugging and improving system behavior—if Claude consistently selects the wrong tool for certain types of queries, we can adjust the system prompt or modify retrieval heuristics. Here's a screenshot of what we see inside Weave after running our script:


Dynamic query processing and iteration

Unlike traditional RAG systems, which retrieve documents in a single-step process, Claude can:
  1. Analyze a query and determine if additional context is needed.
  2. Search multiple knowledge bases if uncertainty exists.
  3. Refine search queries or ask follow-up questions.
Weave logs these multi-step interactions, allowing us to see when query reformulation is necessary or when search strategies need improvement.

Real-world use cases and alternative applications

Agentic RAG provides significant advantages over traditional retrieval systems, especially in knowledge-intensive fields such as legal research and financial analysis. Unlike static retrieval methods that rely on pre-indexed vector databases, agentic systems actively query real-time sources—including websites, structured databases, and non-vectorized documents. This reduces manual updates and ensures responses reflect the most current available information.
Agentic RAG enables comprehensive legal research by:
  • Querying external databases for case law, statutes, and regulatory documents that may not yet be indexed.
  • Handling complex, criteria-based searches (e.g., lawsuits involving specific financial thresholds or settlements).
  • Refining results through iterative query reformulation, reducing irrelevant results and minimizing hallucinations.
For legal professionals, this means higher accuracy, greater adaptability, and better decision-making when researching complex legal matters.

Financial analysis: Real-time market insights

Agentic RAG enhances financial analysis by:
  • Accessing real-time data from external financial sources, earnings reports, and economic indicators.
  • Dynamically adjusting queries to capture nuanced numerical parameters (e.g., market events, large transactions).
  • Reducing inaccuracies through self-correction and adaptive retrieval strategies.
This allows analysts to make better-informed, real-time financial decisions in fast-changing environments.

Challenges and mitigation strategies in agentic RAG systems

1. Increased latency due to complex retrieval

Since agentic RAG retrieves from multiple sources, response times can be slower than a static LLM lookup.
Mitigation strategies:
  • Implement caching mechanisms for frequently queried data.
  • Optimize parallel processing to run multiple retrievals simultaneously.
  • Prioritize faster sources before expanding the search scope.

2. Reliability in tool selection

Selecting the wrong retrieval tool can lead to irrelevant or incomplete responses.
Mitigation strategies:
  • Monitor tool selection behavior using Weave to identify systematic errors and make real-time adjustments.
  • Introduce confidence thresholds to expand searches if low-relevance results are detected.
  • Suggest alternative queries to guide users toward more precise retrieval.

3. Ethical concerns: Privacy and bias risks

Interacting with real-world databases introduces risks of data exposure and bias amplification.
Mitigation strategies:
Introduce a privacy agent to:
  • Enforce strict access controls to prevent unauthorized data retrieval.
  • Flag sensitive information before processing.
  • Ensure compliance with privacy regulations (e.g., GDPR, HIPAA).
    • Cross-reference multiple sources to reduce misinformation risks.
    • Maintain audit logs for transparency and accountability.
By embedding privacy-focused safeguards, agentic RAG can ensure security and trustworthiness in automated retrieval.

Conclusion

Retrieval-augmented generation has evolved from a simple static retrieval process into a more dynamic and adaptive system through agentic RAG architectures. While traditional RAG provides a single-step lookup mechanism, agentic RAG systems introduce multi-step reasoning, adaptive query decomposition, and modular retrieval agents that improve response accuracy and scalability.
Single-agent systems provide efficiency and simplicity for structured retrieval, while multi-agent architectures offer greater flexibility for complex queries. By integrating tools like Claude 3.7 and monitoring frameworks such as Weave, retrieval systems can refine searches, expand knowledge sources, and improve accuracy in real time. As AI-driven retrieval advances, agentic RAG systems will enhance industries like legal research and finance by streamlining information access.
Iterate on AI agents and models faster. Try Weights & Biases today.