Agentic RAG: Enhancing retrieval-augmented generation with AI agents
This article explores how agentic RAG enhances retrieval-augmented generation by using AI agents to dynamically refine search strategies, coordinate multiple data sources, and improve response accuracy.
Created on March 10|Last edited on March 26
Comment
Retrieval-augmented generation (RAG) enhances AI systems by combining pre-trained knowledge with external information retrieval, allowing models to generate more accurate and contextually relevant responses. Traditional RAG operates in two steps: it retrieves relevant documents based on a query and integrates them into a language model’s response. While effective for straightforward lookups, this static retrieval approach struggles with complex reasoning, multi-step queries, and ambiguous requests, limiting its ability to handle more sophisticated tasks.
Agentic RAG addresses these challenges by introducing intelligent agents that actively manage retrieval, refine search strategies, and coordinate multiple information sources. Instead of a fixed retrieval process, these agents dynamically adapt based on the query, improving precision and scalability. By leveraging multi-agent architectures, real-time tool use, and monitoring frameworks like Weave, Agentic RAG enables AI systems to process complex queries more efficiently.
This article explores Agentic RAG, comparing single-agent and multi-agent architectures and analyzing their strengths, trade-offs, and practical applications in enhancing AI-driven retrieval.

Monitoring agent tool use with W&B Weave.
Table of contents
Table of contentsWhat is retrieval-augmented generation?Traditional RAG vs. Agentic RAGAgentic RAG architecture: Single-agent vs. Multi-agent SystemsSingle-agent systems: Simplicity with limitationsSingle-agent systems for query routingMulti-agent systems: Specialization and scalabilityMulti-agent systems for advanced reasoning and coordinationWhen to use multi-agent RAG:Beyond single vs. multi-agent: Other RAG architecturesStep-by-step tutorial: Building a single-agent RAG systemUnderstanding user queries in software engineeringOverview of the implementation1. Collect and process data2. Convert text into embeddings3. Implement query classification and retrieval4. Generate responses with GPT-4oNext Step: Writing the scriptBuilding the query routing agentHow the query router worksEnhancing accuracy with code-based signalsHandling ambiguous queriesImplementation approachNext step: Writing the query routing agent in codeQuerying our databases Step 1: Query classificationStep 2: Searching the vector databaseStep 3: Generating a responseNext Step: Writing the codeFuture areas of improvement (click to expand)Integrating Claude 3.7 tool-use Vector search and tool-based retrievalClaude 3.7 tool selectionWeave for monitoring and optimizationDynamic query processing and iterationReal-world use cases and alternative applicationsLegal research: More accurate and adaptive case law retrievalFinancial analysis: Real-time market insightsChallenges and mitigation strategies in agentic RAG systems1. Increased latency due to complex retrieval2. Reliability in tool selection3. Ethical concerns: Privacy and bias risksConclusion
What is retrieval-augmented generation?
Vanilla RAG works in two steps: retrieval and generation. When a user submits a query, it is converted into an embedding (a numerical representation of its meaning). This embedding is compared against a vector database of preprocessed document embeddings to find the most relevant matches. The system retrieves these documents using similarity scores.
The retrieved documents are then passed to the generation module (an LLM), which incorporates them into its response. This allows the model to generate answers that blend its pre-trained knowledge with up-to-date external information.
Vanilla RAG performs this retrieval once per query, meaning it does not refine or iterate on its results. While effective for simple lookups, it struggles with complex reasoning or multi-step queries, which is where Agentic RAG improves by adding iterative retrieval and autonomous query handling.
Traditional RAG vs. Agentic RAG
As we noted above, traditional RAG retrieves information once per query from a static, preprocessed database and incorporates it directly into the response. This approach, while effective for simple lookups, struggles when queries are complex, ambiguous, or require multiple steps to fully resolve. It cannot adapt its retrieval strategy, refine its results, or integrate information from multiple diverse sources.
Agentic RAG addresses these limitations by introducing intelligent agents that actively manage and enhance the retrieval process. Rather than relying solely on pre-indexed data, these agents can dynamically interact with various tools and databases—including real-time sources like websites, structured databases accessed via advanced queries (e.g., SQL), and external APIs. The agents adapt their retrieval strategy based on the nature of the query, performing iterative retrieval, refining their searches through self-reflection, and adjusting their approach as needed. This ability to reason autonomously, coordinate across multiple information sources, and continuously refine both the queries and retrieval methods significantly improves accuracy, reduces hallucinations, and ensures highly relevant and contextually informed responses.
Agentic RAG architecture: Single-agent vs. Multi-agent Systems
Agentic RAG architectures rely on LLM-powered agents to retrieve and process information from diverse sources. These systems can be structured as single-agent or multi-agent, each offering trade-offs in efficiency, modularity, and scalability.
This section explores single-agent and multi-agent architectures, detailing their capabilities, trade-offs, and use cases for retrieval-augmented generation (RAG) systems.
Single-agent systems: Simplicity with limitations

In a single-agent system, a single agent manages the entire retrieval pipeline. It:
- Processes the query
- Determines relevant data sources
- Retrieves necessary information
- Integrates results before passing them to the LLM
This approach is straightforward and efficient but has limitations as retrieval tasks become more complex. A single agent must handle:
- Multiple retrieval strategies (e.g., vector search, keyword matching)
- Diverse query formats (e.g., structured vs. unstructured data)
- Different ranking mechanisms to prioritize relevant information
As retrieval complexity grows, a single agent can become a bottleneck, making the system harder to scale and maintain.
Single-agent systems for query routing
Despite these challenges, single-agent architectures can still be highly effective for structured and relatively simple retrieval tasks. They function as centralized query routers, efficiently managing query processing and response generation.
Key capabilities of a single-agent RAG system include:
- Query pre-processing: Refining and restructuring queries before retrieval
- Multi-step retrieval: Iteratively refining searches and synthesizing multiple retrieval passes
- Validation & ranking: Filtering out low-confidence sources and cross-referencing retrieved data
- External tool integration: Connecting with vector databases, APIs, web search modules, and structured knowledge bases
When to use single-agent RAG:
- Tasks with consistent query structures (e.g., structured database lookups)
- Low-latency applications where fast, centralized decision-making is needed
- Systems that don’t require multi-step reasoning or complex cross-referencing
However, for more complex retrieval needs, a multi-agent system offers a scalable alternative. Enter multi-agent systems.
Multi-agent systems: Specialization and scalability

A multi-agent system overcomes single-agent limitations by distributing retrieval tasks across specialized agents, each optimized for a specific retrieval method or data type. Instead of a single agent managing all retrieval logic, tasks are delegated, enabling a modular and scalable retrieval framework.
For example:
- Vector search agents retrieve information from vector databases using semantic search
- Web search agents fetch real-time data from external sources
- Structured data agents query databases, enterprise records, and email systems
Advantages of multi-agent systems
- Improved scalability: New retrieval agents can be added without disrupting the existing system
- Better optimization: Each agent is tailored to specific data types, improving retrieval accuracy
- Easier debugging & development: Changes can be made to individual agents without affecting the entire pipeline
However, multi-agent systems also introduce challenges:
- Coordination overhead: Managing multiple agents requires sophisticated query routing & aggregation techniques
- Higher computational costs: Multiple agent calls increase LLM processing overhead, impacting performance
Multi-agent systems for advanced reasoning and coordination
Multi-agent architectures excel in handling complex retrieval workflows, particularly when dealing with:
- Cross-referencing multiple data sources for more informed responses
- Multi-step reasoning tasks, where retrieved information must be iteratively refined
- Adaptive retrieval, where agents adjust their retrieval strategies based on query complexity
By collaborating in a structured manner, these agents refine search strategies iteratively, ensuring that responses are more accurate, contextually informed, and explainable.
When to use multi-agent RAG:
- High-stakes retrieval tasks (e.g., legal research, financial analytics)
- Dynamic queries requiring real-time cross-referencing
- Knowledge-intensive applications needing multi-source verification
While multi-agent systems increase complexity, their ability to process and refine information dynamically makes them powerful for advanced reasoning tasks.
Beyond single vs. multi-agent: Other RAG architectures
Beyond these architectures, other variations of RAG exist, such as:
- Modular RAG, where retrieval components are designed to be plug-and-play for different data sources
For a more comprehensive overview of these architectures, refer to Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG by Aditi Singh et al. This paper provides detailed insights and diagrams contextualizing the broader RAG landscape.
I would like to personally thank the authors of the paper for granting me permission to use their excellent diagrams in the next few sections.
💡
Step-by-step tutorial: Building a single-agent RAG system
In this tutorial, we will implement a single-agent retrieval-augmented generation (RAG) system designed to search across repositories for two key tools from Weights & Biases (W&B):
- W&B Experiment Tracking: A framework for tracking, visualizing, and managing machine learning experiments. It helps practitioners monitor model performance, manage hyperparameters, and organize datasets throughout the development lifecycle.
- W&B Weave: A toolkit for developers building applications powered by large language models (LLMs). It provides programmatic tools to track, evaluate, and debug LLM prompts and interactions, making it easier to refine workflows.
By integrating retrieval across both tools, our system will provide relevant information for:
- General ML experimentation (via W&B documentation and issue tracking)
- LLM development and debugging (via Weave documentation and discussions)
Understanding user queries in software engineering
User queries typically fall into two categories:
- "How-to" and API usage questions: Users need guidance on using a feature, function, or API endpoint
- Troubleshooting and issue resolution: Users report errors, unexpected behaviors, or bugs
To address these queries, our single-agent RAG system will:
- Classify whether the query relates to API usage or debugging
- Search documentation for instructional queries
- Retrieve relevant GitHub issues for troubleshooting queries
- Generate responses using GPT-4o, incorporating additional GitHub discussions if needed
Overview of the implementation
Our system will follow these four key steps:
1. Collect and process data
- Clone the latest versions of the W&B and Weave repositories
- Extract documentation (Markdown files) and clean them for indexing
- Fetch GitHub issues, pulling titles, descriptions, and metadata (excluding comments for now)
2. Convert text into embeddings
- Use OpenAI’s text-embedding-3-small model to generate vector embeddings for all documents
- Store the embeddings in ChromaDB, a high-performance vector database
- Differentiate between documentation and issue reports to improve search relevance
3. Implement query classification and retrieval
- Determine whether a user query is related to API documentation or debugging
- If API-related, search documentation embeddings for relevant results
- If related to issues, retrieve and rank GitHub issues
- Fetch additional GitHub comments if further context is required
4. Generate responses with GPT-4o
- Use the retrieved results as context for GPT-4o to generate a response
- Ensure that responses reference the original documentation or issue discussions to maintain transparency
Next Step: Writing the script
In the next section, we will write the actual implementation, starting with data collection and vectorization
import osimport globimport reimport requestsimport timefrom langchain_core.documents import Documentfrom langchain_openai import OpenAIEmbeddingsfrom langchain_community.vectorstores import Chroma# GitHub repository detailsREPOS = {"weave": "wandb/weave","wandb": "wandb/docs", # For documentation}# GitHub repositories for issuesISSUE_REPOS = {"weave": "wandb/weave","wandb": "wandb/wandb", # Correct repo for WandB issues}# GitHub issues URLISSUES_ENDPOINT = "/issues?state=all&per_page=100"# Initialize OpenAI Embeddingsembeddings = OpenAIEmbeddings(model="text-embedding-3-small")# Function to download repo docs using manual git commandsdef download_docs(repo_name, repo_url, path):if os.path.exists(path):print(f"{repo_name} documentation already exists, skipping download.")returnprint(f"Downloading {repo_name} documentation...")# Simple git clone without sparse-checkout (more reliable)clone_cmd = f"git clone --depth 1 https://github.com/{repo_url}.git {path}"result = os.system(clone_cmd)if result != 0 or not os.path.exists(path):print(f"❌ Failed to clone {repo_name}. Check if the repository URL is correct.")returnprint(f"✅ {repo_name} documentation downloaded successfully.")# Function to clean markdown contentdef clean_markdown(text):if not text:return ""text = re.sub(r'!\[.*?\]\(.*?\)', '', text) # Remove imagestext = re.sub(r'\[([^\]]+)\]\((.*?)\)', r'\1', text) # Keep link text but remove URLsreturn text.strip()# Function to fetch GitHub issues without commentsdef fetch_issues(repo_name, repo_url, max_pages=5):all_issues = []page = 1# Paginate through the resultswhile page <= max_pages:url = f"https://api.github.com/repos/{repo_url}{ISSUES_ENDPOINT}&page={page}"headers = {"Accept": "application/vnd.github.v3+json"}try:print(f"Fetching page {page} of issues from {repo_name}...")response = requests.get(url, headers=headers)if response.status_code == 200:issues = response.json()# If no more issues are returned, break the loopif not issues:break# Process each issue without fetching commentsfor issue in issues:# Add basic issue data to the listall_issues.append({"id": issue["id"],"number": issue["number"],"title": issue["title"],"body": issue.get("body", ""),"state": issue["state"],"created_at": issue["created_at"],"comments_count": issue.get("comments", 0),"github_url": issue["html_url"], # URL to the GitHub issue page"comments_url": issue["comments_url"], # Direct API URL for comments"repo": repo_url # Store the repo info for later reference})print(f"Fetched {len(issues)} issues from {repo_name} (page {page})")page += 1time.sleep(1) # Small delay between pageselif response.status_code == 403 and "rate limit exceeded" in response.text.lower():print("⚠️ GitHub API rate limit exceeded. Waiting for 60 seconds...")time.sleep(60) # Wait for rate limit to resetcontinue # Retry the same pageelse:print(f"Error fetching issues from {repo_name}: {response.status_code}")breakexcept Exception as e:print(f"Exception fetching issues for {repo_name}: {e}")breakprint(f"Total issues fetched from {repo_name}: {len(all_issues)}")return all_issues# Function to create documents from issues (without comments)def create_issue_documents(issues):docs = []for issue in issues:# Create a document for the main issueissue_text = f"Title: {issue['title']}\nIssue #{issue['number']}\nState: {issue['state']}\nComments: {issue['comments_count']}\n"# Add a note about comments and where to find themif issue['comments_count'] > 0:issue_text += f"\nThis issue has {issue['comments_count']} comments. View at: {issue['github_url']}\n"issue_text += f"\nBody:\n{issue['body']}"# Create the document object with rich metadatadocs.append(Document(page_content=clean_markdown(issue_text),metadata={"id": issue["id"],"number": issue["number"],"created_at": issue["created_at"],"state": issue["state"],"comments_count": issue["comments_count"],"github_url": issue["github_url"],"comments_url": issue["comments_url"],"repo": issue["repo"],"type": "issue"}))return docs# Function to process Markdown docsdef create_markdown_documents(repo_name, doc_path):if not os.path.exists(doc_path):print(f"⚠️ {doc_path} does not exist, skipping doc processing.")return []# Find markdown files recursivelymd_files = glob.glob(f'{doc_path}/**/*.md', recursive=True)if not md_files:print(f"No markdown files found in {doc_path}.")return []docs = []for md_file in md_files:try:with open(md_file, 'r', encoding='utf-8') as f:content = clean_markdown(f.read())# Create a proper Langchain Document objectdocs.append(Document(page_content=content,metadata={"file": md_file,"type": "documentation"}))except Exception as e:print(f"Error processing {md_file}: {e}")return docs# Function to query the vector storedef search_docs(query, db_directory, k=5):"""Search a Chroma vector store for documents similar to the query."""try:# Load the Chroma databasedb = Chroma(persist_directory=db_directory, embedding_function=embeddings)# Search for similar documentsresults = db.similarity_search_with_score(query, k=k)if not results:print(f"No results found for '{query}'")return []print(f"Found {len(results)} results for '{query}':")for i, (doc, score) in enumerate(results):print(f"\nResult {i+1}:")print(f"Score: {score}")# Print metadataprint("Metadata:")for key, value in doc.metadata.items():if key in ["id", "number", "state", "comments_count", "type"]:print(f" {key}: {value}")# For issues with comments, show the GitHub URLif doc.metadata.get("type") == "issue" and doc.metadata.get("comments_count", 0) > 0:print(f" GitHub URL: {doc.metadata.get('github_url')}")# Print content (truncated if too long)content = doc.page_contentif len(content) > 500:content = content[:500] + "... [truncated]"print(f"\nContent: {content}\n")print("-" * 40)return resultsexcept Exception as e:print(f"Error searching database: {e}")return []# Main executionif __name__ == "__main__":# Create directory for vector storesos.makedirs("db", exist_ok=True)for repo_name, repo_url in REPOS.items():print(f"\nProcessing {repo_name.upper()}...")# Fetch issues from the correct repositoryissue_repo_url = ISSUE_REPOS[repo_name]print(f"Fetching issues from {issue_repo_url}...")issues = fetch_issues(repo_name, issue_repo_url)# Download docsdoc_path = f"{repo_name}_docs"download_docs(repo_name, repo_url, doc_path)# Process issuesif issues:print(f"Creating vector store for {repo_name} issues...")issue_docs = create_issue_documents(issues)# Create Chroma vector store for issuesissues_db = Chroma.from_documents(documents=issue_docs,embedding=embeddings,persist_directory=f"db/{repo_name}_issues")print(f"✅ Saved {repo_name} issues vector store ({len(issue_docs)} documents)")else:print("No issues to process.")# Process docsprint(f"Creating vector store for {repo_name} documentation...")md_docs = create_markdown_documents(repo_name, doc_path)if md_docs:# Create Chroma vector store for documentationdocs_db = Chroma.from_documents(documents=md_docs,embedding=embeddings,persist_directory=f"db/{repo_name}_docs")print(f"✅ Saved {repo_name} documentation vector store ({len(md_docs)} documents)")else:print("No documentation to process.")print("\n✅ Data ingestion completed successfully.")# Example search to verify everything worksprint("\nTesting search functionality...")search_docs("authentication issues", "db/weave_issues", k=2)
With the vector database in place, the system will be ready to process user queries, retrieving relevant documentation or past discussions from GitHub issues. The next step will involve implementing the retrieval mechanism that searches the database based on similarity to user queries.
Building the query routing agent
A query-routing agent is essential for optimizing retrieval in our RAG pipeline. Rather than searching all available sources indiscriminately, the routing agent classifies user queries and directs them to the most relevant knowledge bases, improving efficiency and accuracy.
How the query router works
The routing agent performs two key classification tasks:
- Identifying the relevant tool: Does the query relate to W&B Experiment Tracking or Weave?
- Determining query type: Is the user asking for documentation or troubleshooting an issue?
- Documentation queries typically involve user guides, tutorials, and API references
- Troubleshooting queries are best answered by retrieving GitHub discussions, bug reports, and issue resolutions
If a query is ambiguous, the agent returns multiple knowledge bases instead of forcing a single, potentially incorrect choice
Enhancing accuracy with code-based signals
To improve classification, the agent leverages syntax and code snippets commonly associated with each tool:
- W&B-related queries often contain wandb.init() or references to logging metrics
- Weave-related queries may include weave.op() or mention function tracking
When these patterns appear in a query, the agent uses them as additional routing signals to refine search accuracy
Handling ambiguous queries
The agent is designed to intelligently manage vague queries.
For example, if a user asks:
"How do I debug errors in my AI workflow?"
The system cannot determine with certainty whether the issue relates to W&B or Weave. In such cases, the agent returns both relevant knowledge bases to ensure comprehensive retrieval.
Implementation approach
To implement this logic, we define a system prompt that instructs the agent to:
- Analyze the user’s query and classify it accordingly
- Return structured JSON output specifying which knowledge bases to search
- Support multi-source retrieval for ambiguous or overlapping queries
By implementing this routing mechanism, we:
- Reduce unnecessary searches
- Improve response relevance
- Create a structured, efficient retrieval process
Next step: Writing the query routing agent in code
Now, we will implement the query-routing agent in code and integrate it into our retrieval pipeline.
You are a query-routing assistant that determines whether a user's question is related to Weights & Biases (wandb) or Weave and directs them to the most relevant database. Users may ask about documentation, troubleshooting issues, or implementation details for either tool.Your task is to analyze the query and route it to all relevant sources if uncertainty exists. If a query could apply to multiple areas, return multiple results rather than choosing just one.The available knowledge bases are:1. "Weave Issues" - Contains GitHub issues from the wandb/weave repository2. "Weave Documentation" - Contains documentation for Weave3. "WandB Issues" - Contains GitHub issues from the wandb/wandb repository4. "WandB Documentation" - Contains documentation for Weights & BiasesReturn a JSON object with the following format:{"knowledge_bases": ["name_of_kb1", "name_of_kb2"]}## Library Descriptions for Context### Weights & Biases (wandb) – Experiment Tracking & ML Workflow ManagementWeights & Biases (wandb) is a tool for tracking and visualizing machine learning experiments. It helps researchers and engineers:* Log and monitor metrics like loss, accuracy, and learning rate.* Track hyperparameters and model configurations for experiment reproducibility.* Store and visualize model artifacts and training progress.* Integrate seamlessly with deep learning frameworks like PyTorch, TensorFlow, and Hugging Face Transformers.* Collaborate via dashboards that display real-time and historical training runs.It is widely used in research and production to optimize models, compare runs, and fine-tune hyperparameters efficiently.### W&B Weave – Generative AI & LLM Debugging ToolkitWeave is a toolkit designed for developing, debugging, and evaluating Generative AI applications, particularly those using Large Language Models (LLMs). It enables users to:* Log inputs and outputs from LLMs for better debugging.* Build evaluations to compare different model responses.* Organize information across LLM workflows from experimentation to production.* Track interactions within multi-step AI pipelines to analyze decision-making processes.Weave is essential for understanding and improving the performance of AI agents, LLM-based chatbots, and retrieval-augmented generation (RAG) systems.## Routing Logic* If the user asks about experiment tracking, ML metrics, logging, hyperparameters, dashboards, or integrations, route to W&B Docs.* If they report an error, unexpected behavior, or API issue related to W&B, route to W&B Issues.* If the user asks about LLM evaluation, function tracking, Weave UI development, or Weave's role in ML workflows, route to Weave Docs.* If they report errors related to Weave tracking or data storage, route to Weave Issues.* If uncertain, include all possibly relevant sources instead of picking only one.## Code Samples for ContextIf the query is vague or unclear, match it against the functionality of each tool using these sample code snippets:### W&B Experiment Tracking Example (wandb Docs/Issues)```pythonimport wandbwandb.init(project="my_project", config={"learning_rate": 0.001, "epochs": 10})for epoch in range(wandb.config.epochs):train_loss = 0.01 * (wandb.config.epochs - epoch)train_accuracy = 0.1 * epochwandb.log({'epoch': epoch, 'train_loss': train_loss, 'train_accuracy': train_accuracy})```If the query involves logging, training loops, metrics, dashboards, or hyperparameters, it should likely be routed to W&B Docs or W&B Issues if errors are involved.### Weave Function Tracking Example (Weave Docs/Issues)```pythonimport weaveweave.init(project_name="my_weave_project")@weave.op()def add_numbers(a: int, b: int) -> int:return a + bresult = add_numbers(5, 7)print(f'Result: {result}')```If the query involves tracking function calls, logging LLM interactions, debugging AI workflows, or analyzing execution steps, route it to Weave Docs or Weave Issues if errors occur.Focus on understanding:- Is the query about Weave or WandB?- Is the user looking for documentation or issues/bugs?- If the query mentions "bug" or "issue" or describes a problem, prioritize the Issues knowledge base- If the query is asking how to use a feature or understand concepts, prioritize Documentation- If unclear whether the query is about Weave or WandB, include knowledge bases for bothExamples:1. For "How do I track experiments in wandb?"- Knowledge bases: ["WandB Documentation"]2. For "Weave authentication failing"- Knowledge bases: ["Weave Issues"]3. For "How to visualize neural networks in Weights & Biases"- Knowledge bases: ["WandB Documentation"]4. For "LLM output tracking shows errors in production"- Knowledge bases: ["Weave Issues", "Weave Documentation"]Always include at least one knowledge base in your response.Return a JSON object with the following format:{"knowledge_bases": ["name_of_kb1", "name_of_kb2"]}
Querying our databases
Now, we will implement query retrieval using our vector database and query-routing agent. This process consists of three key steps:
- Query classification: The AI-powered router determines which knowledge bases should be searched
- Vector search: The system retrieves the most relevant results from our Chroma database, ranking them by semantic similarity
- Response generation: GPT-4o synthesizes an answer, ensuring transparency by including references to the retrieved sources
Step 1: Query classification
The query-routing agent classifies user queries and returns a structured JSON response specifying:
- Multi-source retrieval if the query is ambiguous, maximizing relevance
Rather than searching all sources blindly, this approach improves efficiency and accuracy.
Step 2: Searching the vector database
Once the routing decision is made, the system performs a semantic search using ChromaDB, which contains:
- Embeddings of W&B and Weave documentation
- Historical GitHub issues for troubleshooting queries
If a retrieved document is a GitHub issue, the system performs additional processing to fetch relevant comments and discussions, providing more context and potential solutions.
Step 3: Generating a response
The retrieved content is passed to GPT-4o, which:
- Synthesizes a response based on the most relevant documents
- Includes references to documentation and GitHub issues to ensure transparency
- Handles edge cases – If no relevant results are found, the system:
- Prompts the user for clarification.
- Suggests an alternative query to refine the search.
By integrating query classification, semantic search, and intelligent response generation, this pipeline ensures accurate, transparent, and contextually relevant answers.
Next Step: Writing the code
Now, let's implement the querying logic with the following script:
import sysimport osimport jsonfrom litellm import completionfrom langchain_openai import OpenAIEmbeddingsfrom langchain_community.vectorstores import Chromaimport weave; weave.init("agentic_rag")import requestsimport osimport jsonimport time# Create cache directory if it doesn't existos.makedirs("comment_cache", exist_ok=True)# Initialize OpenAI Embeddingsembeddings = OpenAIEmbeddings(model="text-embedding-3-small")# Configure which databases to search (can be modified to include/exclude databases)DATABASES_TO_SEARCH = [{"name": "Weave Issues", "path": "db/weave_issues"},{"name": "Weave Documentation", "path": "db/weave_docs"},{"name": "WandB Issues", "path": "db/wandb_issues"},{"name": "WandB Documentation", "path": "db/wandb_docs"}]def load_router_prompt():"""Load the router prompt from router_prompt.txt"""try:with open("router_prompt.txt", "r") as f:return f.read()except FileNotFoundError:print("❌ Error: router_prompt.txt not found.")print("Please create this file with your router prompt before running the script.")sys.exit(1)def search_vector_store(query, db_directory, k=5):"""Search a vector store and return results."""try:# Load the Chroma databasedb = Chroma(persist_directory=db_directory, embedding_function=embeddings)# Search for similar documentsresults = db.similarity_search_with_score(query, k=k)return resultsexcept Exception as e:print(f"❌ Error searching database at {db_directory}: {e}")return []def extract_json_from_response(response_text):"""Extract JSON from response text, handling code blocks."""# Try to find JSON in code blocksimport rejson_pattern = r'```(?:json)?\s*(\{.*?\})\s*```'match = re.search(json_pattern, response_text, re.DOTALL)if match:# Extract JSON from code blockjson_str = match.group(1)else:# Assume the entire response is JSONjson_str = response_text# Remove any non-JSON contentjson_str = json_str.strip()return json.loads(json_str)def route_query(user_query):"""Use LiteLLM to route the query to the appropriate databases."""router_prompt = load_router_prompt()try:# Call LiteLLM with the router promptresponse = completion(model="openai/gpt-4o",messages=[{"role": "system", "content": router_prompt},{"role": "user", "content": user_query}])# Extract the content from the responserouter_response = response.choices[0].message.content# Parse the JSON responsetry:# parsed_response = json.loads(router_response)parsed_response = extract_json_from_response(router_response)return parsed_responseexcept json.JSONDecodeError:print("❌ Error: Router didn't return valid JSON. Using all knowledge bases.")print(f"Router response: {router_response}")return {"knowledge_bases": [db["name"] for db in DATABASES_TO_SEARCH],"search_query": user_query,"reasoning": "Fallback due to JSON parsing error"}except Exception as e:print(f"❌ Error calling LiteLLM: {e}")return {"knowledge_bases": [db["name"] for db in DATABASES_TO_SEARCH],"search_query": user_query,"reasoning": f"Fallback due to error: {str(e)}"}def get_kb_by_name(name):"""Get the knowledge base by name."""for db in DATABASES_TO_SEARCH:if db["name"] == name:return dbreturn Nonedef fetch_comments_for_issue(repo, issue_number):# Check cache firstcache_file = f"comment_cache/{repo.replace('/', '_')}_issue_{issue_number}_comments.json"if os.path.exists(cache_file):try:with open(cache_file, 'r') as f:comments = json.load(f)print(f"Loaded {len(comments)} comments for issue #{issue_number} from cache")return commentsexcept Exception as e:print(f"Error reading cache: {e}")# Continue to fetch if cache read fails# Fetch from GitHub APIurl = f"https://api.github.com/repos/{repo}/issues/{issue_number}/comments"headers = {"Accept": "application/vnd.github.v3+json"}# Add GitHub token if availableif os.environ.get("GITHUB_TOKEN"):headers["Authorization"] = f"token {os.environ.get('GITHUB_TOKEN')}"try:print(f"Fetching comments for issue #{issue_number} from {repo}...")response = requests.get(url, headers=headers)if response.status_code == 200:comments = response.json()# Cache the commentswith open(cache_file, 'w') as f:json.dump(comments, f, indent=2)print(f"✅ Fetched and cached {len(comments)} comments for issue #{issue_number}")return commentselif response.status_code == 403 and "rate limit exceeded" in response.text.lower():print("⚠️ GitHub API rate limit exceeded. Waiting 60 seconds...")time.sleep(60)return fetch_comments_for_issue(repo, issue_number) # Retry after waitingelse:print(f"Error fetching comments: {response.status_code}")return []except Exception as e:print(f"Exception fetching comments: {e}")return []def format_comments_for_context(comments):if not comments:return ""comments_text = "\n\nCOMMENTS:\n"for i, comment in enumerate(comments):user = comment.get('user', {}).get('login', 'anonymous')created_at = comment.get('created_at', 'unknown date')body = comment.get('body', '').strip()comments_text += f"\n[Comment #{i+1} by {user} on {created_at}]\n{body}\n"return comments_textdef generate_response(user_query, relevant_docs, query_analysis):"""Generate a response using LiteLLM based on the retrieved documents."""# Prepare context from documentscontext = ""github_links = []for i, (doc, score) in enumerate(relevant_docs):# Add the document to contextcontext += f"--- Document {i+1} (Relevance: {score:.4f}) ---\n"# Add source informationif doc.metadata.get('type') == 'issue':issue_num = doc.metadata.get('number', 'Unknown')repo = doc.metadata.get('repo', 'Unknown')context += f"Source: GitHub Issue #{issue_num} in {repo}\n"# If issue has comments, fetch and add them to contextif doc.metadata.get('comments_count', 0) > 0:github_url = doc.metadata.get('github_url')if github_url:github_links.append(github_url)# Fetch and append commentsprint(f"Fetching comments for issue #{issue_num}...")comments = fetch_comments_for_issue(repo, issue_num)if comments:comments_text = format_comments_for_context(comments)context += doc.page_content + comments_text + "\n\n"continue # Skip the standard content addition belowelse:context += f"Source: {doc.metadata.get('file', 'Unknown')}\n"# Add content (only reached if the issue doesn't have comments or isn't an issue)context += doc.page_contentcontext += "\n\n"# Create the prompt for the responseresponse_prompt = f"""You are an AI assistant for Weights & Biases (wandb) and Weave.You've been given several documents retrieved from a search based on the user's query.Use the information in these documents to answer the user's question.If the documents don't contain the necessary information to answer the question, admit that you don't knowrather than making up an answer. If appropriate, suggest what the user might search for instead.User Query: {user_query}Retrieved Documents:{context}Based on these documents, provide a helpful response to the user's query.Pay special attention to issue comments as they often contain solutions to the problems described in the issues."""try:# Call LiteLLM with the response promptresponse = completion(model="openai/gpt-4o",messages=[{"role": "system", "content": response_prompt},{"role": "user", "content": user_query}])# Extract the content from the responsereturn response.choices[0].message.contentexcept Exception as e:print(f"❌ Error generating response: {e}")return f"I encountered an error while generating a response: {str(e)}"def agentic_rag(user_query):"""Main function for the agentic RAG system."""print(f"\n🔍 Processing query: '{user_query}'")print("-" * 80)# Step 1: Route the queryquery_analysis = route_query(user_query)print(f"✅ Query Analysis:")print(f"- Selected Knowledge Bases: {', '.join(query_analysis['knowledge_bases'])}")# print(f"- Refined Search Query: '{query_analysis['search_query']}'")# print(f"- Reasoning: {query_analysis['reasoning']}")# Step 2: Search the selected knowledge basesall_results = []for kb_name in query_analysis["knowledge_bases"]:kb = get_kb_by_name(kb_name)if kb and os.path.exists(kb["path"]):# print(f"\n🔍 Searching '{kb['name']}' for: '{query_analysis['search_query']}'")results = search_vector_store(user_query, kb["path"], k=3)if results:print(f"Found {len(results)} results in {kb['name']}")# Print out GitHub URLs for issues with commentsfor doc, _ in results:if doc.metadata.get('type') == 'issue' and doc.metadata.get('comments_count', 0) > 0:github_url = doc.metadata.get('github_url')if github_url:print(f" - Issue with comments: {github_url}")all_results.extend(results)else:print(f"No results found in {kb['name']}")else:print(f"⚠️ Knowledge base '{kb_name}' not found or not available.")# Step 3: Sort results by score (ascending since lower score = more relevant in Chroma)all_results.sort(key=lambda x: x[1])# Take top results (up to 5)top_results = all_results[:5]if not top_results:print("❌ No relevant documents found across any knowledge base.")return "I couldn't find any relevant information to answer your question. Could you please rephrase or provide more details?"# Step 4: Generate a responseprint(f"\n💬 Generating response based on {len(top_results)} documents...")response = generate_response(user_query, top_results, query_analysis)return responsedef interactive_mode():"""Run the RAG system in interactive mode."""print("\n" + "=" * 80)print("AGENTIC RAG SYSTEM")print("=" * 80)print("Type 'exit' to quit.")print("To ask a question, type your question (can be multiple lines)")print("When finished, press Enter on an empty line to submit.")while True:print("\nYour question (multi-line, empty line to submit):")lines = []while True:line = input().strip()# Check for exit command on single lineif not lines and line.lower() in ['exit', 'quit']:return# Empty line finishes input if we already have some contentif not line and lines:break# Otherwise add the line to our inputif line:lines.append(line)# Combine all lines into a single queryuser_query = "\n".join(lines)if not user_query:print("Please enter a question.")continueprint("\n" + "-" * 80)print("Processing your query...")response = agentic_rag(user_query)print("\n" + "=" * 80)print("ANSWER:")print(response)print("=" * 80)if __name__ == "__main__":# Check for API keyif not os.environ.get("OPENAI_API_KEY"):api_key = input("Please enter your OpenAI API key: ").strip()if api_key:os.environ["OPENAI_API_KEY"] = api_keyelse:print("❌ No API key provided. Exiting.")sys.exit(1)interactive_mode()
Weave is used to monitor system behavior and track interactions, allowing us to analyze query patterns, response quality, and retrieval efficiency. This provides insight into how well the system is performing and where improvements can be made. By logging queries, retrieved results, and generated responses, Weave helps us refine the RAG pipeline and optimize retrieval strategies. It also allows us to debug potential issues by tracing how queries are routed and ensuring that the system selects the most relevant knowledge sources.
This single-agent design ensures efficient query handling while maintaining a simple, centralized retrieval workflow. It sets the foundation for later expansion into a multi-agent system where different agents handle structured data, real-time search, or iterative refinement.
Future areas of improvement (click to expand)
Integrating Claude 3.7 tool-use
Our current system, while functional, is somewhat manual and rigid. It requires careful prompt tuning, explicit database configurations, and a fair amount of human intervention whenever we want to modify how it retrieves information. Adding a new data source—whether it's an additional documentation set or another GitHub repository—means manually updating the vector database, modifying retrieval logic, and ensuring everything integrates correctly. While this works, it’s not scalable.
To make the system more flexible and autonomous, we can integrate Claude 3.7 with tool use, allowing it to reason about the retrieval process and dynamically query our knowledge sources. Instead of relying on a predefined set of rules to determine which database to search, Claude can analyze the user’s query, decide which sources are most relevant, and execute searches in real time. This approach follows the ReAct paradigm, where Claude is not just retrieving information passively but actively reasoning about how to retrieve the most relevant results.
This strategy has several advantages:
- Less manual intervention: Instead of constantly refining prompts and retrieval rules, we can let Claude make retrieval decisions dynamically.
- Easier scaling: Adding a new data source won’t require deep modifications to the system. Claude can recognize new tools and use them as needed.
- Better query handling: If a query is ambiguous, Claude can reformulate it or run multiple searches instead of just returning poor results.
Weave plays a key role in this by tracking interactions, helping us visualize which retrieval strategies work best, and refining how queries are processed over time.
Below, we implement this by giving Claude direct access to our vector database through tool use, allowing it to search documentation, GitHub issues, and user discussions in real time.
import osimport jsonimport sysfrom anthropic import Anthropicfrom langchain_openai import OpenAIEmbeddingsfrom langchain_community.vectorstores import Chromaimport requestsimport timeimport weave; weave.init("claude_agentic_rag")# Create cache directory if it doesn't existos.makedirs("comment_cache", exist_ok=True)# Initialize OpenAI Embeddings for vector searchembeddings = OpenAIEmbeddings(model="text-embedding-3-small")# Configure knowledge basesKNOWLEDGE_BASES = [{"name": "weave_issues", "path": "db/weave_issues", "description": "GitHub issues related to Weave, containing bug reports and feature requests."},{"name": "weave_docs", "path": "db/weave_docs", "description": "Documentation for Weave, explaining its features, APIs, and usage."},{"name": "wandb_issues", "path": "db/wandb_issues", "description": "GitHub issues related to Weights & Biases (wandb), containing bug reports and feature requests."},{"name": "wandb_docs", "path": "db/wandb_docs", "description": "Documentation for Weights & Biases (wandb), explaining its features, APIs, and usage."}]# Initialize Anthropic clientCLAUDE_API_KEY = os.environ.get("ANTHROPIC_API_KEY")MODEL = "claude-3-7-sonnet-20250219"client = Anthropic(api_key=CLAUDE_API_KEY)# Define search toolsTOOLS = [{"name": "search_weave_issues","description": "Search for GitHub issues related to Weave. Use this for debugging, error messages, or when the user mentions bugs or problems with Weave.","input_schema": {"type": "object","properties": {"query": {"type": "string", "description": "The search query to find relevant Weave issues."}},"required": ["query"]}},{"name": "search_weave_docs","description": "Search Weave documentation. Use this for understanding Weave features, APIs, usage examples, or 'how-to' questions about Weave.","input_schema": {"type": "object","properties": {"query": {"type": "string", "description": "The search query to find relevant Weave documentation."}},"required": ["query"]}},{"name": "search_wandb_issues","description": "Search for GitHub issues related to Weights & Biases (wandb). Use this for debugging, error messages, or when the user mentions bugs or problems with wandb.","input_schema": {"type": "object","properties": {"query": {"type": "string", "description": "The search query to find relevant Weights & Biases issues."}},"required": ["query"]}},{"name": "search_wandb_docs","description": "Search Weights & Biases documentation. Use this for understanding wandb features, APIs, usage examples, or 'how-to' questions about wandb.","input_schema": {"type": "object","properties": {"query": {"type": "string", "description": "The search query to find relevant Weights & Biases documentation."}},"required": ["query"]}}]@weave.opdef search_vector_store(kb_name, query, k=3):"""Search a vector store and return results, automatically fetching comments for issues."""kb = next((kb for kb in KNOWLEDGE_BASES if kb["name"] == kb_name), None)if not kb:return {"error": f"Knowledge base '{kb_name}' not found"}db_path = kb["path"]if not os.path.exists(db_path):return {"error": f"Knowledge base at '{db_path}' does not exist"}try:# Load the Chroma databasedb = Chroma(persist_directory=db_path, embedding_function=embeddings)# Search for similar documentsresults = db.similarity_search_with_score(query, k=k)# Format resultsformatted_results = []for i, (doc, score) in enumerate(results):# Format documentresult = {"document_id": i + 1,"relevance_score": float(score),"content": doc.page_content,"metadata": doc.metadata}# Add special handling for issuesif doc.metadata.get('type') == 'issue':issue_number = doc.metadata.get('number')repo = doc.metadata.get('repo')result["issue_number"] = issue_numberresult["repo"] = reporesult["github_url"] = doc.metadata.get('github_url')result["comments_count"] = doc.metadata.get('comments_count', 0)# Automatically fetch comments if availableif doc.metadata.get('comments_count', 0) > 0 and repo and issue_number:print(f" Fetching {doc.metadata.get('comments_count')} comments for {repo}#{issue_number}")comments_result = fetch_github_comments(repo, issue_number)result["comments"] = comments_resultformatted_results.append(result)return {"results_count": len(formatted_results),"results": formatted_results}except Exception as e:return {"error": f"Error searching knowledge base: {str(e)}"}def fetch_github_comments(repo, issue_number):"""Fetch comments for a GitHub issue."""# Check cache firstcache_file = f"comment_cache/{repo.replace('/', '_')}_issue_{issue_number}_comments.json"if os.path.exists(cache_file):try:with open(cache_file, 'r') as f:comments = json.load(f)return {"source": "cache","comments_count": len(comments),"comments": comments}except Exception as e:# Continue to fetch if cache read failspass# Fetch from GitHub APIurl = f"https://api.github.com/repos/{repo}/issues/{issue_number}/comments"headers = {"Accept": "application/vnd.github.v3+json"}# Add GitHub token if availableif os.environ.get("GITHUB_TOKEN"):headers["Authorization"] = f"token {os.environ.get('GITHUB_TOKEN')}"try:response = requests.get(url, headers=headers)if response.status_code == 200:comments = response.json()# Cache the commentswith open(cache_file, 'w') as f:json.dump(comments, f, indent=2)# Format the commentsformatted_comments = []for comment in comments:formatted_comments.append({"id": comment.get("id"),"user": comment.get("user", {}).get("login", "anonymous"),"created_at": comment.get("created_at"),"body": comment.get("body", "")})return {"source": "github_api","comments_count": len(formatted_comments),"comments": formatted_comments}elif response.status_code == 403 and "rate limit exceeded" in response.text.lower():return {"error": "GitHub API rate limit exceeded","suggestion": "Set a GITHUB_TOKEN environment variable to increase rate limits"}else:return {"error": f"Error fetching comments: HTTP {response.status_code}","response": response.text}except Exception as e:return {"error": f"Exception fetching comments: {str(e)}"}def execute_tool(tool_name, tool_input):"""Execute the appropriate tool based on the name and input."""if tool_name == "search_weave_issues":return search_vector_store("weave_issues", tool_input["query"])elif tool_name == "search_weave_docs":return search_vector_store("weave_docs", tool_input["query"])elif tool_name == "search_wandb_issues":return search_vector_store("wandb_issues", tool_input["query"])elif tool_name == "search_wandb_docs":return search_vector_store("wandb_docs", tool_input["query"])else:return {"error": "Unknown tool requested"}def get_system_prompt():"""Get the system prompt for Claude."""return """You are an AI assistant for Weights & Biases (wandb) and Weave.You have access to several searchable knowledge bases:1. Weave Issues - Contains GitHub issues related to Weave2. Weave Documentation - Contains documentation for Weave3. WandB Issues - Contains GitHub issues related to Weights & Biases4. WandB Documentation - Contains documentation for Weights & BiasesTo assist the user effectively:1. Analyze their question to determine if it's about Weave or WandB (or both)2. Determine if they're asking about documentation or having an issue3. Search the appropriate knowledge base(s) using the search tools4. If an issue looks relevant but lacks context, fetch its GitHub comments5. Synthesize the retrieved information to provide a detailed answer6. Make sure to provide the full links to the github issue so the user can investigate further7. For doc links- DO NOT hallucinate links ---> just provide a verbatim repeat of the docs you think could be helpfulFor code-related queries or error messages, search the issues database first. For "how to" questions, search the documentation database first.Example workflows:- For "How do I track experiments in wandb?": Search wandb_docs- For "Weave authentication failing": Search weave_issues, then fetch comments if needed- For a query mentioning an error message: Search the relevant issues database with the error messageAim to provide comprehensive answers based on the most relevant retrieved documents. If you don't find relevant information in your first search, try different search queries or additional knowledge bases."""@weave.opdef claude_agentic_rag(user_query):"""Run the agentic RAG system using Claude with tools.Args:user_query (str): The user's queryReturns:dict: The complete response, including search results and Claude's answer"""# Check for API keyif not CLAUDE_API_KEY:raise ValueError("ANTHROPIC_API_KEY environment variable is not set")system_prompt = get_system_prompt()print(f"\n🔍 Processing query: '{user_query}'")print("-" * 80)# Initial request with the user's promptresponse = client.messages.create(model=MODEL,max_tokens=4000,system=system_prompt,thinking={"type": "enabled", "budget_tokens": 2000},tools=TOOLS,messages=[{"role": "user", "content": user_query}])# Display thinkingthinking_blocks = [b for b in response.content if b.type == "thinking"]for block in thinking_blocks:print("\n🧠 THINKING:")print(block.thinking[:300] + "..." if len(block.thinking) > 300 else block.thinking)# Process tool use if neededconversation = [{"role": "user", "content": user_query}]search_results = []# We might need multiple tool calls, so loop until we get a final answerwhile response.stop_reason == "tool_use":tool_block = next((b for b in response.content if b.type == "tool_use"), None)if tool_block:# Show which tool was selectedprint(f"\n🔧 USING TOOL: {tool_block.name}")print(f"Tool input: {json.dumps(tool_block.input, indent=2)}")# Execute the appropriate tooltool_result = execute_tool(tool_block.name, tool_block.input)print(f"Tool found {tool_result.get('results_count', 0)} results" if 'results_count' in tool_result else f"Tool completed")# Save search results for return valuesearch_results.append({"tool": tool_block.name,"input": tool_block.input,"result": tool_result})# Save assistant's response (thinking + tool use)assistant_blocks = thinking_blocks + [tool_block]conversation.append({"role": "assistant", "content": assistant_blocks})# Add tool result to conversationconversation.append({"role": "user","content": [{"type": "tool_result","tool_use_id": tool_block.id,"content": json.dumps(tool_result)}]})# Get next responseresponse = client.messages.create(model=MODEL,max_tokens=4000,system=system_prompt,thinking={"type": "enabled", "budget_tokens": 2000},tools=TOOLS,messages=conversation)# Update thinking blocks for next iterationthinking_blocks = [b for b in response.content if b.type == "thinking"]for block in thinking_blocks:print("\n🧠 ADDITIONAL THINKING:")print(block.thinking[:300] + "..." if len(block.thinking) > 300 else block.thinking)# Get final text responsefinal_text = ""for block in response.content:if block.type == "text":final_text += block.textprint("\n" + "=" * 80)print("ANSWER:")print(final_text)print("=" * 80)# Return the complete result with all contextreturn {"query": user_query,"answer": final_text,"search_results": search_results,"conversation_history": conversation}def interactive_mode():"""Run the Claude agentic RAG system in interactive mode."""print("\n" + "=" * 80)print("CLAUDE 3.7 AGENTIC RAG SYSTEM")print("=" * 80)print("Type 'exit' or 'quit' to exit.")while True:print("\nYour question: ", end="")user_query = input().strip()# Check for exit commandif user_query.lower() in ['exit', 'quit']:return# Skip empty queriesif not user_query:print("Please enter a question.")continueprint("\n" + "-" * 80)print("Processing your query...")try:# Process the query using Claudeclaude_agentic_rag(user_query)except Exception as e:print(f"❌ Error: {str(e)}")if __name__ == "__main__":# Check for API keysif not os.environ.get("ANTHROPIC_API_KEY"):api_key = input("Please enter your Anthropic API key: ").strip()if api_key:os.environ["ANTHROPIC_API_KEY"] = api_keyCLAUDE_API_KEY = api_keyelse:print("❌ No Anthropic API key provided. Exiting.")sys.exit(1)# Check GitHub tokenif not os.environ.get("GITHUB_TOKEN"):print("\n⚠️ No GitHub token found. You may hit rate limits when fetching comments.")print("To set a token: export GITHUB_TOKEN=your_token_here\n")# Run the interactive modeinteractive_mode()
Our implementation utilizing Claude 3.7 with tool use enables the system to dynamically decide how to retrieve relevant information from multiple knowledge bases without requiring manual tuning. By integrating Weave, we can also monitor how Claude selects retrieval tools, track its reasoning process, and improve system performance iteratively. Here’s a breakdown of the core components of our system.
Vector search and tool-based retrieval
At the heart of the system is the vector search implementation, which allows Claude to retrieve relevant documents using semantic search rather than simple keyword matching. We use ChromaDB for efficient vector storage and OpenAI's text-embedding-3-small model to generate vector representations. The function search_vector_store handles queries by:
- Identifying the relevant knowledge base (e.g., Weave issues, Weave docs, Weights & Biases issues, Weights & Biases docs).
- Performing a similarity search against the stored vector database.
- Returning the top results, including metadata like GitHub issue numbers, associated repositories, and comments.
If a search result is a GitHub issue, we automatically fetch comments using the fetch_github_comments function, allowing the system to pull additional context that might contain solutions or relevant discussions.
Claude 3.7 tool selection
Claude simplifies our system due to its ability to choose the right tool dynamically based on the user’s query. The system is given access to four retrieval tools:
- search_weave_issues: Queries Weave-related GitHub issues.
- search_weave_docs: Searches Weave documentation.
- search_wandb_issues: Retrieves issues related to Weights & Biases.
- search_wandb_docs: Looks up relevant sections of the WandB documentation.
The execute_tool function runs the correct retrieval function based on Claude’s decision. This removes the need for static retrieval logic, allowing Claude to reason about which knowledge base to search.
Weave for monitoring and optimization
One challenge with tool-augmented retrieval systems is understanding how they make decisions and whether those decisions lead to optimal results. We use Weave to monitor which tools Claude selects for each query, what reasoning it provides for its selections, and how retrieved results influence the final response.
Each function involved in retrieval is wrapped in @weave.op, meaning all interactions can be logged, visualized, and analyzed. This is critical for debugging and improving system behavior—if Claude consistently selects the wrong tool for certain types of queries, we can adjust the system prompt or modify retrieval heuristics. Here's a screenshot of what we see inside Weave after running our script:

Dynamic query processing and iteration
Unlike traditional RAG systems, which retrieve documents in a single-step process, Claude can:
- Analyze a query and determine if additional context is needed.
- Search multiple knowledge bases if uncertainty exists.
- Refine search queries or ask follow-up questions.
Weave logs these multi-step interactions, allowing us to see when query reformulation is necessary or when search strategies need improvement.
Real-world use cases and alternative applications
Agentic RAG provides significant advantages over traditional retrieval systems, especially in knowledge-intensive fields such as legal research and financial analysis. Unlike static retrieval methods that rely on pre-indexed vector databases, agentic systems actively query real-time sources—including websites, structured databases, and non-vectorized documents. This reduces manual updates and ensures responses reflect the most current available information.
Legal research: More accurate and adaptive case law retrieval
Agentic RAG enables comprehensive legal research by:
- Querying external databases for case law, statutes, and regulatory documents that may not yet be indexed.
- Handling complex, criteria-based searches (e.g., lawsuits involving specific financial thresholds or settlements).
- Refining results through iterative query reformulation, reducing irrelevant results and minimizing hallucinations.
For legal professionals, this means higher accuracy, greater adaptability, and better decision-making when researching complex legal matters.
Financial analysis: Real-time market insights
Agentic RAG enhances financial analysis by:
- Accessing real-time data from external financial sources, earnings reports, and economic indicators.
- Dynamically adjusting queries to capture nuanced numerical parameters (e.g., market events, large transactions).
- Reducing inaccuracies through self-correction and adaptive retrieval strategies.
This allows analysts to make better-informed, real-time financial decisions in fast-changing environments.
Challenges and mitigation strategies in agentic RAG systems
1. Increased latency due to complex retrieval
Since agentic RAG retrieves from multiple sources, response times can be slower than a static LLM lookup.
Mitigation strategies:
- Implement caching mechanisms for frequently queried data.
- Optimize parallel processing to run multiple retrievals simultaneously.
- Prioritize faster sources before expanding the search scope.
2. Reliability in tool selection
Selecting the wrong retrieval tool can lead to irrelevant or incomplete responses.
Mitigation strategies:
- Monitor tool selection behavior using Weave to identify systematic errors and make real-time adjustments.
- Introduce confidence thresholds to expand searches if low-relevance results are detected.
- Suggest alternative queries to guide users toward more precise retrieval.
3. Ethical concerns: Privacy and bias risks
Interacting with real-world databases introduces risks of data exposure and bias amplification.
Mitigation strategies:
Introduce a privacy agent to:
- Enforce strict access controls to prevent unauthorized data retrieval.
- Flag sensitive information before processing.
- Ensure compliance with privacy regulations (e.g., GDPR, HIPAA).
- Cross-reference multiple sources to reduce misinformation risks.
- Maintain audit logs for transparency and accountability.
By embedding privacy-focused safeguards, agentic RAG can ensure security and trustworthiness in automated retrieval.
Conclusion
Retrieval-augmented generation has evolved from a simple static retrieval process into a more dynamic and adaptive system through agentic RAG architectures. While traditional RAG provides a single-step lookup mechanism, agentic RAG systems introduce multi-step reasoning, adaptive query decomposition, and modular retrieval agents that improve response accuracy and scalability.
Single-agent systems provide efficiency and simplicity for structured retrieval, while multi-agent architectures offer greater flexibility for complex queries. By integrating tools like Claude 3.7 and monitoring frameworks such as Weave, retrieval systems can refine searches, expand knowledge sources, and improve accuracy in real time. As AI-driven retrieval advances, agentic RAG systems will enhance industries like legal research and finance by streamlining information access.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.