Building a Github Repo Summarizer with CrewAI

Created on May 7|Last edited on May 9
Comment
﻿
Multi-agent systems—where multiple AI agents work together on specialized tasks—are becoming a core approach for automating complex workflows. CrewAI is a platform built for designing these systems, letting developers assign tasks, define agent roles, and connect tools to handle real-world projects. When combined with Weave, a tool for tracking and debugging every agent’s behavior, the result is full transparency and control over the entire workflow.
In this walkthrough, we’ll build a multi-agent system that automatically documents a GitHub repository. Agents will explore the repo, summarize code, write a usage guide, generate an architecture diagram, and package everything into an HTML page. We’ll use CrewAI to manage the flow and Weave to monitor how every decision is made.
﻿
What is an Agent?Understanding Multi-Agent SystemsKey Features of CrewAI's PlatformTutorial: Building a GitHub Repo Documenter Multi-Agent System with CrewAI and WeaveIntegrating Weave for Enhanced Observability and DebuggingConclusion
﻿
What is an Agent?An AI agent is an intelligent system designed to achieve specific goals through planning, the use of external tools, retaining memory, and adapting over time. Unlike traditional automation - which follows rigid, predefined rules - AI agents dynamically process information, make decisions, and refine their approach based on feedback.
While chatbots primarily engage in conversations and require user input at every step, AI agents operate independently. They don’t just generate responses, they take action, interact with external systems, and manage multi-step workflows without constant supervision.
Some key components of AI agents include:
Tools: Connect to APIs, databases, and software to extend functionality.
Memory: Store information across tasks to improve consistency and recall.
Continual learning: Adapt and refine strategies based on past performance.
Orchestration: Manage multi-step processes, break down tasks, and coordinate with other agents
Understanding Multi-Agent SystemsAt its core, a multi-agent system is a network of autonomous AI agents that communicate and coordinate to accomplish complex tasks more efficiently than any single agent could. Each agent is assigned specific roles and responsibilities, enabling specialization and, when appropriate, parallel processing. For example, in a news aggregation system, one agent might gather the latest headlines while another deep-dives into detailed analysis. Effective multi-agent design hinges on modularity, clear task delegation, and seamless data exchange, resulting in scalable and adaptable AI solutions that mirror human team dynamics.
Utilizing multiple specialized agents allows developers to break down large, complex problems into smaller, well-defined subtasks. Instead of designing one monolithic system to handle every aspect of a workflow, developers can assign each component or phase to an agent uniquely equipped for that responsibility—such as summarizing text, generating code snippets, analyzing sentiment, or formatting output. This division of labor not only makes the system easier to reason about but also enhances maintainability and robustness: individual agents can be updated, replaced, or scaled without disrupting the whole pipeline.
It also makes it easier to manage model context limits, swap in different models for specific tasks, and fine-tune or control prompt usage across the system—helping optimize resource usage and accuracy at each stage.
Key Features of CrewAI's PlatformCrewAI’s platform is designed to facilitate the construction of scalable multi-agent workflows by providing structured abstractions and flexible orchestration mechanisms:
Automation Framework: At the heart of CrewAI is a clear separation between agents, tasks, and tools. Agents represent autonomous entities with defined roles and goals, while tasks are the discrete units of work assigned to them. Tools are external APIs, search engines, or custom functions that agents utilize to perform or enhance their tasks. This triad—agents, tasks, and tools—forms the building blocks for modular and maintainable AI workflows.
Crews and Flows: CrewAI uses Crews to orchestrate multiple agents and their tasks. Crews control the overall execution flow and data exchange between agents. The platform supports different workflow processing strategies via the Process class:
Sequential: Tasks are executed in sequence, one after another, with outputs from earlier tasks feeding subsequent tasks.
Hierarchical: Tasks can be organized into nested or dependent structures, enabling complex coordination and delegation across agents.
Flexible Deployment: CrewAI supports both inline Python scripting for agile development and YAML configuration files for greater structure and scalability. This flexibility allows developers to prototype quickly and then scale seamlessly to handle more complex multi-agent systems.
Integration with External Tools: CrewAI agents can be extended by connecting them to various tools, which act as interfaces to external data and capabilities such as web search APIs, databases, or other services. By leveraging tools, agents gain the ability to retrieve up-to-date information, perform specialized computations, or interact with third-party systems—significantly expanding the scope of automated workflows.
Explicit Task Assignments: CrewAI requires explicit definition and assignment of tasks to agents. Rather than autonomous self-monitoring or delegation, developers specify each task’s responsibilities and which agents are responsible. This explicitness improves transparency, predictability, and ease of debugging complex multi-agent interactions.
Tutorial: Building a GitHub Repo Documenter Multi-Agent System with CrewAI and WeaveIn this tutorial, we demonstrate how to build a multi-agent AI system using CrewAI that automatically explores, understands, documents, and visualizes a GitHub repository. The goal is to create a fully automated pipeline that transforms a raw codebase into a comprehensive usage guide and architecture diagram—packaged as a clean HTML document.
Before creating our AI agents, we first need to build a set of specialized tools that provide the agents with the necessary capabilities to interact with the GitHub repository. Tools act as external interfaces or utilities that agents can call on to perform specific functions like searching files, reading content, or summarizing code. In this tutorial, we will implement three core tools. The Repo Explorer Tool takes a user’s query, searches through the entire repository's file structure, and returns file paths relevant to the query to help agents identify where to focus their analysis. The View File Tool reads and returns the full content of any specified file path, giving agents access to raw source code or documentation. The Summarize File Tool uses a language model to digest the content of a file and generate a concise summary, allowing agents to grasp complex code or documentation without reading everything line-by-line.
import os
import subprocess
from typing import Type
from pydantic import BaseModel, Field
from crewai.tools import BaseTool
from langchain_openai import ChatOpenAI
﻿
# Clone the repo if not already cloned
REPO_URL = "https://github.com/karpathy/nanogpt.git"
REPO_DIR = "./nanogpt"
﻿
if not os.path.exists(REPO_DIR):
    subprocess.run(["git", "clone", REPO_URL])
﻿
# --- Define the Repo Explorer Tool ---
class RepoQueryInput(BaseModel):
    query: str = Field(..., description="Describe what you are looking for in the codebase.")
﻿
class RepoExplorerTool(BaseTool):
    name: str = "Repo Explorer"
    description: str = "Given a query, find relevant file paths from the nanogpt repo."
    args_schema: Type[BaseModel] = RepoQueryInput
﻿
    def _run(self, query: str) -> str:
        file_list = []
        for root, _, files in os.walk(REPO_DIR):
            for file in files:
                full_path = os.path.join(root, file)
                file_list.append(full_path)
﻿
        llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)
        prompt = f"""
You are a codebase expert. Here is a list of files:\n{file_list}\n\n
Based on the user query "{query}", return ONLY the full file paths that are most relevant.
Respond with one file path per line, no extra text.
"""
        response = llm.invoke(prompt)
        return response.content.strip()
﻿
# --- Define the View File Content Tool ---
class ViewFileInput(BaseModel):
    filepath: str = Field(..., description="Full path to the file you want to read.")
﻿
class ViewFileTool(BaseTool):
    name: str = "View File Content"
    description: str = "Reads the full content of a given file path from the nanogpt repo."
    args_schema: Type[BaseModel] = ViewFileInput
﻿
    def _run(self, filepath: str) -> str:
        filepath = filepath.strip()  # <<< add this
        try:
            with open(filepath, "r", encoding="utf-8") as f:
                content = f.read()
            return content
        except Exception as e:
            return f"Error reading file: {str(e)}"
﻿
# --- Instantiate the tools ---
repo_explorer_tool = RepoExplorerTool()
view_file_tool = ViewFileTool()
﻿
# --- Example usage ---
if __name__ == "__main__":
    # Step 1: Find files related to training loop
    query = "Where is the training loop defined?"
    result_paths = repo_explorer_tool._run(query=query)
    print("Relevant Files:\n", result_paths)
﻿
    # Step 2: Read the first file found
    first_file = result_paths.splitlines()[0]
    file_content = view_file_tool._run(filepath=first_file)
    print("\nContent of the first relevant file:\n")
    print(file_content[:3000])  # only print first 3000 chars to keep output readable
﻿
Now we are ready to define our AI agents and the tasks they will perform, leveraging the tools we have built.
Our system consists of multiple specialized agents collaborating sequentially to deliver the final documentation output:
Demo Generator: Crafts an initial, beginner-friendly usage guide for the repository, including a minimal example command.
Follow-Up Analyst: Reviews the generated usage guide, identifies gaps, asks clarifying questions, and poses concrete suggestions for improvement.
Repo Summarizer: Scans and summarizes major files in the repository, providing concise descriptions to aid developer understanding.
Diagram Generator: Creates a textual architecture diagram (e.g., PlantUML or ASCII) based on file summaries, giving a high-level repo structure overview.
Shell Script Automation Engineer: Generates a comprehensive, idempotent Bash script that automates all repository setup steps—including dependency installation, environment setup, and running the demo—so a developer can get started with a single command.
HTML Coder: Combines the improved usage guide, the architecture diagram, and the generated setup script into a clean, styled HTML documentation page for easy consumption.
Here’s the code! 
import os
import subprocess
from typing import Type, List
from pydantic import BaseModel, Field
from crewai import Agent, Task, Crew, Process
from crewai.tools import BaseTool
from langchain_openai import ChatOpenAI
import weave; weave.init("crewai_git_documenter")
# --- Repo Setup ---
﻿
REPO_URL = os.getenv('REPO_URL', 'https://github.com/LiveCodeBench/LiveCodeBench')
REPO_NAME = REPO_URL.split("/")[-1].replace(".git", "")
REPO_DIR = f"./{REPO_NAME}"
﻿
if not os.path.exists(REPO_DIR):
    subprocess.run(["git", "clone", REPO_URL])
﻿
# --- Tools ---
﻿
class RepoQueryInput(BaseModel):
    query: str = Field(..., query="Describe what you are looking for in the codebase.")
﻿
class RepoExplorerTool(BaseTool):
    name: str = "Repo Explorer"
    description: str = "Given a query, find relevant file paths from the repo."
    args_schema: Type[BaseModel] = RepoQueryInput
﻿
    def _run(self, query: str) -> str:
        file_list = []
        for root, _, files in os.walk(REPO_DIR):
            for file in files:
                full_path = os.path.join(root, file)
                file_list.append(full_path)
﻿
        llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)
        prompt = f"""
Here is a list of files:\n{file_list}\n\n
User query: "{query}"
Return ONLY full file paths that match. One per line.
"""
        response = llm.invoke(prompt)
        return response.content.strip()
﻿
class ViewFileInput(BaseModel):
    filepath: str = Field(..., description="Full path to the file you want to read.")
﻿
class ViewFileTool(BaseTool):
    name: str = "View File Content"
    description: str = "Reads the full content of a given file path from the repo."
    args_schema: Type[BaseModel] = ViewFileInput
﻿
    def _run(self, filepath: str) -> str:
        filepath = filepath.strip()
        try:
            with open(filepath, "r", encoding="utf-8") as f:
                return f.read()
        except Exception as e:
            return f"Error reading file: {str(e)}"
﻿
class SummarizeFileTool(BaseTool):
    name: str = "Summarize File Content"
    description: str = "Summarizes a given file's content into a short description."
    args_schema: Type[BaseModel] = ViewFileInput
﻿
    def _run(self, filepath: str) -> str:
        filepath = filepath.strip()
        try:
            with open(filepath, "r", encoding="utf-8") as f:
                content = f.read()
            llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)
            prompt = f"Summarize the following code file in 5 sentences:\n\n{content}"
            response = llm.invoke(prompt)
            return response.content.strip()
        except Exception as e:
            return f"Error summarizing file: {str(e)}"
﻿
# --- Instantiate tools ---
repo_explorer_tool = RepoExplorerTool()
view_file_tool = ViewFileTool()
summarize_file_tool = SummarizeFileTool()
﻿
# --- Agents ---
﻿
demo_generator_agent = Agent(
    role="Demo Generator",
    goal="Write and expand a basic usage example for the repo.",
    backstory="Expert technical writer.",
    tools=[repo_explorer_tool, view_file_tool],
    allow_delegation=False,
    verbose=True,
    llm=ChatOpenAI(model_name="gpt-4o-mini", temperature=0.5)
)
﻿
followup_analyst_agent = Agent(
    role="Follow-Up Analyst",
    goal="Analyze the usage guide, find missing info, suggest improvements.",
    backstory="Developer experience expert.",
    allow_delegation=False,
    verbose=True,
    llm=ChatOpenAI(model_name="gpt-4o-mini", temperature=0.3)
)
﻿
html_coder_agent = Agent(
    role="HTML Coder",
    goal="Turn the improved usage guide into a clean HTML page.",
    backstory="Frontend dev specializing in docs websites. DO NOT GENERATE IMAGES MAN eg  dont do src=data:image/png etc etc",
    allow_delegation=False,
    verbose=True,
    llm=ChatOpenAI(model_name="gpt-4o-2024-08-06", temperature=0.4)
)
﻿
repo_summarizer_agent = Agent(
    role="Repo Summarizer",
    goal="Summarize all important files in the repo for easier understanding.",
    backstory="Documentation architect.",
    tools=[repo_explorer_tool, view_file_tool, summarize_file_tool],
    allow_delegation=False,
    verbose=True,
    llm=ChatOpenAI(model_name="gpt-4o-mini", temperature=0.4)
)
﻿
diagram_generator_agent = Agent(
    role="Diagram Generator",
    goal="Using file summaries, generate a full architecture diagram of the repo.",
    backstory="System architect skilled in explaining large codebases.",
    allow_delegation=False,
    verbose=True,
    llm=ChatOpenAI(model_name="gpt-4o-mini", temperature=0.4)
)
﻿
﻿
shell_script_agent = Agent(
    role="Shell Script Automation Engineer",
    goal="Create a working shell script that sets up all requirements and runs the repo's demo or basic usage.",
    backstory="Expert in bash scripting, devops, and repo onboarding automations.",
    tools=[repo_explorer_tool, view_file_tool, summarize_file_tool],  # Optional, useful for reading files
    allow_delegation=False,
    verbose=True,
    llm=ChatOpenAI(model_name="gpt-4o-mini", temperature=0.3)
)
﻿
# --- Tasks ---
﻿
﻿
demo_generation_task = Task(
    description=f"Write a beginner-friendly usage guide explaining how to use {REPO_NAME}. Include a minimal example command.",
    expected_output="Markdown snippet with instructions.",
    agent=demo_generator_agent,
    output_file="basic_usage.md"
)
﻿
followup_task = Task(
    description="Read 'basic_usage.md', find confusing parts, generate 5+ questions and improvements.",
    expected_output="List of questions + feedback report.",
    agent=followup_analyst_agent,
    context=[demo_generation_task],
    output_file="feedback_report.md"
)
﻿
improvement_task = Task(
    description="Use 'feedback_report.md' to expand and improve the usage guide.",
    expected_output="New improved guide.",
    agent=demo_generator_agent,
    context=[followup_task],
    output_file="improved_usage.md"
)
﻿
repo_summarization_task = Task(
    description="Scan the repo, summarize the major files (training, models, data utils, etc.).",
    expected_output="List of file summaries.",
    agent=repo_summarizer_agent,
    output_file="repo_summary.md"
)
﻿
diagram_generation_task = Task(
    description="Using 'repo_summary.md', create an overall architecture diagram of the repo in text/plantuml format.",
    expected_output="Architecture diagram text.",
    agent=diagram_generator_agent,
    context=[repo_summarization_task],
    tools=[repo_explorer_tool, view_file_tool, summarize_file_tool],
﻿
    output_file="architecture_diagram.md"
)
﻿
shell_script_task = Task(
    description=(
        "Read the improved usage guide and repo summary. Write a BASH shell script that fully sets up the repo from scratch, "
        "including dependency installation, environment setup, any required downloads, and running the minimal example demo. "
        "The script should be idempotent (safe to rerun), and **explain each step with comments**. Output the script in a codeblock."
    ),
    expected_output="A complete shell script (setup_and_run_demo.sh) with comments.",
    agent=shell_script_agent,
    context=[improvement_task, repo_summarization_task],  # Uses improved usage and file summaries
    output_file="setup_and_run_demo.sh"
)
﻿
﻿
﻿
html_generation_task = Task(
    description="Combine 'improved_usage.md', and 'architecture_diagram.md' into a final HTML file. Make sure to append the full setup_and_run_demo.sh script at the end",
    expected_output="Final styled HTML guide. Follow styling similar to Wandb's website styling",
    agent=html_coder_agent,
    context=[improvement_task, diagram_generation_task, shell_script_task],
    output_file="final_guide.html"
)
﻿
﻿
# --- Run Crew ---
crew = Crew(
    agents=[
        demo_generator_agent,
        followup_analyst_agent,
        repo_summarizer_agent,
        diagram_generator_agent,
        html_coder_agent,
        shell_script_agent,  # <-- Add here
    ],
    tasks=[
        demo_generation_task,
        followup_task,
        improvement_task,
        repo_summarization_task,
        diagram_generation_task,
        shell_script_task,    # <-- Add here, before html_generation_task
        html_generation_task,
    ],
    process=Process.sequential,
    verbose=True
)
﻿
if __name__ == "__main__":
    result = crew.kickoff()
    print("\n\nFinal HTML Guide and Repo Diagram Created:\n")
    print(result)
﻿
We orchestrate the entire workflow using CrewAI’s Agent, Task, and Crew classes. Each agent is assigned a specialized task, and all tasks are executed sequentially—so that the output of one stage becomes the input for the next. This ensures a stepwise build-up of information, from initial guide and file summaries through diagram and shell script generation, culminating in a comprehensive HTML documentation page that consolidates all results.
After running the script, you will see a new file created called final_guide.html, which contains a helpful guide for using the repo! Here's a few screenshots of what it looks like!
﻿
﻿
﻿
Integrating Weave for Enhanced Observability and DebuggingTo monitor and debug this multi-agent system efficiently, we integrate Weave, initializing it at the start with: import weave; weave.init("crewai_git_documenter"). This setup captures every call each agent makes to the OpenAI models, logging prompts, responses, and relevant context centrally. With Weave’s visualization dashboard, developers can inspect detailed request-response cycles for each agent’s LLM calls, trace the flow of information as agents retrieve, analyze, and generate content, quickly identify bottlenecks, unexpected outputs, or errors in the chain, and tune prompt instructions and parameters based on real usage data. Heres a screenshot of Weave, showing how we can easily visualize every single call to the LLM that our Agents make! 
﻿
This granular visibility into agent interactions is especially valuable for multi-agent workflows like ours, where multiple models contribute specialized knowledge. Instead of sifting through disjointed logs in local file, Weave provides a web-based, queryable history of the entire pipeline, empowering more effective iterative development and maintenance.
By dividing responsibilities across specialized agents and tracking every step with Weave, this CrewAI-powered multi-agent system offers a scalable, maintainable, and transparent approach to automated software documentation.
ConclusionMulti-agent systems are transforming automation by enabling AI agents to work together on complex tasks. CrewAI is built for this, letting developers define agents with specific roles and connect them to tools and tasks. It supports both code and config-based workflows, with clear orchestration that makes debugging and scaling easier. When combined with Weave, every model call is tracked and visualized, giving developers a full view into how agents think, act, and collaborate.
In the GitHub documentation system, each agent handles a specific job—summarizing code, generating diagrams, writing usage guides, or building setup scripts. The system runs sequentially, with each step feeding the next. With Weave integrated, developers can monitor prompt-response pairs, trace logic, fix errors, and improve output quality. CrewAI handles the automation, Weave reveals the internals, and together they create maintainable, transparent multi-agent systems.
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
Add a comment