Exploring multi-agent AI systems
This project explores multi-agent AI systems, examining how multiple specialized agents collaborate to enhance decision-making, problem-solving, and automation across various domains.
Created on February 4|Last edited on February 27
Comment
AI systems are evolving beyond single models handling tasks in isolation. Instead, intelligent multi-agent AI systems are emerging as a way to distribute work across multiple specialized agents, allowing for collaboration, iterative refinement, and more dynamic decision-making. Rather than relying on a single model to generate and verify outputs on its own, multi-agent systems introduce structured interactions where agents communicate, challenge assumptions, and improve results through collective reasoning.
This article will explore how multi-agent AI systems work, their key differences from single-agent models, and the benefits they offer. We'll also cover different types of multi-agent systems, real-world applications, and the challenges that come with building and deploying them.

Table of contents
What are agentic AI systems?How multi-agent systems differ from single-agent systemsKey benefits of intelligent multi-agent AI systemsTypes of multi-agent systemsAutonomous multi-agent systemsHuman-guided multi-agent systemsHybrid autonomous systemsAndrew Ng's thoughts on multi-agent systemsMulti-agent AI systems in the real worldResearch on multi-agent systems How Moodys is using multi-agent systems Multi-agent systems in customer serviceBuilding a multi-agent system Conclusion
What are agentic AI systems?
An AI agent is an intelligent system designed to achieve specific goals through planning, the use of external tools, retaining memory, and adapting over time. Unlike traditional automation - which follows rigid, predefined rules - AI agents dynamically process information, make decisions, and refine their approach based on feedback.
While chatbots primarily engage in conversations and require user input at every step, AI agents operate independently. They don’t just generate responses, they take action, interact with external systems, and manage multi-step workflows without constant supervision.
Some key components of AI agents include:
- Tools: Connect to APIs, databases, and software to extend functionality.
- Memory: Store information across tasks to improve consistency and recall.
- Continual learning: Adapt and refine strategies based on past performance.
- Orchestration: Manage multi-step processes, break down tasks, and coordinate with other agents.
How multi-agent systems differ from single-agent systems
The core difference between single-agent and multi-agent AI systems is that a single-agent system operates under one "identity"—a single model processes input, generates output, and refines its own responses internally. It relies solely on its own mechanisms for reasoning and correction, without structured external feedback.
In contrast, multi-agent systems distribute work across multiple agents, each with a specialized role. These agents engage in structured interactions, questioning assumptions, verifying results, and iterating on solutions. Instead of a single perspective, they incorporate multiple viewpoints, reducing the risk of unchecked errors and improving adaptability.
Multi-agent systems can be:
- Fully autonomous, operating without human intervention.
- Human-guided, where humans oversee and direct their interactions.
- Hybrid, allowing AI agents to work independently but requiring human validation at key stages.
As AI continues to evolve, multi-agent frameworks are being explored as a promising approach for handling complex workflows, reasoning across multiple perspectives, and refining outputs through collaboration.
Key benefits of intelligent multi-agent AI systems
Multi-agent AI systems improve performance, efficiency, and adaptability by distributing tasks among specialized agents rather than relying on a single model to handle everything. This approach mirrors human teamwork, where breaking down complex projects into focused roles leads to better outcomes.
One major benefit is improved task decomposition. LLMs struggle with long, complex instructions, but dividing a problem into smaller, well-defined steps allows each agent to focus on a specific aspect. This leads to better execution, whether in software development, research, or content creation.
Another advantage is enhanced accuracy and reliability. When multiple agents verify, critique, or refine each other’s work, the chance of errors or hallucinations decreases. Systems like ChatDev and MetaGPT introduce structured communication methods, enabling agents to question assumptions, iterate on ideas, and generate higher-quality outputs.
Scalability is another key factor. A single AI model has limitations in terms of memory and processing capacity, but a network of agents can manage multiple tasks in parallel, making them more suitable for large-scale automation. Research automation, for example, benefits from AI-driven scientists generating hypotheses, running experiments, and analyzing results concurrently.
Multi-agent systems also allow for dynamic adaptability. Agents can recall past interactions, reflect on their own performance, and modify their approach based on feedback. This is particularly useful in applications where ongoing learning and refinement are needed, such as AI-driven software development or long-term project management.
Types of multi-agent systems
After diving into much of the latest research and attempting to build a multi-agent system myself, it seems that the largest differentiator between AI Agent systems currently is the amount of human intervention involved in the system. From my perspective, there are three main categories of multi-agent systems: autonomous, human-guided, and hybrid autonomous systems. Note that these categories were created by me after research and experimentation, and the world of agents is moving at a blistering pace, so these categories are subject to change and opinion.
While full autonomy is often presented as the ultimate goal, I think it’s a bit of a gimmick right now. AI models struggle with consistency, long-term reasoning, long term memory retrieval, and self-correction without human intervention. However, this could rapidly change as base models improve. For now, semi-autonomous approaches offer a more practical balance between automation and usability.
Autonomous multi-agent systems
These systems operate entirely without human intervention. AI agents generate ideas, critique each other, and refine their outputs iteratively. The system determines its own workflow, deciding what needs to be done and how to approach each step. These approaches are promising for large-scale research automation, but they still suffer from hallucinations, lack of true long-term memory, and difficulties with self-correction.
Some implementations focus on AI-driven research, where agents propose hypotheses, run code, analyze results, and refine conclusions. Others use adversarial setups where agents challenge each other’s findings to improve reliability. While these methods can accelerate discovery, they often produce ungrounded results or reinforce errors without human oversight.
Fully autonomous systems are still inconsistent in complex problem-solving, but as AI improves, they could take on more aspects of research, engineering, and decision-making with minimal intervention.
Human-guided multi-agent systems
These systems rely on humans to define and trigger each step, with AI agents executing specific tasks but not taking initiative beyond their assigned roles. Every phase is broken into discrete steps, with humans overseeing and initiating each one. Essentially, here, the AI functions as a set of highly capable tools rather than an independent research entity.
Humans predefine each step and activate AI agents to complete individual tasks. For example, a researcher might prompt an AI to generate a literature review, then instruct another agent to summarize findings, and later request an AI-driven analysis. AI does not determine the research flow but executes predefined roles within a strict pipeline, where scientists manually assign tasks like hypothesis generation, data processing, and result verification. AI agents operate as isolated specialists handling tasks such as code generation or statistical analysis, but each step requires human intervention before moving forward, ensuring full control over the process.
Hybrid autonomous systems
Hybrid systems give AI more flexibility while keeping human oversight at key points. Rather than being restricted to predefined tasks, AI agents operate within broader roles, making independent decisions and performing multi-step actions before requiring human input. They can refine their own work, collaborate dynamically, and leverage tools before seeking validation.
AI agents take multiple sequential actions before pausing for human review. An AI scientist might generate a hypothesis, analyze prior research, and design an experiment before requesting feedback. Instead of being confined to single tasks, agents function within open-ended roles, identifying research gaps, drafting questions, and suggesting methodologies. AI also iterates with human guidance, refining models, adjusting parameters, and optimizing performance before further input.
This hybrid approach balances automation with usability. AI can manage complex workflows, while human oversight ensures relevance, accuracy, and ethical considerations. Full autonomy remains a distant goal, but hybrid systems are already proving effective in real-world research.
Andrew Ng's thoughts on multi-agent systems
Andrew Ng has acknowledged multi-agent collaboration as a powerful design pattern for AI systems, particularly for complex tasks like software development.
Instead of relying on a single LLM to handle everything, breaking tasks into specialized roles—such as software engineer, product manager, and QA tester—improves efficiency and accuracy. Even if all agents stem from the same LLM, prompting them separately allows for better task decomposition and execution.
One key advantage of this approach is that while modern LLMs can process large input contexts, their ability to understand and act on long, complex instructions remains inconsistent. By assigning focused roles, each agent can optimize its part of the task, leading to better overall performance. This mirrors real-world project management, where managers divide work among specialists rather than expecting a single person to do everything.
Ng points to emerging frameworks like AutoGen, Crew AI, LangGraph, and ChatDev as examples of multi-agent systems in action. ChatDev, in particular, models a virtual software company where agents collaborate using structured workflows. While these systems aren’t always perfect, they often produce surprisingly effective results.
He also notes that multi-agent collaboration introduces unpredictability, especially when agents interact freely. More structured patterns like Reflection and Tool Use tend to be more reliable. Despite these challenges, Ng sees multi-agent AI as a promising direction for improving complex problem-solving.
Multi-agent AI systems in the real world
From software development and scientific research to finance, healthcare, and customer service, multi-agent setups enable AI to work in structured teams, optimizing workflows and reducing errors. While challenges like unpredictability and coordination remain, emerging frameworks demonstrate that multi-agent AI can enhance automation, decision-making, and problem-solving in real-world applications. This section explores various implementations and the impact of multi-agent AI across industries.
Research on multi-agent systems
Research on multi-agent AI explores how AI systems collaborate, adapt, and solve complex problems. While often studied in simulated environments, these systems have potential for real-world applications. This section covers key research in the field, from generative agents to AI-driven scientific discovery, highlighting how multi-agent AI can enhance automation, decision-making, and teamwork.
Generative agents: Interactive simulacra of human behavior
AI systems that simulate human behavior are often built for entertainment or research, but could they also be useful in real-world work environments?
A recent study explored this question by creating generative agents—AI-powered characters capable of recalling past events, forming relationships, and making plans. While largely an exploratory effort focused on social behavior, the results offer insights for more productivity-oriented AI systems.

The researchers placed 25 agents in a simulated town where they lived out their daily routines. The agents ran businesses, conducted research, and even planned a Valentine’s Day party—entirely on their own. One agent decided to host the party, invited friends, and spread the word. Others coordinated decorations, asked each other on dates, and showed up—without explicit instructions. Meanwhile, agents worked as shopkeepers, students, and artists, sticking to schedules and responding dynamically when things changed.
What made this possible? Memory, reflection, and planning. Agents stored past experiences, recalled relevant details, and adjusted their actions accordingly.
Findings showed that memory retrieval failures led to inconsistencies, reflection helped agents form realistic long-term behavior, and planning allowed for multi-step work. Even though this was a sandbox experiment, it suggests that similar AI architectures could be useful in workplace automation, collaborative AI teams, and long-term AI assistants that need to remember and adapt over time. AI that can recall, reflect, and plan isn’t just interactive—it might be truly useful.
The AI Scientist
The AI Scientist is an early attempt at automating the research process, using AI to generate ideas, write code, run experiments, analyze results, and draft scientific papers. Given a broad research direction and a simple code template, it explores possible improvements, tests them, and documents its findings—all with minimal human input. It even evaluates its own work through an automated review system, mimicking aspects of peer review.

In one experiment, it proposed a new method for diffusion models, modified the code to implement it, ran the tests, and wrote a full research paper detailing its results. While it occasionally made errors in reasoning or execution, it consistently produced papers at a cost of around $15 per study, demonstrating its ability to conduct independent research.
What makes this possible is its ability to remember, reflect, and iterate. It recalls past discoveries, revises flawed approaches, and refines its ideas over multiple attempts. While still early in development, this kind of system could eventually assist human researchers by automating repetitive tasks, generating hypotheses, and accelerating scientific progress.
ChatDev
ChatDev is a software development framework where multiple LLM-powered agents work together through structured conversations. Instead of handling software development in isolated phases, ChatDev uses a "chat chain" where specialized agents communicate in natural language and code to design, develop, and test software collaboratively.

Each agent has a defined role, such as CEO, CTO, programmer, reviewer, or tester. They interact through multi-turn dialogues, breaking the process into smaller subtasks. The design phase focuses on system planning in natural language, while the coding phase involves writing and refining code through iterative exchanges. The testing phase includes automated debugging, with agents providing structured feedback to fix errors and improve code quality.
To reduce coding hallucinations—where models generate incorrect or incomplete code—ChatDev introduces "communicative dehallucination." Before responding, agents proactively request clarifications, leading to more accurate outputs. Memory mechanisms manage short-term dialogue context and long-term project history to maintain coherence across development stages.
Compared to single-agent approaches like GPT-Engineer, ChatDev outperforms in completeness, executability, and consistency. While it requires more computational resources, its structured communication model results in higher-quality software. The approach highlights how linguistic coordination between LLM agents can drive automated, multi-agent software development.
MetaGPT
MetaGPT is a multi-agent software development framework that follows Standardized Operating Procedures (SOPs) to structure collaboration between LLM-powered agents. Instead of relying on unstructured dialogue, as in ChatDev (Zhao et al., 2023), agents in MetaGPT communicate through documents and diagrams. These structured outputs ensure that all necessary information is retained, reducing errors caused by irrelevant or missing content.
Each agent in MetaGPT has a specialized role, such as Product Manager, Architect, Engineer, or QA. The framework organizes tasks into a structured workflow, where agents follow predefined procedures to analyze requirements, design systems, generate code, and test implementations. A shared message pool allows agents to publish and subscribe to relevant updates, ensuring efficient information flow without unnecessary exchanges.

MetaGPT also introduces an "executable feedback" mechanism. After generating code, agents run tests and refine their outputs based on real execution results. This iterative debugging process reduces hallucinations and improves software reliability without requiring human intervention.
By enforcing structured communication and integrating real-time code testing, MetaGPT produces more accurate and coherent software solutions while minimizing inefficiencies in multi-agent collaboration.
Enhancing diagnostic accuracy through multi-agent conversations
Multi-agent AI systems are being explored as a way to improve diagnostic accuracy in medicine, particularly by addressing cognitive biases that can lead to misdiagnoses. Recent research tested an AI-driven multi-agent conversation framework in a clinical setting, where LLM-powered agents simulated the decision-making process of a medical team. Each agent played a specific role—one proposed an initial diagnosis, another acted as a critical challenger to question assumptions, a third facilitated discussion to prevent premature conclusions, and a fourth summarized findings.
The study evaluated 16 clinical cases where human doctors had previously made diagnostic errors due to cognitive biases. In 80 simulated diagnostic scenarios, the AI agents initially made incorrect diagnoses in every case. However, after multi-agent discussion, the accuracy of the top differential diagnosis increased to 71.3%, and the accuracy for the final two differential diagnoses reached 80.0%. These results suggest that multi-agent AI discussions could help refine medical decision-making, particularly in complex cases where initial diagnoses are misleading.
While still in the research phase, this approach shows promise for assisting clinicians by providing structured, multi-perspective analysis of patient cases. Future developments could integrate these AI agents into electronic medical records, offering real-time diagnostic support and helping doctors recognize and correct biases before they impact patient outcomes. However, challenges remain, including the need for rigorous validation in real-world clinical settings, ethical considerations, and ensuring that AI serves as an aid rather than a replacement for human expertise.
How Moodys is using multi-agent systems
Moody’s, a leading provider of financial analysis and credit ratings, assesses risk for businesses, governments, and investors worldwide. To enhance its research and evaluation processes, Moody’s is using multi-agent AI systems to automate financial research, credit risk assessment, and market analysis. Their system consists of a network of AI agents assigned specific roles, such as retrieving financial reports, analyzing creditworthiness, identifying emerging risks, and generating insights for investment decisions. These agents collaborate by cross-verifying data, refining outputs, and improving the accuracy of financial evaluations.
A key aspect of Moody's multi-agent approach is the use of LLMs as evaluators, where AI models assess the relevance and reliability of retrieved information. This helps ensure that the financial insights generated by AI agents align with industry standards and market realities. Additionally, Moody's employs a mixture-of-experts framework, where different AI agents specialize in specific tasks—such as corporate bond analysis, ESG risk assessment, or regulatory compliance—allowing for more precise and well-rounded evaluations.
By leveraging multi-agent AI, Moody's streamlines traditionally manual financial research processes, improving efficiency and consistency in credit ratings, risk analysis, and investment intelligence. This system allows for more scalable and automated financial decision-making while reducing human workload in data-heavy evaluations.
Multi-agent systems in customer service
Customer service is an ideal use case for multi-agent AI systems because it involves a mix of high-volume repetitive inquiries and complex information retrieval. Many customer interactions follow predictable patterns—such as tracking orders, processing refunds, or troubleshooting common issues—but still require pulling data from different sources and coordinating across multiple departments. Multi-agent systems are well-suited for this environment because they allow different agents to specialize in distinct tasks while working together to deliver seamless support.
Instead of relying on a single chatbot to handle everything, a multi-agent system breaks customer service into specialized roles. For example, in e-commerce and logistics, a system might include:
- A customer inquiry agent that manages initial interactions and routes requests.
- A payment processing agent that handles billing issues and transaction verification.
- A shipping agent that retrieves real-time tracking updates and resolves delivery problems.
- A returns & refunds agent that checks eligibility, processes returns, and updates inventory.
Each department can focus on developing and refining its own agent, making the system more scalable than a single-agent approach. Instead of building one AI model that must handle every type of customer request, companies can improve individual agents separately—upgrading the shipping agent with better tracking integrations or enhancing the payment processing agent to detect fraudulent transactions more effectively. This modular approach also makes maintenance easier, as specific agents can be updated or replaced without disrupting the entire system.
For more advanced setups, AI-powered escalation agents monitor frustration levels and determine when a human should intervene. Some companies also deploy AI supervisor agents that oversee multiple interactions, optimize workflows, and detect inefficiencies in real time.
By leveraging multi-agent AI, customer service operations can automate routine tasks, retrieve information more efficiently, and scale AI development across departments. This not only improves response times and accuracy but also allows human agents to focus on complex cases, enhancing overall customer satisfaction.
Building a multi-agent system
Now we will build a multi-agent system that will be designed to help authors write articles. This system falls into the category of a human-guided multi-agent system" and relies on the author to create the specific roles of each agent in the system, as well as orchestrate the usage of each agent during the writing process.
Below, we will write a python app that creates a writing editor, with a few features that will enable integration with different agents. The editor itself contains basic text editing functionality along with functionality for saving and loading different articles. Additionally, it includes functionality for utilizing agents that can do specific tasks. For example, an author can create a "conclusion agent" that would generate a conclusion based on the body of the article.
Similarly, an author could create an "introduction agent" that generates an introduction based on the content of the article. The system would allow authors to easily switch between agents, integrate their outputs seamlessly into the article, and manage the workflow of their writing process. The agents themselves could be implemented using various AI models, allowing authors to choose the best model for each task.
Below is a Python implementation using Tkinter for the writing editor GUI and agent integration setup:
import tkinter as tkfrom tkinter import messageboxfrom tkinter import ttkfrom tkinter import scrolledtextimport jsonimport osimport subprocessfrom functools import partialSAVE_DIR = "notes"AGENT_INSTRUCTIONS_DIR = "agent_instructions"# Create necessary directoriesfor directory in (SAVE_DIR, AGENT_INSTRUCTIONS_DIR):os.makedirs(directory, exist_ok=True)class MultilineDialog:def __init__(self, parent, title, prompt):self.result = None# Create dialog windowself.dialog = tk.Toplevel(parent)self.dialog.title(title)self.dialog.transient(parent)self.dialog.grab_set()# Configure dialogself.dialog.geometry("400x300")# Add prompt labeltk.Label(self.dialog, text=prompt, wraplength=380).pack(pady=5, padx=10)# Add text areaself.text_area = scrolledtext.ScrolledText(self.dialog,wrap=tk.WORD,width=40,height=10,font=("Arial", 10))self.text_area.pack(expand=True, fill='both', padx=10, pady=5)# Add buttonsbutton_frame = tk.Frame(self.dialog)button_frame.pack(fill='x', padx=10, pady=5)self.ok_button = tk.Button(button_frame, text="OK", command=self._on_ok)self.ok_button.pack(side='right', padx=5)tk.Button(button_frame, text="Cancel", command=self._on_cancel).pack(side='right')# Center dialog on parentself.dialog.geometry("+%d+%d" % (parent.winfo_rootx() + 50,parent.winfo_rooty() + 50))# Bind Enter key to OK buttonself.dialog.bind('<Return>', lambda e: self._on_ok())# Set focusself.text_area.focus_set()# Wait for dialogself.dialog.wait_window()def _on_ok(self):self.result = self.text_area.get("1.0", "end-1c").strip()self.dialog.destroy()def _on_cancel(self):self.result = None # Explicitly set result to None on cancelself.dialog.destroy()class TextEditor:def __init__(self, parent):self.frame = tk.Frame(parent)self.frame.pack(side="left", expand=True, fill="both")# Text widget with efficient configurationself.text_widget = tk.Text(self.frame,wrap="word",font=("Arial", 17),undo=True,autoseparators=True,maxundo=-1)self.text_widget.pack(expand=True, fill="both")# Optimized event bindingsself._setup_bindings()def _setup_bindings(self):"""Set up optimized keyboard bindings."""for sequence, callback in [('<Command-z>', self.undo),('<Control-z>', self.undo),('<Command-Shift-Z>', self.redo),('<Control-y>', self.redo)]:self.text_widget.bind_all(sequence, callback)def get_text(self):return self.text_widget.get("1.0", "end-1c")def set_text(self, text):self.text_widget.delete("1.0", tk.END)self.text_widget.insert("1.0", text)def get_selection(self):try:return self.text_widget.get(tk.SEL_FIRST, tk.SEL_LAST)except tk.TclError:return Nonedef undo(self, event=None):try:self.text_widget.edit_undo()except tk.TclError:passreturn "break"def redo(self, event=None):try:self.text_widget.edit_redo()except tk.TclError:passreturn "break"class WritingApp:def __init__(self, root):self.root = rootself.root.title("Writing App")# Configure root for better performanceself.root.update_idletasks()# Main Layoutself.frame = tk.Frame(root)self.frame.pack(fill="both", expand=True)self._init_components()self._load_initial_data()def _init_components(self):"""Initialize all UI components efficiently."""self._init_notes_pane()self.editor = TextEditor(self.frame)self._init_quick_insert_pane()self._init_buttons()self._init_bindings()# Track current noteself.current_note = Nonedef _init_notes_pane(self):self.notes_frame = tk.Frame(self.frame, width=200, bg="#f0f0f0")self.notes_frame.pack(side="left", fill="y")tk.Label(self.notes_frame, text="Notes", font=("Arial", 12, "bold"),bg="#000000").pack(pady=5)self.notes_listbox = tk.Listbox(self.notes_frame)self.notes_listbox.pack(fill="both", expand=True, padx=5, pady=5)# Optimize listbox selection handlingself.notes_listbox.bind("<<ListboxSelect>>",lambda e: self.root.after(50, self.switch_note))tk.Button(self.notes_frame, text="New Note",command=lambda: self.root.after(1, self.create_new_note)).pack(pady=5)def _init_quick_insert_pane(self):self.quick_insert_frame = tk.Frame(self.frame, width=200, bg="#e0e0e0")self.quick_insert_frame.pack(side="right", fill="y")tk.Label(self.quick_insert_frame, text="Agents",font=("Arial", 12, "bold"), bg="#000000").pack(pady=5)self.quick_insert_listbox = tk.Listbox(self.quick_insert_frame)self.quick_insert_listbox.pack(fill="both", expand=True, padx=5, pady=5)# Optimize quick insert handlingself.quick_insert_listbox.bind("<<ListboxSelect>>",lambda e: self.root.after(50, self.launch_chatbot_with_prompt))def _init_buttons(self):"""Initialize buttons with optimized commands."""self.button_frame = tk.Frame(self.root)self.button_frame.pack(pady=5)buttons = [("Chat with Selection", self.chat_with_selection),("Save (Cmd+S / Ctrl+S)", self.save_current_note)]for text, command in buttons:tk.Button(self.button_frame, text=text,command=lambda cmd=command: self.root.after(1, cmd)).pack(side=tk.LEFT, padx=5)def _init_bindings(self):"""Set up optimized keyboard bindings."""for sequence in ("<Command-s>", "<Control-s>"):self.root.bind(sequence, lambda e: self.root.after(1, self.save_current_note))def _load_initial_data(self):"""Load initial data efficiently."""self.root.after(100, self.refresh_notes_list)self.root.after(100, self.load_agents)def create_new_note(self):dialog = MultilineDialog(self.root, "New Note", "Enter note title:")if dialog.result:file_path = os.path.join(SAVE_DIR, f"{dialog.result}.txt")if not os.path.exists(file_path):with open(file_path, "w", encoding="utf-8"):passself.refresh_notes_list()def switch_note(self):selected_indices = self.notes_listbox.curselection()if not selected_indices:returnselected_note = self.notes_listbox.get(selected_indices[0])file_path = os.path.join(SAVE_DIR, f"{selected_note}.txt")if os.path.exists(file_path):self.load_note(file_path, selected_note)def load_note(self, file_path, note_name):try:with open(file_path, "r", encoding="utf-8") as f:note_text = f.read()self.editor.set_text(note_text)self.current_note = note_nameexcept Exception as e:messagebox.showerror("Error", f"Failed to load note: {e}")def save_current_note(self, event=None):if not self.current_note:returnfile_path = os.path.join(SAVE_DIR, f"{self.current_note}.txt")try:with open(file_path, "w", encoding="utf-8") as f:f.write(self.editor.get_text())messagebox.showinfo("Save", f"'{self.current_note}' saved successfully!")except Exception as e:messagebox.showerror("Error", f"Failed to save note: {e}")def refresh_notes_list(self):self.notes_listbox.delete(0, tk.END)notes = sorted(f.replace(".txt", "") for f in os.listdir(SAVE_DIR)if f.endswith(".txt"))for note in notes:self.notes_listbox.insert(tk.END, note)def load_agents(self):self.quick_insert_listbox.delete(0, tk.END)inserts = sorted(f.replace(".txt", "") for f in os.listdir(AGENT_INSTRUCTIONS_DIR) if f.endswith(".txt"))for insert in inserts:self.quick_insert_listbox.insert(tk.END, insert)def chat_with_selection(self):selection = self.editor.get_selection()if not selection:messagebox.showwarning("Selection Error","Please select some text before starting a chat.")returndialog = MultilineDialog(self.root, "Chat Prompt", "Enter any additional context or questions:")if dialog.result is not None: # Only proceed if not canceledadditional_note = dialog.resultfull_prompt = f"{selection}\n\n{additional_note}".strip()self.root.after(1, lambda: subprocess.Popen(["python", "chat_app.py", full_prompt]))def launch_chatbot_with_prompt(self):selected_indices = self.quick_insert_listbox.curselection()if not selected_indices:returnselected_text_key = self.quick_insert_listbox.get(selected_indices[0])file_path = os.path.join(AGENT_INSTRUCTIONS_DIR, f"{selected_text_key}.txt")try:with open(file_path, "r", encoding="utf-8") as f:quick_insert_text = f"##### {f.read().strip()} --------->"current_note_text = self.editor.get_text()dialog = MultilineDialog(self.root, "Additional Note", "Enter additional details (optional):")if dialog.result is not None: # Proceed only if not canceledadditional_note = dialog.resultfull_prompt = f"{quick_insert_text}\n\n{current_note_text}\n\n{additional_note}".strip()# Launch chat_app.py with the prompt and agent IDself.root.after(1, lambda: subprocess.Popen(["python","chat_app.py",full_prompt,selected_text_key # Pass the agent ID as an additional argument]))# Clear the selection in the quick_insert_listboxself.quick_insert_listbox.selection_clear(0, tk.END)except Exception as e:messagebox.showerror("Error", f"Failed to launch chat: {e}")if __name__ == "__main__":root = tk.Tk()app = WritingApp(root)root.mainloop()
This Python application, built with Tkinter and the litellm library, integrates a functional AI chat system with a robust text editor. The editor efficiently handles large files, supports standard features like undo and redo, and enhances the writing process with agents and a chatbot integration. Here's a screen shot of what the main editor looks like:

Users can manage multiple writing projects by saving their work as individual note files, allowing for easy switching between different documents. A key feature is the integration of agents instructions, which are essentially pre-written text templates designed to serve as prompts or starting points for generating content.
These agents are intended to be used in conjunction with an external chatbot. When an agent is selected, the application sends the agent's text, combined with any additional context or instructions provided by the user, to the chatbot. The chatbot then processes this information and may return expanded or modified text, allowing for features like content generation, summarization, or style adjustments.
This modular design using a separate chatbot process provides flexibility in selecting and integrating different chatbot implementations or AI models. The overall system aims to streamline and augment the writing process by combining structured text templates with the dynamic capabilities of a large language model.
Here's a screenshot of what the chatbot looks like after I launched the "conclusion agent":

I'll share the chatbot script, which launches when an agent is selected. Having the additional flexibility is really important in my opinion, because agent often times make mistakes, and it's important to have the ability to correct them. I'll use Weave inside the code to track inputs and outputs to the run_inference function, which allows me to keep track of the data that comes in and out of the model, which I can use later on to improve the model to ultimately write in a way that I prefer.
Here's the code:
import tkinter as tkfrom tkinter import scrolledtext, messageboxfrom litellm import completionimport osimport sysfrom functools import partialimport queueimport threadingimport weave; weave.init("writing-agent")# Set API Keysos.environ["OPENAI_API_KEY"] = ""os.environ["GEMINI_API_KEY"] = ""os.environ["ANTHROPIC_API_KEY"] = ""def load_last_model():try:with open("last_model.txt", "r") as f:return f.read().strip()except:return "gemini/gemini-1.5-flash"@weave.opdef run_inference(model_id, messages, temperature=0.7, agent_id=None):"""Dedicated function to handle model inferenceArgs:model_id (str): The ID of the model to usemessages (list): List of conversation messagestemperature (float): Temperature parameter for generationagent_id (str): The ID of the agent to use (optional)Returns:str: The model's response textRaises:Exception: If there's an error during inference"""try:response = completion(model=model_id,messages=messages,temperature=temperature)return response["choices"][0]["message"]["content"]except Exception as e:raise Exception(f"Inference error: {str(e)}")class ChatApp:def __init__(self, root, initial_prompt=None, agent_id=None):self.root = rootself.root.title("AI Chat App")self.agent_id = agent_id# Message queue for thread-safe UI updatesself.message_queue = queue.Queue()self._init_ui()self._setup_event_handling()# Chat historyself.conversation = []# Processing flagself.is_processing = False# If an initial prompt is provided, schedule itif initial_prompt:self.root.after(100, self.send_initial_prompt, initial_prompt)def _init_ui(self):"""Initialize UI components efficiently."""# Main frameself.main_frame = tk.Frame(self.root)self.main_frame.pack(fill=tk.BOTH, expand=True, padx=10, pady=10)# Model selection frameself._init_model_frame()# Chat displayself._init_chat_display()# Input fieldself._init_input_field()# Buttonsself._init_buttons()def _init_model_frame(self):"""Initialize model selection frame."""model_frame = tk.Frame(self.main_frame)model_frame.pack(fill=tk.X, pady=(0, 10))tk.Label(model_frame, text="Model ID:").pack(side=tk.LEFT)self.model_entry = tk.Entry(model_frame)self.model_entry.pack(side=tk.LEFT, fill=tk.X, expand=True, padx=(5, 0))self.model_entry.insert(0, load_last_model())def _init_chat_display(self):"""Initialize chat display with optimized configuration."""self.chat_display = scrolledtext.ScrolledText(self.main_frame,wrap=tk.WORD,width=80,height=20,font=("Arial", 25),state=tk.DISABLED,setgrid=True # Improved text rendering)self.chat_display.pack(fill=tk.BOTH, expand=True, pady=(0, 10))# Configure tags for formattingself.chat_display.tag_configure("user", foreground="yellow")self.chat_display.tag_configure("assistant", foreground="lightgreen")self.chat_display.tag_configure("system", foreground="red")def _init_input_field(self):"""Initialize input field with optimized configuration."""self.input_field = scrolledtext.ScrolledText(self.main_frame,wrap=tk.WORD,width=80,height=5,font=("Arial", 10),undo=True,maxundo=50)self.input_field.pack(fill=tk.BOTH, expand=True, pady=(0, 10))# Bind Ctrl+Enter with delay to prevent double-firingself.input_field.bind("<Control-Return>",lambda e: self.root.after(10, self._handle_send))def _init_buttons(self):"""Initialize buttons with optimized event handling."""button_frame = tk.Frame(self.main_frame)button_frame.pack(fill=tk.X)# Send buttonself.send_button = tk.Button(button_frame,text="Send (Ctrl+Enter)",command=lambda: self.root.after(10, self._handle_send))self.send_button.pack(side=tk.LEFT, padx=(0, 5))# Clear buttontk.Button(button_frame,text="Clear Chat",command=lambda: self.root.after(10, self.clear_chat)).pack(side=tk.LEFT)def _setup_event_handling(self):"""Set up event handling and message processing."""self.root.after(100, self._process_message_queue)def _process_message_queue(self):"""Process messages in the queue."""try:while True:message = self.message_queue.get_nowait()self._update_chat_display(message)except queue.Empty:passfinally:# Schedule next checkself.root.after(100, self._process_message_queue)def _update_chat_display(self, message):"""Update chat display with message."""text, tag = messageself.chat_display.config(state=tk.NORMAL)self.chat_display.insert(tk.END, text, tag)self.chat_display.see(tk.END)self.chat_display.config(state=tk.DISABLED)def _handle_send(self):"""Handle send message action."""if self.is_processing:returnuser_message = self.input_field.get("1.0", tk.END).strip()if not user_message:return# Clear input field immediatelyself.input_field.delete("1.0", tk.END)# Disable input during processingself.is_processing = Trueself.send_button.config(state=tk.DISABLED)self.input_field.config(state=tk.DISABLED)# Start processing in separate threadthreading.Thread(target=self._process_message,args=(user_message,),daemon=True).start()def _process_message(self, user_message):"""Process message in background thread."""try:# Save selected modelmodel_id = self.model_entry.get().strip()with open("last_model.txt", "w") as f:f.write(model_id)# Add user message to conversation and displayself.conversation.append({"role": "user", "content": user_message})self.message_queue.put((f"You:\n{user_message}\n", "user"))self.message_queue.put(("#" * 50 + "\n", "system"))# Get AI response using dedicated inference function, now passing agent_idai_reply = run_inference(model_id, self.conversation, agent_id=self.agent_id)# Add AI response to conversation and displayself.conversation.append({"role": "assistant", "content": ai_reply})self.message_queue.put((f"AI:\n{ai_reply}\n", "assistant"))self.message_queue.put(("#" * 50 + "\n", "system"))except Exception as e:self.message_queue.put((f"Error: {str(e)}\n", "system"))finally:# Re-enable inputself.root.after(0, self._enable_input)def _enable_input(self):"""Re-enable input controls."""self.is_processing = Falseself.send_button.config(state=tk.NORMAL)self.input_field.config(state=tk.NORMAL)self.input_field.focus_set()def send_initial_prompt(self, prompt):"""Send initial prompt when app starts."""if prompt:self._process_message(prompt)def clear_chat(self):"""Clear chat history and display."""self.conversation = []self.chat_display.config(state=tk.NORMAL)self.chat_display.delete("1.0", tk.END)self.chat_display.config(state=tk.DISABLED)if __name__ == "__main__":# Get initial prompt and agent_id from command line if providedinitial_prompt = sys.argv[1] if len(sys.argv) > 1 else Noneagent_id = sys.argv[2] if len(sys.argv) > 2 else Noneroot = tk.Tk()# Configure root windowroot.title("AI Chat App")root.geometry("800x600")# Make the window resizableroot.rowconfigure(0, weight=1)root.columnconfigure(0, weight=1)app = ChatApp(root, initial_prompt, agent_id)# Start the applicationroot.mainloop()
This Python chatbot uses litellm to interact with various LLMs (OpenAI, Gemini, Anthropic), and threading for smooth operation. User input is processed, sent to the selected LLM, and the response is displayed. A chat history is maintained, and model selection persists between sessions. The application incorporates error handling and uses a message queue to prevent UI freezes during LLM interactions. An initial prompt can be provided via the command line.
You might notice that I also pass the agent id to the chatbot script, and also pass it in the run_inference function. I dont actually need the agent_id in the run_inference function, however, this addition is needed in order to track and analyze agent performance inside Weave. As shown in the screenshot, each call to run_inference includes an agent_id parameter (like "typo_checker"), which Weave automatically tracks. This enables powerful filtering capabilities - you can see in the UI how we can filter all runs by "inputs.agent_id equals typo_checker" to analyze that specific agent's performance, response times, and patterns.


If you’re interested in trying this tool out yourself, I definitely recommend it! Note you will need to add your api keys inside the chatbot script, and after using the chatbot, you will see your conversations show up inside Weave, which is a great tool for projects where data is valuable and needs to be tracked for future model improvement!
Conclusion
Multi-agent AI systems represent a shift in how artificial intelligence is applied to complex tasks. Rather than relying on a single model to manage every aspect of a process, these systems distribute responsibilities across specialized agents, allowing for greater efficiency, accuracy, and scalability. While challenges remain—particularly in areas like self-correction, long-term memory, and managing unpredictability—continued advancements in AI coordination strategies are making these systems more practical and reliable.
The real value of multi-agent AI will come from its ability to integrate into real-world workflows, whether in research, software development, customer service, or beyond. Fully autonomous systems may still be limited in their effectiveness, but hybrid approaches that balance AI autonomy with human oversight are already proving useful. As these systems evolve, they could redefine how we approach automation, enabling AI to function not just as a tool but as a collaborative partner in problem-solving and innovation.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.