Introducing OrchestrAI: Building Custom Autonomous Agents with Prompt Chaining

Autonomous agents are a rising frontier in AI, tackling complex tasks through simpler steps. In this report, we'll delve into the current state of agents, and introduce a new custom framework, OrchestrAI.
Sam Shapley
Created on August 11|Last edited on September 5
Comment
﻿
Introduction This report introduces OrchestrAI, a new custom framework that lets users build pipelines by simply using intelligent prompting and agent chaining. Before you dig in, we recommend watching the demo below to see how easy it is to build a low-code autonomous agents in two minutes. 
﻿
BackgroundWithin the scope of recent technological innovations, few developments have garnered as much admiration and scrutiny as the rise of large language models (LLMs). Unlike other "transformative inventions" such as crypto, NFT's and the Metaverse, conversational AI has captured the attention of humanity with it's immediate real-word impact, felt on both the positive and negative sides of the coin.
The history of conversational AI traces its roots back to the 1960s with ELIZA, often dubbed as the first chatbot. Developed at the MIT Artificial Intelligence Laboratory by Joseph Weizenbaum, ELIZA simulated a psychotherapist and could engage in basic conversation, primarily through pattern matching and substitution. Another early chatbot named PARRY was designed in the early 1970s by Kenneth Colby. While ELIZA mimicked a psychotherapist, PARRY simulated a person with paranoid schizophrenia. Their "interaction" was one of the first instances of chatbots communicating with each other:
﻿
﻿We've come a long way in 60 years. ﻿
Many wrongly believed that human level AI was right around the corner. But despite the early promise, limitations in computational speed and memory requirements hindered progress, leading to periods of stagnation known as the AI winters.
Fast forward a few decades to the 2010s: research and industry witnessed a dramatic shift as conversational AI moved from rule-based systems to more sophisticated models powered by neural networks. OpenAI's GPT-2 in 2019 showcased the capabilities of generating coherent and diverse paragraphs of text, giving a taste of the potential of large neural models in conversation. 
In 2021, OpenAI's GPT-3, with its 175 billion parameters, demonstrated human-like text generation capabilities with broader contextual understanding. The process of Reinforcement Learning with Human Feedback (RLHF) helped create systems better aligned to human intention. Coupled with a sleek user interface, ChatGPT became the fastest growing user application of all time, signing up 100 million people in just the first two months. 
Today, our vibrant community thrives, teeming with dreamers, hackers, and builders. With numerous open-source conversational models emerging, like the LLAMA family from Meta, there's a collective endeavor to harness the immense potential of these new  companions, balancing innovation with mindful consideration of ethical dimensions and hurdles.
Looking to the future, a clear trajectory emerges. Systems capable of continual learning, interaction and multi-step complex problem solving are no longer decades away. In fact, early versions already here. We refer to these programs as Autonomous Agents, or simply Agents, and they're scarily easy to build!
Read on to see just how easy that is.
Here's what we'll be covering:
Table of ContentsIntroduction BackgroundTable of ContentsLet's think step by stepChain of Thought promptingLimitations of prompt engineeringAutonomous AgentsThe current state of affairsExisting attempts at autonomous systemsActually realizing autonomy is hardOrchestrAI, a simple framework for spawning autonomous agentsSystem OverviewTechnical Implementation.Logging with Weights & Biases PromptsDiscussionNear term capabilitiesA very different kind of futureAre we all doomed?
﻿
Let's think step by step
Chain of Thought promptingOne interesting feature of LLMs is that they perform better in multi-step reasoning when prompted with "let's think step by step." This method of prompting, called Zero-shot Chain of Thought (Zero-shot-CoT), substantially improves performance on complex reasoning tasks. 
For example, on the MultiArith arithmetic dataset, Zero-shot-CoT lifts accuracy of the GPT-3 model from 3.3% to 19%, and the InstructGPT model from 17.7% to 78.7% (Kojima et al., 2023). CoT improves reasoning across diverse tasks like math, common sense, and symbolic reasoning, suggesting fundamental reasoning abilities exist in large language models. 
﻿
From Large Language Models are Zero Shot Reasoners, P8. LLMs have a clear prompt dependence in their problem solving abilities.
Providing an LLM with additional examples of the task at hand (one/many-shot prompting), results in responses more likely to be aligned to a chosen task. 
Limitations of prompt engineeringPrompt engineering has emerged as a powerful method to interact with and control the behavior of large language models (LLMs). By fine-tuning how we ask questions, it is possible to guide LLMs toward providing more desirable answers. 
However, while the approach has its advantages, it comes with inherent limitations:
Memory Context Constraints: LLMs have a fixed context length, defined by the number of tokens they can process in a single step. This limitation impacts their ability to remember or consider lengthy input sequences. If the input surpasses this limit, parts of the data can be truncated, leading to potential memory lapses and degraded performance. As such, the amount of information we can feed to the model in one go is constrained.
Lack of Self-Reflection: Traditional LLMs predict the next token based on the previous context without the ability to evaluate the optimality of their outputs. Without multi-step processing or feedback loops, there's no mechanism for the model to self-assess or refine its answers. In practice, that means ensuring the best possible response often requires external judgment or through trial and error.
No Native Tool Delegation: LLMs inherently lack the capability to delegate specific sub-tasks to separate modules or systems. Their design focuses on predicting text based on prior context. Unlike human problem solvers who might use various tools or consult multiple sources for different parts of a task, LLMs handle all input within the same framework, which might not always be ideal for multifaceted problems.
Trade-off Between Generality and Performance: While LLMs are designed to be generalists, able to tackle a wide range of topics, this generality can sometimes come at the cost of specificity. When too much generality is expected, the model's performance on a specific task might decline. This limitation underscores the reason many users, especially those seeking solutions to complex coding or debugging tasks, initiate new interactions for each distinct question. By doing so, they ensure the model remains focused and offers the most relevant output for the task at hand.
﻿
Autonomous Agents﻿
Claude hasn't quite figured out humor
﻿
The current state of affairsAutonomous Agents were first described by T. Bösser in the International Encyclopedia of the Social & Behavioral Sciences in 2001:
"Autonomous agents are software programs which respond to states and events in their environment independent from direct instruction by the user or owner of the agent, but acting on behalf and in the interest of the owner." To grasp the underlying mechanics, it's essential to understand that these agents can systematically break down a high-level task or "goal" into a sequence of sub-tasks. They continue to work on these tasks, refining and iterating as needed until the overarching goal is deemed satisfied. Without the existence of AI models capable of acting as a kind of central nervous system, the vision of an autonomous computer program remained purely theoretical. However, with the release of GPT-4 in 2023, things are moving quickly. 
Increased parameter count and sophisticated RLHF has birthed a cognitive powerhouse. Though OpenAI has been relatively tight-lipped about the intricate details and inner workings of GPT-4, it's evident that the model's improvements have made it a potent tool for processing and understanding complex tasks.
One of the standout features of GPT-4 is its refined interaction with system prompts. These prompts act as directives that guide the model's behavior, independent to the actual prompt in the conversation. While GPT-3.5 veers off track and misinterprets instructions, GPT-4 displays heightened acuity. It not only comprehends the specific instructions embedded in the system prompt but also maintains a tighter grip on the broader context.
For instance, if we were to guide the AI through a planning phase for a project, GPT-4 could be tasked with brainstorming, evaluating feasibility, and outlining potential roadblocks. These steps, executed sequentially or simultaneously, allow for a much more comprehensive approach to problem-solving.
Existing attempts at autonomous systemsOne of the first implementations to surface this year was "Auto-GPT," developed by Toran Bruce Richards, the founder of the video game company Significant Gravitas Ltd. The agent, upon receiving a simple goal, embarks on the task of decomposing it into manageable sub-tasks. The process involves leveraging the capabilities of OpenAI's GPT-4 and, when necessary, accessing the internet and various other tools to carry out these tasks. Notably, after its release, Auto-GPT quickly rose to prominence, marking its territory as the top trending repository on GitHub, currently sitting at around 150K stars.
Building on the momentum sparked by Auto-GPT, several other initiatives were launched. One such derivative project is "BabyAGI" At its core, BabyAGI operates on a continuous loop, encompasses a system of task retrieval, execution, creation and reprioritization. Based on the outcome of the previous task and the overarching objective, new tasks are crafted and the task list is reordered. 
﻿
The BabyAGI execution flow. from https://github.com/yoheinakajima/babyagi﻿
﻿
A final project worth mentioning is the famed gpt-engineer, able to generate an entire codebase from a single prompt. After specifying the requirements for your repository, the agent generates and iterates through "unclear areas", requesting clarification on different points from the user. Once all these points are addressed, the agent formulates a plan, generates each file and extracts them into a folder. This is the tip of the iceberg for what software engineering will become as we move forward in the age of AI.
Actually realizing autonomy is hardThe idea of fully autonomous agents is appealing but presents numerous complexities and challenges:
Unpredictability of LLMs: Large language models (LLMs) like ChatGPT can produce varied responses to the same prompt. While diverse outputs can be beneficial, they pose challenges when consistent decisions are needed from the agent.
The Objective Dilemma: While breaking tasks into sub-tasks is crucial, ensuring that these sub-tasks align with the main objective can be tricky. There's a risk of the agent focusing on details and missing the bigger picture.
Dynamic Problems Demand Dynamic Solutions: Our world constantly evolves, and static approaches can quickly become obsolete. Autonomous agents must be adaptable, catering to ever-changing problems. Existing implementations rely on a fixed architecture to solve tasks.
Infinite Action Space & Intuition Gap: In a vast sea of potential actions, how does an agent choose? Humans rely on biases, experiences, and intuition. Autonomous agents face challenges prioritizing actions and often lack the intuitive decision-making that humans possess. A true autonomous agent should be able to write its own code, and interact with its own programming to evolve over time.
Dependency on External Inputs: While autonomous agents aim to act independently, they often lean heavily on external inputs, especially in the initial stages. These inputs can be corrupted, biased, or just plain wrong. Ensuring the agent can discern quality information from the noise is crucial but challenging.
﻿
OrchestrAI, a simple framework for spawning autonomous agentsWith our table setting out of the way, it's time to introduce you to OrchestrAI. OrchestrAI is a simple, intuitive way to build your own low-code autonomous agents. Functionally, it's a system with modular agents, controlled by a customizable pipeline, where you can design specific agent chains that work for the problem you're trying to solve. 
We'll break down how it works in further detail below, but if you'd like to try OrchestrAI on your own, please check out the OrchestrAI GitHub repo to get started. ﻿
System OverviewOrchestrAI provides a framework for building autonomous agents quickly and easily, without the need for large adaptations to the codebase. During development, I've kept the following principles in the foreground:
Modular and extensible architecture: Different capabilities are encapsulated in modules. Users can easily extend OrchestrAI by creating new modules for capabilities like computer vision, speech processing etc.
Customizable pipelines: The sequence of modules to execute for a given task is specified in a pipeline configuration file. Users can create pipelines tailored to their specific use case by ordering and configuring modules.
Agent automation: OrchestrAI executes the modules in a pipeline automatically based on the workflow specified. This removes the need for manual orchestration of different AI capabilities.
Human in the loop: Integration of human input through modules like 'human intervention' and 'code modification.' Users can provide additional info to guide the agent if needed, which aims to reduce the impact of some of the previously defined issues.
Before we go into the technical details, it's worth noting that all of the previously mentioned issues about autonomous agents still plague OrchestrAI. The idea is to provide a groundwork for quick agent development, enabling easy comparison of different execution chains. Also, please keep an eye out on your usage, Agents consume tokens very very quickly.
💡
Technical Implementation.For a full understanding, I recommend cloning the repository and working through the scripts yourself!
PipelinesThe core functionality of this system is the orchestration of different modules by a predefined pipeline. Let's look at an example pipeline for a task planning agent, and break down how this is used to build the agent:
pipeline: 
  - module: start_module
    inputs: []
    output_name: request
  - module: task_planner
    inputs: [request]
    output_name: task_plan
  - module: scrutinizer
    model_config:
      model: 'gpt-3.5-turbo'
      temperature: 0.7
    inputs: [request, task_plan]
    output_name: scrutinized_task_plan
  - module: enhancer
    inputs: [request, scrutinized_task_plan, task_plan]
    output_name: enhanced_task_plan
  - module: markdown_converter
    model_config:
      model: 'gpt-3.5-turbo'
      temperature: 0.3
    inputs: [enhanced_task_plan]
    output_name: markdown_plan
Using pipeline.yml , different modules are strung together to form a DAG (directed acyclic graph) in orchestrate.py. This is a chain of execution steps without any cycles. A module is a function specified in the modules.py script. If multiple inputs to a module are specified in the pipeline, these are concatenated into an input string before being passed into the next module. 
Additionally, you can specify the model_config for each module. This allows you to control the OpenAI settings on a per module basis. For example, easier tasks might warrant gpt-3.5-turbo whereas more creative tasks might need higher temperatures. The default model_config is specified in the config.yml file.
You can also specify a supplement for each module, additional context which the model may find useful. This is essential for some modules, such as a translation task, where you don't want a new system prompt per language.
pipeline: 
  - module: start_module
    inputs: []
    output_name: request
  - module: translator
    model_config:
      model: 'gpt-3.5-turbo'
      temperature: 0.2
    inputs: [request]
    supplement: "Spanish"
    output_name: translation
The pipeline and orchestration system allows for easy experimentation of different types of autonomous agents, without the need to rewrite the agent itself. One potential application could be a form of pipeline sweep, where one could set up a pipeline to be repeated with a range of temperature values, and evaluate the effectiveness of different variations.
ModulesOrchestrAI uses modules to encapsulate different capabilities, which are nothing more than functions which handle interaction with the LLM with or without additional capabilities. 
Most pipelines will begin with a start_module, which simply accepts user input from the terminal. However, this could easily be swapped out to load data from an external source instead. 
The most basic module is the `chameleon` module, a generic module used to make an OpenAI call with a custom system prompt. This allows you to easily create new modules by simply adding a system prompt to the system_prompts folder. This module will be invoked if the system prompt exists, but no function matching the module name exists.
def chameleon(prompt, module_name, model_config=None):
    ai = AI(module_name, model_config=model_config)
    response = ai.generate_response(prompt)
    log_action({"module": module_name, "input": prompt, "output": response})
    return response["response_text"]
The example task planning pipeline is purely composed of chameleon modules. Just with this alone you can decompose a single requirement into multiple tailored steps, experimenting with different configurations and system prompting strategies. We can log any custom JSON as an action to a memory file which is created per run of the agent, detailing all steps taken during the pipeline.
Software Engineering and Self DebuggingNot all modules need to be basic text in-text out. OrchestrAI comes with a provided pipeline for spawning software engineering agents, equipped with self-debugging, codebase modification, and documentation creation.
Let's take a look at the engineering_pipeline.yml.
pipeline: 
  - module: start_module
    inputs: []
    output_name: request
  - module: code_planner
    inputs: [request]
    output_name: code_plan
  - module: engineer
    inputs: [request, code_plan]
    output_name: code
  - module: debugger
    model_config:
      model: 'gpt-3.5-turbo-16k'
      temperature: 0.7
    inputs: [request, code]
    output_name: working_codebase
  - module: modify_codebase
    inputs: [working_codebase]
    output_name: modified_codebase
  - module: create_readme
    inputs: [modified_codebase]
    output_name: readme
In this pipeline, we use the code_planner, engineer, debugger and modify_codebase modules in succession to create a full repository based on an initial request. The model has been instructed to stick to a particular response format, as all generated code is extracted via regex matching. The debugger module will attempt to run the generated code, intercepting errors and passing them back into a specific debugging LLM. The modify codebase module can be used to make changes to any existing code, just by providing feedback on the result. 
The current implementation is instructed to always generate a main.py script to run, so is tailored to writing python based systems. This is a limitation that all current autonomous agents face, as generalisation to be able to run and debug any desired code is a very wide action space.
This being said, the pipeline can still be used for fun projects, quick debugging and early exploration of ideas. Here's an example of working brickbreaker game code, where not a single character was typed or modified by human hands.
﻿
﻿
﻿
ToolsTools within OrchestrAI is a work-in-progress, and will reduce the agent capabilities due to the context used in its inclusion. By default, tools_enabled = False. Future versions will offer the ability to use tools directly in the pipeline, and configure tool permissions at a module level.
💡
In the context of OrchestrAI, tools are function invoked if the LLM deems it necessary. 
The model gains this ability using a custom tool_prompt.
You can use tools. Tools are functions you can use via tag activation only.
Tool use is first person, "I will use..."
You can use multiple tools, the order of the tools is the order they are executed in.
Activate a tool with the tag wrapped around a JSON object. It must be perfect JSON for the regex to work.
<@ { "tool_name": "TOOL_NAME_HERE", "tool_input_1" : "etc" } @>
We then provide a list of all available tools to the model. Currently, this is a single tool for image generation using DALL-E (Sorry to all you Midjourney nuts out there, I know it's awful in comparison). 
GENERATE_IMAGE - Used to generate an image from a text prompt - Images are stored on generated_outputs/generated_images/{choose}.png
<@ { "tool_name": "GENERATE_IMAGE", "filename": choose , "prompt" : "Descriptive image prompt" } @>
Any tool tags are then captured in the response and executed by the tool_manager.py, passing the JSON to the relevant function from the tools folder. This method is extendable to multiple tools, such as internet searching or the ability to run commands in the terminal.
As an example, we can invoke a different agent, story_pipeline, to create an interesting narrative, providing matching images for the story. (ensure tools_enabled=True in config.yml first)
pipeline: 
  - module: start_module
    inputs: []
    output_name: request
  - module: storyteller
    inputs: [request]
    output_name: story
  - module: illustrator
    inputs: [request, story]
    output_name: illustration
Logging with Weights & Biases PromptsOrchestrAI has built in tracking with Wandb Prompts and traces to record individual agent runs for easy comparison and record keeping.
Wandb Prompts allows you to track and visualize the inputs, outputs, execution flow, and intermediate results of LLM chains.
The main components are:
Trace Table: Provides an overview of the inputs and outputs of an LLM chain. Shows whether the chain executed successfully and any error messages.
Trace Timeline: Displays the execution flow of the LLM chain, color-coded by component type. This allows inspection of inputs, outputs, and metadata for each step.
To use this feature, simply set wandb_enabled: True  in the config.yml file. On a new run of agent.py, we will activate a new agent trace, as the parent of the chain trace, which is used to represent the pipeline.
if wandb_enabled:
    globals.agent_start_time_ms = round(datetime.datetime.now().timestamp() * 1000) 
﻿
    # create a root span for the agent.
    root_span = Trace(
        name="Agent",
        kind="agent",
        start_time_ms=globals.agent_start_time_ms,
        end_time_ms=globals.agent_start_time_ms,
        metadata={"pipeline_name": pipeline_name},
    )
﻿
    # the agent calls into a pipeline, so we create a chain span.
    globals.chain_span = Trace(
        name=pipeline_name,
        kind="chain",
        start_time_ms=globals.agent_start_time_ms,
        end_time_ms=globals.agent_start_time_ms,
    )
﻿
    root_span.add_child(globals.chain_span) # add the chain span as a child of the root span
We can then add logging statements where we want in the modules, using the functions defined in wandb_logging.py. The chain (pipeline) is passed in as a parent of each of the modules.
if wandb_enabled:
        wb.wandb_log_llm(response, ai.model, ai.temperature, parent = globals.chain_span)
The script ensures that the start and end times of the traces are kept as a consistent unbroken chain, resulting in a trace timeline for our executed pipeline in the wandb platform.
def wandb_log_llm(data, model, temperature, parent):
    
    runtime = data["llm_end_time_ms"] - data["llm_start_time_ms"]
﻿
﻿
    llm_span = Trace(
        name=data["module_name"],
        kind="llm",
        status_code=data["status_code"],
        status_message=data["status_message"],
        metadata={
            "temperature": temperature,
            "token_usage": {"total_tokens": data["token_count"]},
            "runtime_ms": runtime,
            "module_name": data["module_name"],
            "model_name": model,
            "max_tokens": data["max_tokens"],
            "top_p": data["top_p"],
            "frequency_penalty": data["frequency_penalty"],
            "presence_penalty": data["presence_penalty"]
        },
        start_time_ms=parent._span.end_time_ms,
        end_time_ms=data["llm_end_time_ms"],
        inputs={"system_prompt": data["system_prompt"], "query": data["prompt"]},
        outputs={"response": data["response_text"]}
    )
    
    parent.add_child(llm_span)
﻿
    parent.add_inputs_and_outputs(
        inputs={"query": data["prompt"]},
        outputs={"response": data["response_text"]}
    )
    parent._span.end_time_ms = data["llm_end_time_ms"]
﻿
    llm_span.log(name="pipeline_trace")
Here's an example of what the trace timeline might look like for an engineering agent which has been asked to obtain information about the laptop it is running on. 
﻿
﻿
You can click on the different colored panels in the trace timeline to view the component system prompts, and evaluate inputs and outputs across the entire agent chain. For example, in the above chain, we can see that the run_main failed at the end of modify_codebase, initiating the debugger module.
Finally, wandb offers an additional trace kind, "tools."  If tools are enabled in OrchestrAI, tool logs are recorded as children of the modules. Here, we can see the story_pipeline creating a thrilling tale about the Paperclip Maximiser and illustrating it in a single chain. 
﻿
﻿
"'A futuristic rendering of barren planets and moons in the process of being terraformed, gradually turning into fields of paperclips" 
﻿
Discussion
Near term capabilitiesSo here's the thing. I've spent a lot of time playing around with, evaluating and building agents, and I've come to a conclusion:
Autonomous agents are no longer an engineering and research problem, but a product problem. The technology has matured, the paradigm has shifted. It's not about what can be built anymore but what should be built.
It's remarkable how easy it was for a single person to build a powerful system without the backing of a large team or organization. Clearly, at this point, even a tech giant like Apple could develop a fully-fledged autonomous agent. 
The technology exists, and it's robust. What we're grappling with now is figuring out the right format that enables these agents to solve a wide variety of complex tasks. The key will be in designing interfaces and experiences that not only capture the human need but also encapsulate the capability to adapt and evolve with user demands.
For example, a next stage of development for OrchestrAI would be the delegation of the pipeline creation to yet another AI system. This system would understand the framework and have the ability to modify and extend the source code. When confronted with a new task, this agent would figure out an optimal pipeline, creating the necessary modules on the go. This "self building" system would be endowed with even greater power than its predecessor.
Also, the cost could go down a little...
﻿
A very different kind of futureAs autonomous agents evolve, the traditional point-and-click interface will become increasingly obsolete. Imagine conversing naturally with your software, not only asking it to carry out tasks but also brainstorming ideas and solving complex problems together. Natural language will serve as a universal interface, making software more intuitive and approachable. Apps and programs will be created on the fly to fulfill a request.
The reach of autonomous agents will extend into the business realm in a dramatic way. Imagine AI systems autonomously handling your marketing strategy, optimizing advertising spends, customer engagement, and even customer service. They will not just analyze data but implement actions based on insights, effectively running entire departments. In legal settings, AIs could sift through regulations and case law to make preliminary judgments, simplifying the judicial process.
There will come a point where the number of agents at any given moment could surpass the number of humans alive. These agents will interact with each other in complex ways, creating a sort of digital ecosystem. This won't just be a collection of isolated algorithms but a web of interconnected systems, each augmenting the others' capabilities, offering a level of efficiency and complexity that we can hardly fathom today.
Are we all doomed?Is a truly autonomous agent possible? The answer is clearly yes, because you reading this is living proof. Most researchers agree that the transformer architecture will not carry us all the way to AGI, and new architectures are required to transition to the "general intelligence" stage. 
For the first time in history, humanity is replacing the only thing that truly makes us unique, our ability to think about a problem, and then solve it. The cost of intelligence is rapidly dropping, and it's not quite clear what this post-biological-cognitive age will look like. 
Too much discussion in the media and elsewhere seems to focus on the possibility of malevolent AI, rampaging unchecked and eventually ending all life, the sort of extistential crisis we should aim to avoid at all costs. However, this risk is far overshadowed by the danger of autonomous systems with no consciousness, but a misaligned goal, causing chaos no matter what obstacles they may face. Controlled by bad actors, these systems will be able to manipulate individuals, access secure systems and design weapons, all in secrecy. How we do defend against this? 
No idea, but you should be polite to your agents just in case.
﻿
﻿Follow me on X.﻿
﻿
Add a comment
Tags: Articles, Intermediate, LLM, Prompts, GenAI, Plots, Github, Panels, Agents
Iterate on AI agents and models faster. Try Weights & Biases today.