Falcon OS Part 2: LLM Decision Making with ReAct

In this blog post we explore the concept of LLM decision making by examining the ReAct framework.
Created on December 15|Last edited on March 5
Comment
This is the second post in my series about Falcon OS, an open-source operating system powered by a large language model (LLM). In the first post, I introduced the Falcon 40B model and my vision for the project. Let's now dive into one of the core challenges: how do we get the Falcon LLM to make decisions and take actions intelligently?
IntroductionImagine an AI-powered operating system (like Falcon OS)  receiving a user request like this: "Summarize this earnings report, find any discrepancies, and email the key points to my accountant." It's a complex task, isn't it? Multiple skills are required, from reading and understanding a complex document, to flagging potential issues, to crafting a professional email. These are the tasks any LLM operating system (LLM OS) needs to be able to face.
But how can we prompt an LLM to make these kind of decisions? They go way beyond just writing summaries or generating creative text. The model needs to reason and decide to act. How does it choose the right software tools? What format should it use to communicate with those tools? That's the challenge we need to solve for a truly LLM-powered operating system.
In this blog post, we'll explore how to guide LLMs towards this kind of decision-making. We'll look at the ReAct framework and see how well state-of-the-art AI models already do at this particular task of decision making.
﻿
ReAct: Prompting LLMs to reason and make decisionsWhile we all might love getting creative text snippets or summaries from LLMs, the true power lies in its potential to solve problems. But to do that effectively, it needs to go beyond text. If Falcon OS can't make decisions – choosing the right tool and then using it – its usefulness will be limited.
So, the question becomes: how do we teach an LLM to make those choices? That's where the ReAct (Reasoning & Action) framework comes in. 
Let's compare it to the more common Chain-of-Thought (CoT) approach which encourages LLMs to break down problems into smaller, step-by-step reasoning. CoT is helpful but operates within the LLM's internal knowledge base, which can limit its accuracy and ability to interact dynamically with the real world.
A classic chain-of-thought example
CoT reasoning is a static black box in that the model uses its own internal representations to generate thoughts and is not grounded in the external world, which limits its ability to reason reactively or update its knowledge. This can lead to issues like  hallucination and error propagation over the reasoning process.
﻿Introduced in October 2022, the ReAct framework builds upon chain-of-thought thinking by adding "action" and "observation" steps. Essentially, ReAct lets the LLM interact with external tools and information, broadening its problem-solving abilities and helps the LLM see that a problem isn't solved solely within its own language generation, but might require choosing and using external tools.
﻿﻿The primary goal of ReAct is to bridge the gap between language generation capabilities of LLMs and the need for real-world action and problem-solving. It aims to transform LLMs from generating only text to orchestrating complex tasks that involve multiple steps and tools. An LLM that leverages the ReAct Framework breaks down the process like so:
Understand the problem: The LLM carefully analyzes the task or question you've given it. It looks for keywords and patterns to determine what kind of problem it's facing.
Consider the toolkit: The LLM reviews the tools (or agents) it has access to. This could include everything from a calculator and a Python interpreter to a database lookup system or even the ability to search the web. Each tool has its strengths.
Choose the best tool: This is where it gets interesting. The LLM needs to decide which tool is the most likely to help solve the problem. Should it use a calculator for a math question, or consult a database for a factual inquiry?
Craft the input: Once it's decided on the tool, the LLM needs to translate your original question or task into the specific input format that the chosen tool requires. This can get complex, as it requires an understanding of how different tools function.
Execute and interpret: The LLM directs a tool handler to execute the chosen tool with the crafted input. This might involve making an API call, running a calculation, or performing a search. The tool handler returns a response, which the LLM must parse and interpret to understand the result.
Generate a natural language response: Finally, with the insights gained from the tool's output, the LLM crafts a response in natural language that addresses the user's original query or task.
﻿
The React flow with the LLM having access to tools like internet search, a database, and a document store. 
A ReAct exampleThe ReAct framework sounds good in theory, but does it actually work? Let's look at a specific example where a state-of-the-art LLM (in our example we will use Gemini Ultra) successfully navigates the challenge and demonstrates true decision-making in the real world. 
We'll ask the LLM about the distance between two cities. Instead of relying on its internal knowledge to answer the question, we want to prompt the model to use an external tool. We will give it two tools to choose from: Internet search and Wolfram Alpha API. For this question the model ideally chooses to use the Wolfram Alpha API which is a much better tool for this kind of question.
Prompt engineeringFirst we need to set the stage with prompt engineering. Think of the prompt as mission control.  It gives the LLM an explicit task ("Solve the user's query") and a list of available tools ("Calculator, Wolfram Alpha, Python Interpreter, etc."). This sets the framework for decision-making:
The instructions to prompt an LLM to leverage the ReAct framework
This prompt will hopefully entice the LLM to not rely on its internal knowledge but use the tools we provide.
How to answer a user query with ReActThe query we will use is "How far is Kirkwall (UK) from Plymouth (UK)?"  This is a query that the LLM could potentially try to answer using its internal knowledge, but it would probably hallucinate since these are two lesser known cities in the UK. Therefore we want prompt the model to use a tool with the instructions we specified above.
Once we submit the query with the above instructions this is what happens:
Understanding: The LLM recognizes this isn't just about chatting or generating a summary – this is a question about distance. It needs a specific, factual answer that likely requires real-world knowledge.
Toolkit Review: The LLM scans its available tools. It could try a simple calculator, but that won't factor in things like routes and terrain. A general web search might be too broad. Then, it sees Wolfram Alpha, a knowledge engine designed to compute answers to specific factual queries.
Tool Selection: This is where the magic happens! The LLM determines that Wolfram Alpha's strengths – precise calculations and access to real-world data – make it the most likely tool to provide an accurate answer.
﻿
Crafting the input: Now that it has decided to use Wolfram Alpha it will create an input for the tool, i.e. a search term that can be used with WA:
﻿
Execute and interpret: The LLM directs a tool handler to execute the chosen tool with the crafted input. This might involve making an API call, running a calculation, or performing a search. The tool handler returns a response, which the LLM must parse and interpret to understand the result.
Using Wolfram Alpha 
Generate a natural language response: Finally, with the insights gained from the tool's output, the LLM crafts a response in natural language that addresses the user's original query or task.
The final response
This example isn't just about getting a correct answer; it's about the process. The LLM showcased its ability to analyze a problem, consider a range of tools, and strategically select the one best suited for the task. Additionally, it  reformatted the input, demonstrating understanding of both natural language and the syntax of the chosen tool. That's impressive decision making and problem solving.
﻿
﻿
Conclusion & next stepsThe ReAct framework shows incredible promise for building intelligent systems. However, in this blog post we leveraged Gemini Ultra, which is one of the most capable LLMs out there. In the next blog post we will investigate whether the Falcon-40B model is equally able to leverage this ReAct framework and is equally good at decision making. And if not, we will investigate what we can do to adapt and improve the model so that it will be able to play the role of the central processing unit of an LLM OS.
﻿
Add a comment
Tags: Articles, GenAI, LLM, NLP, Intermediate, Experiment
Iterate on AI agents and models faster. Try Weights & Biases today.