Skip to main content

Enhancing LLM Agent Performance through Dynamic Plugin Selection and W&B Prompts

Created on August 8|Last edited on March 1

Show me the Code!

Open In Colab



Demo

A preview of what we'll be making today:

Success
startTime
Timestamp
Input
Output
Chain
Error
Model ID
14
13
12
1: AgentExecutor
23041ms
5: LLMChain
17494ms
6: ChatOpenAI
17415ms
4: invalid_tool
0ms
2: LLMChain
5537ms
3: ChatOpenAI
5236ms
Click and drag to pan
AgentExecutor
23041ms
Result Set 1
Inputs
input
Solve this equation: 2x + 3 = 7 immediately
Outputs
output
Arr, the solution to the equation 2x + 3 = 7 be x = 2, matey!
Metadata
ID
da701ef1-e104-426b-9019-7e0327ef3b78
Kind
AGENT
Status
SUCCESS
Start Time
Fri Jun 09 2023 16:05:29 GMT+0000 (Coordinated Universal Time)
End Time
Fri Jun 09 2023 16:05:52 GMT+0000 (Coordinated Universal Time)
Child Spans
3
execution_order
1
Run set
31


Introduction

Consider the challenges faced by a popular e-commerce platform that caters to a diverse range of industries.
With the goal of personalizing customer service across all service domains and for hundreds of thousands of customers, large language models (LLMs) are an attractive solution. Specifically, these models could power a chatbot capable of tailoring suggestions for each individual customer.
This e-commerce platform prefers to utilize an LLM agent as the core reasoning mechanism for their chatbot, contrasting with a more traditional 'chain' approach. With chains, the behavior is hardcoded or pre-scripted, characterized by specific prompts and responses that are then fed into other LLM calls with their distinct prompts and responses, resulting in a 'chain' of LLM calls resolving to a predetermined outcome.
Alternatively, LLM-based agents perpetually call the LLM until specified stopping criteria are satisfied. This affords considerable flexibility in defining an agent’s prompt, assuming the stopping criteria are appropriately formulated. This configuration allows the agent to 'think' through input queries, deconstructing the query into a series of LLM calls that gradually move the agent towards its stopping criteria. This 'thinking' process enables agents to utilize tools.
Here's an illustration of a standard agent template:
# Set up the base template
template = f"""
System Rules:
{system_rules}

Query Transform Rules (only do this for the initial input query!):
{query_transform_rules}

Task:
Answer the following questions as best you can, but speaking as {speaker} might speak. You have access to the following tools:

{{tools}}

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{{tool_names}}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!
Remember to speak as {speaker} when giving your final answer.
{speaker_rules}

Question: {{input}}
{{agent_scratchpad}}"""
The Zero-shot agent prompt template has been adjusted to facilitate testing of:
  • Different system rules (contextualizing the industry/application focus of the agent)
  • Different decomposition tactics for input queries (making it akin to one-shot)
  • Different speakers (varying tone and restrictions)
Given the constraints of the knowledge base used as training data for the LLM, the e-commerce platform cannot directly utilize a stock LLM like OpenAI’s GPT-4 for their agent. The agent would lack many contemporary details essential for an adequate customer experience such as current store inventory, prevailing customer trends, and other external information necessary for personalizing customer interactions.
To equip our agents with access to these external details, we introduce tools, similar to plugins. These tools are Python functions with designated roles, which can include Google Search, Database lookup, Python REPL, among other API-like behaviors.
Here is an example of a tool:
def search_function(...):
pass

tools = [
Tool.from_function(
func=search_function,
name="Search",
description="useful for when you need to answer questions about current events"
# coroutine= ... <- you can specify an async method if desired as well
),
]
Consider a fashion retail arm that sells a broad spectrum of clothing and accessories via physical outlets and a strong online platform. The e-commerce platform determines that the chatbot required for this industry must contextually suggest clothing to users based on:
  • Local weather conditions
  • Current clothing inventory
  • Currency conversion for price presentation
Each of these conditions can be viewed as a separate plugin that would be created and managed. The Agent is aware of these plugins through the human-readable descriptions we provide for each tool.
Note: plugins and tools are similar but they differ in structure as OpenAI uses a different configuration schema than Langchain.
💡
openapi: 3.0.1
info:
title: Weather
description: Allows users to fetch current and forecasted weather information based on location. You MUST ALWAYS convert the plugin response to the units that are most useful to your user, when in doubt assume USA/Enlish units.
version: 'v1.2'
servers:
- url: https://weather--vicentescode.repl.co
paths:
/weathernow:
get:
operationId: getWeatherNow
summary: Get the current weather information based on city, state, and country. You MUST ALWAYS convert the plugin response to the units that are most useful to your user, when in doubt assume USA/Enlish units.
parameters:
- in: query
name: city
schema:
type: string
required: true
description: The city name.
- in: query
name: state
...
- in: query
name: country
...
responses:
"200":
description: OK
content:
application/json:
schema:
type: object
additionalProperties: true
...
To convert an OpenAI plugin spec to a Tool, Natural Language API Toolkits (NLAToolkits) within Langchain, are used which enables LangChain Agents to effectively plan and combine calls across various endpoints, including calls to the OpenAI plugin ecosystem.
Given the vastness of industries and plugins to manage, the e-commerce platform can curate and maintain a collection of approved plugins that are compatible with both OpenAI and Langchain applications, and can be utilized for any downstream LLM-based agent.
To manage DataOps and LLMOps, we employ Weights & Biases. Here, W&B Artifacts are used to track datasets, models, dependencies, and results at each stage of LLM pipelines. Artifacts provide a comprehensive and auditable history of changes to your files. Artifacts are either an input of a run or an output of a run, where a run encompasses a session of our LLM usage. Common artifacts include complete training sets and models. Data can be directly stored into artifacts, or artifact references can be used to point to data in other systems like Amazon S3, GCP, or your proprietary systems.


Run set
31


As the e-commerce’s customer-service bots find success within the market, the scope of the bot may increase, which in turn increased the amount of plugins needed to be pulled down from the plugin store to match the need for up-to-date information for the bot. As a result, due to token limits for the bot, the LLM prompt template may not support the additional tool description context.
To manage this, a vector store can be used to dynamically retrieve the relevant tools that would be needed for a given input query. This dynamic retrieval allows for the inclusion of infinite scalings amounts of plugins available to an agent by choosing which `n` tools should be used for the input, where the retrieval is based on our embedding schema. This also allows for a decoupling of plugin management from LLM prompt template management which allows for better development and audibility of the key aspects of our LLM agent. These plugins and prompts are both easily stored in Weights & Biases artifacts for versioning and portability.
To better exemplify the above consider a user asking for fashion recommendations for a pub crawl in Philadelphia, with a request like to our LLM agent:
"I'm a man going out with friends to a pub crawl today in Philadelphia. What should I wear?"
The implemented system should ideally perform the following steps:
  1. Determine from the input query the most relevant tools that solve the given input query. Similar to above it would expected to have the ability to find:
  • Local weather conditions
  • Current clothing inventory
  • Currency conversion for price presentation (if relevant)
2. The agent transforms this request into a set of actions usable for our given plugins.
  • First, it decides to check the weather in Philadelphia using the Weather plugin.
  • Next, it determines what kind of clothing would be suitable given the weather,
  • and lastly, it decides to get a list of trendy male outfits from the Inventory plugin.
3. For each action, the agent uses the corresponding plugin (tool), sends a request (tool input), and gets a response (observation). This can be done using an appropriate Output parser.
4. After executing all actions and making all observations, the agent compiles a comprehensive response. This can be done using an appropriate Output parser.
  • For example, it might respond with, "Given the pleasant weather in Philadelphia today, a trendy outfit for your pub crawl could be a crisp white linen shirt, comfortable dark jeans, and a light jacket. Add a touch of sophistication with a leather bracelet and a pair of aviator sunglasses. Here are some suggestions from our collection..."
5. The suggestions are then returned to the user, tailored to their request and the current weather conditions in Philadelphia.

Show me the code, again!

Let's walk through the code:


import os
os.environ["LANGCHAIN_WANDB_TRACING"] = "true"
os.environ["WANDB_PROJECT"] = "langchain-plugin-retrieval-demo"
Set W&B autologging with minimal code.


run = wandb.init(project=os.environ["WANDB_PROJECT"], name="save_company_plugin_registry")
plugin_registry_art = wandb.Artifact(name="company_plugin_registry",
type="plugins",
description="Registry to hold our list of all available plugins")
plugin_registry_art.add_file(all_plugins_fp)
run.log_artifact(plugin_registry_art)
Serialize and version our plugin_registry saved at all_plugins_fp


all_plugins_fp = run.use_artifact('company_plugin_registry:latest').get_path("chatgpt_plugins.json").download()

#Load from artifact
with open(all_plugins_fp, "r") as json_file:
chatgpt_plugins = json.load(json_file)

#Process details for easy retrieval and investigation
AI_PLUGIN_DETAILS = {}
for plugin in chatgpt_plugins["items"]:
#…
AI_PLUGIN_DETAILS[”detail"]] = plugin_manifest
Load and subset all latest plugins for FashionGPT agent relevance as AI_PLUGIN_DETAILS


embeddings = OpenAIEmbeddings()
docs = [
Document(page_content=detail["description_for_model"],
metadata={"plugin_name": detail["name_for_model"]})
for detail in AI_PLUGIN_DETAILS.values()
]
vector_store = FAISS.from_documents(docs, embeddings)
#…
plugin_registry_art = wandb.Artifact(name="subset_plugin_registry",
type="plugins",
metadata=meta,
description="Registry to hold our vector store of indexed plugins, and metadata of selected plugins."
)
vector_store.save_local("faiss_index")
with open("selected_plugins.json", "w", encoding='utf-8') as outfile:
json.dump(AI_PLUGIN_DETAILS, outfile, ensure_ascii=False, indent=4)
plugin_registry_art.add_dir("faiss_index", name="faiss_index/")
plugin_registry_art.add_file("selected_plugins.json")
run.log_artifact(plugin_registry_art)
run.finish()
Embed the description of the subsetted plugins based on the description of the plugin and store into a vector store, serializing and versioning the vector store for ease of use in downstream applications.


subset_plugin_registry_fp = run.use_artifact('subset_plugin_registry:latest').download()
#…
vector_store = FAISS.load_local(faiss_index_fp, embeddings)
with open(selected_plugins_fp, "r") as json_file:
AI_PLUGIN_DETAILS = json.load(json_file)
AI_PLUGINS = [AIPlugin.from_url(detail["spec_url"]) for detail in AI_PLUGIN_DETAILS.values()]
toolkits_dict = {plugin.name_for_model:
NLAToolkit.from_llm_and_ai_plugin(llm, plugin)
for plugin in AI_PLUGINS
}
Load the latest selected plugins and vector store, and convert the OpenAI spec into the relevant LangChain tool.


retriever = vector_store.as_retriever()

def get_tools(query):
# Get documents, which contain the Plugins to use
docs = retriever.get_relevant_documents(query)
# Get the toolkits, one for each plugin
tool_kits = [toolkits_dict[d.metadata["plugin_name"]] for d in docs]
# Get the tools: a separate NLAChain for each endpoint
tools = []
for tk in tool_kits:
tools.extend(tk.nla_tools)
return tools
Define the logic to dynamically retrieve the relevant tools given an input query.


# Set up a prompt template
class CustomPromptTemplate(StringPromptTemplate):
# The template to use
template: str
############## NEW ######################
# The list of tools available
tools_getter: Callable

def format(self, **kwargs) -> str:
# Get the intermediate steps (AgentAction, Observation tuples)
# Format them in a particular way
intermediate_steps = kwargs.pop("intermediate_steps")
thoughts = ""
for action, observation in intermediate_steps:
thoughts += action.log
thoughts += f"\nObservation: {observation}\nThought: "
# Set the agent_scratchpad variable to that value
kwargs["agent_scratchpad"] = thoughts
############## NEW ######################
tools = self.tools_getter(kwargs["input"])
# Create a tools variable from the list of tools provided
kwargs["tools"] = "\n".join([f"{tool.name}: {tool.description}" for tool in tools])
# Create a list of tool names for the tools provided
kwargs["tool_names"] = ", ".join([tool.name for tool in tools])
return self.template.format(**kwargs)

prompt = CustomPromptTemplate(
template=template,
tools_getter=get_tools,
# This omits the agent_scratchpad, tools, and tool_names variables because those are generated dynamically
# This includes the intermediate_steps variable because that is needed
input_variables=["input", "intermediate_steps"]
)
Create a custom prompt template to allow for the usage of this dynamic retrieval step for tool retrieval for the agent.


class CustomOutputParser(AgentOutputParser):

def parse(self, llm_output: str) -> Union[AgentAction, AgentFinish]:
# Check if agent should finish
if "Final Answer:" in llm_output:
return AgentFinish(
# Return values is generally always a dictionary with a single `output` key
# It is not recommended to try anything else at the moment :)
return_values={"output": llm_output.split("Final Answer:")[-1].strip()},
log=llm_output,
)
# Parse out the action and action input
regex = r"Action\s*\d*\s*:(.*?)\nAction\s*\d*\s*Input\s*\d*\s*:[\s]*(.*)"
match = re.search(regex, llm_output, re.DOTALL)
if not match:
raise ValueError(f"Could not parse LLM output: `{llm_output}`")
action = match.group(1).strip()
action_input = match.group(2)
# Return the action and action input
return AgentAction(tool=action, tool_input=action_input.strip(" ").strip('"'), log=llm_output)
Define the logic within the OutputParser to properly retrieve the information from the dynamic plugins and be able to determine stop criteria for the agent all mentioned in the custom PromptTemplate


llm = ChatOpenAI(temperature=0, model=model)
# LLM chain consisting of the LLM and a prompt
llm_chain = LLMChain(llm=llm, prompt=prompt)
tool_names = [tool.name for tool in tools]
agent = LLMSingleActionAgent(
llm_chain=llm_chain,
output_parser=output_parser,
stop=["\nObservation:"],
allowed_tools=tool_names
)
agent_executor = AgentExecutor.from_agent_and_tools(agent=agent, tools=tools, verbose=True)
Create the agent using the custom LLM prompt and the custom output parser.


query="I'm a man going out with friends to a pub crawl today in Philadelphia. What should I wear”
result = agent_executor.run(query)
print(result)
Utilize the agent as expected.


To run the above code please follow the provided link:

Open In Colab




Results

The output of our agent tends to get very verbose, which makes it difficult to prompt engineer and prompt tune as there are many intermediate observations that need to be scrutinized, especially as the plugins and responsibilities scale for the agent.

To better manage this, we'll use W&B Prompts, a suite of LLMOps tools built for the development of LLM-powered applications can be used. W&B Prompts is used to visualize and inspect the execution flow of LLMs, analyze the inputs and outputs of LLMs, view the intermediate results and securely store and manage prompts and LLM chain/agent configurations.
W&B Prompts supports a tool called Trace. Trace allows for tracking and visualization of the inputs and outputs, execution flow, model architecture, and any intermediate results of LLM chains/agents. These include all the complex plugin interactions that occur in the example agent provided:

Run set
31

Trace consists of three main components:
  • Trace table: Overview of the inputs and outputs of a chain.
  • Trace timeline: Displays the execution flow of the chain and is color-coded according to component types.
  • Model architecture: View details about the structure of the chain and the parameters used to initialize each component of the chain.
The Trace Table provides an overview of the inputs and outputs of a chain. The trace table also provides information about the composition of a trace event in the chain, whether or not the chain ran successfully, and any error messages returned when running the chain.
The Trace Timeline view displays the execution flow of the chain and is color-coded according to component types. Select a trace event to display the inputs, outputs, and metadata of that trace.
The Model Architecture view provides details about the structure of the chain and the parameters used to initialize each component of the chain. Click on a trace event to learn more details about that event.
After experimenting with different plugins and different prompts, it can be seen that for this scenario, it is useful to explicitly:
  • Define the restrictive context of what the Agent is aiming to do (be FashionGPT as opposed to a general zero-shot agent)
  • Decompose example input queries into thought process steps, (akin to a one-shot approach) to force the agent to have consistent usage of plugins for similar queries
Unfortunately for the case of fashion planning, the agent is very good at utilizing the plugins to craft an outfit with the relevant types of clothing needed for weather conditions, but falls flat in ensuring the outfit is coordinated in terms of aesthetics. A useful plugin addition would be to curate coordinated outfits and make comparisons of the provided store inventory to construct an outfit, as opposed to directly listing the inventory and choosing whether relevant pieces as the agent is currently doing.
Outside of coordinated aesthetics, the usage of dynamic plugins is readily available and useful for LLM agents. With the proper tooling and organization, many industries can take advantage of Agent workflows in ways that are better prepped for their needs as it scales, while also being safe and secure in the experimentation process.

Iterate on AI agents and models faster. Try Weights & Biases today.
List<wb_trace_tree>