Going from 17% to 91% Accuracy through Prompt Engineering on a Real-World Use Case
This is the first in a series all about real world LLM-app building. Today, we're going to build a customer success ticket classifier and improve its accuracy five fold.
Created on March 11|Last edited on March 13
Comment
Introduction
The process for productionizing LLMs is quite different than traditional software Whereas in software the output is the code that produces an application, in LLMs (like in ML) the output is all the things you tried along the way – the data, prompts, pipelines, evaluation metrics etc. ML engineers should adopt an experimentation mindset and realize the underlying components of the process are, in fact, the intellectual property being created.
LLMs themselves are being deployed by companies today in a large number of functions: automating customer support tickets, sales data cleanup/matching, internal documentation, automated marketing copy generation, and others.
And in this series, we'll walk you through one such use case, step by step: classifying our own customer success tickets with the help of GPT-4. Our reasons for this are pretty simple. We do our best to give real feedback to everyone who writes in but different tickets require diverse responses. A message that touches on something existential should be treated differently than someone having trouble logging in. Outside of triage, understanding our entire dataset of support questions lets us tease out themes, areas we can improve, even product features we should prioritize. Then, there’s the fact that a lot of our support tickets simply aren’t about our product. Being able to identify those programmatically will help our customer success organization work more efficiently and waste less time reading non-actionable messages.
The TL;DR here is that the better we understand these messages individually and holistically, the better we can help our users.
Today, we’re going to start from scratch, try out some evaluation harnesses, and hone our prompts to see how well we can do. Later in the series, we’ll dig into everything from fine-tuning to RAG to deployment, but we need a foundation to build on before we get there.
Let’s build that foundation today! You can follow along with the code for this project via the Colab link below:
Table of Contents
IntroductionTable of ContentsInspecting our DataDefining our ScopeEvaluating our PerformanceManual EvaluationAutomated EvaluationPrompt EngineeringUse Delimiters to Help the Model—and Avoid Prompt Injections Add Examples Inside Prompt Context WindowExplain what the Classes MeanChange your Model’s Configuration ParametersAsk for a Structured Output Like JSON or HTMLSpecify StepsAsk for the Reasoning Behind Each StepChain Of Thought PromptingPrompt ChainingCustom evaluations for promptsConclusion
Inspecting our Data
Our customer success folks (thanks Frida and Artsiom!) sent along a dataset of roughly 26,000 messages. Let’s start by examining one at random and seeing what we might be able to do with it (I’ll pull the text of the message out for readability):
“Hi, I've noticed that when I run a training script through `wandb agent` , a fatal exception in the code is not printed out to the console. While when I run the same code directly with `python train.py` I see the exception in the command output. Looks like if the agent was suppressing the exception printout to the console. Have you heard about something like that?”
Full JSON:
{"description": "Hi, I've noticed that when I run a training script through `wandb agent` , a fatal exception in the code is not printed out to the console. While when I run the same code directly with `python train.py` I see the exception in the command output. Looks like if the agent was suppressing the exception printout to the console. Have you heard about something like that?","raw_subject": "[SDK] wandb agent fatal exception not printed","subject": "[SDK] wandb agent fatal exception not printed","priority": "urgent","problem_id": null,"tags": ["component_cli","enterprise_customer","halp","p0","question","sweeps"],"id": 36380,"question": "question","360042457771": ["enterprise_customer"],"360042775872": ["component_cli"],"360042457751": "sweeps","360044019192": null,"360041678631": "https://weightsandbiases.slack.com/archives/C01L79NKX5L/p1669025253633279?thread_ts=1669007664.253479&cid=C01L79NKX5L","360041794452": null,"4419373106452": null,"4419370133012": null,"4419373051540": null,"4419370299284": null,"4419380461716": null,"4417243931028": false,"4419380411796": null}
Right away, we can see there are a few useful fields outside the message itself. Question, tags, and priority seem like a promising start where some of those later fields do not. We’ll redact those, clean our data, and end up with something like this:
{"description": "Hi, I've noticed that when I run a training script through `wandb agent` , a fatal exception in the code is not printed out to the console. While when I run the same code directly with `python train.py` I see the exception in the command output. Looks like if the agent was suppressing the exception printout to the console. Have you heard about something like that?","question": "question","priority": "urgent","tags": ["component_cli","enterprise_customer","halp","p0","question","sweeps"]}
Here, we have just the four fields we care about and removed the unnecessary noise. It’s time to really get started.
Defining our Scope
Eventually, we’ll want to classify these tickets in myriad ways but my first instinct was to predict tags. As it turned out, this instinct was wrong.
I won’t belabor this too long, but I think it’s good to point out that this is pretty typical behavior at the beginning of modeling. Machine learning is an experimental science and it’s not always going to succeed from jump street. You need to be willing to pivot and try novel approaches if your first blush attempt isn’t quite as fruitful as you’d hope.
The issue with tags was there were just too many. In fact, there were over 500. I tried a bunch of techniques but couldn’t really get better than 40% accuracy. The main issues were that the ticket text wasn’t all that predictive of the tags and some tags were incorrectly applied. In other words, the scope was too broad up front.
I pivoted instead to trying to detect whether the ticket was related to W&B or not and whether the ticket text was a bug, feature request or general question. I hand labeled some examples to build myself a quick evaluation set.
Goal: Given the support text, predict if 'question' is one of the following:'type_feature_request''type_bug''none''question'
The lesson: Succeeding with a simpler problem before layering in complexity is a better approach than being overly ambitious. Build something that works before you tackle the trickier issues.
Evaluating our Performance
Before we can try any of our prompting techniques, we need a way to evaluate performance. After all, without having a good way to evaluate our model’s performance, we lose valuable lessons we could be learning and any experiments we do are in vain. Which is to say: while we aren’t doing any real modeling in this step, we want a baseline upon which we can build.
What we’ll do is use a simple prompt and evaluate how it handles a random selection of customer success tickets. That basic prompt:
"""Classify the text delimited by triple backticks into one of the following classes.Classes: 'type_bug', 'none', 'type_feature_request', 'question'Text: ```{ticket_text}```Class:"""
To reiterate—and as you can see from the code above—our model can be evaluated on how well it does with specific correct answers. We’re hoping to see it correctly identify bugs, feature requests, and questions from text. It’s worth noting that with many other LLM use cases (think any text generation task like summary or copywriting) we don’t have this luxury.
The next section deals with evaluating an LLM on our case—i.e. on outputs that can either be correct or incorrect.
Manual Evaluation
First, we want to test customer success tickets on a simple prompt and look at the result manually. Don’t worry about class balance yet and feel free to select random examples to evaluate your model.
If your model fails on certain examples, consider adding those to your prompt. It’s a good idea when you’re modifying your prompts to do some simple regression testing to make certain your new prompt works on your older examples.
Functionally, we’re making sure our model performs and everything is working, even if performance isn’t ideal. And in fact, performance likely won’t be great until we get into prompt engineering, but your prompt should lead your model to correctly predict at least some of the example data.
Calculate your success rate for a baseline and move onto automated evaluation.
Automated Evaluation
We shouldn’t evaluate our model manually forever, so at some point, it’s wise to set up some automated evaluation. Here, we’d like to increase our example data and pay some attention to class balance. We’re only looking to predict three classes (bug, feature request, and question) so we should make sure to have adequate amounts of each in our evaluation set of 100 examples.
What I did was create a Python dictionary of the user messages and the correct answer. I ran looped over the examples in the dictionary, called the LLM with those message/answer pairs, and calculated the score (here, just the average number of times the model output matched this ideal answer).
To evaluate our app, we’ll go with a simple average accuracy score and evaluate the prompt on a predefined evaluation set of 100 examples. Here’s the code we’re going to use:
def basic_prompt(evalset_start=400, evalset_end=500):count, score_tags = 0, 0for element in filtered_data[evalset_start:evalset_end]:# print(element)ticket_text = element['description']# Prompt – Classify Classprompt = f"""Classify the text delimited by triple backticks into one of the following classes.Classes: {desired_tags}Text: ```{ticket_text}```Class: """response = get_completion(prompt)print("Prediction: "+response)if(response == element['question']):score_tags += 1print("Correct. Actual: "+element['question'])else:print("Incorrect. Actual: "+element['question'])count += 1print()print("__________________")print(f"Priority Accuracy: {score_tags/count}")return score_tags/countacc['basic_prompt'] = basic_prompt(evalset_start_all, evalset_end_all)
The lesson: Though your first metrics will likely be fairly underwhelming, setting up your evaluation harness early lets you ensure both that your pipeline is working as expected and that your prompt engineering experiments are evaluated by the same rubric throughout.
Now, let’s get to the fun part:
Prompt Engineering
We’ve got our scope and we’ve got our preliminary evaluation harness. The next step is honing our prompt.
In these experiments, I managed to move this prompt from 17% accuracy to above 90%, all with only prompt changes. You can see my general approach in the table and graph below but don’t worry, we’ll walk through the important changes in detail!

Each experiment and its resulting accuracy

Use Delimiters to Help the Model—and Avoid Prompt Injections
You can use delimiters to specify where in the prompt you’re inserting the user input. This helps the model identify where the user input starts and ends (especially good for things like the support tickets that can be 10s to 100s of lines).
This can also help avoid prompt injections where malicious users try to manipulate our LLM to do things it’s not supposed to do. Delimiters can be anything like: ```, """, < >, <tag> </tag>, :.
Here’s how I structured my basic prompt (this is in fact the prompt we used to set up our evaluation harness):
"""Classify the text delimited by triple backticks into one of the following classes.Classes: 'type_bug', 'none', 'type_feature_request', 'question'Text: ```{ticket_text}```Class:"""
Our basic prompt was just 17% accurate. It’s a start but we’ll have to do a lot better. Let’s add some examples:
Add Examples Inside Prompt Context Window
Giving your model specific examples of prompts and expected generated outputs is one of the best ways to quickly improve performance. We’re subtly helping the LLM learn more about our problem, as well as the complexity, tone, and concision of our examples (in this case, customer success tickets).
We’re also guiding the model to produce the right shape of output. As an example, here’s a toy problem that shows how simple changes can help form desired outputs. you can see an example – if we end the prompt with something like A: below, the model will naturally fill it with the answer. Depending on how we structure the example answers, we can specify the complexity of the response from the model.

LLMs are usually great at zero-shot learning with no examples, but if we pick a smaller model we can get a good uptick in performance by using one-shot or few-shot learning and including examples of the expected answers.
For our experiment, I added different amounts of examples the same general prompt skeleton:
"""Classify the text delimited by triple backticks into one of the following classes.Classes: {desired_tags}Text: ```{ticket_text1}```Class: questionText: ```{ticket_text2}```Class: type_bug<...more examples here>Text: ```{ticket_text}```Class:"""
Adding five examples improved accuracy from 25% to 30% whereas adding 20 random examples to our prompt increased our accuracy to 70%.
The lesson: For problems like ours, consider adding examples before doing more granular prompt engineering. Like humans, machines often learn better from real-world examples rather than florid descriptions of the problem.
Explain what the Classes Mean
Though “bug,” “feature request,” and “question” are common terms, defining each term can be a simple but effective approach. In our case, this led to a 5% uptick in accuracy:
"""Given the following description for each class:type_feature_request: A request for a feature by a user of Weights & Biases.type_bug: A bug report by a user of Weights & Biases.question: If the request is not related to Weights & Biases; or doesn't fit the above 2 categories.Classify the text delimited by triple backticks into one of the following classes.Classes: {desired_tags}{examples}Text: ```{ticket_text}```Class: """
Change your Model’s Configuration Parameters
Another thing we can do outside of changing our prompt is tweaking our model configurations. Each parameter can influence the model’s decision about how it generates the next word. LLM playgrounds like Together.ai allow you to play with these inference time parameters.
- Max Tokens: Limits the number of tokens that the model will generate. This puts a cap on the number of times the model will go through the selection process to pick the next word, saving us time and compute.
- Random Sampling: Instead of always picking the most probable word, the model chooses a word at random. This makes sure we introduce some variability and words aren’t repeated.
- Top K and Top P Sampling: With random sampling, the model might get too creative and pick words randomly. With top p and top k samplings, we limit the space of words the model can pick from, while still introducing variability. Top K picks from only the k tokens with the highest probability. Top P picks predictions whose combined probabilities don’t exceed p. With Top K we specify the number of tokens for the model to pick from, but with Top P we pick the total probability.
- Temperature: Also helps control the randomness of the output. The higher the temperature, the higher the randomness, and the more creative the model. Conversely, the lower the temperature the more predictable the model. Changing the temperature actually changes the model’s predictions where the other sampling techniques only change which prediction the model picks.
I toyed with model parameters and was able to get a small lift from adjusting temperature. Adding 20 examples and changing the temperature got me to 90% accuracy.
Ask for a Structured Output Like JSON or HTML
You can ask the model to reply with a very specific syntax. This can make parsing the model outputs easier for humans, but it’s particularly useful if we want to chain prompts. We can ask the model for an output in the specific format that our next LLM call needs.
We could use this if we try to have this model do more than the single thing we’re aiming for today. The model would return a JSON object of the summary of the ticket, recommended next action and some additional classifications of the ticket. You could then feed this JSON output into a Python list and build complex workflows!
An example:
"""Given the text delimited by triple backticks, perform the following actions:1 - Summarize what the user wants in 1-2 lines.2 - Recommend a next action based on the user' request.3 - Determine if the request is related to the product or company 'Weights & Biases'4 - Classify the into one of the following classes. Classes: {desired_tags}5 - Output a json object that contains the following keys: summary, recommended_action, is_wb, classHere's the text ```{ticket_text}```Make sure the output is only a json object."""
Specify Steps
One interesting prompting technique that worked for me was specifying the steps I wanted the model to take and use that to generate the final output. Instead of allowing the model to approach the problem however it wanted, I asked it to first summarize what the user wants, determine if the request is related to W&B, then predict the class. Doing this gave me a small bump in performance.
"""Given the text delimited by triple backticks, perform the following actions:1 - Summarize what the user wants in 1-2 lines.2 - Recommend a next action based on the user' request.3 - Determine if the request is related to the product or company 'Weights & Biases'4 - Classify the into one of the following classes. Classes: {desired_tags}5 - Output a json object that contains the following keys: summary, recommended_action, is_wb, classHere are some examples help you with the classification step.{examples}And here's the text ```{ticket_text}```"""
Ask for the Reasoning Behind Each Step
In the interest of full disclosure, not every technique is going to work for every problem. This is one such example. I tried asking my model to provide the reasoning behind each step. This actually reduced the performance of the model by 20% so I scrapped it.
"""Given the text delimited by triple backticks, perform the following actions:1 - summary: Summarize what the user wants in 1-2 lines.2 - summary_reasoning: Explain your reasoning for the summary.3 - recommended_action: Recommend a next action based on the user' request.4 - recommended_action_reasoning: Explain your reasoning for the recommended next action.5 - is_wb: Determine if the request is related to the product or company 'Weights & Biases'.6 - is_wb_reasoning: Explain your reasoning for detemining if the request is W&B related.7 - class: Classify the into one of the following classes. Classes: {desired_tags}8 - Output a json object that contains the following keys: summary, summary_reasoning, recommended_action, recommended_action_reasoning, is_wb, is_wb_reasoning, classHere are some examples help you with the classification step.{examples}And here's the text ```{ticket_text}```.Make sure the output is only a json object."""
Chain Of Thought Prompting
In the last prompt, I specified some of the steps I want the model to take. But we can also instruct the model to generate its own intermediate reasoning steps, aka chain of thought, before generating output. This helps the model be better at complex reasoning tasks.

A canonical chain-of-thought example
Prompt Chaining
Prompt chaining is similar to chain of thought prompting, but instead of doing it all in one prompt, we can break our problem up into multiple prompts so one prompt’s generation is added to the context of the next prompt. This makes each prompt more focused thus more likely to perform well, avoids context limitations because each step can use the full context window, reduces costs and results in code that’s a lot easier to read.
Also, if we break up each step into its own prompt, it also makes it much easier to evaluate as we can write custom evaluations for each step.
Lastly, if we have a complex workflow where the result of step one picks very different workflows in step two, breaking down the prompts can help you dynamically pick which set of prompts to call next.
This is a really powerful technique, especially for complex problems. For our task here? It didn’t work out all that well. But as we build out our model and ask it to do something more nuanced than tag classification, we’ll be leaning heavily on prompt chaining to do so. Here’s the code I used:
# Prompt – Classify Classprompt = f"""Given the text delimited by triple backticks, perform the following actions:1 - summary: Summarize what the user wants in 1-2 lines.2 - summary_reasoning: Explain your reasoning for the summary.3 - recommended_action: Recommend a next action based on the user' request.4 - recommended_action_reasoning: Explain your reasoning for the recommended next action.5 - is_wb: Determine if the request is related to the product or company 'Weights & Biases'.6 - is_wb_reasoning: Explain your reasoning for detemining if the request is W&B related.7 - Output a json object that contains the following keys: summary, summary_reasoning, recommended_action, recommended_action_reasoning, is_wb, is_wb_reasoning, classAnd here's the text ```{ticket_text}```.Make sure the output is only a json object."""response_json = get_completion(prompt)# print("Prompt: "+prompt)print("Prediction: ")response = json.loads(response_json)print("Summary: "+response['summary'])print("Summary Reasoning: "+response['summary_reasoning'])print("Recommended Action: "+response['recommended_action'])print("Recommended Reasoning: "+response['recommended_action_reasoning'])print("Is W&B Related: "+str(response['is_wb']))print("Is W&B Related Reasoning: "+response['is_wb_reasoning'])prompt_2 = f"""Given the following info about a user request:1 - text delimited by triple backticks: ```{ticket_text}```2 - summary of what the user wants: {response['summary']}3 - recommended next action based on the user' request: {response['recommended_action']}4 - whether the request is related to the product or company 'Weights & Biases': {response['is_wb']}Classify the text into one of the following classes. Classes: {desired_tags}Here are some examples help you with the classification step.{examples}Only print the name of the class"""
Custom evaluations for prompts
As you make your prompts more complex, you can improve your evaluation prompts. You can add additional calls to ask an LLM to check whether all the conditions are satisfied – e.g. is it generating reasonable output for all the intermediary steps? Is it’s reasoning sound? Are the next actions suggested based on W&B docs? Is the code executable?
These can all bump performance upwards but for our simple starting problem,
There are two other advanced prompting techniques that I didn’t get a chance to try in this project, but I’m looking forward to doing next: tree of thought prompting and ReAct. We’ll cover these more in subsequent posts in this series, but if you want to learn more advanced prompting techniques, I recommend this guide.
Conclusion
What I took away from this experiment most is that simple iterations on our prompt and straightforward tweaks to our model’s parameters have massive positive effects on our model’s output. In fact, we were able to go from 17% accuracy to over 90% doing just that.
What worked best was adding diverse examples and their tags to my prompt as well as chain-of-thought prompting to help our model work through the problem step-by-step.
This is the first in a series of articles we’ll be publishing on real-world LLM app building. We’ll cover everything from fine-tuning to advanced prompting techniques to deployment and we hope to see you for the next edition! In the meantime, you can head over to our courses page for a free, interactive course on LLM app building.
See you next time!
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.