The Art and Science of Prompt Engineering
Darek Kleczek takes us through the art and science of how to prompt engineer like a pro.
Created on September 7|Last edited on September 8
Comment
Prompt engineering is crucial for getting the best out of language models. Just like crafting a good question can get you a clear answer, designing a good prompt can make your large language model perform better. In this article, we'll look at how researchers test and refine these prompts. We'll break down the steps and techniques from academic papers in a simple, technical way.
💡
What We'll Be Covering
Let's get better at AI!Organize your LLM ExperimentsBaseline GenerationWhat makes a great prompt? Start with the end in mindStop and reflectUsing few-shot prompt engineeringGrading timeEvaluating our prompt engineeringConclusion
Let's get better at AI!
As a motivating use case, let's try to develop an AI tutor using only prompt engineering. The goal of the tutor is to help us become better at machine learning. We'll learn by doing - let's start with a simple prompt and see what comes out.
system_prompt = "You are an AI tutor helping students prepare for machine learning coding interviews."user_prompt = "Hi, can you give me an assignment? I'm just getting started."
Organize your LLM Experiments
I once met an artist who meticulously organized his sketches and drafts. Each piece had its own place and order. This systematic approach wasn't just about neatness; it was the backbone of his creative process. By having a clear view of his progress and iterations, he could evolve his ideas into masterpieces.
Similarly, when working on our experiments, it's crucial to keep an organized log. For our purposes, we'll document each test in a W&B Table, ensuring we can trace our steps, refine our approach, and drive towards optimal results.
# Start a W&B run to track our experimentswandb.init(project="prompt-engineering")# Define W&B Table to store generationscolumns = ["system_prompt", "user_prompt", "generations", "elapsed_time", "timestamp",\"model", "prompt_tokens", "completion_tokens", "total_tokens"]table = wandb.Table(columns=columns)
The function below acts as an interface to the OpenAI API, employing a retry mechanism to handle rate limits gracefully. After obtaining the desired completion, it logs relevant data, such as the prompts and response times, into a W&B table for tracking and analysis.
@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))def completion_with_backoff(**kwargs):return openai.ChatCompletion.create(**kwargs)def generate_and_print(system_prompt, user_prompt, table, n=1):messages=[{"role": "system", "content": system_prompt},{"role": "user", "content": user_prompt},]start_time = time.time()responses = completion_with_backoff(model=MODEL_NAME,messages=messages,n = n,)elapsed_time = time.time() - start_timefor response in responses.choices:generation = response.message.contentprint(generation)table.add_data(system_prompt,user_prompt,[response.message.content for response in responses.choices],elapsed_time,datetime.datetime.fromtimestamp(responses.created),responses.model,responses.usage.prompt_tokens,responses.usage.completion_tokens,responses.usage.total_tokens)
Baseline Generation
We ran our first baseline experiment, and here's what we found: our AI tutor gave us a clear assignment to write a Python function to calculate an average. The task we received seems too straightforward, and the detailed steps aren't necessary in the learning process.
We'll need to refine our prompt.
As a reminder, here's what we're using right now:
system_prompt = "You are an AI tutor helping students prepare for machine learning coding interviews."user_prompt = "Hi, can you give me an assignment? I'm just getting started."
Sure! I can give you a simple assignment to get you started.
Assignment:
Write a Python function called `calculate_average` that takes in a list of numbers as a parameter and returns the average (mean) of the numbers.
Here are the steps to follow:
1. Define the function `calculate_average` that takes a list of numbers as a parameter.
2. Inside the function, calculate the sum of all the numbers in the list.
3. Determine the length of the list.
4. Divide the sum by the length to calculate the average.
5. Return the average.
Example input and expected output:
Input: [5, 10, 15, 20]
Expected output: 12.5
Input: [1, 2, 3, 4, 5]
Expected output: 3.0
Try to implement the function on your own, and let me know if you have any questions or if you want me to review your code!
What makes a great prompt?
To maximize the potential of large language models, it often takes more than just a simple question. The TELeR paper outlines a detailed taxonomy of prompts, culminating in the intricate Level 5 prompt. We'll delve into this methodology, experimenting and refining to enhance our prompt design.

Start with the end in mind
Clear goals in a prompt are essential for accurate responses from language models. They steer the model towards the right answer. To set clear goals, avoid vague prompts and specify any constraints or requirements directly. This ensures the model knows exactly what's expected. Let's put this into practice.
system_prompt = """You are an AI tutor helping students prepare for machine learning coding interviews.Your goal is to come up with learning assignments that will help students pass interviews. Specifically:- Prompt students to solve a task that involves a simple machine learning concept and a coding exercise- The task should be possible to solve in 30 minutes using a simple algorithm in Python- The instruction should be minimal. Don't provide hints at this stage.- The task should be solvable by a student who has taken a machine learning course and has some coding experience- The task should be interesting and fun to solve- The task should advance the student's knowledge of machine learning"""user_prompt = "Hi, can you give me an assignment? I'm just getting started."
Here's the first output generated by the LLM in response to the prompt:
Sure! Here's a simple assignment for you:
Task: Implement a Logistic Regression model from scratch
Instructions:
1. Create a Python class called `LogisticRegression` with the following methods:
a. `__init__(self, learning_rate=0.01, num_iterations=1000)`: Initialize the logistic regression model with the specified learning rate and number of iterations.
b. `fit(self, X, y)`: Train the model on the given training data `X` (a numpy array of shape [num_samples, num_features])and corresponding labels `y` (a numpy array of shape [num_samples]).
c. `predict(self, X)`: Make predictions on the given input data `X` (a numpy array of shape [num_samples,num_features]).
d. Optional: create a method `score(self, X, y)`, which calculates the accuracy of the model on the given input data `X` and corresponding labels y
2. Inside the `fit` method, implement the gradient descent algorithm to learn the optimal parameters for the logistic regression model. Use the sigmoid activation function.
3. Start by testing your implementation with a simple synthetic dataset generated using sklearn's `make_classification` function. Use a dataset with two classes.
4. Once you have verified that your implementation is working correctly, test it on a more challenging dataset, such as the "Breast Cancer" dataset from sklearn.
5. Experiment with different learning rates and number of iterations to find the best hyperparameters for your model.
6. Bonus: Visualize the decision boundary of your trained model in your plot using matplotlib
Good luck, and have fun coding! Let me know if you have any questions.
The model missed the mark big time, especially on the "no hints" directive. We've got some serious refining to do!
Stop and reflect
LLMs aren't just about producing text—they can explain their responses too. Instead of just seeking an explanation, we should leverage this to ensure the model truly grasps the instruction. And since the model churns out text one character at a time, let's prioritize its reasoning before diving into the main task!
system_prompt = """You are an AI tutor helping students prepare for machine learning coding interviews.Your goal is to come up with learning assignments that will help students pass interviews. Specifically:- Prompt students to solve a task that involves a simple machine learning concept and a coding exercise- The task should be possible to solve in 30 minutes using a simple algorithm in Python- The instruction should be minimal. Don't provide hints at this stage.- The task should be solvable by a student who has taken a machine learning course and has some coding experience- The task should be interesting and fun to solve- The task should advance the student's knowledge of machine learningStart by summarizing what you're trying to achieve and your goals. Explain your reasoning behind the task and the way you present it. Then present the task concisely.Use this format:REASONING: max 3 sentencesTASK: max 3 sentences, no detailed instructions"""user_prompt = "Hi, can you give me an assignment? I'm just getting started"
Here's the generation.
REASONING: As a beginner, it's important to start with a simple task that covers a fundamental concept in machine learning. This will help you build a strong foundation and gain confidence in your abilities.
TASK: Build a program that predicts whether a given email is spam or not spam, based on a set of pre-labeled emails. Use a binary classification algorithm of your choice to train a model using the provided dataset, and then use the trained model to predict the labels for a test set of emails. Finally, evaluate the accuracy of your model by comparing the predicted labels with the true labels.
We made some progress with a concise assignment, but these tasks still seem overboard for basic exercises. I'm looking for more bite-sized tasks – actually, I need examples. That's our next focus.
Using few-shot prompt engineering
LLMs learn well when you show them what you want, using input-output pairs. Add the right examples, and the model gets better at following patterns. Change the examples, and the model's results might change too. We'll use this to our advantage.
system_prompt = """You are an AI tutor helping students prepare for machine learning coding interviews.Your goal is to come up with learning assignments that will help students pass interviews. Specifically:- Prompt students to solve a task that involves a simple machine learning concept and a coding exercise- The task should be possible to solve in 30 minutes using a simple algorithm in Python- The instruction should be minimal. Don't provide hints at this stage.- The task should be solvable by a student who has taken a machine learning course and has some coding experience- The task should be interesting and fun to solve- The task should advance the student's knowledge of machine learningExample tasks by level:- Beginner: calculate probability of 3 heads in 5 coin flips, count the number of times a word appears in a text- Intermediate: implement a single neuron in Python, implement a simple decision tree in Python- Advanced: implement backpropagation of a simple MLP in Python, implement a simple CNN in PythonStart by summarizing what you're trying to achieve and your goals. Explain your reasoning behind the task and the way you present it. Then present the task concisely.Use this format:REASONING: max 3 sentencesTASK: max 3 sentences, no detailed instructions"""user_prompt = "Hi, can you give me an assignment? I'm just getting started."
Here's the generation:
REASONING: As a beginner, it's important to start with simple and foundational concepts in machine learning. This will build a strong understanding of the basics before diving into more complex topics. By starting with a task like calculating the probability of coin flips, students will learn about probability and basic statistical concepts, while also gaining experience in coding.
TASK: Write a Python function that takes the number of coin flips `n` as input and calculates the probability of getting exactly `k` heads in `n` coin flips. The function should return the probability as a decimal value. For example, given `n=5` and `k=3`, the function should return `0.3125`. Use the formula for calculating the probability of a specific outcome in a binomial distribution.
The task is decent, but it's too swayed by the example, limiting diversity. It's also spitting out potentially made-up numbers, which isn't what we want. Time to tweak.
Grading time
On top of specifying our goal, let's tell the model how we'll evaluate it. Let's set word limits, and specify the format and constraints. By setting clear criteria, we can guide the model to produce answers that meet our expectations.
system_prompt = """You are an AI tutor helping students prepare for machine learning coding interviews.Your goal is to come up with learning assignments that will help students pass interviews. Specifically:- Prompt students to solve a task that involves a simple machine learning concept and a coding exercise- The task should be possible to solve in 30 minutes using a simple algorithm in Python- The instruction should be minimal. Don't provide hints at this stage.- The task should be solvable by a student who has taken a machine learning course and has some coding experience- The task should be interesting and fun to solve- The task should advance the student's knowledge of machine learningExample tasks by level:- Beginner: calculate probability of 3 heads in 5 coin flips, count the number of times a word appears in a text- Intermediate: implement a single neuron in Python, implement a simple decision tree in Python- Advanced: implement backpropagation of a simple MLP in Python, implement a simple CNN in PythonYou'll be evaluated on:- conciseness of the task description- clarity of the task description- creativity of the task- matching the task to the student's level- learning value of the taskStart by summarizing what you're trying to achieve and your goals. Explain your reasoning behind the task and the way you present it. Then present the task concisely.Use this format:REASONING: max 1 sentenceTASK: max 5 short bullet points, no detailed instructions."""user_prompt = "Hi, can you give me an assignment? I'm just getting started."
Here's the generation:
REASONING: A beginner-level assignment should be simple and cover a basic concept in machine learning.
TASK:
- Write a function in Python called "mean_squared_error" that takes in two lists of numbers, `y_true` and `y_pred`, and calculates the mean squared error between them.
- Calculate the mean squared error between the following two lists: [1, 2, 3, 4, 5] and [2, 4, 6, 8, 10].
- Implement a function called "linear_regression" that takes in two lists of numbers, `X` and `y`, and performs simple linear regression (fitting a line of the form `y = mx + b`) using the least squares method.
- Use the "linear_regression" function to find the best-fitting line for the following data points:
X: [1, 2, 3, 4, 5]
y: [2, 3, 4, 5, 6]
- Print the equation of the line (in the form `y = mx + b`) that best fits the given data points.
This is pretty neat!
Evaluating our prompt engineering
Interactive experimentation is a swift way to test ideas, but it might not always reflect what happens on a larger scale. To get a broader picture, we can take our top-performing prompts, generate more examples, and then evaluate them. W&B Tables can be a valuable tool for this. Let's take a look!
Run: rare-totem-82
1
Conclusion
The Level 5 prompt is defined by its specific features: a complex directive, clear high-level goal, detailed list of sub-tasks, request for the LLM to explain its output, guidelines for evaluating the output, and few-shot examples. This helps steer the outputs toward your desired goals and is a practical approach for getting the best out of LLMs. We recommend using the provided Colab for hands-on practice and enrolling in our free course for more insights.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.