Skip to main content

OpenAI Evals Demo: Using W&B Prompts to Run Evaluations

In this article, we explain how to use W&B Prompts with OpenAI evals along with a short walkthrough, enabling you to run any evaluation with just one click.
Created on April 20|Last edited on January 11
Evaluating LLMs for your own use cases is challenging, ambiguous, and fast-evolving.
OpenAI Evals is a fast-growing repository of dozens of evaluation suites for LLM evaluation. With W&B Launch, users can easily run any evaluation from OpenAI Evals with just one click, and then visualize and share those results in Weights & Biases.
Here's a short walkthrough of how to use OpenAI Evals with W&B Launch. (For an introduction to W&B Launch, see this guide.)



1. Visit the Job Page

Click on the teal Launch button to bring up the Launch modal:




2. Launch the job

1. Click on the Clone from... button to use a valid preset; or define your own config
2. (Optional) To try some prompt engineering, you can change:
1. registry to add new datasets
2. model.override_prompt to try new prompts
3. Select the W&B Global CPU queue.
Update 2024-01-11: Create and select a queue (see this notebook for an example)
4. Select a destination project (this is where your run will be logged)
5. Click the teal Launch now button







3. View results in a Report

When the job finishes, you'll see a link to a run. Follow the link to see a workspace showing the results, both in terms of metrics and in an interactive table of prompts, responses, and metadata.


The job will also generate a shareable report summarizing the model's performance on the chosen eval and prompts. To find it, navigate to the project and open the Report icon on the side.



The report covers several aspects of performance:
  • Across the top, see key performance and cost metrics
  • In a plot, see how performance has changed over every version of the evaluation.
  • Below, see the data lineage, and a preview of the registry where you can see the artifacts that are the input to and output of the job.
  • For some interesting results, try out the Japanese Translation presets!





4. Create a custom eval

The generated report includes instructions for how to run your own eval:

Iterate on AI agents and models faster. Try Weights & Biases today.