Six years ago, the tools needed to realize the potential of deep learning didn’t exist. We started Weights & Biases to build them. Our tools have made it possible to track and collaborate on the colossal amount of experimental data needed to develop GPT-4 and other groundbreaking models.
Today, GPT-4 has incredible potential in applications for humanity, but that potential far exceeds our ability to actually apply it. To solve this, we need to think differently about how we build software. We need new tools.
We’re very proud to announce public availability of Weave, a suite of tools for developing and productionizing AI applications.
Use Weave to:
- log and version LLM interactions and surrounding data, from development to production
- experiment with prompting techniques, model changes, and parameters
- evaluate your models and measure your progress
Go to https://wandb.me/weave to get started.
Demos are easy, production is hard
Generative AI models are incredibly powerful, but they are non-deterministic black boxes by nature. We now know that it’s easy to make incredible AI demos, but it takes significant engineering effort to make production applications work.
This difficulty arises from the stochastic nature of AI models. The input space for any given application is far too large to completely test.
But there is a solution: treat the model as a blackbox and follow a scientific workflow, akin to the workflow machine learning practitioners use to build these models in the first place.
Here’s how it works:
- Log everything: Capture every interaction you have with LLMs, from development to production. This data is expensive to produce, so keep it! You’ll use it to improve your models, and build up evaluations.
- Experiment: Try lots of different configurations and parameters to figure out what works.
- Evaluate: Build up suites of evaluations to measure progress. You’re flying blind if you don’t do this!
Weave introduces minimal abstractions that make this process a natural part of your workflow.
Weave Tracking
The first step toward harnessing the power of AI models is to log everything you do to a central system-of-record.
You’ll use this data to understand what experimental changes have impact, build up evaluation datasets, and improve your models with advanced techniques like RAG and fine-tuning.
What do you need to track?
- code: ensure all code surrounding generative AI API calls is versioned and stored
- data: where possible, version and store any datasets, knowledge stores, etc
- traces: permanently capture traces of functions surrounding generative AI calls
Weave makes this easy. Wrap any Python function with @weave.op(), and Weave will capture and version the function’s code, log traces of all calls, including their inputs and outputs.
@weave.op()
def extract_first_person_name(doc: str) -> str:
client = openai.OpenAI()
prompt_template = ‘What is first person’s name in the following document: {doc}’
response = client.chat.completions.create(
model=’gpt-3.5-turbo’,
messages=[{‘role’: ‘user’, prompt_template.format(doc=doc))})
return response.choices[0].message.content
Call @weave.init(“my-project”) to enable Weave tracking, and then call your function as normal.
If you change the function’s code and then call it again, Weave will track a new version of the function.
prompt_template: str@weave.op()
def extract(doc: str) -> str:
client = openai.OpenAI()
response = client.chat.completions.create(
model=’gpt-3.5-turbo’,
messages=[{‘role’: ‘user’, self.prompt_template.format(doc=doc))})
return response.choices[0].message.content
Weave Objects use the pydantic library under the hood. You can instantiate and call them like this:
name = person_extract.extract(“There were three of them: Kara, Frank, and Shaun”)
Weave Objects are organized and versioned automatically as well. Now that we’ve extracted the prompt template from inside the function’s body, it is queryable. You can easily filter down calls to those that used a specific word in their prompt template.
You should use Weave tracking in development and production, to easily capture and organize all the valuable data generated by your AI development process.
Weave Evaluations
Evaluations are like unit tests for AI applications. But since AI models are non-deterministic, we use scoring functions instead of strict pass/fail assertions.
You should continuously evolve a suite of Evaluations for any AI applications that you build, just like you would write a suite of unit tests for software.
Here’s a simple example:
import weave
import openai@weave.op()
def score_match(expected, prediction):
return expected == predictioneval = weave.Evaluation(
dataset=[{
“doc”: “The first person to land on the moon was Neil Armstrong.”,
“expected”: “Neil Armstrong”,},
{“doc”: “There were three of them: Kara, Frank, and Shaun”,
“expected”: “Kara”,},
{“doc”: “There are two cats: Jimbob, and Zelda.”, “expected”: None},],
scorers=[score_match],)@weave.op()
def extract_first_person_name(doc):
client = openai.OpenAI()
prompt_template = “What is the name of the first person in the following document? Just give the name and nothing else. Document: {doc}”
response = client.chat.completions.create(
model=”gpt-3.5-turbo”,
messages=[{“role”: “user”, “content”: prompt_template.format(doc=doc)}],
)
return response.choices[0].message.content
weave.init(“weave-announcement-draft-eval-1”)
asyncio.run(eval.evaluate(extract_first_person_name))
Here are the results:
Looks like our model is also extracting the names of cats. We should be able to fix it with a little bit of prompt engineering.
With Evaluations you can spot trends, identify regressions, and make informed decisions about future iterations.
Go forth and build
There’s a lot more to discover in today’s release. Head on over to http://wandb.me/weave to get started.
The best way to build great tools is to talk to users. We love feedback of all varieties. Please get in touch: @weights_biases on X, or email: support@wandb.com