Skip to main content

How to train an LLM router with W&B Weave and Not Diamond

LLM routers can drastically reduce cost and improve performance. Here's how to train one with W&B Weave and Not Diamond
Created on November 5|Last edited on November 12
This article was written in collaboration with Tomás Hernando Kofman, Co-founder of Not Diamond.
💡
If you’re building an AI application, you have probably asked yourself these questions:
  • Which model should I use?
  • How do I know if my application is working correctly?
You're not alone. The non-deterministic nature of LLMs, coupled with the diversity of models available, means that if we want to build production-worthy applications, we need to evaluate and test each of our models. But even the best overall model might not perform well on all edge cases, and there may be other models that are orders of magnitude cheaper yet can still optimally handle most inputs.
One solution to this problem is to train an LLM router, which learns to select the best LLM for every query to improve accuracy by up to 25% while also reducing inference costs and latency by up to ten times over. You can think of a model router as a meta-model that optimally combines multiple LLMs to leverage the unique strengths (and costs) of each model.
In this tutorial, we will use Not Diamond, an AI model router training workflow integrated with W&B Weave, to train a custom router on data you already have on W&B.

How does model routing work?

For any distribution of data, rarely will one single model outperform every other model on every single query. Model routing works by combining together multiple models into a "meta-model" that learns when to call each LLM, outperforming every individual model’s performance and driving down costs and latency by leveraging smaller, cheaper models when doing so doesn't degrade quality.
To determine the best LLM to recommend for a given query, we use our evaluation data of the LLMs we want to route between to train a custom router. This custom router can then be used to make recommendations of which LLM from your evaluation data is best to answer novel queries not in the evaluation dataset.


Train your own custom router with Weave

Using our integration with Weave you can build a custom router from evaluation results you have already stored on Weights & Biases. Not Diamond will use these evaluation results to create a custom router for your use case in under 10 minutes.
Let’s walk through an example together.

Setup

In this example, we will download prepared evaluation results on humaneval for the following models:
  • openai/gpt-4o-2024-05-13
  • openai/gpt-4-turbo-2024-04-09
  • google/gemini-1.5-pro-latest
  • anthropic/claude-3-opus-20240229
  • anthropic/claude-3-5-sonnet-20240620
In practice, you will use your own evaluations to train a custom router. Take a look at this guide for how to evaluate LLMs using Weave.
!curl -L "https://drive.google.com/uc?export=download&id=1q1zNZHioy9B7M-WRjsJPkfvFosfaHX38" -o humaneval.csv

Prepare the dataset

Next we’ll load the dataset into weave’s Dataset and EvaluationResults
import random

import weave
from weave.flow.dataset import Dataset
from weave.flow.eval import EvaluationResults
from weave.integrations.notdiamond.util import get_model_evals

pct_train = 0.8
pct_test = 1 - pct_train

# In practice, you will build an Evaluation on your dataset and call
# `evaluation.get_eval_results(model)`
model_evals = get_model_evals("./humaneval.csv")
model_train = {}
model_test = {}
for model, evaluation_results in model_evals.items():
n_results = len(evaluation_results.rows)
all_idxs = list(range(n_results))
train_idxs = random.sample(all_idxs, k=int(n_results * pct_train))
test_idxs = [idx for idx in all_idxs if idx not in train_idxs]

model_train[model] = EvaluationResults(
rows=weave.Table([evaluation_results.rows[idx] for idx in train_idxs])
)
model_test[model] = Dataset(
rows=weave.Table([evaluation_results.rows[idx] for idx in test_idxs])
)
print(
f"Found {len(train_idxs)} train rows and {len(test_idxs)} test rows for {model}."
)

Training a custom router

Now that you have EvaluationResults, you can train a custom router. Make sure you have created an account and generated an API key, then insert your API key below.

import os

from weave.integrations.notdiamond.custom_router import train_router

api_key = os.getenv("NOTDIAMOND_API_KEY", "<YOUR_API_KEY>")

preference_id = train_router(
model_evals=model_train,
prompt_column="prompt",
response_column="actual",
language="en",
maximize=True,
api_key=api_key,
# Leave this commented out to train your first custom router
# Uncomment this to retrain your custom router in place
# preference_id=preference_id,
)
You can then follow the training process for your custom router via the Not Diamond app.


Use your custom router

Once your custom router has finished training, you can use it to route your prompts. All you need to do is specify the preference_id in your model_select calls. Here's the code:
from notdiamond import NotDiamond

import weave

weave.init("notdiamond-quickstart")

llm_configs = [
"anthropic/claude-3-5-sonnet-20240620",
"openai/gpt-4o-2024-05-13",
"google/gemini-1.5-pro-latest",
"openai/gpt-4-turbo-2024-04-09",
"anthropic/claude-3-opus-20240229",
]
client = NotDiamond(api_key=api_key, llm_configs=llm_configs)

new_prompt = (
"""
You are a helpful coding assistant. Using the provided function signature, write the implementation for the function
in Python. Write only the function. Do not include any other text.

from typing import List


def has_close_elements(numbers: List[float], threshold: float) -> bool:
"""
""" Check if in given list of numbers, are any two numbers closer to each other than
given threshold.
>>> has_close_elements([1.0, 2.0, 3.0], 0.5)
False
>>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
True
"""
"""
"""
)
session_id, routing_target_model = client.model_select(
messages=[{"role": "user", "content": new_prompt}],
preference_id=preference_id, # The UUID returned from train_router
)

print(f"Session ID: {session_id}")
print(f"Routed to: {routing_target_model}")
By initializing Weave, this request will automatically be traced available in the Weave UI.


Evaluating your custom router

Once you have trained your custom router, you can evaluate either its
  • In-sample performance by submitting the training prompts, or
  • Out-of-sample performance by submitting new or held-out prompts
Below, we submit the test set to the custom router to evaluate its performance.
from weave.integrations.notdiamond.custom_router import evaluate_router

eval_prompt_column = "prompt"
eval_response_column = "actual"

best_provider_model, nd_model = evaluate_router(
model_datasets=model_test,
prompt_column=eval_prompt_column,
response_column=eval_response_column,
api_key=api_key,
preference_id=preference_id,
)

@weave.op()
def is_correct(score: int, model_output: dict) -> dict:
# We hack score, since we already have model responses
return {"correct": score}


best_provider_eval = weave.Evaluation(
dataset=best_provider_model.model_results.to_dict(orient="records"),
scorers=[is_correct],
)
await best_provider_eval.evaluate(best_provider_model)

nd_eval = weave.Evaluation(
dataset=nd_model.model_results.to_dict(orient="records"), scorers=[is_correct]
)
await nd_eval.evaluate(nd_model)
In this instance, the Not Diamond "meta-model" routes prompts across several different models.
Training the custom router via Weave will also run evaluations and upload results to the Weave UI. Once the custom router process is completed, you can review the results there.
In the UI we see that the Not Diamond "meta-model" outperforms the best-performing model by routing prompts to other models with higher likelihood of answering the prompt accurately.


Conclusion

In this tutorial, we showed how you can use Not Diamond and Weave to train a LLM router using evaluations on W&B Weave. To try out this example, sign up to Not Diamond to get your API key.

Additional resources

Iterate on AI agents and models faster. Try Weights & Biases today.