Skip to main content

How to migrate from Humanloop to W&B Weave

This article explains how to migrate traces, prompts, datasets, evaluators, and evaluations from Humanloop into W&B Weave to preserve past work while gaining a more powerful environment for ongoing development and monitoring.
Created on August 19|Last edited on August 19
Humanloop’s recent acquisition by Anthropic means the platform will no longer be available to its users. If you’ve been relying on it to manage traces, prompts, datasets, and evaluations, you now face the question of where that work will live going forward. The good news is that you don’t have to start over. W&B Weave offers a natural home for these assets, giving you continuity for your past experiments and a more powerful environment for your future ones.
By migrating into Weave, you preserve the full history of your development while gaining an interactive UI for tracing, built-in feedback capture, versioned prompts and datasets, reusable evaluators, and evaluation runs that can be compared side by side. Instead of being left with static exports or losing access altogether, you can move seamlessly into a platform designed to track and improve model performance in production.
In this article, we'll outline the process of migrating your data into W&B Weave (note: we weren’t able to obtain a real export file, so your own exports may differ slightly in structure). Consider the Humanloop data formats here as illustrative, while the Weave upload code itself is fully correct and will work once adapted to your data.



Table of contents



Migrating Traces to W&B Weave

Traces in Humanloop captured the entire lifecycle of an AI agent: the user’s input, the system prompts, every tool call, and the refinements that followed. They served as a record of how decisions were made, making it possible to debug outputs, retrace reasoning, and judge whether behavior matched your design goals. With the platform shutting down, preserving this history matters because it holds not just raw data but the pathways your team relied on to build confidence in the application.
Traces are a core feature in Weave. Every call to a model, function, or tool can be logged with its inputs, outputs, and context, then linked together to show how the application executed. This functionality is incredibly valuable since it allows you to return to a trace to see what the model saw, how it responded, and how it was judged.
We can write a simple script to migrate these traces to Weave. Each trace is reconstructed as a sequence of calls that can be nested to reflect parent–child relationships. A Humanloop ‘flow’ call might represent the root interaction with a user, while child prompt calls show how the model responded at each step, and tool calls capture external lookups or calculations. Replaying a Humanloop trace in W&B means iterating through the original log and creating these calls in order.
Here’s the code:
import weave
PROJECT = "hl_migrate"

# A more complete interaction: Flow -> prompts -> tools -> final plan
HL_LOGS = [
# root flow
{"id": "root1", "type": "flow",
"inputs": {"topic": "Tokyo trip", "days": 3, "preferences": {"kids": True, "budget_usd": 1200}},
"output": None},

# gather requirements via prompt
{"id": "p_requirements", "type": "prompt", "parent_id": "root1",
"inputs": {"messages": [
{"role": "system", "content": "You are a meticulous travel planner."},
{"role": "user", "content": "Family of 3 visiting Tokyo for 3 days. Like food, parks, tech. Budget 1200 USD."}
]},
"output": "Got it. Will plan kid friendly itinerary with food, parks, and tech within budget."},

# weather tool to inform outdoor choices
{"id": "t_weather", "type": "tool", "parent_id": "root1",
"inputs": {"tool": "weather", "city": "Tokyo", "dates": ["2025-09-10","2025-09-11","2025-09-12"]},
"output": {"daily": [
{"date": "2025-09-10", "forecast": "Sunny", "high_c": 29},
{"date": "2025-09-11", "forecast": "Partly Cloudy", "high_c": 28},
{"date": "2025-09-12", "forecast": "Light Rain", "high_c": 26}
]}},

# attractions search tool
{"id": "t_search_attractions", "type": "tool", "parent_id": "root1",
"inputs": {"tool": "search_attractions", "city": "Tokyo",
"filters": {"kid_friendly": True, "themes": ["parks","science","animals","interactive"]}},
"output": [
{"name": "Ueno Zoo", "area": "Ueno"},
{"name": "teamLab Planets", "area": "Toyosu"},
{"name": "Odaiba Seaside Park", "area": "Odaiba"},
{"name": "Miraikan", "area": "Odaiba"},
{"name": "Asakusa Senso-ji", "area": "Asakusa"}
]},

# first draft plan from model
{"id": "p_draft", "type": "prompt", "parent_id": "root1",
"inputs": {"messages": [
{"role": "system", "content": "Create a 3 day Tokyo itinerary using the weather and attraction data."},
{"role": "user", "content": "Prefer short transfers, street food, and a couple of museums."}
]},
"output": "Draft: Day 1 Asakusa and Ueno. Day 2 Odaiba. Day 3 Toyosu and Shibuya."},

# distances to tighten transfers
{"id": "t_maps", "type": "tool", "parent_id": "root1",
"inputs": {"tool": "maps_batch_dist",
"legs": [
["Asakusa","Ueno"], ["Ueno","Akihabara"], ["Toyosu","Odaiba"], ["Odaiba","Shibuya"]
]},
"output": {"km": [2.2, 2.1, 5.0, 10.8]}},

# simple budget calculator
{"id": "t_budget", "type": "tool", "parent_id": "root1",
"inputs": {"tool": "budget_calc",
"items": [
{"label": "hotel_2n", "usd": 600},
{"label": "food", "usd": 300},
{"label": "transport", "usd": 90},
{"label": "tickets_misc", "usd": 140}
]},
"output": {"total_usd": 1130, "under_budget": True}},

# refine with rain day handling and distances
{"id": "p_refine", "type": "prompt", "parent_id": "root1",
"inputs": {"messages": [
{"role": "system", "content": "Refine the itinerary. Use indoor-heavy activities on the rain day."},
{"role": "user", "content": "Rain likely on Day 3. Keep transfers short."}
]},
"output": "Refined: Day 1 Asakusa-Ueno. Day 2 Odaiba (Miraikan). Day 3 Toyosu (teamLab Planets) then Shibuya."},

# final plan summary
{"id": "p_final", "type": "prompt", "parent_id": "root1",
"inputs": {"messages": [
{"role": "system", "content": "Return the final plan with times, food picks, and short notes."},
{"role": "user", "content": "Please include breakfast, lunch, dinner recs near each area."}
]},
"output":
"Day 1 Asakusa and Ueno. Morning Senso-ji and Nakamise snacks. Lunch Ueno Ameyoko stalls. "
"Afternoon Ueno Zoo and Ueno Park. Dinner Akihabara ramen. "
"Day 2 Odaiba. Morning Miraikan interactive exhibits. Lunch Aqua City food court. "
"Afternoon Odaiba Seaside Park and Palette Town. Dinner DiverCity casual Japanese. "
"Day 3 Toyosu and Shibuya. Morning teamLab Planets (indoor, good for rain). "
"Lunch Toyosu Market sushi. Afternoon Shibuya Crossing and Miyashita Park playground. "
"Dinner Shibuya yakitori alley. Total budget about 1130 USD within target. Transfers kept short where possible."
}
]

def op_name(l):
t = l.get("type")
return "flow" if t=="flow" else "prompt" if t=="prompt" else "tool"

def main():
client = weave.init(PROJECT)
made = {}
for l in HL_LOGS:
parent_call = made.get(l.get("parent_id"))
call = client.create_call(
op=op_name(l),
inputs=l.get("inputs", {}),
parent=parent_call
)
client.finish_call(call, output=l.get("output"))
made[l["id"]] = call
print("uploaded", len(HL_LOGS), "calls to", PROJECT)

if __name__ == "__main__":
main()
Instead of static JSON files or legacy exports that become unreadable after Humanloop shuts down, your traces live interactively in the W&B interface. You can open any call, see the exact prompt text and the model’s response, and explore how tools were chained together.
After migrating your traces to W&B, you can navigate to the UI and visualize each trace!


The migration scripts rebuild past Humanloop logs into Weave so that old traces remain accessible. That process is useful for preserving history, but it isn’t how you’ll normally capture traces going forward. In production, logging traces is even simpler, only requiring you to you to add the the @weave.op decorator to functions of interest.
Using @weave.op ensures that every call is automatically logged as a trace without additional boilerplate code. Inputs and outputs are captured each time the function runs, and the call graph is constructed on the fly. This means you can monitor live traffic the same way you inspected migrated traces, but with none of the manual reconstruction work.
Here’s an example:
import weave
from openai import OpenAI

# initialize weave once at the start of your program
weave.init("production-traces")

client = OpenAI()

# wrap your inference function with @weave.op
@weave.op
def run_model(prompt: str) -> str:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.content

# every call will now be automatically logged to Weave
print(run_model("Write me a haiku about sushi."))
For many workloads you don’t even need to decorate your own functions. Weave is natively integrated with popular LLM libraries like openai, anthropic, and litellm. Once you import weave and call weave.init(...), any requests made through those clients are automatically tracked. That means tracing can begin with almost no code changes, giving you immediate observability into your production traffic.
In the UI, you can open any call and add a reaction such as a thumbs up or thumbs down, or leave a written note explaining what worked or what failed. These annotations are stored inline with the trace and are visible whenever you inspect inputs and outputs. The result is an immediate link between the model’s behavior and human evaluation.

The SDK makes it possible to add or query feedback programmatically. This is useful if you’re importing judgments collected in Humanloop or want to automate feedback capture during tests. You can log custom payloads, filter by reaction type, and build datasets of annotated calls. These datasets can later be reused in evaluations, fine-tuning, or debugging pipelines.

Migrating Prompts

Prompts define the behavior of your models, and migrating them ensures that the design work you put into crafting instructions doesn’t get lost. In Humanloop, prompts typically carried more than just the raw text. They often included configuration details like model name, temperature, ownership, version identifiers, or tags used for organization. To keep this information useful after migration, you can publish it in Weave as a separate object alongside the prompt itself.
In Weave, a prompt is represented as a weave.StringPrompt, which can be published under a stable name and reused across projects. The additional metadata from Humanloop can be packaged into a Python dictionary and published in the same way. This keeps the text and configuration logically connected but modular, so that prompts can be retrieved, versioned, and experimented with while still carrying the context in which they were created.
Here’s some code showing how to upload prompts to Weave:
import weave

# Init your project
weave.init("intro-example")

# Publish a StringPrompt under a name
pirate_prompt = weave.StringPrompt("You are a pirate. Rewrite: {text} twice")
weave.publish(pirate_prompt, name="pirate_prompt")

# Publish your metadata as just another object under its own name
meta = {
"config": {
"model": "gpt-4o-mini",
"temperature": 0.2,
"meta": {
"team": "search",
"owner": "brett",
"source": "humanloop-migration",
"humanloop": {"file_id": "pr_123", "version_id": "v_456"},
},
},
"tags": ["summary", "imported"],
}
weave.publish(meta, name="pirate_prompt_meta")

# ----------------------
# Later, load both back
# ----------------------
fetched_prompt = weave.ref("pirate_prompt").get()
fetched_meta = weave.ref("pirate_prompt_meta").get()

print("Owner from meta:", fetched_meta["config"])
print("Prompt output:", fetched_prompt.format(text="The Eiffel Tower is in Paris."))
Weave also supports other formats for storing prompts. For more details, check out the docs here.
💡
This approach mirrors how Humanloop stored prompts. The StringPrompt carries the actual instructions, while the metadata object captures everything else that defined how the prompt was used. Once migrated, both can be referenced in evaluations or included in production code.

Migrating datasets

In Weave, datasets aren’t required in order to run an evaluation. You can pass in rows directly to an evaluation without publishing them first. But if you want to keep track of your data over time, it makes sense to migrate datasets into Weave so they are versioned and tied to your project.
A dataset in Weave is just a list of rows with fields like prompt_text and target_text. Once published, it shows up in the UI and can be reused in multiple evaluations. Here’s some code showing how to upload a dataset to Weave :
import json
import weave
from weave import Dataset

# put your Humanloop JSON inline or load from file
hl_json = {
"datapoints": [
{
"messages": [
{
"role": "user",
"content": "How do i manage my organizations API keys?\n"
}
],
"target": {
"response": "Hey, thanks for your questions. Here are steps..."
}
},
{
"messages": [
{
"role": "user",
"content": "Hey, can do I use my code evaluator for monitoring my legal-copilot prompt?"
}
],
"target": {
"response": "Hey, thanks for your questions. Here are steps..."
}
}
],
"path": "datasets/support-queries",
"version_name": "Initial version",
"version_description": "Add two new questions and answers"
}

def to_weave_rows(hl_obj):
rows = []
for i, dp in enumerate(hl_obj.get("datapoints", [])):
messages = dp.get("messages") or []
target = dp.get("target") or {}
inputs = dp.get("inputs") or {}

prompt_text = ""
if messages:
parts = []
for m in messages:
role = m.get("role", "")
content = m.get("content", "")
parts.append(f"{role}: {content}".strip())
prompt_text = "\n".join(parts)
elif inputs:
prompt_text = json.dumps(inputs, ensure_ascii=False)

if isinstance(target, dict):
target_text = target.get("response") or ""
else:
target_text = str(target)

row = {
"id": str(i),
"messages": messages,
"inputs": inputs,
"prompt_text": prompt_text,
"target_text": target_text
}
rows.append(row)
return rows

# init your Weave project
weave.init("hl_migrate")

rows = to_weave_rows(hl_json)

dataset_name = hl_json.get("path", "humanloop_import").split("/")[-1]
dataset = Dataset(name=dataset_name, rows=rows)

# publish to Weave
weave.publish(dataset)

print(f"Uploaded {len(rows)} rows to weave dataset '{dataset.name}'")
Each row is viewable with its prompt and target, and the dataset can be versioned and reused in evaluations.

After uploading a dataset into Weave, you can immediately use it to run evaluations. Datasets keep your rows organized and versioned, and evaluations let you measure how your model performs on them.
Here’s an example:
import asyncio
import weave

weave.init("hl_migrate")

# retrieve the published dataset version
dataset_v0 = weave.ref("support-queries:v1").get()

# simple model wrapped as a weave op
@weave.op
def my_model(prompt_text: str) -> str:
# replace with your real model call
return f"echo: {prompt_text}"

# scorer that compares predicted text to the gold text
@weave.op
def exact_match(target_text: str, output: str) -> float:
return float(target_text.strip().lower() == output.strip().lower())

# map each dataset row to the model input args
def preprocess_model_input(row: dict):
# the converter script created prompt_text and target_text fields
return {"prompt_text": row.get("prompt_text", "")}

evaluation = weave.Evaluation(
name="support-queries-eval",
dataset=dataset_v0,
scorers=[exact_match],
preprocess_model_input=preprocess_model_input,
)

async def main():
await evaluation.evaluate(my_model)
print("evaluation complete. check Weave UI under Evaluations in byyoung3/my-project")

if __name__ == "__main__":
asyncio.run(main())
This script loads your dataset from Weave, runs your model on each row, and applies the scorer function to measure performance. Once the evaluation finishes, results show up directly in the Weave UI, where you can inspect scores per example or compare across evaluation runs. Here’s a screenshot inside the comparison view showing how we can visualize the inputs and outputs for each sample in the dataset:


Migrating HumanLoop Evaluators

Evaluators in Humanloop defined how model outputs were judged. Some were simple Python functions, others used an LLM to provide natural language judgments. To preserve that work, you can migrate evaluators into Weave so they remain reusable objects.
In Weave, an evaluator can be stored as a plain Python dictionary describing its spec, then published under a stable name. Once uploaded, you can pull it back down and wrap it inside a Scorer class. The Scorer defines a score method decorated with @weave.op, so every evaluation run is logged just like a model call.
Here’s an example showing both a Python-based accuracy evaluator and an LLM-based evaluator, published and then used in scoring:
import weave
from weave import Scorer
from litellm import completion
from pydantic import Field

weave.init("hl_migrate")

# ----------------------------
# Evaluator Specs
# ----------------------------
python_evaluator = {
"spec": {
"arguments_type": "target_required",
"return_type": "number",
"evaluator_type": "python",
"code": "def evaluate(answer, target):\n return 1.0 if answer.strip()==target.strip() else 0.0",
"number_valence": "positive"
},
"path": "Shared Evaluators/Simple Accuracy",
"version_name": "v1",
"version_description": "Python accuracy evaluator"
}

llm_evaluator = {
"spec": {
"arguments_type": "target_required",
"return_type": "text",
"evaluator_type": "llm",
"prompt": {
"provider": "openai",
"model": "gpt-4o-mini",
"system_prompt": "You are a helpful evaluator. Judge correctness of {answer} against {target}.",
"user_prompt": "Answer: {answer}\nTarget: {target}\nGive a 1 sentence judgment."
}
},
"path": "Shared Evaluators/LLM Judge",
"version_name": "v1",
"version_description": "LLM based evaluator"
}

# ----------------------------
# Publish evaluators to Weave
# ----------------------------
weave.publish(python_evaluator, name="simple_accuracy_eval")
weave.publish(llm_evaluator, name="llm_judge_eval")

# ----------------------------
# Retrieve them back down
# ----------------------------
fetched_python_eval = weave.ref("simple_accuracy_eval").get()
fetched_llm_eval = weave.ref("llm_judge_eval").get()

# ----------------------------
# Python Evaluator Scorer
# ----------------------------
class PythonEvaluatorScorer(Scorer):
evaluator: dict = Field(...)

@weave.op
def score(self, answer: str, target: str) -> dict:
spec = self.evaluator["spec"]
try:
local_env = {}
exec(spec["code"], {}, local_env)
result = local_env["evaluate"](answer, target)
return {"score": result}
except Exception as e:
return {"error": str(e)}

# ----------------------------
# LLM Evaluator Scorer
# ----------------------------
class LLMEvaluatorScorer(Scorer):
evaluator: dict = Field(...)

@weave.op
def score(self, answer: str, target: str) -> dict:
spec = self.evaluator["spec"]
try:
user_prompt = spec["prompt"]["user_prompt"].format(answer=answer, target=target)
system_prompt = spec["prompt"]["system_prompt"]
response = completion(
model=spec["prompt"]["model"],
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
temperature=0,
max_tokens=200,
)
return {"judgment": response["choices"][0]["message"]["content"].strip()}
except Exception as e:
return {"error": str(e)}

# ----------------------------
# Example Usage
# ----------------------------
if __name__ == "__main__":
answer, target = "Hello world", "Hello world!"

# use fetched evaluators from weave
py_scorer = PythonEvaluatorScorer(evaluator=fetched_python_eval)
llm_scorer = LLMEvaluatorScorer(evaluator=fetched_llm_eval)

print("Python Evaluator:", py_scorer.score(answer, target))
print("LLM Evaluator:", llm_scorer.score(answer, target))
With this setup, you can reuse evaluators from Humanloop inside Weave evaluations. Just like datasets and prompts, they are stored as versioned objects, retrievable by name, and tracked in the UI when they run. This makes it easy to compare scoring logic across experiments or share common evaluators between projects.

Migrating evaluations

Evaluations in Humanloop tied together a dataset, a model, and one or more evaluators. Each run stored predictions along with scores. When migrating to Weave, you don’t need to change that structure, you just re-express it using the Weave EvaluationLogger.
The script below shows how to take a past Humanloop evaluation, including its metadata and logs, and push it into Weave. Each prediction is logged with its inputs, outputs, and evaluator scores. At the end, a summary is computed and published so that the whole run appears in the Weave Evaluations tab.
import weave
from weave import EvaluationLogger
from statistics import mean

# -----------------
# HARD CODE DATA
# -----------------
WEAVE_PROJECT = "hl_migrate"

EVAL_META = {
"id": "evr_12345",
"model": "gpt5",
"name": "Prompt Quality Evaluation",
"file_id": "pr_67890",
"created_at": "2025-06-05T12:00:00Z",
"description": "Evaluation of prompt outputs against reference answers",
"status": "completed",
"config": {
"evaluators": ["accuracy", "helpfulness"],
"split": "validation"
}
}

LOGS = [
{"id": "log_001", "input": "What is the capital of France?", "output": "Paris",
"target": "Paris", "evaluators": {"accuracy": 1, "helpfulness": 1}},
{"id": "log_002", "input": "2+2?", "output": "4",
"target": "4", "evaluators": {"accuracy": 0, "helpfulness": 0}},
{"id": "log_003", "input": "2+3?", "output": "6",
"target": "5", "evaluators": {"accuracy": 0, "helpfulness": 0}},
{"id": "log_004", "input": "Summarize 'Romeo and Juliet'", "output": "Two lovers die tragically.",
"target": "Romeo and Juliet is a tragedy about two lovers whose deaths reconcile their feuding families.",
"evaluators": {"accuracy": 1, "helpfulness": 1}},
{"id": "log_005", "input": "What is the capital of Spain?", "output": "Madrid",
"target": "Madrid", "evaluators": {"accuracy": 1, "helpfulness": 1}},
{"id": "log_006", "input": "Translate 'hello' to French", "output": "bonjour",
"target": "bonjour", "evaluators": {"accuracy": 1, "helpfulness": 1}},
{"id": "log_007", "input": "Square root of 16?", "output": "4",
"target": "4", "evaluators": {"accuracy": 1, "helpfulness": 1}},
{"id": "log_008", "input": "Capital of Germany?", "output": "Berlin",
"target": "Berlin", "evaluators": {"accuracy": 1, "helpfulness": 1}},
{"id": "log_009", "input": "What is 10/0?", "output": "Infinity",
"target": "undefined", "evaluators": {"accuracy": 0, "helpfulness": 0}},
{"id": "log_010", "input": "Largest planet in solar system?", "output": "Jupiter",
"target": "Jupiter", "evaluators": {"accuracy": 1, "helpfulness": 1}}
]
# -----------------

def summarize(all_scores):
agg = {}
if not all_scores:
return agg
keys = set().union(*[d.keys() for d in all_scores if isinstance(d, dict)])
for k in keys:
vals = []
for d in all_scores:
if not isinstance(d, dict):
continue
v = d.get(k)
if isinstance(v, bool):
vals.append(1.0 if v else 0.0)
elif isinstance(v, (int, float)):
vals.append(float(v))
if vals:
agg[k + "_mean"] = mean(vals)
return agg

def main():
weave.init(WEAVE_PROJECT)

model_name = EVAL_META.get("model") or EVAL_META.get("config", {}).get("model") or "unknown_model"
dataset_name = EVAL_META.get("config", {}).get("dataset") or EVAL_META["name"]

logger = EvaluationLogger(model=model_name, dataset=dataset_name)

per_pred_scores = []
for log in LOGS:
inputs = {"input": log["input"], "target": log["target"]}
output = log.get("output")
pred = logger.log_prediction(inputs=inputs, output=output)

scores = log.get("evaluators", {})
for scorer, score in scores.items():
val = float(score) if isinstance(score, (int, float)) else (1.0 if score else 0.0)
pred.log_score(scorer=str(scorer), score=val)
per_pred_scores.append(scores)

pred.finish()

summary = summarize(per_pred_scores)
summary["source_file_id"] = EVAL_META["file_id"]
summary["humanloop_eval_name"] = EVAL_META["name"]
logger.log_summary(summary)

print("migration complete. check Weave Evals tab")

if __name__ == "__main__":
main()
This preserves historical results while letting you move forward with native Weave evaluations. Old Humanloop scores remain viewable, and once inside Weave, you can compare them against new runs, reuse the same datasets, or swap in migrated evaluators.
After migration, evaluations appear in Weave with the ability to compare different models side by side. The summary view presents aggregate metrics such as accuracy and helpfulness using both bar charts and radar plots, making it straightforward to see whether a new model improves over the baseline or if certain metrics regress.



Conclusion

The transition from Humanloop to Weave doesn’t mean losing the work you’ve already put in. By migrating traces, prompts, datasets, evaluators, and evaluations, you preserve your past experiments while unlocking more powerful tools for future development. Instead of being left with static exports, you gain a new interactive environment where every call can be traced, evaluated, and improved.
Weave not only safeguards your history but also gives you real-time observability, feedback capture, and seamless integration with popular LLM libraries. What began as a migration task quickly becomes an upgrade: your past workflows remain intact, but now they’re connected to a platform built for ongoing experimentation and production monitoring.
Iterate on AI agents and models faster. Try Weights & Biases today.