Skip to main content

Weave Product Features Overview

Created on March 3|Last edited on March 3

W&B Weave

Weights & Biases (W&B) Weave is a framework for tracking, experimenting with, evaluating, deploying, and improving LLM-based applications. Designed for flexibility and scalability, Weave supports every stage of your LLM application development workflow and is supported by the following features. Below is a breakdown of the various feature offered directly within weave to help you build your GenAI applications!



Tracing & Monitoring

Weave provides powerful tracing capabilities to track and version objects and function calls in your applications. This comprehensive system enables better monitoring, debugging, and iterative development of AI-powered applications, allowing you to "track insights between commits." With weave you can create
  • Easily create custom trace trees
  • Build a trace tree to debug data flow
  • Combine scoring with human feedback
  • Compare accuracy, latency, and cost
  • Track parent-child relationships, agentic workflows, and more using the weave.op decorator
  • Logs text, datasets, code, images, and audio. Video and other modalities coming soon!
Trace view within weave showcasing nested function calls for image generation model
Visit W&B Tracking docs for feature walk through

Evaluations

To improve an application, we need a way to evaluate its progress. A common approach is to test it against the same set of examples whenever changes are made. Weave provides a first-class solution for this with its Model and Evaluation classes.
Weave's Evaluation class is designed to assess a model's performance on a given dataset or set of examples using scoring functions—either built-in or custom-defined. This enables evaluation-driven development, a systematic approach that ensures reliable iteration.
Additionally, Evaluations in Weave provide a structured way to benchmark and calculate quantitative metrics, allowing you to confidently measure improvements. The API is designed with minimal assumptions to support a wide range of use cases, making it highly flexible.

For more on evaluations visit evaluation pipeline tutorial

Pre-Defined Evaluation Scorers for GenAI Apps

Weave prebuilt scorers
In Weave, prebuilt Scorers are used to evaluate AI outputs and return evaluation metrics. They take the AI's output, analyze it, and return a dictionary of results. Scorers can use your input data as reference if needed and can also output extra information, such as explanations or reasonings regarding the evaluation result. Weave provides scorers out of the box to help you get started with evaluation-driven AI development.
Highlights:
  • Support for OpenAI, Anthropic, Google GenerativeAI and MistralAI as llm-as-judges
  • Plug-N-Play evaluation of your application
  • See the the docs here

Custom User Defined Scorers

For scenarios that require specialized evaluations—like specific parameters in an LLM-based judge, custom prompts, or domain-specific scoring logic—Weave supports custom scorers.
By creating custom Scorer classes, you can tailor the scoring process to precisely capture the metrics that matter most. For example, you might implement a standardized LLMJudge class that targets a specific chat model with a particular prompt, calculates row-level scores, and aggregates them into a final metric. This flexibility ensures you can incorporate domain knowledge and unique evaluation strategies, allowing you to confidently measure and iterate on your application’s performance.
Checkout this doc to get started with defining your own scorers

Online Evaluations



Because LLMs are non-deterministic, even thorough debugging and rigorous evaluations can’t catch everything. There will always be outlier cases in real-world usage, plus the usual production challenges—latency spikes, failures, user request surges, and so on.
That’s where online evaluations come in. They let you score live traces in production, so you can see how your model is performing in actual usage. You can configure sampling filters to run online evaluations on just a subset of calls, and set alerts whenever certain thresholds get crossed—helping you react quickly to any unexpected behavior.
Check out the documentation to learn more.

Guardrails for Safe AI

Running coherence and hallucination guardrails against RAG outputs
Guardrails help you safeguard users and brand against harmful content, prompt attacks, and PII (personally identifiable information) leaks. As you start deploying AI systems more broadly, ensuring they operate safely and produce high-quality outputs can be a challenge. Weave Guardrails address this need with a new, developer-friendly API to detect adverse events along with a comprehensive suite of pre-built scorer models.
With guardrails, you can implement programmable safeguards. When a guardrail is triggered, you can route your application to execute a custom exception code, preventing harmful outputs from reaching end users. All detections are automatically logged within Weave, allowing you to monitor scorer performance over time and continuously improve application quality.
We've released the following five scorers
  1. And more!
Read the launch blog here

Feedback and Annotations

Efficiently evaluating LLM applications requires robust tooling to collect and analyze feedback. Weave provides an integrated feedback system, allowing users to submit feedback directly through the UI or programmatically via the SDK. Supported feedback types include emoji reactions, textual comments, and structured data, enabling teams to:
  • Build evaluation datasets for performance monitoring.
  • Identify and resolve LLM content issues.
  • Gather examples for fine-tuning and advanced tasks.
Feedback Tab to annotate individual traces and evaluations.
Checkout the weave feedback documentation here

Dataset Creation & Editing

Weave Datasets help you to organize, collect, track, and version examples for LLM application evaluation for easy comparison. You can create and interact with Datasets programmatically and via the UI.

Weave Datasets help organize, collect, track, and version examples for LLM application evaluation, making comparisons easy. You can create and interact with datasets both programmatically and via the UI.
In the Weave UI, you can add, edit, and delete individual examples within a dataset—useful for removing bad examples or correcting labels. Once changes are published, Weave automatically increments the dataset version, ensuring evaluations remain consistent across versions.
Additionally, you can create datasets directly from the UI using existing traces and evaluations, allowing for flexible dataset customization and management.


Leaderboards


W&B Weave allows developers to group evaluations into leaderboards featuring the best performers. You can share your leaderboards across your organization. Using leaderboards, the next application you build can start with the best performing models and prompts instead of starting from scratch.
Every part of the leaderboard can be analyzed so you can drill into areas like prompts and datasets. A pivot view enables pairwise comparison. It applies to prompts, LLMs, code changes, and more. We provide a default pivot view that you can configure for your use case.
You can see this demo of a leaderboard here, and a quickstart is here.

First Class Prompts and Playground


To evaluate models and prompts without jumping into code, W&B Weave offers a playground to quickly iterate on prompts, functions/tools, and responses to see how the LLM response changes. W&B Weave treats prompts as a first class citizen: reusable across your application, centrally referenced, and dynamic to accommodate complex use cases.

Check out the docs: Playground and First Class Prompts

TypeScript / JavaScript Client



We know that not every developer builds their AI applications using Python. Check out the SDK README and the quickstart documentation.
Just like in the Python SDK, Weave TS/JS client has deep integrations with major providers of AI for deep tracing, an intuitive evaluation framework, and native multi-modal support.