Weave
Deliver AI applications with confidence
W&B Weave helps developers evaluate, monitor, and iterate on their AI applications to continuously improve quality, latency, cost, and safety. Run robust evaluations, keep pace with new LLMs, debug your applications easily, and monitor production performance—all while collaborating securely.
W&B Weave is framework and LLM agnostic with a wide range of pre-built integrations
Evaluations
Experiment with LLMs, prompts, RAG, agents, and guardrails using rigorous evaluations to optimize your AI application’s performance across multiple dimensions—quality, latency, cost, and safety. Weave offers powerful visualizations, automatic versioning, leaderboards, and a playground to precisely measure and rapidly iterate on improvements. Centrally track all evaluation data to enable reproducibility, lineage tracking, and collaboration.
Learn more
Production monitoring and debugging
Weave automatically logs all inputs, outputs, code, and metadata in your application and organizes the data into a trace tree that you can easily navigate and analyze to debug issues. Use real-time traces to monitor your app in production and improve performance continuously. Score live incoming production traces with online evals for monitoring without impacting your app’s performance (sign up for online evals preview). Develop multimodal apps—Weave logs text, documents, code, HTML, chat threads, images, and audio, with support for video and other modalities coming soon.
Learn more
Start with our scorers
Weave provides pre-built LLM-based scorers for common tasks
Hallucination
Summarization
Moderation
(based on OpenAI moderation API)
Similarity
JSON strings
XML strings
Pydantic data models
Context entity recall
(from RAGAS)
Context relevancy
(from RAGAS)
and more …
Or bring your own
Plug in off-the-shelf third party scoring solutions into Weave or write your own
RAGAS
EvalForge
LangChain
LlamaIndex
HEMM
and more …
Scoring
Weave automatically tracks quality scores, latency, and cost metrics for every trace. Weave offers built-in scorers for common metrics like hallucination, moderation, and context relevancy. Customize them or build your own from scratch. Scorers can use any LLM as a judge to generate the metrics.
Human feedback
Collect human feedback from users and experts for real-life testing and evaluation. Feedback can be simple thumbs-up/down ratings and emojis or detailed qualitative annotations. Use our annotation template builder to tailor the labeling interface for consistency while improving efficiency and quality.
❌ Toxicity
❌ Bias
❌ Hallucination
And more …
Guardrails (preview in Q1 2025)
Protect your brand and end users by implementing guardrails using Weave. Our out-of-box filters detect harmful outputs and prompt attacks. Once an issue is detected, pre- and post-hooks trigger safeguards to steer the response in line with your company guidelines and policies.
Sign up for preview
Get started with one line of code
Developers love Weave because it’s so easy to get started – all you need is one line of code, and your GenAI application inputs, outputs, and code are automatically tracked and organized for rigorous evaluation, monitoring, and iteration. We offer SDKs for Python, JavaScript, and TypeScript. For other languages, you can use our REST API.
The Weights & Biases platform helps you streamline your workflow from end to end
Models
The Weights & Biases platform helps you streamline your workflow from end to end
Models
Experiments
Track and visualize your ML experiments
Sweeps
Optimize your hyperparameters
Registry
Publish and share your ML models and datasets
Automations
Trigger workflows automatically
Weave
Traces
Explore and
debug LLMs
Evaluations
Rigorous evaluations of GenAI applications