W&B Weave EvaluationLogger: A more flexible approach to evaluating AI applications
How to measure your AI and agentic applications the precise way you want to
Created on June 11|Last edited on June 11
Comment

Frameworks are a double-edged sword when developing and evaluating agentic AI applications. While they provide structure and predictability, they can also be restrictive.
EvaluationLogger offers a more flexible alternative to Weave's standard evaluation framework. Instead of forcing you into using specific dataset and scorer formats, it lets you add evaluation logic directly to your code without rigid requirements. No need to define everything upfront. Just log what matters, when it matters.
This flexibility can be crucial for complex, multi-step agentic workflows where you need to:
- Apply scorers dynamically within workflows
- Evaluate specific steps or multi-step combinations within an agentic graph
- Perform custom aggregations on top of Weave's standard aggregations
For experienced AI developers working on cutting-edge agentic applications, EvaluationLogger provides the flexibility to measure multiple dimensions—such as accuracy, latency, cost, and safety—on your terms without the constraints that come along with using a framework.
Usage
To use the EvaluationLogger, just follow these steps:
1) Initialize the logger
When initializing, be sure to add names for both the model and the evaluation dataset.
# Initialize EvaluationLoggereval_logger = EvaluationLogger(model="my_model",dataset="my_dataset")# Example input data (this can be any data structure you want)eval_samples = [{'inputs': {'a': 1, 'b': 2}, 'expected': 3},{'inputs': {'a': 2, 'b': 3}, 'expected': 5},{'inputs': {'a': 3, 'b': 4}, 'expected': 7},]# Example model logic using OpenAI@weave.opdef user_model(a: int, b: int) -> int:oai = OpenAI()response = oai.chat.completions.create(messages=[{"role": "user", "content": f"What is {a}+{b}?"}],model="gpt-4o-mini")# Use the response in some way (here we just return a + b for simplicity)return a + b
Our example function simply asks the gpt-4o-mini model to return the sum of two integers.
2) Log predictions
Evaluating our LLM call is now as easy as looping through our dataset inputting the integer pairs included in each row.
# Iterate through examples, predict, and logfor sample in eval_samples:inputs = sample["inputs"]model_output = user_model(**inputs) # Pass inputs as kwargs# Log the prediction input and outputpred_logger = eval_logger.log_prediction(inputs=inputs,output=model_output)
EvaluationLogger records the prediction inputs and outputs.
3) Log scores
Using the prediction results, we can verify if our output matches the expected value included in our dataset.
# Calculate and log a score for this predictionexpected = sample["expected"]correctness_score = model_output == expectedpred_logger.log_score(scorer="correctness", # Simple string name for the scorerscore=correctness_score)
All you need to define is the scorer name and the score. Whether the scores are numeric or boolean values, EvaluationLogger will handle the final aggregation.
4) Finish prediction(s)
Once you log a score, call finish() to finalize that prediction’s logging. After finish(), no more scores can be added to that prediction. You can then either proceed to step 5 to log summaries or continue logging scores for other predictions.
# Finish logging for this specific predictionpred_logger.finish()
5) Log summary
Calling log_summary concludes the evaluation and triggers automatic score summarization in Weave.
# Log a final summary for the entire evaluation.# Weave auto-aggregates the 'correctness' scores logged above.summary_stats = {"subjective_overall_score": 0.8}eval_logger.log_summary(summary_stats)print("Evaluation logging complete. View results in the Weave UI.")
You can also optionally add pre-aggregated summary scores at the end of your evaluation workflow to:
- Include custom metrics that matter to your specific use case
- Complement Weave's automatic aggregations with your own calculations
- Provide context and insights that standard metrics might miss
The wandb.log of Weave
The W&B SDK lets you log key metrics quickly and effortlessly with a single command: wandb.log. The Weave EvaluationLogger extends this ease to Weave evaluations. This means you can capture evaluation data at exactly the right moment in your workflow without restructuring your code.
The result is a more natural, flexible way to evaluate AI applications. Spend less time setting up evaluations and more time analyzing results and optimizing your AI applications.
Conclusion
Weave now supports two methods to match your evaluation preferences. The standard Evaluations framework offers a structured approach with predefined formats and clear guidance. The new EvaluationLogger provides a flexible alternative for incremental logging and can be added at the right spots in your workflow and your Python code. Both deliver comprehensive assessment capabilities, so you can pick the one that works best for your project.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.