For more information or if you need help retrieving your data, please contact Weights & Biases Customer Support at support@wandb.com
Weights & Biases is heading to NeurIPS 2025 to connect with the brightest minds shaping the future of AI. Drop by our booth to explore how teams are not just building and evaluating models—but deploying, governing and optimizing next-generation systems with Weights & Biases. From fine-tuning massive LLMs to orchestrating agentic workflows, managing synthetic-data pipelines and driving model lifecycle observability, we’d love to talk about how we can help accelerate your work in this new era.
Tuesday (12/02) 2:30pm – Measuring Emergent Behavior in AI Agents
As language models transition into agents, they exhibit behaviors that were not explicitly trained—emergent dynamics that are powerful yet poorly understood. Measuring these behaviors requires dedicated tooling that treats evaluation as a central research problem rather than a peripheral task. This talk introduces frameworks for self-improving agents that generate candidate variants, run structured experiments, and incorporate evaluation feedback into iterative refinement. Such loops operationalize the scientific method in software, enabling agents to improve through cycles of hypothesis, measurement, and revision. Tooling for evaluation plays a critical role in this process, transforming measurement from a diagnostic exercise into an engine for discovery. Early experiments reveal both hidden failure modes and novel capabilities, underscoring the need to build for emergence as an active research objective. The talk concludes by outlining a research agenda in which evaluation frameworks provide the substrate for cultivating reliable, trustworthy agent systems.