Announcing our newest GenAI course—LLM Apps: Evaluation
Develop techniques for building, optimizing, and scaling AI evaluators with minimal human input. Learn to build reliable evaluation pipelines for LLM applications by combining programmatic checks with LLM-based judges.
Created on January 8|Last edited on January 8
Comment
We're excited to announce our latest course, LLM Apps: Evaluation. It's designed to give you the practical skills you need to build and deploy production-ready LLM Judges for your Gen AI applications. And like all our courses, it's completely free to register.
Register now

Led by industry experts from Weights & Biases, Google, and All Hands AI this course begins with the evaluation basics and builds up to a fully aligned LLM judge.
At the end of the course you will:
- Understand key principles, implementation methods, and appropriate use cases for LLM application evaluation
- Learn how to create a working LLM as a judge
- Align your auto-evaluations with minimal human input
Le'ts get a bit deeper into each course chapter.
Chapter 1: Evaluation basics
An introduction to the fundamentals of LLM evaluation, led by Weights & Biases ML Engineer Anish Shah, including a deep dive into metrics, datasets, and the importance of aligning evaluations with business goals. This chapter also covers a practical medical use case, highlighting high-stakes scenarios where accuracy and privacy are critical. You'll explore methods for extracting structured insights from unstructured medical data while adhering to strict compliance standards.
Chapter 2: Programatic and LLM evaluations
This chapter introduces programmatic evaluation strategies and how to extend them using LLM judges. You’ll learn to create automated checks, validate outputs with JSON schemas, and evaluate key aspects like privacy compliance and word limits in generated text. The chapter includes hands-on examples such as evaluating code diffs for quality, extracting structured data from free-form medical inputs, and refining prompts to improve LLM performance in dynamic settings.

Chapter 3: Structuring LLM Evaluators
Join W&B's Ayush Thakur as he explains "breaking abstract concepts into objective criteria—such as grammar, logic, and flow—transforms subjective tasks into scalable, repeatable processes." This principle drives the hands-on experience course participants will gain in designing structured evaluators.
You'll work directly with code to evaluate essays based on multiple metrics like grammar, coherence, and logical structure while ensuring alignment with human judgment. We also cover strategies for handling multimodal inputs and refining evaluator prompts to optimize reasoning accuracy, balancing the trade-offs inherent in building complex evaluation systems.
Get started
Case Study: Evaluation of an agentic system with OpenHands
In the first case study, All Hands AI's Graham Neubig covers two interesting topics: using an agent to build an LLM judge and evaluating agents themselves. You will first see OpenHands in action as it tackles practical tasks like code management and web navigation to create a working LLM Judge and log the evaluations to Weights & Biases Weave. Then, Graham walks you through the evaluation framework behind OpenHands, including its GitHub CI/CD evaluation setup and methods for assessing multi-step processes efficiently.
Chapter 4: Improving LLM Evaluators
Refining and optimizing LLM evaluators involves a systematic approach to addressing bias, improving alignment with human judgment, and increasing evaluation efficiency. You'll learn advanced techniques such as iterative refinement using feedback loops, incorporating structured outputs for clarity, and analyzing trade-offs between precision and generalization. Real-world examples include fine-tuning evaluator prompts for domain-specific tasks and leveraging metrics to measure alignment reliability, including:
- Cohen's Kappa: Measures agreement between LLM evaluators and human annotators, accounting for chance agreement.
- Kendall's Tau: Evaluates rank correlation, useful for comparing ordered outputs or rubric-based scoring.
- Spearman's Rho: Assesses monotonic relationships between ranks, providing insight into alignment consistency across datasets.
Case Study: Google Gemmini, Imagen and Veo 2
Google's Paige Bailey showcases how to evaluate multimodal systems using cutting-edge tools and models from Google. You will understand how to use Imagen for image generation and Veo 2 for video synthesis, learning how to assess their performance in real-world applications. Paige also explains the nuances of integrating tool use into evaluations, highlighting the importance of grounding outputs and ensuring reliability in production scenarios.

At the end of this course, you will have the knowledge and skills to confidently move from vibe or off-the-shelf evaluation to an auto-evaluating LLM Judge aligned to your specific use case and business need.
Complete the course and catch your GenAI App issues before they reach production.
Enroll now!
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.