LLM apps: Evaluation

Develop techniques for building, optimizing, and scaling AI evaluators with minimal human input. Learn to build reliable evaluation pipelines for LLM applications by combining programmatic checks with LLM-based judges.

2 Hours

Free

Learnings & outcomes

At the end of the course you will:

Understand key principles, implementation methods, and appropriate use cases for LLM application evaluation
Learn how to create a working LLM as a judge
Align your auto-evaluations with minimal human input

Curriculum

Welcome to the course
Evaluation basics
Programmatic and LLM Evaluations
Alignment
Case study: Google: Imagen and Veo
Case study: Open Hands
Automatic LLM Evaluators
Conclusion and course assignment

In partnership with

Reviews

Extremely Valuable and Well-Structured Course!

This course exceeded my expectations in every way. It provided a perfect balance between theory and hands-on practice, allowing me to deeply understand how to evaluate LLM applications effectively. The real-world examples and step-by-step projects made complex topics like BLEU, ROUGE, F1 Score, and human evaluation strategies feel approachable and actionable. I especially appreciated the focus on building practical workflows using tools like Weights & Biases, and the constant emphasis on scalability and real-world application. Highly recommend this course to anyone serious about working with LLMs!

Good collection of information about LLM Evaluation.

Even though it still have some of the limitation of hard to evaluate the subjective question, it still be a good starting point to have a LLM evaluation system that can automatically evaluate LLM application automatically

Great learning at weights and biases.

It was a great learning opportunity and at the time when AI is at its peak its really helpful to know more about it.

Practical, immediately useful

Clear, hands-on course that I could apply immediately across multiple projects. The eval loop + Weave dashboards made it easy to measure improvements, catch bias, and iterate faster with more reliable outputs.

Course instructors

Ayush Thakur

Weights & Biases

AI Engineer

Ayush Thakur is an AI Engineer at Weights and Biases and Google Developer Expert in Machine Learning (TensorFlow). He is interested in everything computer vision and representation learning. For the past 2 years he’s been working with LLMs and have covered RLHF and how and what of building LLM-based systems.

Anish Shah

Weights & Biases

AI Engineer

Anish loves turning ML ideas into ML products. Anish started his career working with multiple Data Science teams within SAP, working with traditional ML, deep learning, and recommendation systems before landing at Weights & Biases. With the art of programming and a little bit of magic, Anish crafts ML projects to help better serve our customers, turning “oh nos” to “a-ha”s!

Paige Bailey

Google

GenAI Developer Experience

Paige Bailey is the engineering lead for GenAI Developer Experience at Google. Paige has a deep understanding of the generative AI landscape, having previously served as an applied machine learning engineer at Microsoft and GitHub, and a product lead for Google's PaLM v2 and Gemini models. Paige is passionate about making cutting-edge AI technology accessible, and empowering developers to build the next generation of innovative applications.

Graham Neubig

All Hands AI

Chief Scientist Associate Professor Carnegie Mellon University

Graham Neubig is an Associate Professor at Carnegie Mellon University, and Chief Scientist at All Hands AI. His research work focuses on AI agents for web browsing and code generation, as well as improvements to LLMs for multilingual and multimodal applications. He is a big proponent of open source and open science, including the OpenHands framework for software engineering agents, developed by All Hands AI.

Explore our other courses

GenAI

AI Engineering: Agents

In this course you'll learn to work with reasoning models to build production-grade AI agents that can autonomously tackle complex workflows, from deep analysis to support agents. Learn to architect and evaluate robust systems that combine LLMs, tools, memory, and planning to ship real-world applications.

GenAI

RAG++ : From POC to production

Practical RAG techniques for engineers: learn production-ready solutions from industry experts to optimize performance, cut costs, and enhance the accuracy and relevance of your applications.

GenAI

Developer’s guide to LLM prompting

This course covers everything you need to get started with prompt engineering, from system prompts and structural techniques to model-specific strategies. Using a text-understanding use case, you will gain experience with code and industry knowledge.

GenAI

LLM engineering: Structured outputs

Improve your LLM engineering skills with Jason Liu, the author of the Instructor library. Learn about structured JSON output handling, function calling, complex validations with Pydantic and more in this short and helpful course!

GenAI

Building LLM-powered apps

Learn how to build LLM-powered applications using LLM APIs, Langchain and Weights & Biases LLM tooling. This course will guide you through the entire process of designing, experimenting, and evaluating LLM-based apps.

GenAI

Training and fine-tuning LLMs

Learn to harness the power of LLMs with our comprehensive course. Discover the importance and history of LLMs, explore their architecture, training techniques, and fine-tuning methods. Gain hands-on experience with practical recipes from Jonathan Frankle (MosaicML), and other industry leaders,and learn cutting-edge techniques like LoRA and Prefix Tuning. Perfect for machine learning engineers, data scientists, researchers, and NLP enthusiasts. Stay ahead of the curve and become an expert in LLMs.

GenAI

W&B PLATFORM

Weights & Biases 101: Weave

This course introduces Weave by Weights & Biases, a toolkit for developing Generative AI applications. Learn to log, debug, and evaluate language model workflows, ensuring accurate and consistent results across your projects.

LLM apps: Evaluation

Learnings & outcomes

Curriculum

Ayush Thakur

Anish Shah

Paige Bailey

Graham Neubig

AI Engineering: Agents

RAG++ : From POC to production

Developer’s guide to LLM prompting

LLM engineering: Structured outputs

Building LLM-powered apps

Training and fine-tuning LLMs

Weights & Biases 101: Weave

The Platform

Article

Resources

Company

Use cases

Industries

Learn more

LLM apps: Evaluation

Learnings & outcomes

Curriculum

Ayush Thakur

Anish Shah

Paige Bailey

Graham Neubig

AI Engineering: Agents

RAG++ : From POC to production

Developer’s guide to LLM prompting

LLM engineering: Structured outputs

Building LLM-powered apps

Training and fine-tuning LLMs

Weights & Biases 101: Weave

The Platform

Article

Resources

Company

Use cases

Industries