Skip to main content

Human annotations: Why they matter—and how to get them right

Learn how W&B Weave can help you gather user and expert annotations to improve your AI apps
Created on February 25|Last edited on February 25
Ever wonder if your AI application or agent is actually hitting the mark for real users? You can run all the programmatic tests in the world, but there’s nothing like real human feedback to show you what’s actually going on. That’s where user reviews and expert annotations come in—part art, part science, all essential.

Two types of human feedback

It’s important to articulate what we mean by human feedback. First, there’s the most familiar variety: user feedback through likes, emojis, and comments. This is the quick stuff—snap judgments from everyday users that give you a gut check on whether your AI is delivering the best experience in the real world. Then there’s a deeper layer: expert annotations, where domain professionals evaluate your AI app’s responses for things like factual accuracy, style, or tone.
While we launched user feedback in Weave last year, we recently introduced human annotations as well. That means you can capture both end-user reactions and in-depth expert reviews.

Benefits of human feedback

Both user feedback and expert annotations let you grade your application’s outputs. When you spot bizarre or misleading responses, you can quickly filter and flag them using human feedback. This can help you identify potential issues and fix them.

You can also feed those poor responses back into your evaluation and fine-tuning datasets, improving both the robustness of your evals and your model’s performance. All you need to do is simply filter the traces using the annotation label. For example, in the screenshot below, you can filter all the traces labeled as emails.

You can then select all the traces matching that filter and add them to an existing dataset or a new one by clicking on the “Add selected rows to a dataset” button. This helps you enhance your eval and fine-tuning datasets with any tricky examples you find during testing or production use. You can also do this using the from_call API. For more information, see Datasets in the Weave technical documentation.


The pain point: Inconsistent labeling

But here’s the catch: different annotators can interpret things differently. You can’t just hand over a set of instructions and expect perfect alignment—after all, humans aren’t robots. Some might focus too much on style and ignore factuality, or vice versa.
They might also provide annotations using different labels. For example, some might classify the responses as good and bad, while others might further categorize them with qualifying descriptors. These inconsistencies can mess up your data, undermining your evals and fine-tuning.

Enter Weave’s human annotation scorers

That’s why Weave’s new human annotation scorers are a game-changer. You can set one up in the Weave UI or through the API (whatever suits your workflow). Once it’s ready, your experts jump in and start annotating within a consistent, structured interface.

And if you decide the scoring criteria need a tweak—maybe you want to switch from a binary (0/1) system to a list of labels—no sweat. You can edit your scorers via the API, refining them until everyone’s on the same page.

Getting started

We’ve made it incredibly easy to get started with human annotations in Weave. Just create a human annotation scorer in the UI with four simple steps:
  • In the sidebar, navigate to Scorers.
  • In the upper right corner, click + Create scorer.
  • In the configuration page, set:
    • Scorer type to Human annotation
    • Name
    • Description
    • Type (this determines the type of feedback that will be collected, such as boolean or integer).
  • Click Create scorer.
Now, you can use your scorer to make annotations. For more details, check out the human annotations documentation.

Wrapping it up

Human annotation scorers let you customize the labeling process so it’s consistent, efficient, and high-quality. That translates to a better workflow for debugging your AI application, more rigorous evaluations, and better datasets for fine-tuning your LLMs. Because when it comes to AI performance, nothing beats solid human insight.
Iterate on AI agents and models faster. Try Weights & Biases today.