How to navigate the EU AI Act with W&B Weave
The EU AI Act demands that you can track and understand the behavior of your AI applications. Here's how Weave can help.
Created on May 6|Last edited on May 6
Comment
The EU AI Act is beginning to shape how organizations design, develop, and deploy artificial intelligence systems, particularly those classified as high-risk.
In this article, we’ll explore how W&B Weave (“Weave”), a toolkit for building and managing AI applications, can help organizations meet some of its compliance obligations under the EU AI Act as a provider of high-risk AI systems.
Like our previous article in this series, we’ll be focusing on VCorp, a wholly imaginary entity developing an application that uses LLMs to score resumes for open positions (we’ll refer to it simply as the “App”). Because this App could influence individuals’ access to employment, we’ll assume that it falls within the high-risk category outlined in Article 6, Annex III of the EU AI Act.
Article 9
Under Article 9 of the EU AI Act, VCorp is required to establish a risk management system that identifies, evaluates, and mitigates risks throughout the entire lifecycle of the App.
Because LLMs can generate different responses to the same input, it’s not feasible to test every scenario the App might encounter in the real world. To manage this uncertainty, VCorp uses Weave to run evaluations on large datasets to gain confidence in the accuracy and reliability of the App. Custom scoring tools (or Scorers) are used to assess the App’s inputs and outputs to measure safety, bias, relevancy, hallucination rate, and other metrics defined by VCorp. If something goes wrong, Weave makes it easy to trace every step of the App’s process, from the user input to the LLM’s response and the surrounding code, so the team can quickly find and fix issues.
For example, if an update to the App introduces a bias such as a lower match score for resumes containing gaps in employment, Weave enables the team to detect the change by comparing evaluation metrics over time. With Scorers tracking fairness and bias indicators, the team can identify that the model's behavior has shifted and take corrective action, such as retraining on a more balanced dataset or adjusting scoring logic. This ability to monitor, trace, and respond to emerging risks supports VCorp’s compliance with Article 9's requirement for a proactive and ongoing risk management system.
Article 10
Article 10 of the EU AI Act requires providers to use high-quality data when developing AI systems. Evaluation datasets are not one size fits all—they must take into consideration the audience and context and reflect real-world input.
Weave helps VCorp meet this requirement by incorporating data from the App’s actual usage (such as logs and failure modes) to create more realistic evaluations. It also enables VCorp to gather real-world feedback—such as simple thumbs up or down scoring or written comments—directly from users. This feedback can be combined with insights from expert reviewers, such as recruiters who assess resumes for job fit, to build evaluation datasets that are representative and trustworthy. Together, these capabilities support VCorp’s compliance with Article 10 by grounding the App’s development in high-quality, real-world data.
Article 12
Article 12 requires that the App must be able to automatically record events (i.e. logs). Weave Traces helps VCorp meet this requirement by automatically capturing everything that happens inside the App at a very granular level, including inputs, outputs, code versions and metadata. This comprehensive logging supports transparency, accountability and faster issue resolution.
Article 13
Article 13 of the EU AI Act requires providers to ensure their systems are transparent and understandable, so users can use them properly and responsibly.
Weave helps VCorp meet this requirement by offering tools to measure, track, and explain how the App performs. VCorp can run evaluations in Weave to establish a clear performance baseline (i.e. set benchmarks for metrics like accuracy, relevancy, and bias). This helps the team to understand what “normal” behavior looks like and monitor changes over time.
All evaluation data is centrally tracked and organized in Weave, making it easy to reproduce results and identify performance trends. As the App evolves, Weave automatically versions the code, datasets, and scorers, allowing VCorp to identify what changed and how those changes impacted performance.
For example, if recruiters begin noticing that resumes for a particular job role are consistently receiving lower scores than expected, VCorp can use Weave to investigate. By reviewing versioned evaluation data and performance metrics, the team can identify whether a recent model update or dataset change is responsible. Weave enables VCorp to clearly explain what changed, why it happened, and how it affected results. This level of visibility supports transparency and responsible use, in line with Article 13’s requirements.
Article 14
Article 14 of the EU AI Act requires high-risk AI systems to be designed with human oversight in mind, meaning teams must be able to monitor the system’s behavior and step in when needed. Weave supports this through Guardrails and Monitors, which are powered by the Scorers mentioned above.
When used as Guardrails, Scorers operate in real time to block or modify unsafe content before it reaches users, helping to prevent harm or misuse. As Monitors, Scorers track these metrics in the background, providing teams ongoing visibility into trends and unusual behavior. With these tools, Weave can alert VCorp to anomalies or malfunctions and ensure that humans can intervene promptly.
Article 15
Article 15 of the EU AI Act requires providers of high-risk AI systems to meet specific standards for accuracy, robustness, and cybersecurity during development and in production.
W&B Weave helps VCorp meet these requirements by providing tools to evaluate, monitor, and safeguard the app at every stage. Before launch, VCorp can use Weave to run evaluations that establish a baseline for key metrics such as accuracy, relevance, and error rates. These benchmarks help ensure the app is performing as expected and provide a reference point for tracking performance over time.
In production, Weave helps maintain consistency and security through Monitors and Guardrails. For example, Guardrails can catch hallucinations, like the App falsely stating that a candidate has experience they never mentioned. These systems allow VCorp to detect threats, prevent errors, and maintain performance, helping ensure the App remains accurate, robust, and secure.
As the EU AI Act sets new standards for high-risk AI systems, tools like W&B Weave can help companies build transparency, traceability, and oversight into their AI workflows. By integrating Weave, VCorp is better equipped to meet its compliance obligations while improving the quality and reliability of its AI applications.
No information contained in this article should be construed as legal advice from Weights & Biases or the individual author, nor is it intended to be a substitute for legal counsel on any subject matter.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.