Hiring Agent: E2E AI Evaluation for Fair & Auditable Hiring Decisions

Building, optimizing and evaluating an agentic workflow with guardrails, fine-tuning, and extensive evaluation. We also use this case study to explore responsible and fair agent development and reporting for the EU AI Act.
Nicolas Remerscheid, Karan Nisar
Created on April 11|Last edited on June 2
Comment
﻿
🎯 Objective🧬 System Architecture🧪 Development Workflow1. Dataset Generation2. System Development3. Fine-tuning of the comparison model📈 Evaluation and Monitoring1. System Evaluation in W&B Weave2. Model Evaluation in W&B Models3. Production Monitoring🧠 Conclusion and Next Steps
﻿
🎯 ObjectiveThe Hiring Assistant assists in pre-selecting candidates for an interview round by evaluating candidate applications against job positions using a chain of different LLMs. We focus on performance and transparent, unbiased decision making while maintaining compliance with regulations like the EU AI Act. While there's a lot more techniques and processes one can add to this workflow the objective was to show an E2E fine-tuning, system engineering, evaluation, and production monitoring of an agentic workflow on a real-world use-case. 
Check the code here for implementation details and the whitepaper here explaining in detail what risks, mitigation strategies and reporting frameworks we set up for this high-risk use-case, in collaboration with appliedAI. 
💡
Figure 7 in whitepaper - shows the different steps and possible execution worklows depending on the guardrail (see below for more info).
﻿
Table 3.1.3 in whitepaper - subset of the analyzed risks and their respective mitigation techniques and proposed testing methods. 
﻿
🧬 System ArchitectureThe system consists of the following components. We use Streamlit as simple UI for the HITL view and Langgraph as state machine.
1. Extraction Step
PDFs for job offers and applications is extracted using PyMuPDF 
Extraction LLM is used to convert the extracted text into a structured format (i.e. as a baseline gpt-4o-mini)
2. Comparison Model
Both application text and job position text are ingested into a versioned prompt and fed into the comparison LLM
Any LLM can be used, we tested with different OpenAI, Bedrock, and open-source fine-tuned models hosted on Ollama
3. Reasoning Guardrail
The reason for the hiring decision is fed into a Guardrail LLM that checks whether it strictly follows from the application, position
If it detects a hallucination it will go back to the comparison step and force another decision
If it detects a hallucination for a second time it will involve a human expert to validate the decision and reason
We use our built-in hallucination scorer from Weave based on gpt-4o-mini but have also tried local models and specialized models such as Selene from Atla. 
﻿
🧪 Development WorkflowThis section focuses on the end-to-end development process.
1. Dataset Generation
2. System Development
3. Fine-tuning of the comparison model﻿
📈 Evaluation and MonitoringThis focuses on the system evaluation of our hiring agent system and the model evaluation during fine-tuning. 
1. System Evaluation in W&B Weave
2. Model Evaluation in W&B Models
3. Production Monitoring﻿
🧠 Conclusion and Next StepsAs AI adoption accelerates, regulatory compliance becomes not merely an obligation but a strategic imperative. The EU AI Act demands rigorous processes and comprehensive documentation, making the right choice of supporting technology crucial for organizations navigating these requirements. Weights & Biases stands out by offering robust, integrated tools that simplify compliance tasks—enabling teams to effectively manage risk, maintain transparency, and deliver trustworthy AI solutions. By leveraging the Weights & Biases AI developer platform, organizations can transform regulatory compliance into an advantage, ensuring continuous innovation while upholding the highest standards of safety and ethical responsibility.
﻿
Roll out AI literacy training for your workforce to identify obligations and demonstrate compliance
Create an inventory of AI systems in your company, classify their risk class, and identify obligations with a service provider such as the appliedAI Initiative.
Setup your compliant AI operating model to define your regulatory strategy, governance processes, roles & responsibilities with a service provider such as appliedAI Initiative.
Standardize your compliance processes through the Weights & Biases observability and governance platform.
Automate reporting and enforce standards using Weights & Biases. Contact us for a demo.
﻿
﻿
Add a comment