Hiring Agent Demo: E2E AI Evaluation for Fair & Auditable Hiring Decisions
Created on April 4|Last edited on April 21
Comment
๐ฏ Objective๐ง What It Does๐๏ธ System Architecture๐ Workflow Structure๐ฆ Key Features๐ Evaluation Modes๐งโโ๏ธ Expert Review System๐ฌ Model Evaluation & Scoring๐งช Fine-Tuning Implementation๐งฌ Overview๐๏ธ Fine-Tuning Process๐ EvaluationGuardrails and Robustness๐ ๏ธ Tech Stack
๏ปฟ
๏ปฟ
๐ฏ Objective
This AI-powered Hiring Assistant automates the hiring process by evaluating candidate applications against job offers using advanced language models. It ensures fair, unbiased decisions, maintains compliance with regulations like the EU AI Act, and offers auditable, transparent decision-making.

๏ปฟ
๐ง What It Does
- Generates synthetic CVs and job offers
- Uses LLMs to decide interview outcomes
- Logs and evaluates decisions for accuracy, consistency, and fairness
- Supports traceable workflows and model comparison through Weights & Biases and Weave
๐๏ธ System Architecture
1. Document Processing
- PDF text extraction for job offers and applications
- Image handling for scanned documents or embedded visuals
- Structured data extraction using AI-driven prompt templates
2. AI Models Integration
- Supports OpenAI, AWS Bedrock, and Ollama
- UI-based model selection & config
- Includes guardrails to prevent hallucinations and enforce fairness
3. Evaluation System
- Single Test Mode: Evaluate one application-offer pair in real time
- Batch Testing Mode: Large-scale evaluations across datasets
- Comprehensive scoring: Accuracy, consistency, and rationale quality
๐ Workflow Structure
The app follows a clear, auditable multi-step pipeline:
- Document Upload โ User provides PDFs or triggers dataset generation
- Information Extraction โ Structured data from job offers and CVs
- Comparison & Decision โ LLM evaluates candidate fit
- Hallucination Checking โ Weave scoring for factual consistency
- Expert Review (Optional) โ Human override and auditing capabilities
๐ฆ Key Features
1. Dataset Creation
- Generates structured applicant characteristics
- Supports bias control for:
- Gender
- Age
- Nationality
- Produces complete, synthetic applications aligned to job offers
๏ปฟ
Run set
157
๏ปฟ
๏ปฟ
๏ปฟ
Run set
46
๏ปฟ
2. R Score Calculation
- Quantitative metric for dataset representativeness
- Tracks distribution across features
- Validates balance using threshold-based checks
๏ปฟ
3. Application Generation
- AI-powered resume synthesis from characteristics
- Incorporates job offer requirements dynamically
- Built-in quality control prompts and rules
๏ปฟ
Run set
46
๏ปฟ
๐ Evaluation Modes
๐งช Single Test Mode
- Compare one job offer + one CV
- Get real-time decision with detailed explanation
- Optional expert review step
๐ Batch Testing Mode
- Run large evaluations across multiple pairings
- Support for multiple trials
- Tracks performance metrics and decision consistency

๏ปฟ
๐งโโ๏ธ Expert Review System
- Human-in-the-loop oversight
- Configurable review triggers (e.g. low R score, uncertain LLM response)
- Experts can override decisions
- Full logging of all reviews and model outputs
๐ฌ Model Evaluation & Scoring
Metric | Tool | Purpose |
---|---|---|
Reason Quality | ReasonScorer | Measures clarity and justification of the model's output |
Consistency | decision_match() | Checks if model decisions align with expected outcomes |
Factuality | HallucinationFreeScorer | Detects unsupported or incorrect claims in model output |
๏ปฟ
๐งช Fine-Tuning Implementation
The hiring agent includes a fine-tuned model purpose-built for comparing job applications to job offers. This section outlines the full lifecycle: from training to integration and evaluation.
๐งฌ Overview
The fine-tuned model enhances comparison accuracy by learning from a rich dataset of offers, resumes, and decisions. It's trained using W&B for tracking and deployed via Ollama for local inference.
๐๏ธ Fine-Tuning Process
1. ๐งพ Training Data Generation
- Generates positive and negative examples for comparison tasks
- Creates synthetic job offers and applications
- Produces structured interview decisions with reasoning
- Uses: generate_application_from_characteristics() and generate_dataset()
2. ๐ง Model Architecture
- Base Model: Lightweight LLM fine-tuned for offer-applicant comparisons
- Training Environment: Google Colab notebook
- Input Format: JSON (offers, applications, decisions)
- Output Format: Structured interview decisions with rationale
3. ๐ ๏ธ Training Implementation
- Integrated with W&B for experiment tracking
- Stores versions as W&B Artifacts
- Includes an automated evaluation pipeline
๏ปฟ
Run set
46
๏ปฟ
๐ Evaluation
1. ๐ Performance Metrics
- Decision Matching Accuracy
- Reason Scoring (clarity, fairness, justification)
- Hallucination Detection
- Batch Evaluation Support
2. ๐ค Comparison with Base Models
- Benchmarked against OpenAI and AWS Bedrock
- Compared on realistic hiring scenarios
- Evaluated against human judgments for fairness

Comparing the performance of different hiring agents based on different comparison models

Comparing our fine-tuned comparison model with other comparison models.
Guardrails and Robustness
The following figure shows 3 different agentic workflows of the hiring agent based on whether the hallucination guardrail detects a hallucination or not.
- The first image shows the steps if the comparison passes the guardrail.
- The second image shows how the agent self-reflects by looping back feedback from the hallucination guardrail to the comparison model to redo the comparison and then passes the test.
- The third image shows the agent fails twice and a human operator has to be included in the process, who can then take the decision and give a reason.

๏ปฟ
๐ ๏ธ Tech Stack
- Streamlit: For the user interface
- LangChain + LangGraph: For orchestrating workflows
- W&B Weave: For observability, evaluation, monitoring, and guardrails
- Ollama, OpenAI, Bedrock: Model backends
- Custom scoring modules: For explainability and audit-readiness
Add a comment