Hiring Agent Demo: E2E AI Evaluation for Fair & Auditable Hiring Decisions

Created on April 4|Last edited on April 21
Comment
﻿
🎯 Objective🧠 What It Does🏗️ System Architecture🔁 Workflow Structure📦 Key Features📈 Evaluation Modes🧑‍⚖️ Expert Review System🔬 Model Evaluation & Scoring🧪 Fine-Tuning Implementation🧬 Overview🏗️ Fine-Tuning Process📊 EvaluationGuardrails and Robustness🛠️ Tech Stack
﻿
﻿
🎯 ObjectiveThis AI-powered Hiring Assistant automates the hiring process by evaluating candidate applications against job offers using advanced language models. It ensures fair, unbiased decisions, maintains compliance with regulations like the EU AI Act, and offers auditable, transparent decision-making.
﻿
🧠 What It DoesGenerates synthetic CVs and job offers
Uses LLMs to decide interview outcomes
Logs and evaluates decisions for accuracy, consistency, and fairness
Supports traceable workflows and model comparison through Weights & Biases and Weave
🏗️ System Architecture1. Document Processing
PDF text extraction for job offers and applications
Image handling for scanned documents or embedded visuals
Structured data extraction using AI-driven prompt templates
2. AI Models Integration
Supports OpenAI, AWS Bedrock, and Ollama
UI-based model selection & config
Includes guardrails to prevent hallucinations and enforce fairness
3. Evaluation System
Single Test Mode: Evaluate one application-offer pair in real time
Batch Testing Mode: Large-scale evaluations across datasets
Comprehensive scoring: Accuracy, consistency, and rationale quality
🔁 Workflow StructureThe app follows a clear, auditable multi-step pipeline:
Document Upload – User provides PDFs or triggers dataset generation
Information Extraction – Structured data from job offers and CVs
Comparison & Decision – LLM evaluates candidate fit
Hallucination Checking – Weave scoring for factual consistency
Expert Review (Optional) – Human override and auditing capabilities
📦 Key Features1. Dataset Creation
Generates structured applicant characteristics
Supports bias control for:
Gender
Age
Nationality
Produces complete, synthetic applications aligned to job offers
﻿
Run set157
﻿
﻿
﻿
Run set46
﻿
2. R Score Calculation
Quantitative metric for dataset representativeness
Tracks distribution across features
Validates balance using threshold-based checks
﻿
3. Application Generation
AI-powered resume synthesis from characteristics
Incorporates job offer requirements dynamically
Built-in quality control prompts and rules
﻿
Run set46
﻿
📈 Evaluation Modes🧪 Single Test Mode
Compare one job offer + one CV
Get real-time decision with detailed explanation
Optional expert review step
📊 Batch Testing Mode
Run large evaluations across multiple pairings
Support for multiple trials
Tracks performance metrics and decision consistency
﻿
🧑‍⚖️ Expert Review SystemHuman-in-the-loop oversight
Configurable review triggers (e.g. low R score, uncertain LLM response)
Experts can override decisions
Full logging of all reviews and model outputs
🔬 Model Evaluation & Scoring
























MetricToolPurpose
Reason QualityReasonScorerMeasures clarity and justification of the model's output
Consistencydecision_match()Checks if model decisions align with expected outcomes
FactualityHallucinationFreeScorerDetects unsupported or incorrect claims in model output
﻿
🧪 Fine-Tuning ImplementationThe hiring agent includes a fine-tuned model purpose-built for comparing job applications to job offers. This section outlines the full lifecycle: from training to integration and evaluation.
🧬 OverviewThe fine-tuned model enhances comparison accuracy by learning from a rich dataset of offers, resumes, and decisions. It's trained using W&B for tracking and deployed via Ollama for local inference.
🏗️ Fine-Tuning Process1. 🧾 Training Data Generation
Generates positive and negative examples for comparison tasks
Creates synthetic job offers and applications
Produces structured interview decisions with reasoning
Uses: generate_application_from_characteristics() and generate_dataset()
2. 🧠 Model Architecture
Base Model: Lightweight LLM fine-tuned for offer-applicant comparisons
Training Environment: Google Colab notebook
Input Format: JSON (offers, applications, decisions)
Output Format: Structured interview decisions with rationale
3. 🛠️ Training Implementation
Notebook-based fine-tuning: Open in Colab﻿
Integrated with W&B for experiment tracking
Stores versions as W&B Artifacts
Includes an automated evaluation pipeline
﻿
Run set46
﻿
📊 Evaluation1. 📈 Performance Metrics
Decision Matching Accuracy
Reason Scoring (clarity, fairness, justification)
Hallucination Detection
Batch Evaluation Support
2. 🤖 Comparison with Base Models
Benchmarked against OpenAI and AWS Bedrock
Compared on realistic hiring scenarios
Evaluated against human judgments for fairness
Comparing the performance of different hiring agents based on different comparison models
Comparing our fine-tuned comparison model with other comparison models.
Guardrails and RobustnessThe following figure shows 3 different agentic workflows of the hiring agent based on whether the hallucination guardrail detects a hallucination or not. 
The first image shows the steps if the comparison passes the guardrail. 
The second image shows how the agent self-reflects by looping back feedback from the hallucination guardrail to the comparison model to redo the comparison and then passes the test. 
The third image shows the agent fails twice and a human operator has to be included in the process, who can then take the decision and give a reason.
﻿
🛠️ Tech StackStreamlit: For the user interface
LangChain + LangGraph: For orchestrating workflows
W&B Weave: For observability, evaluation, monitoring, and guardrails
Ollama, OpenAI, Bedrock: Model backends
Custom scoring modules: For explainability and audit-readiness
﻿
Metric	Tool	Purpose
Reason Quality	`ReasonScorer`	Measures clarity and justification of the model's output
Consistency	`decision_match()`	Checks if model decisions align with expected outcomes
Factuality	`HallucinationFreeScorer`	Detects unsupported or incorrect claims in model output
Add a comment