Skip to main content

Hiring Agent Demo: E2E AI Evaluation for Fair & Auditable Hiring Decisions

Created on April 4|Last edited on April 21
๏ปฟ
๏ปฟ

๐ŸŽฏ Objective

This AI-powered Hiring Assistant automates the hiring process by evaluating candidate applications against job offers using advanced language models. It ensures fair, unbiased decisions, maintains compliance with regulations like the EU AI Act, and offers auditable, transparent decision-making.
๏ปฟ

๐Ÿง  What It Does

  • Generates synthetic CVs and job offers
  • Uses LLMs to decide interview outcomes
  • Logs and evaluates decisions for accuracy, consistency, and fairness
  • Supports traceable workflows and model comparison through Weights & Biases and Weave

๐Ÿ—๏ธ System Architecture

1. Document Processing
  • PDF text extraction for job offers and applications
  • Image handling for scanned documents or embedded visuals
  • Structured data extraction using AI-driven prompt templates
2. AI Models Integration
  • Supports OpenAI, AWS Bedrock, and Ollama
  • UI-based model selection & config
  • Includes guardrails to prevent hallucinations and enforce fairness
3. Evaluation System
  • Single Test Mode: Evaluate one application-offer pair in real time
  • Batch Testing Mode: Large-scale evaluations across datasets
  • Comprehensive scoring: Accuracy, consistency, and rationale quality

๐Ÿ” Workflow Structure

The app follows a clear, auditable multi-step pipeline:
  • Document Upload โ€“ User provides PDFs or triggers dataset generation
  • Information Extraction โ€“ Structured data from job offers and CVs
  • Comparison & Decision โ€“ LLM evaluates candidate fit
  • Hallucination Checking โ€“ Weave scoring for factual consistency
  • Expert Review (Optional) โ€“ Human override and auditing capabilities

๐Ÿ“ฆ Key Features

1. Dataset Creation
  • Generates structured applicant characteristics
  • Supports bias control for:
    • Gender
    • Age
    • Nationality
  • Produces complete, synthetic applications aligned to job offers
๏ปฟ
Run set
157
๏ปฟ
๏ปฟ
๏ปฟ
Run set
46
๏ปฟ
2. R Score Calculation
  • Quantitative metric for dataset representativeness
  • Tracks distribution across features
  • Validates balance using threshold-based checks
๏ปฟ
3. Application Generation
  • AI-powered resume synthesis from characteristics
  • Incorporates job offer requirements dynamically
  • Built-in quality control prompts and rules
๏ปฟ
Run set
46
๏ปฟ

๐Ÿ“ˆ Evaluation Modes

๐Ÿงช Single Test Mode
  • Compare one job offer + one CV
  • Get real-time decision with detailed explanation
  • Optional expert review step
๐Ÿ“Š Batch Testing Mode
  • Run large evaluations across multiple pairings
  • Support for multiple trials
  • Tracks performance metrics and decision consistency
๏ปฟ

๐Ÿง‘โ€โš–๏ธ Expert Review System

  • Human-in-the-loop oversight
  • Configurable review triggers (e.g. low R score, uncertain LLM response)
  • Experts can override decisions
  • Full logging of all reviews and model outputs

๐Ÿ”ฌ Model Evaluation & Scoring

MetricToolPurpose
Reason QualityReasonScorerMeasures clarity and justification of the model's output
Consistencydecision_match()Checks if model decisions align with expected outcomes
FactualityHallucinationFreeScorerDetects unsupported or incorrect claims in model output
๏ปฟ

๐Ÿงช Fine-Tuning Implementation

The hiring agent includes a fine-tuned model purpose-built for comparing job applications to job offers. This section outlines the full lifecycle: from training to integration and evaluation.

๐Ÿงฌ Overview

The fine-tuned model enhances comparison accuracy by learning from a rich dataset of offers, resumes, and decisions. It's trained using W&B for tracking and deployed via Ollama for local inference.

๐Ÿ—๏ธ Fine-Tuning Process

1. ๐Ÿงพ Training Data Generation
  • Generates positive and negative examples for comparison tasks
  • Creates synthetic job offers and applications
  • Produces structured interview decisions with reasoning
  • Uses: generate_application_from_characteristics() and generate_dataset()
2. ๐Ÿง  Model Architecture
  • Base Model: Lightweight LLM fine-tuned for offer-applicant comparisons
  • Training Environment: Google Colab notebook
  • Input Format: JSON (offers, applications, decisions)
  • Output Format: Structured interview decisions with rationale
3. ๐Ÿ› ๏ธ Training Implementation
  • Notebook-based fine-tuning: Open in Colab๏ปฟ
  • Integrated with W&B for experiment tracking
  • Stores versions as W&B Artifacts
  • Includes an automated evaluation pipeline
๏ปฟ
Run set
46
๏ปฟ

๐Ÿ“Š Evaluation

1. ๐Ÿ“ˆ Performance Metrics
  • Decision Matching Accuracy
  • Reason Scoring (clarity, fairness, justification)
  • Hallucination Detection
  • Batch Evaluation Support
2. ๐Ÿค– Comparison with Base Models
  • Benchmarked against OpenAI and AWS Bedrock
  • Compared on realistic hiring scenarios
  • Evaluated against human judgments for fairness
Comparing the performance of different hiring agents based on different comparison models
Comparing our fine-tuned comparison model with other comparison models.

Guardrails and Robustness

The following figure shows 3 different agentic workflows of the hiring agent based on whether the hallucination guardrail detects a hallucination or not.
  1. The first image shows the steps if the comparison passes the guardrail.
  2. The second image shows how the agent self-reflects by looping back feedback from the hallucination guardrail to the comparison model to redo the comparison and then passes the test.
  3. The third image shows the agent fails twice and a human operator has to be included in the process, who can then take the decision and give a reason.
๏ปฟ

๐Ÿ› ๏ธ Tech Stack

  • Streamlit: For the user interface
  • LangChain + LangGraph: For orchestrating workflows
  • W&B Weave: For observability, evaluation, monitoring, and guardrails
  • Ollama, OpenAI, Bedrock: Model backends
  • Custom scoring modules: For explainability and audit-readiness