AI guardrails: Bias scorers

This article explores bias in AI systems, the need for bias guardrails, detection models, and strategies to mitigate, monitor, and evaluate bias effectively.
Brett Young
Created on January 7|Last edited on March 1
Comment
﻿
Artificial intelligence is transforming decision-making across industries, but it comes with a significant challenge: bias.
When left unaddressed, bias can amplify discrimination, undermine trust, and compromise the fairness of AI applications - ultimately limiting their potential to benefit society. This article explores the complexities of bias in AI and offers actionable strategies to safeguard fairness and equity in your AI systems.
Using real-world examples and practical tools, we’ll show you how to detect, mitigate, and monitor bias to build more inclusive and reliable AI workflows.
Ready to get hands-on with bias detection? Start exploring with our interactive Colab.
﻿
Otherwise, continue reading for a deeper understanding of bias in AI and the code you’ll need to integrate these strategies into your workflows.
﻿
Table of contentsTable of contentsWhat is bias in AI?Why you need bias guardrailsIntroduction to bias detection modelsWeave BiasScorerOpenAI Moderation APIKey Features:LlamaGuardValuRank Bias DetectorGenderRaceBiasModelEvaluating bias detection models with Weave Conclusion
﻿
What is bias in AI?Bias in AI refers to the presence of systematic favoritism or prejudice within an AI model's outputs, often reflecting imbalances or inequities in the data used to train the model. These biases can manifest in various forms, such as racial, gender, or cultural discrimination, and can lead to significant societal impacts.
For example, in image generation, prompts like "CEO in a boardroom" might produce outputs that reflect societal stereotypes, while prompts for "traditional clothing" could result in narrow or oversimplified depictions of certain cultures. In medical applications, diagnostic tools trained on data that underrepresents certain races or genders might overlook key symptoms, leading to inaccurate or incomplete diagnoses. Without robust bias detection and mitigation, AI systems risk perpetuating discrimination, eroding trust, and failing to meet user expectations for fairness and reliability. 
Open-ended conversational AI systems should also be monitored for biases to prevent the propagation of harmful or discriminatory content. These systems, often trained on vast datasets from the internet, can unintentionally replicate biased language, or generate offensive responses. Ensuring these systems are equipped with robust guardrails helps maintain fairness and align interactions with ethical and societal standards. 
Why you need bias guardrailsBias guardrails act as safeguards to detect and mitigate biases in AI systems, ensuring that these technologies operate in a fair and equitable manner. By identifying harmful patterns embedded in training data or arising from algorithmic decision-making, they allow organizations to take corrective measures that align AI outputs with ethical and societal standards. 
Without robust mechanisms for bias detection and mitigation, AI systems can unintentionally perpetuate discrimination by reflecting existing societal inequities. This can manifest in various ways, such as reinforcing stereotypes, excluding underrepresented groups, or producing outcomes that disproportionately harm certain populations. These failures erode public confidence in AI technologies and hinder their ability to meet user expectations for reliability, inclusivity, and fairness.
Bias guardrails play a pivotal role in mitigating these risks and enabling organizations to build AI systems that are both effective and socially responsible.
Introduction to bias detection modelsBias detection models have evolved significantly to address specific forms of discrimination in AI outputs. Tools like the Weave BiasScorer, OpenAI Moderation API, ValuRank Bias Detector, and LlamaGuard represent advancements in the field.
Below we'll evaluate these models using frameworks that emphasize fairness, precision, and recall in bias detection tasks. The Toxic Commons dataset and a curated subset of the Media Bias Identification Benchmark (MBIB) were used to provide representative examples of biased content.
To start, first run the following install command: 
git clone https://github.com/wandb/weave.git && cd weave && git fetch origin pull/3006/head:xtra-scorers && git checkout xtra-scorers && pip install -qq -e . && pip install openai asyncio
Weave BiasScorerThe Weave BiasScorer is a fine-tuned model designed for detecting gender and racial biases. Built on DeBERTa v3, it was trained on a filtered and mixed dataset combining Toxic Commons and a subset of the MBIB. This model improves upon previous proposals by delivering enhanced performance in bias detection tasks. It categorizes outputs into "gender_bias" and "racial_bias," offering a more nuanced analysis of biased content. The checkpoint for this model can be accessed here. Integration with Weave ensures seamless evaluation and scoring across datasets.
import weave; weave.init('bias-scorers')
from weave.scorers import BiasScorer
﻿
﻿
# Initialize the BiasScorer
scorer = BiasScorer(device="cpu")  # Use "cuda" for GPU if available
﻿
# Define a list of sample texts to evaluate
texts = [
    "This is a hateful message.",
    "This statement is free of bias."
]
﻿
# Retrieve logits for more detailed analysis
for text in texts:
    result_with_logits = scorer.score(text)
    print(f"Text: {text}")
    print(f"Detailed Result: {result_with_logits}\n")
﻿
# Example with adjusted threshold
scorer_with_threshold = BiasScorer(device="cpu", threshold=0.8)
for text in texts:
    result = scorer_with_threshold.score(text)
    print(f"Text (with threshold=0.8): {text}")
    print(f"Result: {result}\n")
The threshold scores can be obtained by setting return_all_scores=True in the scorer.score() function. This option provides a detailed breakdown of bias scores across categories, such as gender_bias and racial_bias, instead of just a binary flag. The returned result includes the raw scores for each bias category, enabling a more nuanced analysis of the model's output and understanding which biases exceeded the thresholds.
After running this code, you will be able to see the results logged inside Weave. Normally, you would need to add the @weave.op decorator above your bias detection inference function in order to track the inputs and outputs with Weave, but since the BiasScorer is integrated with Weave, all that's needed is to simply import and init Weave. Here's what it looks like inside Weave after running the code: 
﻿
﻿
Having systems in place to record and monitor production data is important for making iterative improvements and maintaining model performance. Continuous logging with Weave allows teams to identify trends and refine models to adapt to evolving requirements and reduce potential biases.
OpenAI Moderation APIThe OpenAI Moderation API is a versatile, general-purpose tool designed to detect a wide range of content categories, including harassment, hate speech, violence, and other forms of harmful material. While not explicitly tailored for bias detection, its ability to flag related patterns such as hate speech and discriminatory content makes it a valuable supplementary tool for AI bias workflows.
Key Features:Broad Content Coverage: Capable of identifying diverse categories of harmful content, the API provides a robust foundation for content moderation and safety protocols.
Scalability: The API is lightweight and easy to integrate, enabling real-time moderation in applications of any scale.
Pre-Trained Models: Leveraging OpenAI's advanced language models, it offers a reliable out-of-the-box solution for detecting harmful or unsafe content.
Below is a sample script demonstrating how to use the OpenAI Moderation API to analyze text for potentially harmful or biased patterns. The integration with Weave ensures efficient tracking and visualization of the results.red for bias detection, it can flag related patterns: 
from openai import OpenAI
import weave; weave.init('bias-scorers')
client = OpenAI()
﻿
﻿
response = client.moderations.create(
  model="omni-moderation-latest",
  input="text to check bias for",
)
﻿
print(response)
This script uses OpenAI's Moderation API, integrated into Weave, to check for bias. While this model is not particularly suited for bias detection, it shows some correlation in categories like hate speech.
LlamaGuardThe LlamaGuard model offers broader categories such as hate, defamation, and other forms of harmful content. Although not explicitly designed for bias detection, it can be adapted to identify patterns indicative of bias within AI outputs. Its flexibility makes it a valuable tool for organizations seeking to address a wide range of ethical and safety concerns in AI applications.
LlamaGuard is available in two variants:
8B model: Optimized for GPU usage, this variant is ideal for systems with CUDA capabilities. It offers higher accuracy and faster processing, making it suitable for large-scale deployments or scenarios requiring real-time analysis.
1B model: A lightweight alternative designed for CPU-based systems, this variant is better suited for environments with limited computational resources. While less powerful than the 8B model, it provides an accessible solution for smaller-scale projects or testing purposes.
Both models are capable of identifying unsafe or biased content by categorizing outputs into predefined areas, such as violence, hate speech, and defamation. This adaptability allows teams to focus on specific concerns relevant to their use cases. Additionally, the models integrate seamlessly with Weave, enabling efficient monitoring, evaluation, and visualization of results.
For example, when integrated into a bias detection workflow, the LlamaGuard model can flag potentially unsafe outputs while providing detailed insights into the categories of violations. Teams can then use this information to refine their AI systems and ensure compliance with ethical and societal standards.
from weave.scorers.llamaguard_scorer import LlamaGuardScorer
import asyncio
﻿
import weave; weave.init('bias-scorers')
﻿
﻿
async def main():
    # Initialize the LlamaGuardScorer with the 1B model
    scorer = LlamaGuardScorer(
        model_name_or_path="meta-llama/Llama-Guard-3-1B",
        device="cpu",  # Use "cuda" if a GPU is available
    )
﻿
    # Text to score
    sample_text = "Your input text here. Check if this is safe or not."
﻿
    # Run the scorer
    result = await scorer.score(output=sample_text)
﻿
    # Display the result
    print("LlamaGuard Scoring Result:")
    print(f"Safe: {result['safe']}")
    print(f"Unsafe Score: {result['extras']['unsafe_score']}")
    if not result['safe']:
        print("Violated Categories:")
        for category, violated in result['extras']['categories'].items():
            if violated:
                print(f" - {category}")
﻿
﻿
if __name__ == "__main__":
    asyncio.run(main())
This script dynamically selects between the LlamaGuard 8B and 1B models based on system capabilities. Integrated with Weave, it ensures efficient bias detection while optimizing resource usage. LlamaGuard detects biases in AI-generated content, focusing on categories like violence, hate speech, sexual content, and privacy violations. By monitoring these areas, it helps ensure AI systems adhere to ethical standards and provide safe, reliable interactions.
By monitoring these critical areas, LlamaGuard helps organizations uphold ethical standards and build AI systems that are safe, reliable, and aligned with societal values. Its flexibility and seamless integration with tools like Weave make it a powerful addition to any AI bias detection and content moderation workflow. As AI systems continue to evolve, tools like LlamaGuard will play an increasingly vital role in ensuring fair and responsible AI practices.
ValuRank Bias Detector﻿ValuRank is a bias detection model built on DistilROBERTA, fine-tuned specifically for identifying bias in text. Leveraging the distilroberta-base pretrained weights, it includes a classification head designed to categorize text into two classes: neutral and biased. The model's simplicity and efficiency make it a popular choice for bias detection, with approximately 16,000 downloads per month. Its lightweight architecture allows for seamless integration into various workflows, offering a reliable solution for identifying biased content in diverse applications.
GenderRaceBiasModelThe GenderRaceBiasModel is based on PleIA's Celadon framework and utilizes a DeBERTaV3 model fine-tuned on the Toxic Commons dataset. This model is designed to detect biases and toxicity across five distinct categories, including race and gender biases. Its flexibility allows users to filter specific categories to align with their particular use cases, making it a versatile tool for fine-grained bias detection. The model's robust training on a specialized dataset ensures high relevance for applications focused on mitigating bias and promoting fairness in AI systems.
Evaluating bias detection models with Weave Weave provides a unified platform to manage and evaluate bias detection models. The following code will walk you through setting up a workflow to test and compare bias detection models using Weave's BiasScorer, the OpenAI Moderation API, and LlamaGuard, with the ToxicCommons dataset. We will sample 60 examples from the ToxicCommons dataset, with 30 examples from both the positive and negative class, ensuring fair representation of biased and neutral examples. 
Here is the code for the evaluation: 
import weave
from openai import OpenAI
from weave.scorers import BiasScorer
import pandas as pd
from datasets import load_dataset
import asyncio
import os
import torch
import random
import ast
from weave.scorers.llamaguard_scorer import LlamaGuardScorer
﻿
# Set device
device = "cuda" if torch.cuda.is_available() else "cpu"
﻿
﻿
# Initialize scorer
llama_scorer = LlamaGuardScorer(
    model_name_or_path="meta-llama/Llama-Guard-3-8B" if device == "cuda" else "meta-llama/Llama-Guard-3-1B",
    device=device  # Use "cuda" if you have a GPU
)
﻿
client = OpenAI(api_key="your api key")
scorer = BiasScorer()
weave.init("bias_eval")
﻿
# Function to parse scores
def parse_scores(scores):
    try:
        parsed = ast.literal_eval(scores)
        if isinstance(parsed, list):
            return parsed
    except (ValueError, SyntaxError):
        pass
    return []
﻿
# Load and process dataset
def load_ds(max_samples=200000, balanced_samples=30, seed=42, cache_path="cached_dataset.csv"):
    if os.path.exists(cache_path):
        print(f"Loading cached dataset from {cache_path}...")
        df = pd.read_csv(cache_path)
        return df.to_dict("records")
﻿
    print("Loading dataset...")
    random.seed(seed)
    dataset = load_dataset("PleIAs/ToxicCommons", split="train", streaming=True)
    limited_data = list(dataset.take(max_samples))
    random.shuffle(limited_data)
﻿
    processed_data = {"positive_samples": [], "negative_samples": []}
    for row in limited_data:
        text = row.get("original_text", "")
        scores = row.get("scores", "")
        scores_list = parse_scores(scores)
﻿
        if isinstance(text, str) and text.strip() and isinstance(scores_list, list):
            if len(scores_list) >= 2:
                if scores_list[0] > 0 or scores_list[1] > 0:  # Positive condition
                    if len(processed_data["positive_samples"]) < balanced_samples:
                        processed_data["positive_samples"].append({"text": text, "label": 1})
                else:  # Non-positive condition
                    if len(processed_data["negative_samples"]) < balanced_samples:
                        processed_data["negative_samples"].append({"text": text, "label": 0})
﻿
        if len(processed_data["positive_samples"]) >= balanced_samples and len(processed_data["negative_samples"]) >= balanced_samples:
            break
﻿
    final_data = processed_data["positive_samples"] + processed_data["negative_samples"]
    if not final_data:
        print("Error: No valid samples found in the dataset.")
        return None
﻿
    random.shuffle(final_data)  # Shuffle to mix positive and negative samples
    df = pd.DataFrame(final_data)
    df.to_csv(cache_path, index=False)
    print(f"Cached dataset saved to {cache_path}")
    return final_data
﻿
﻿
﻿
class LlamaGuard(weave.Model):
    @weave.op
    async def predict(self, text: str) -> int:
        # Initialize the LlamaGuardScorer
﻿
        # Run the scorer
        result = await llama_scorer.score(output=text)
        
        # If the text is safe, return 0 (safe)
        if result["safe"]:
            return 0
        return 1        
﻿
﻿
class OpenAIModerationModel(weave.Model):
    @weave.op
    def predict(self, text: str) -> int:
        response = client.moderations.create(model="omni-moderation-latest", input=text)
        moderation_result = response.results[0]
        return 1 if moderation_result.flagged else 0
﻿
class WeaveBiasScorerModel(weave.Model):
    @weave.op
    def predict(self, text: str) -> int:
        bias_result = scorer.score(text)
        return 1 if bias_result.get("flagged", False) else 0
﻿
@weave.op
def bias_scorer(label: int, model_output: int) -> dict:
    return {"bias_accuracy": int(model_output == label)}
﻿
# Run evaluations
async def run_evaluations():
    dataset = load_ds()
    print("Dataset loaded...")
﻿
    models = {
        "LlamaGuard": LlamaGuard(),
        "OpenAI": OpenAIModerationModel(),
        "BiasScorer": WeaveBiasScorerModel()
    }
﻿
    dataset_prepared = [{"text": row["text"], "label": row["label"]} for row in dataset]
    scorers = [bias_scorer]
    results = {}
﻿
    for model_name, model in models.items():
        print(f"\nEvaluating {model_name}...")
        evaluation = weave.Evaluation(
            dataset=dataset_prepared,
            scorers=scorers,
            name=model_name + " Eval"
        )
        results[model_name] = await evaluation.evaluate(model)
﻿
if __name__ == "__main__":
    asyncio.run(run_evaluations())
﻿
The pipeline begins with environment setup, including the initialization of Weave, GPU/CPU configuration, and integration of models like LlamaGuard and OpenAI's Moderation API. The dataset is handled efficiently by loading it in streaming mode using Hugging Face's load_dataset.  The ToxicCommons dataset is processed to balance classes and ensure representative samples for both biased and neutral content. The dataset is filtered to include samples where either the race and origin-based bias (first element) or gender and identity-based bias (second element) scores in the array are positive. These categories correspond to biases such as racism, xenophobia, and discrimination based on gender or personal identity. The filtered samples are balanced and cached locally for efficient reuse.
The script includes models like LlamaGuard for unsafe content detection, OpenAI's Moderation API for generalized moderation tasks, and  Weave BiasScorer for evaluating bias-specific content. Each model is defined as a Weave Model, enabling consistent evaluation and comparison. After running the evaluation, you will be able to easily view the results inside Weave, which show the scores for the accuracy of each model's ability to detect bias. 
Here is what it looks like inside Weave after running the evaluation: 
﻿
The WeaveBiasScorerModel achieves the highest bias accuracy at 0.817, followed by the OpenAIModerationModel at 0.583. The LlamaGuard 1B model performs the least effectively on this evaluation with an accuracy of 0.467, making it the weakest among the three.
Since the dataset was balanced between biased and neutral samples, random guessing would produce an accuracy around 50 percent, making LlamaGuard's 1B performance worse than random guessing, and the OpenAIModerationModel only marginally better than random guessing. 
ConclusionAddressing bias in AI is not just a technical challenge but an ethical imperative. Bias can erode trust, perpetuate inequality, and limit the potential of AI systems to benefit society. This article explored a range of tools—such as the Weave BiasScorer, OpenAI Moderation API, LlamaGuard, ValuRank, and GenderRaceBiasModel—that provide frameworks for detecting, mitigating, and monitoring bias in AI workflows.
Among these, the Weave BiasScorer stands out for its precision in detecting nuanced gender and racial biases, while tools like the OpenAI Moderation API and LlamaGuard offer broader, complementary capabilities for ensuring content safety. ValuRank and the GenderRaceBiasModel bring targeted solutions for identifying and addressing bias in diverse contexts, from dataset audits to real-time content moderation.
However, implementing bias detection tools is just one part of the solution. Organizations must adopt a holistic approach that combines:
Continuous Monitoring: Using tools like Weave to track and evaluate AI outputs over time.
Iterative Improvement: Refining models and datasets based on insights from bias detection frameworks.
Ethical Standards: Establishing clear guidelines for fairness, inclusivity, and accountability in AI development.
By leveraging these tools and strategies, teams can create AI systems that are not only effective but also equitable and socially responsible. As AI continues to influence critical aspects of society, ensuring fairness in its design and deployment will remain a cornerstone of its ethical development.
The journey toward unbiased AI is ongoing, but with robust tools and a commitment to fairness, organizations can lead the way in building technologies that truly serve everyone.
﻿
﻿
Add a comment
Tags: Articles, Weave, GenAI, Evaluations
Iterate on AI agents and models faster. Try Weights & Biases today.