Skip to main content

AI guardrails: Toxicity scorers

This article explores the challenges of detecting and managing toxicity in AI systems, providing actionable strategies and tools to foster safer and more inclusive digital interactions. This is a translated version of the article. Feel free to report any possible mis-translations in the comments section
Created on August 26|Last edited on August 26
Toxicity in artificial intelligence presents a pressing concern as AI systems increasingly influence online interactions and content moderation. Left unaddressed, toxicity can foster harmful environments, damage user trust, and undermine the integrity of AI applications.
In this article, we dig into the challenges of detecting and managing toxicity in AI and provide actionable strategies to implement safeguards. Through real-world examples and practical tools, we'll explore how to identify, mitigate, and monitor toxicity to promote safer and more inclusive AI-driven deployments.
If you're eager to get started with toxicity detection and explore this and other AI guardrails, check out the accompanying Colab:

Otherwise, continue reading for a deeper understanding of toxicity in AI and the code you’ll need to integrate these strategies into your workflows.


Table of contents



What is toxicity in AI?

Toxicity in AI refers to the presence of harmful, offensive, or malicious content generated by an AI model, often reflecting or amplifying toxic patterns found in the data used to train the model. These toxic behaviors can manifest in various ways, such as hate speech, harassment, or abusive language, and can have significant societal consequences.
For instance, in content moderation, an AI system might fail to flag overtly toxic language or incorrectly flag benign content, resulting in inadequate enforcement of community guidelines. In image generation, prompts like "depicting a heated debate" might inadvertently produce offensive or inflammatory imagery. Similarly, in health forums or support systems,conversational AI tools could respond insensitively to user concerns, exacerbating emotional distress. Without effective detection and management of toxicity, AI systems risk fostering harmful environments, eroding trust, and failing to meet user expectations for safety and inclusivity.
Open-ended conversational AI systems must also be monitored for toxicity to prevent the spread of harmful or offensive content. Trained on vast datasets that may contain toxic language, these systems can inadvertently generate or propagate such content. Equipping these systems with robust safeguards ensures interactions remain respectful, ethical, and aligned with societal standards, fostering a safer and more inclusive digital landscape.

Why you need toxicity guardrails

Toxicity guardrails are critical for ensuring AI systems contribute to safe and respectful digital interactions. These measures detect and manage harmful behaviors, such as hate speech or abusive language, in AI outputs or training data. By doing so, they enable organizations to take proactive steps to align their AI systems with ethical values and societal expectations.
Without these safeguards, AI risks amplifying harmful content, leading to hostile online environments, alienating marginalized groups, and eroding the trust and credibility of AI-driven platforms. Such failures not only harm users but also undermine the broader potential of AI to create positive and inclusive experiences.
Effective toxicity guardrails go beyond detection; they empower organizations to create AI systems that are both powerful and ethical. By addressing toxicity head-on, these systems can contribute to healthier digital spaces, reinforcing public trust and ensuring that technology serves everyone equitably.

How toxicity is scored

Scoring toxicity involves evaluating how harmful, offensive, or malicious a piece of content is, based on predefined categories such as Race/Origin, Gender/Sex, Religion, Ability, and Violence. The scoring system relies on thresholds to determine whether content is flagged as toxic or remains unflagged. A response is flagged if its toxicity score exceeds a specific threshold, either in total or within a single category.
Toxicity is assessed across five categories: Race/Origin, Gender/Sex, Religion, Ability, and Violence. Each category is scored individually, contributing to the overall evaluation. The thresholds for flagging content include a cumulative score across all categories of 5 or more, which identifies content as broadly toxic, or a score of 2 or higher in any single category, which indicates toxicity in that specific area. This framework ensures comprehensive evaluation while allowing for targeted detection of harmful content.

Introduction to toxicity detection models

Toxicity detection models have advanced considerably to address harmful and offensive content in AI outputs. Tools such as the Weave ToxicityScorer, OpenAI Moderation API, and LlamaGuard represent notable progress in this field.
Below, we evaluate these models using frameworks that emphasize safety, precision, and recall in toxicity detection tasks. The Kaggle Toxicity Dataset was used to provide representative examples of toxic content.
To get started, you can install the necessary dependencies:
git clone https://github.com/wandb/weave.git && cd weave && git fetch origin pull/3006/head:xtra-scorers && git checkout xtra-scorers && pip install -qq -e . && pip install openai asyncio

Weave Toxicity Scorer

The Weave ToxicityScorer is a fine-tuned model designed to detect toxicity across categories such as Race/Origin, Gender/Sex, Religion, Ability, and Violence. Built on DeBERTa v3, it was trained on 2 million rows of the Toxic Commons dataset, enabling robust identification of harmful patterns in text. This model enhances prior approaches by delivering improved performance in detecting various forms of toxic content. Its integration with Weave provides seamless evaluation and scoring across diverse datasets, ensuring comprehensive and reliable analysis. The model checkpoint is available for access here.
from weave.scorers.moderation_scorer import ToxicityScorer
weave.init("toxicity-scorer")
toxicity_scorer = ToxicityScorer()

result = toxicity_scorer.score("This is a hateful message.")
print(result)
# {'flagged': False, 'categories': {'Race/Origin': 1, 'Gender/Sex': 0, 'Religion': 0, 'Ability': 0, 'Violence': 1}}
# sum < 5, categories < 2


result = toxicity_scorer.score("This is another hateful message.")
print(result)
# {'flagged': True, 'categories': {'Race/Origin': 2, 'Gender/Sex': 0, 'Religion': 0, 'Ability': 0, 'Violence': 1}}
# one category is 2>=

result = toxicity_scorer.score("This is a broad hateful message.")
print(result)
# {'flagged': True, 'categories': {'Race/Origin': 1, 'Gender/Sex': 1, 'Religion': 1, 'Ability': 1, 'Violence': 1}}
# sum >= 5

The scores for each category, including Race/Origin, Gender/Sex, Religion, Ability, and Violence, are included in the output of the toxicity_scorer.score() function. This breakdown provides raw scores for each category, enabling a nuanced analysis of the model's outputs and identifying which categories exceeded the thresholds.
After running the code, the results will be logged directly within Weave. Normally, tracking inputs and outputs in Weave would require adding the @weave.op decorator to your inference function. However, since the ToxicityScorer is fully integrated with Weave, it automatically handles logging without requiring additional setup. The results can then be explored and visualized seamlessly in the Weave platform.


Having systems in place to record and monitor production data is important for making iterative improvements and maintaining model performance. Continuous logging with Weave allows teams to identify trends and refine models to adapt to evolving requirements and reduce potential biases.

OpenAI Moderation API

The OpenAI Moderation API is a versatile, general-purpose tool designed to detect a wide range of content categories, including harassment, hate speech, violence, and other forms of harmful material.
Key Features include:
  • Broad Content Coverage:Capable of identifying diverse categories of harmful content, the API provides a robust foundation for content moderation and safety protocols.
  • Scalability:The API is lightweight and easy to integrate, enabling real-time moderation in applications of any scale.
Below is a sample script demonstrating how to use the OpenAI Moderation API to analyze text for potentially harmful or toxic patterns. The integration with Weave ensures efficient tracking and visualization of the results for toxicity detection, it can flag related patterns:
from openai import OpenAI
import weave; weave.init('toxicity-scorers')
client = OpenAI()


response = client.moderations.create(
model="omni-moderation-latest",
input="text to check toxicity for",
)

print(response)
This script uses OpenAI's Moderation API, integrated into Weave, to check for toxicity. While this model is not particularly suited for toxicity detection, it shows some correlation in categories like hate speech.

LlamaGuard

The LlamaGuard model offers broader categories such as hate, defamation, and other forms of harmful content. It can be adapted to identify patterns indicative of toxicity within AI outputs. Its flexibility makes it a valuable tool for organizations seeking to address a wide range of ethical and safety concerns in AI applications.
LlamaGuard is available in two variants:
  • 8B model:Optimized for GPU usage, this variant is ideal for systems with CUDA capabilities. It offers higher accuracy and faster processing, making it suitable for large-scale deployments or scenarios requiring real-time analysis.
  • 1B model:A lightweight alternative designed for CPU-based systems, this variant is better suited for environments with limited computational resources. While less powerful than the 8B model, it provides an accessible solution for smaller-scale projects or testing purposes.
Both models are capable of identifying unsafe or toxic content by categorizing outputs into predefined areas, such as violence, hate speech, and defamation. This adaptability allows teams to focus on specific concerns relevant to their use cases. Additionally, the models integrate seamlessly with Weave, enabling efficient monitoring, evaluation, and visualization of results.
For example, when integrated into a toxicity detection workflow, the LlamaGuard model can flag potentially unsafe outputs while providing detailed insights into the categories of violations. Teams can then use this information to refine their AI systems and ensure compliance with ethical and societal standards.
from weave.scorers.llamaguard_scorer import LlamaGuardScorer
import asyncio

import weave; weave.init('toxicity-scorers')


async def main():
# Initialize the LlamaGuardScorer with the 1B model
scorer = LlamaGuardScorer(
model_name_or_path="meta-llama/Llama-Guard-3-1B",
device="cpu", # Use "cuda" if a GPU is available
)

# Text to score
sample_text = "Your input text here. Check if this is safe or not."

# Run the scorer
result = await scorer.score(output=sample_text)

# Display the result
print("LlamaGuard Scoring Result:")
print(f"Safe: {result['safe']}")
print(f"Unsafe Score: {result['extras']['unsafe_score']}")
if not result['safe']:
print("Violated Categories:")
for category, violated in result['extras']['categories'].items():
if violated:
print(f" - {category}")


if __name__ == "__main__":
asyncio.run(main())
This script dynamically selects between the LlamaGuard 8B and 1B models based on system capabilities. Integrated with Weave, it ensures efficient toxicity detection while optimizing resource usage. LlamaGuard detects toxicity in AI-generated content, focusing on categories like violence, hate speech, sexual content, and privacy violations. By monitoring these areas, it helps ensure AI systems adhere to ethical standards and provide safe, reliable interactions.
By monitoring these critical areas, LlamaGuard helps organizations uphold ethical standards and build AI systems that are safe, reliable, and aligned with societal values. Its flexibility and seamless integration with tools like Weave make it a powerful addition to any AI toxicity detection and content moderation workflow. As AI systems continue to evolve, tools like LlamaGuard will play an increasingly vital role in ensuring fair and responsible AI practices.

Evaluating toxicity detection models with Weave

Weave offers a unified platform to manage and evaluate toxicity detection models efficiently. Using the provided code, you can set up a comprehensive workflow to test and compare multiple toxicity detection models, including Weave's ToxicityScorer, OpenAI Moderation API, LlamaGuard, leveraging the ToxicCommons and Kaggle datasets.
The workflow is designed to ensure fair representation of toxic and neutral examples by balancing samples from different categories. Models are evaluated using metrics such as precision, recall, and F1 score. The provided implementation integrates preprocessing, balanced sampling, and model predictions seamlessly with Weave, enabling effective testing and analysis. This setup also supports GPU acceleration and robust dataset handling to ensure reliable results.
Here is the code for the evaluation:
import weave
from openai import OpenAI
import pandas as pd
from datasets import load_dataset
import asyncio
import os
import torch
import random
import ast
from weave.scorers.llamaguard_scorer import LlamaGuardScorer
from weave.scorers.moderation_scorer import ToxicityScorer
import os
import json
import gzip
# Set device
device = "cuda" if torch.cuda.is_available() else "cpu"
import weave
from transformers import AutoTokenizer



from datasets import load_dataset
import random

# Initialize scorer
llama_scorer = LlamaGuardScorer(
model_name_or_path="meta-llama/Llama-Guard-3-8B" if device == "cuda" else "meta-llama/Llama-Guard-3-1B",
device=device # Use "cuda" if you have a GPU
)

client = OpenAI(api_key="your api key")
scorer = ToxicityScorer()

weave.init("toxicity_eval")


class PrecisionRecallF1Scorer(weave.Scorer):
"""
Custom scorer to calculate precision, recall, and F1 at the dataset level.
"""

@weave.op
def score(self, label: int, model_output: int) -> dict:
"""
Compute True Positives, False Positives, and False Negatives for a single row.
"""
tp = int(label == 1 and model_output == 1) # True Positive
fp = int(label == 0 and model_output == 1) # False Positive
fn = int(label == 1 and model_output == 0) # False Negative

return {"tp": tp, "fp": fp, "fn": fn}

def summarize(self, score_rows: list) -> dict:
"""
Summarize precision, recall, and F1 from the row-level scores.
"""
# Aggregate true positives, false positives, and false negatives
total_tp = sum(row["tp"] for row in score_rows)
total_fp = sum(row["fp"] for row in score_rows)
total_fn = sum(row["fn"] for row in score_rows)

# Calculate precision, recall, and F1
precision = total_tp / (total_tp + total_fp) if (total_tp + total_fp) > 0 else 0
recall = total_tp / (total_tp + total_fn) if (total_tp + total_fn) > 0 else 0
f1 = (
2 * (precision * recall) / (precision + recall)
if (precision + recall) > 0
else 0
)

return {
"precision": precision,
"recall": recall,
"f1": f1,
"tp": total_tp,
"fp": total_fp,
"fn": total_fn,
}



@weave.op
def toxicity_accuracy(label: int, model_output: int) -> dict:
return {"accuracy": int(model_output == label)}



def load_ds(dataset_name="tcapelle/kaggle-toxic-annotated", split="train", balanced_samples=50, seed=42):
# Load the dataset from Hugging Face Hub
print("Loading dataset...")
dataset = load_dataset(dataset_name, split=split)

# Convert to Pandas DataFrame for easier manipulation
df = dataset.to_pandas()

# Initialize random seed
random.seed(seed)

# Define subclasses
subclasses = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]

# Remove rows where any subclass column contains -1
df = df[~df[subclasses].isin([-1]).any(axis=1)]

# Separate into non-toxic, toxic, and severe toxic categories
non_toxic_samples = []
toxic_samples = []
severe_toxic_samples = []

for _, row in df.iterrows():
comment_text = row.get("comment_text", "").strip()
if not comment_text:
continue

# Check if it's non-toxic: all subclass columns must be 0
if all(row.get(subclass, 0) == 0 for subclass in subclasses):
non_toxic_samples.append({"text": comment_text, "label": 0}) # Non-toxic
elif row["severe_toxic"] == 1:
severe_toxic_samples.append({"text": comment_text, "label": 1}) # Severe toxic
elif row["toxic"] == 1:
toxic_samples.append({"text": comment_text, "label": 1}) # Toxic

# Balance the main categories
balanced_severe_toxic = random.sample(severe_toxic_samples, min(len(severe_toxic_samples), balanced_samples))
balanced_toxic = random.sample(toxic_samples, min(len(toxic_samples), balanced_samples))
balanced_non_toxic = random.sample(non_toxic_samples, min(len(non_toxic_samples), balanced_samples))

# Combine all samples
final_data = balanced_severe_toxic + balanced_toxic + balanced_non_toxic
random.shuffle(final_data)

return final_data


class LlamaGuard(weave.Model):
@weave.op
async def predict(self, text: str) -> int:
# Initialize the LlamaGuardScorer

# Run the scorer
result = await llama_scorer.score(output=text)
# If the text is safe, return 0 (safe)
if result["safe"]:
return 0
return 1


class OpenAIModerationModel(weave.Model):
@weave.op
def predict(self, text: str) -> int:
response = client.moderations.create(model="omni-moderation-latest", input=text)
moderation_result = response.results[0]
return 1 if moderation_result.flagged else 0

class WeaveToxicityScorerModel(weave.Model):
@weave.op
def predict(self, text: str) -> int:
toxicity_result = scorer.score(text)
return 1 if toxicity_result.get("flagged", False) else 0

# Run evaluations
async def run_evaluations():
dataset = load_ds()

print("Dataset loaded...")

models = {
"LlamaGuard": LlamaGuard(),
"OpenAI": OpenAIModerationModel(),
"toxicityScorer": WeaveToxicityScorerModel(),
}

dataset_prepared = [{"text": row["text"], "label": row["label"]} for row in dataset]
scorers = [toxicity_accuracy, PrecisionRecallF1Scorer()]
results = {}

for model_name, model in models.items():
print(f"\nEvaluating {model_name}...")
evaluation = weave.Evaluation(
dataset=dataset_prepared,
scorers=scorers,
name=model_name + " Eval"
)
results[model_name] = await evaluation.evaluate(model)

if __name__ == "__main__":
asyncio.run(run_evaluations())



The evaluation highlights the strengths and trade-offs between three models. The WeaveToxicityScorer model stands out for its exceptional efficiency, with the lowest latency (0.316s), making it ideal for time-sensitive applications. Its ability to run locally adds significant value by eliminating reliance on external APIs, which can help reduce costs when processing large datasets. However, its moderate recall (0.660) and F1 score (0.767) suggest a trade-off in accuracy compared to other options. These results are impressive, given this is a comparison against a closed-source model from OpenAI.
The OpenAI Moderation model delivers the highest precision (0.957), recall (0.900), and F1 score (0.928), offering unmatched accuracy. However, its dependence on external APIs can become costly for large-scale data processing, and its latency (0.367s) is slightly higher than Weave’s.
The LlamaGuard model, while functional, underperforms with the lowest recall (0.520), F1 score (0.630), and accuracy (0.593), alongside the highest latency (1.853s).
In addition to the evaluation metrics, Weave's comparisons view allows for detailed analysis of individual responses generated by each model on specific examples from the dataset. This feature provides a side-by-side breakdown of the outputs for each model, paired with the corresponding reference text. Through this view, users can explore qualitative differences in how each model handles the task, such as variations in clarity, logical flow, or inclusion of relevant details.

By examining these comparisons, we can uncover patterns in model behavior, identifying strengths and weaknesses that may not be immediately apparent from aggregate metrics. This granular level of insight is invaluable for debugging, understanding why certain models excel in specific cases, and pinpointing areas where improvements can be made. This functionality empowers users to refine their models with a data-driven approach, making Weave a powerful tool for model evaluation and optimization.

Conclusion

Addressing toxicity in AI systems is essential for fostering safe and inclusive digital environments. The evaluation of tools like WeaveToxicityScorer, OpenAI Moderation, and LlamaGuard illustrates the importance of balancing performance, efficiency, and scalability in toxicity detection. While the WeaveToxicityScorer offers a practical and efficient solution with local deployment capabilities, the OpenAI Moderation model sets a high standard in accuracy, albeit with reliance on external APIs.
Effective toxicity detection is not solely about choosing the best model but integrating these tools into workflows that prioritize iterative improvement. Organizations must not only implement robust detection frameworks but also establish safeguards that promote fairness and trust in AI-driven platforms. As AI continues to shape critical aspects of society, ensuring respectful and responsible outputs remains a cornerstone of ethical AI development.

Learn more about guardrails and evaluations