Toxicity Scorer

Initial analysis about toxicity scorer
Created on November 25|Last edited on February 18
Comment
﻿
Metric DefinitionOur definition of toxicity is based on the recent paper, “Toxicity of the Commons” , and covers 5 traits:
Race and origin-based bias: includes racism as well as bias against someone’s country or region of origin or immigration status, especially immigrant or refugee status
Gender and sexuality-based bias: includes sexism and misogyny, homophobia, transphobia, and sexual harassment 
Religious bias: any bias or stereotype based on someone’s religion.
Ability bias: bias according to someone’s physical, mental, or intellectual ability or disability
Violence and abuse: overly graphic descriptions of violence, threats of violence, or calls or incitement of violence
Each category of toxicity has 4 levels of severity, from 0 to 3. A text will be flagged as toxic if any of the categories has a value of 2 or higher or if the sum of the scores from all categories is 5 or greater.
Datasets﻿Toxic Commons: Toxic Commons is a release of 2 million samples of annotated, public domain, multilingual text that was used to train Celadon. It is being released alongside Celadon, in order to better understand multilingual and multicultural toxicity.
Kaggle Toxic Challenge 2018: This Kaggle competition challenges participants to build a multi-headed model for detecting various types of toxic online behaviors, such as threats, obscenity, insults, and identity-based hate, using a dataset of Wikipedia talk page comments, aiming to improve upon existing models and foster more respectful online discussions.
﻿OpenAI moderation API release test dataset. The dataset from the following paper and OAI moderation API release
Models﻿ToxicBert﻿﻿LlamaGuard Family﻿﻿Celadon﻿﻿﻿A deBertaV3 base (144M parameters) finetune to detect toxicity. Trained on 2M rows of the Toxic Commons dataset. Runs fine on CPU. The model does an ordinal regression for each catergory, returning a value between 0 and 3 assessing the severity of the category.
This model has 2 parameters that can be customized by the user: 
category threshold: Defaults to 2
total threshold: Defaults to 5
A prediction will be flagged as toxic if any of the categories has a value of 2 or higher or if the sum of the scores is 5 or greater. We tuned the parameters to have better recall and good accuracy. The paper's default are 3 and 7.
CategoriesRace/Origin
Gender/Sex
Religion
Ability
Violence
﻿OpenAI Moderation endpoint. 
Fine-tuned LLama 1B: 
Evaluations﻿Weave Evaluation Comparison Dashboard﻿
Below is a comparison of the different Toxicity scorers available
We perform evaluation over 2 datasets: OpenAI Moderation Dataset and Kaggle Toxicity Challenge test set.
OpenAI Moderation API scores the highest accross all metrics in both datasets
LlamaGuard 1B is the best performing model from the open-weights models on the OpenAI Moderation Dataset, it doesn't perform well on the Kaggle Toxic dataset, misses some easy very toxic examples.
Our R&B Llama 1B model performs well, but has low F1 and recall, still WIP.
The Celadon (144M) open weights model scores way above it's weight as it is almost one order of magnitude smaller than LlamaGuard1B and LLama1B. It is also the only one that runs fine on CPU in full precision. It also has secondbest recall and F1 scores and the internal threshold can be tweaked to make it more or less aggresive, check Celadon threshold comparison.﻿
The evaluation were ran on an A100 GPU﻿﻿
💡
﻿https://wandb.ai/c-metrics/toxicity-benchmark/weave/leaderboards/leaderboard-m3yieujn﻿﻿﻿
OpenAI Moderation Dataset Evaluation﻿﻿
>>>Click here to open interactively<<<﻿
﻿
Kaggle Toxic Dataset
>>>Click here to open interactively<<<
For some reason in this benchmark the LlamaGuard model fails to score high.
UsageWe propose to use PlelAI Celadon's model, this is a good compromise between speed and accuracy, running totally fine on CPU. This models outputs a vector of 5 classes:
from scorers.moderation_scorer import ToxicityScorer
﻿
toxicity_scorer = ToxicityScorer()
﻿
result = toxicity_scorer.score("This is a hateful message.")
print(result)
# {'flagged': False, 'categories': {'Race/Origin': 1, 'Gender/Sex': 0, 'Religion': 0, 'Ability': 0, 'Violence': 1}}
# sum < 5, categories < 2
﻿
result = toxicity_scorer.score("This is another hateful message.")
print(result)
# {'flagged': True, 'categories': {'Race/Origin': 2, 'Gender/Sex': 0, 'Religion': 0, 'Ability': 0, 'Violence': 1}}
# one category is 2>=
﻿
result = toxicity_scorer.score("This is a broad hateful message.")
print(result)
# {'flagged': True, 'categories': {'Race/Origin': 1, 'Gender/Sex': 1, 'Religion': 1, 'Ability': 1, 'Violence': 1}}
# sum >= 5
To flag the output as toxic, we compute and aggregation with 2 independent thresholds:
if (sum(predictions) >= self.total_threshold) or any(o >= self.category_threshold for o in predictions):
    flagged = True
We modified the defaults to total_threshold=5 and category_threshold=2 to decrease false positives and improve recall. We provide a repacked model ﻿that has the code for inference correctly refactored. 
Speed ComparisonFor the evals we ran the models on GPU (1xA100 40GB) but if you are limited to CPU, the only model that runs reasonably fast is the 144M parameter CeladonModel.
﻿
On CPU the only model that stays resonably fast is the DeBerta V3 based model Celadon. This is the model used in our ToxicityScorer.
Celadon Threshold comparisonWe ran some threshold optimization for the Celadon model:
﻿Check the evaluations here﻿
We have a selected total_threshold=5 and category_threshold=2 (red) as we found that has a good tradeoff F1/recall and accuracy.
﻿
﻿
﻿
Add a comment
Toxicity Scorer

Metric Definition

Datasets

Models

﻿ToxicBert﻿

﻿LlamaGuard Family﻿

﻿Celadon﻿

Categories

﻿OpenAI Moderation endpoint.

Fine-tuned LLama 1B:

Evaluations

OpenAI Moderation Dataset Evaluation﻿﻿

Kaggle Toxic Dataset

Usage

Speed Comparison

Celadon Threshold comparison

ToxicBert

LlamaGuard Family

Celadon

OpenAI Moderation endpoint.

OpenAI Moderation Dataset Evaluation