Skip to main content

Bias Scorer

Initial Analysis of the Bias models
Created on November 27|Last edited on February 18

Definition

Our definition of bias is based on the recent paper, “Toxicity of the Commons” and the Media Bias Identification benchmark, From these 2 sources we sub-select to focus on gender bias and racial bias and train a classifier on these 2 classes.
  1. Race and origin-based bias: includes racism as well as bias against someone’s country or region of origin or immigration status, especially immigrant or refugee status
  2. Gender and sexuality-based bias: includes sexism and misogyny, homophobia, transphobia, and sexual harassment

Evaluation

To test out hypothesis we performed analysis on this 2 categories by creating an evaluation dataset by creating a test set combining a sample of the MBIB and the Toxic Commons dataset filtered by these 2 categories.
The curation script for the dataset can be found here. We curated 1000 samples from a mix of the datasets filtering only these categories and keeping good balance of negative samples.
The dataset can be explored interactively here.

Comparing the models

  • Our custom model, WeaveBiasScorerV1, performs substantially better than the OpenAI Moderation API and LlamaGuard with a F1 score of 0.64 vs 0.41 for OpenAI Moderation and 0.54 for LlamaGuard.
  • The Celadon-based scorer performs good for as the classes match the 2/5 of the categories already present in the model.
  • The OpenAI moderation API is not specific for this task, and performs poorly in general on these tasks.
  • LLamaGuard doesn't have any class that match this categories, the closest being hate.



Individual samples

OpenAI and LlamaGuard fail to detect very explicit samples like this one:

another and again here:



Usage

from weave.scorers import WeaveBiasScorerV1

scorer = WeaveBiasScorerV1(device="cpu")
result = scorer.score("This is a hateful message.")
print(result)
# {'categories': {'gender_bias': True, 'racial_bias': False}, 'flagged': True}

# You can return the model logits with return_all_score=True
result = scorer.score("This is a hateful message.", return_all_scores=True)
print(result)
You can also adjust the threshold parameter:
scorer = WeaveBiasScorerV1(device="cpu", threshold=0.8)
result = scorer.score("This is a hateful message.")
print(result)

Datasets

Bias Models

ValuRank Biase detector:

GenderRaceBiasModel based on PleIA - Celadon

CustomGenderRaceBiasModel

  • We finetuned a deberta v3 on a filtered and mixed dataset from the above and we could improve the performance of the models proposed.
  • We propose a model that outputs outputs 2 categories: "gender_bias" and "racial_bias"

OpenAIModeration API

  • Not particualarly suited for bias detection, but has some correlation in some categories:
  • hate: Hate speech

LLamaGuard

  • This model is not specific for bias detection, but it has multiple classes that may be used to flag biases. Also it is not particularly marking biased text as "unsafe".
  • S5: Defamation
  • S10: Hate
  • S13: Election


Training

We curated a dataset by looking at a range of research and industry datasets, including selecting the classes we felt were needed.
We curated a 243k samples dataset and trained for 2 epochs.

Run set
3