Bias Scorer

Initial Analysis of the Bias models

Created on November 27|Last edited on February 18

Comment

﻿
DefinitionOur definition of bias is based on the recent paper, “Toxicity of the Commons”  and the Media Bias Identification benchmark,  From these 2 sources we sub-select to focus on gender bias and racial bias and train a classifier on these 2 classes.
Race and origin-based bias: includes racism as well as bias against someone’s country or region of origin or immigration status, especially immigrant or refugee status
Gender and sexuality-based bias: includes sexism and misogyny, homophobia, transphobia, and sexual harassment
EvaluationTo test out hypothesis we performed analysis on this 2 categories by creating an evaluation dataset by creating a test set combining a sample of the MBIB and the Toxic Commons dataset filtered by these 2 categories.
The curation script for the dataset can be found here. We curated 1000 samples from a mix of the datasets filtering only these categories and keeping good balance of negative samples.
The dataset can be explored interactively here.﻿﻿
Comparing the models﻿Weave Evaluation comparison dashboard﻿
Our custom model, WeaveBiasScorerV1, performs substantially better than the OpenAI Moderation API and LlamaGuard  with a F1 score of 0.64 vs 0.41 for OpenAI Moderation and 0.54 for LlamaGuard.
The Celadon-based scorer performs good for as the classes match the 2/5 of the categories already present in the model. 
The OpenAI moderation API is not specific for this task, and performs poorly in general on these tasks.
LLamaGuard doesn't have any class that match this categories, the closest being hate.
﻿
﻿
Individual samplesOpenAI and LlamaGuard fail to detect very explicit samples like this one:
﻿
another and again here:
﻿
﻿
Usagefrom weave.scorers import WeaveBiasScorerV1
﻿
scorer = WeaveBiasScorerV1(device="cpu")
result = scorer.score("This is a hateful message.")
print(result)
# {'categories': {'gender_bias': True, 'racial_bias': False}, 'flagged': True}
﻿
# You can return the model logits with return_all_score=True
result = scorer.score("This is a hateful message.", return_all_scores=True)
print(result)
You can also adjust the threshold parameter:
scorer = WeaveBiasScorerV1(device="cpu", threshold=0.8)
result = scorer.score("This is a hateful message.")
print(result)
Datasets
Bias Models
﻿ValuRank Biase detector: 
GenderRaceBiasModel based on PleIA - Celadon﻿
CustomGenderRaceBiasModelWe finetuned a deberta v3 on a filtered and mixed dataset from the above and we could improve the performance of the models proposed. 
We propose a model that outputs outputs 2 categories: "gender_bias" and "racial_bias"
Model Checkpoint: https://huggingface.co/wandb/bias-scorer﻿
OpenAIModeration APINot particualarly suited for bias detection, but has some correlation in some categories:
hate: Hate speech
LLamaGuardThis model is not specific for bias detection, but it has multiple classes that may be used to flag biases. Also it is not particularly marking biased text as "unsafe".
S5: Defamation
S10: Hate
S13: Election
﻿
TrainingWe curated a dataset by looking at a range of research and industry datasets, including selecting the classes we felt were needed. 
We curated a 243k samples dataset and trained for 2 epochs.
﻿
Run set3
﻿
﻿