How Weights & Biases and MS Fairlearn can help deal with Model and Dataset Bias

As part of this report we take a hard look at Dataset bias for tabular and image data, and also use Microsoft's Fairlearn with Weights and Biases to look at fairness metrics and bias mitigation algorithms.
Aman Arora
Created on December 5|Last edited on January 17
Comment
After reading multiple research papers and performing a thorough literature review on this topic for the past few weeks, I'm of the opinion that while techniques still exist to deal with bias for tabular data, the question of "detecting and dealing with dataset bias for Image and Text data" is still undergoing active research and there is no clear answer. 
As part of this report, I'd like to showcase how integrating your pipelines with Weights and Biases can really help you detect and combat dataset bias for tabular and image data. Also, as part of this report we'll also look at Microsoft's Fairlearn Toolkit, and look at how it can be used to detect and mitigate bias for tabular data by using UCI credit card data as an example. The example notebook can also be found here. 
The best way to get started with Microsoft Fairlearn IMO is through the docs here.﻿
I've also provided towards the end of this report a long list of relevant research papers on the topic for anyone interested.
The aim of this report is two-fold:
To help the readers try and understand some of the subtle ways bias can sneak into our datasets,
To look at some possible ways to combat bias using Weights & Biases especially for tabular data using MS FairLearn.
What is Bias? Before we go any further, we must answer the question "What is Bias?". Quoting Maureen Mcleaney from IBM:
A cognitive bias is a systematic pattern of deviation from norm or rationality in judgement. People make decisions given their limited resources.There are various examples of Bias in Machine Learning: 
Google's sentiment analyzer isn't always effective and sometimes produces biased results. [source]
Figure-1: Bias in Google's sentiment analyzer; source: https://www.vice.com/en/article/j5jmj8/google-artificial-intelligence-bias﻿
Northpointe's Compas Algorithm [source]
Figure-2: Borden was rated high risk for future crime after she and a friend took a kid’s bike and scooter that were sitting outside. She did not reoffend; source: https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing﻿
Amazon’s Face Recognition Falsely Matched 28 Members of Congress With Mugshots [source]
Figure-3: In a 2018 test that the ACLU conducted of the facial recognition tool, called “Rekognition,” the software incorrectly matched 28 members of Congress, identifying them as other people who have been arrested for a crime.; source: https://www.aclu.org/blog/privacy-technology/surveillance-technologies/amazons-face-recognition-falsely-matched-28﻿
There are many more such examples of biases in AI and I could go on, but for brevity, I hope you get the point that this is a serious concern and we as deep learning practitioners/data scientists should make conscious efforts to remove biases from the models that we train!From the Biases in AI systems paper:
A child wearing sunglasses is labeled as a "failure, loser, nonstarter, unsuccessful person." This is just one of the many systemic biases exposed by ImageNet Roulette, an art project that applies labels to user-submitted photos by sourcing its identification system from the original ImageNet database.With the rapid adoption of AI across a variety of sectors, including in areas such as justice and health care, technologists and policy makers have raised concerns about the lack of accountability and bias associated with AI-based decisions. The necessary expertise around AI, datasets, and the policy and rights landscape that collectively helps uncover bias is not uniformly available among these stakeholders. As a consequence, bias in AI systems can compound inconspicuously.
Defining, detecting, measuring, and mitigating bias in AI systems is not an easy task and is an active area of research. (Borocas et al)
A typical AI pipeline starts from the data-creation stage: 
Collecting the data; 
Annotating or labeling it; and  
Preparing or processing it into a format that can be consumed by the rest of the pipeline. 
For a detailed analyses how different types of bias can be introduced in each of these steps, I'd like to refer the readers to Biases in AI Systems. The various types of biases that have been described in the paper have been summarized in the image below:
Figure-4: Taxonomy of Biases in AI Pipeline
Despite significant research efforts within the AI community to address bias-related challenges, several gaps impede the collective progress. These gaps have also been addressed in the paper.
The paper A Survey on Bias in Visual Datasets is another wonderful survey that provides an analysis on the various kinds of biases that exist in Visual datasets for Image recognition, classification and detection. 
💡
Bias Mitigation Algorithms (AI Fairness)Having looked at Bias, let's also look at some of the bias mitigation algorithms that exist. From the AI FAIRNESS 360 paper, There are majorly three types of bias mitigation algorithms:
Pre-processing algorithms: Reweighing (Kamiran & Calders, 2012) generates weights for the training examples in each (group, label) combination differently to ensure fairness before classification. Optimized preprocessing (Calmon et al., 2017) learns a probabilistic transformation that edits the features and labels in the data with group fairness, individual distortion, and data fidelity constraints and objectives. Learning fair representations (Zemel et al., 2013) finds a latent representation that encodes the data well but obfuscates information about protected attributes. Disparate impact remover (Feldman et al., 2015) edits feature values to increase group fairness while preserving rank-ordering within groups. 
In-processing algorithms: Adversarial de-biasing (Zhang et al., 2018) learns a classifier to maximize prediction accuracy and simultaneously reduce an adversarys ability to determine the protected attribute from the predictions. This approach leads to a fair classifier as the predictions cannot carry any group discrimination information that the adversary can exploit. Prejudice remover (Kamishima et al., 2012) adds a discrimination-aware regularization term to the learning objective.
Post-processing algorithms: Equalized odds post-processing (Hardt et al., 2016) solves a linear program to find probabilities with which to change output labels to optimize equalized odds. 
As part of this report, we will be using `ThresholdOptimizer under the Bias Mitigation section that uses  the technique mentioned in Equality of Opportunity in Supervised Learning paper. This technique takes as input an existing classifier and the sensitive feature, and derives a monotone transformation of the classifier's prediction to enforce the specified parity constraints.
💡
Detecting Bias in ImagesResearchers from the paper Unbiased Look at Dataset Bias, decided to play a simple game called "Name that Dataset!" 
Let us play this game too! Shown in Figure-5 below are three images from twelve popular object recognition datasets. The goal is to guess which images came from which dataset. Go ahead - try it out! And compare your answers with the answer key below.
Figure-5: Name That Dataset: Given three images from twelve popular object recognition datasets, can you match the images with the dataset? Answer key below.
Answer key: 
1) Caltech-101
2) UIUC
3) MSRC
4) Tiny Images
5) ImageNet
6) PASCAL VOC
7) LabelMe
8) SUNS-09
9) 15 Scenes
10) Corel
11) Caltech-256
12) COIL-100
💡
From the paper:
In theory, this should be a very difficult task, considering that the datasets contain thousands to millions of images. Moreover, most of these datasets were collected with the expressed goal of being as varied and rich as possible, aiming to sample the visual world “in the wild”. Yet in practice, this task turns out to be relatively easy for anyone who has worked in object and scene recognition (in our labs, most people got more than 75% correct).The lesson from this toy experiment is that, despite the best efforts of their creators, the datasets appear to have a strong build-in bias.
Also, the fact that bias exists in datasets such as ImageNet is no news really - papers such as ImageNet trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness and PASS: An ImageNet replacement for self-supervised pretraining without humans (and many more) have already pointed out that models trained on ImageNet are biased! But, can something be done about it? 
For popular public datasets such as ImageNet that have been studied and researched throughly, the short answer is YES (﻿﻿A Deeper Look at Dataset Bias and Unbiased Look at Dataset Bias).
But what about private datasets? How can researchers or data scientists working on various "other" datasets detect bias? 
The answer after a thorough literature review IMHO lies in:
Building a bias-free test set that is representative of the real world (this is much harder than it sounds and requires domain knowledge!)
Visually inspecting training data and also thoroughly reviewing data collection pipeline.
While one can write no code to help build a bias-free test set, this would really be different for every domain and problem statement, we can definitely utilize Weights and Biases Tables to visualize training data!
Let's look at ImageNette (a small subset of ImageNet) and inspect it visually to see if there is any bias in the dataset! Code to log ImageNette as W&B table can be found here.
﻿
﻿
Looking at the Table above, it becomes clear that there is a strong collection bias when it comes to images of category "Tench" (or label "0"), most of the images also have a human holding the fish in their hands! This is really a very small example as to logging data using W&B tables can help detect bias - in the next section of this report, we will see an even more concrete example using tabular data! 
As a summary, there is no clear way to detect bias in images that works for all datasets. It is really up-to the user to log and inspect data visually - W&B can be a great tool to help you do that in seconds! 
💡
Detecting and Mitigating Bias in Tabular data!Under this section we will be looking at UCI Credit-card Default Dataset  example, synthetically add bias to the dataset and then use Microsoft's FairLearn to detect and mitigate bias! We will also learn about Weights and Biases custom charts and also look more deeply into how W&B tables can help you at detecting bias! 
The section below is slightly code heavy. Unless specificied otherwise, the code has been copied from the notebook here with Weights and Biases integration added to use Custom Charts, W&B Tables and to import interactive charts to this report.
💡
So, let's get started.
First, let's define the ML problem that we are trying to solve. 
In this section, we consider a scenario where algorithmic tools are deployed to predict the likelihood that an applicant will default on a credit-card loan. The notebook emulates the problem presented in this white paper in collaboration with EY.
 UCI Credit-card default is a toy dataset reflecting credit-card defaults in Taiwan. 
To make this dataset biased, we introduce a synthethic feature that is highly predictive for applicants defined as "female" in terms of the "sex" feature, but is uninformative for applicants defined as "male".To follow along, refer to this notebook here that has been adapted from here.
Imports import wandb
﻿
# General imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
﻿
# Data processing
from sklearn.model_selection import train_test_split
﻿
# Models
import lightgbm as lgb
from sklearn.calibration import CalibratedClassifierCV
﻿
# Fairlearn algorithms and utils
from fairlearn.postprocessing import ThresholdOptimizer
from fairlearn.reductions import GridSearch, EqualizedOdds
﻿
# Metrics
from fairlearn.metrics import (
    MetricFrame,
    selection_rate, demographic_parity_difference, demographic_parity_ratio,
    false_positive_rate, false_negative_rate,
    false_positive_rate_difference, false_negative_rate_difference,
    equalized_odds_difference)
from sklearn.metrics import balanced_accuracy_score, roc_auc_score
We import everything that we need. Specifically we import wandb package that will be used to log charts, tables and metric throughout the process.
Download DataThe UCI dataset contains data on 30,000 clients and their credit card transactions at a bank in Taiwan. In addition to static client features, the dataset also contains the history of credit card bill payments between April and September 2005, as well as the balance limit of the client's credit card. The target is whether the client will default on a card payment in the following month, October 2005. A model trained on this data could be used, in part, to determine whether a client is eligible for another loan or a credit increase.
# Load the data
data_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls"
dataset = pd.read_excel(io=data_url, header=1).drop(columns=['ID']).rename(columns={'PAY_0':'PAY_1'})
dataset.head()
Raw data as a Weights and Biases table has been shown below.
﻿
﻿
﻿
After downloading the dataset, we can make simple transforms to convert categorical features to Pandas Categorical type and next we add synthetic data to make LIMIT_BAL column highly predictive for the "female" group but not for the "male" group. Specifically, we set this up, so that a lower credit limit indicates that a female client is less likely to default, but provides no information on a male client's probability of default.
dist_scale = 0.3
np.random.seed(12345)
# Make 'LIMIT_BAL' informative of the target
dataset['LIMIT_BAL'] = Y + np.random.normal(scale=dist_scale, size=len(dataset))
# But then make it uninformative for the male clients
dataset.loc[A==1, 'LIMIT_BAL'] = np.random.normal(scale=dist_scale, size=dataset[A==1].shape[0])
The updated synthetic-data logged as a W&B table has been shown below.
﻿
﻿
﻿
As can be seen, there is a high correlation now between target `default payment and column `LIMIT_BAL. This is the benefit of using W&B tables! Everything becomes interactive and is right in front of your eyes.
If we plot, the `LIMIT_BAL distribution as a KDE plot using W&B as below, one can see the distributions are different for females. This suggests that are data is biased.
﻿
﻿
﻿
As can be seen, the distribution for females is centered around 0 for when females pay on time, and the distribution is centered around 1 when females default! This suggests high correlation between the target and `LIMIT_BAL variable - suggesting that our data is biased!
Fairness Unaware Model Next, we could simply split our data into train-test splits and let's train a naive model on the synthetic-data. We will then use Microsoft's Fairlean to check the fairness of the model! 
lgb_params = {
    'objective' : 'binary',
    'metric' : 'auc',
    'learning_rate': 0.03,
    'num_leaves' : 10,
    'max_depth' : 3
}
model = lgb.LGBMClassifier(**lgb_params)
model.fit(df_train, Y_train)
﻿
# Scores on test set
test_scores = model.predict_proba(df_test)[:, 1]
﻿
# Train AUC
roc_auc_score(Y_train, model.predict_proba(df_train)[:, 1])
﻿
﻿
﻿
Did you know it is also possible to log ROC-auc curve as a custom chart to Weights and Biases?! It is really simple and can be done in a two lines of code as below.
 # LOG roc-auc curve to W&B
roc_plot = wandb.plot.roc_curve(Y_train.values, model.predict_proba(df_train), labels=None, classes_to_plot=None)
wandb.log({"roc-auc": roc_plot})
Next, let's the model's feature importances as below! 
﻿
﻿
﻿
We notice that the synthetic feature LIMIT_BAL appears as the most important feature in this model although it has no predictive power for an entire demographic segment in the data. This is all thanks to the process that we have followed so far - everything is logged and can be explained to any member in our team, this is the power of integrating Weights and Biases with your workflows! 
Next, let's now use FairLearn to examine false-positive and false-negative rates for our demographic! 
mf = MetricFrame({
    'FPR': false_positive_rate,
    'FNR': false_negative_rate},
    Y_test, test_preds, sensitive_features=A_str_test)
﻿
wandb.log({"fpr-fnr": wandb.Table(dataframe=mf.by_group)})
﻿
﻿
It can be seen that both kinds of errors are more common in the "male" group than in the "female" group.
Next, let's calculate and look at several fairness metrics from the FairLearn repo logged as a Weights and Biases table! 
 
﻿
﻿
An explanation for the metrics above has been provided in the example notebook as below: 
As the overall performance metric we use the area under ROC curve (AUC), which is suited to classification problems with a large imbalance between positive and negative examples.
As the fairness metric we use equalized odds difference, which quantifies the disparity in accuracy experienced by different demographics. Our goal is to assure that neither of the two groups ("male" vs "female") has substantially larger false-positive rates or false-negative rates than the other group. The equalized odds difference is equal to the larger of the following two numbers: 
(1) the difference between false-positive rates of the two groups, 
(2) the difference between false-negative rates of the two groups.
The table above shows the overall AUC of 0.89 and the overall balanced error rate of 0.23 (based on 0/1 predictions). Both of these are satisfactory in our application context. 
However, there is a large disparity in accuracy rates (as indicated by the balanced error rate difference) and even larger when we consider the equalized-odds difference. 
As a sanity check, we also show the demographic parity ratio, whose level (slightly above 0.8) is considered satisfactory.
💡
Bias MitigationSo far in the previous section we have looked at how inspecting metrics from Microsoft FairLearn logged to Weights and Biases as tables or custom charts can help detect bias! In this section below, we will look at Microsoft Fairlearn's `ThresholdOptimizer postprocessing mitigation algorithm and re-calculate the metrics to see if it helps! 
So, let's get started. First, we create a `ThresholdOptimizer and fit it to the model to increase fairness. From the example notebook:
This algorithm finds a suitable threshold for the scores (class probabilities) produced by the lightGBM model by optimizing the accuracy rate under the constraint that the equalized odds difference (on training data) is zero. This means that ThresholdOptimizer is effectively optimizing balanced accuracy on the original data.postprocess_est = ThresholdOptimizer(
    estimator=model,
    constraints="equalized_odds",
    prefit=True)
﻿
# Balanced data set is obtained by sampling the same number of points from the majority class (Y=0)
# as there are points in the minority class (Y=1)
balanced_idx1 = df_train[Y_train==1].index
pp_train_idx = balanced_idx1.union(Y_train[Y_train==0].sample(n=balanced_idx1.size, random_state=1234).index)
﻿
df_train_balanced = df_train.loc[pp_train_idx, :]
Y_train_balanced = Y_train.loc[pp_train_idx]
A_train_balanced = A_train.loc[pp_train_idx]
﻿
postprocess_est.fit(df_train_balanced, Y_train_balanced, sensitive_features=A_train_balanced)
Next, we can now calculate the metrics again and compare them to the previous metrics as a Weights and Biases table below. 
﻿
﻿
﻿
﻿
The `ThresholdOptimizer algorithm has significantly reduced the disparity according to multiple metrics. However, the performance metrics (balanced error rate as well as AUC) get worse. In our case it is because the available features are much less informative for one of the demographic groups than for the other.
ConclusionAs part of this report, we talked about model and dataset bias. And we saw how bias can sneak in into our datasets. By playing a simple game "Name your dataset", we also understood that almost every dataset has been built differently and has some uniqueness and really, it is upto us to create a real-world based test set to make sure that our model generalizes and isn't really biased.
Two main ways from the Unbiased Look at Dataset Bias to combat bias in images are: 
Cross-dataset generalization
Negative Set Bias
As part of this report we also looked at Microsoft FairLearn and integrated it with Weights and Biases for a UCI credit card analysis example! We learnt about W&B tables and custom charts!
In summary, there is no free-lunch algorithm to detect bias for every dataset. It really requires domain knowledge and a careful eye - thus logging everything to Weights and Biases and spending time analysing plots, charts and tables like we've done in our report can really help detect bias! Also, to be sure that decisions are not being made due to biases, we must look beyond using accuracy as our only performance metric. 
References    Flavio P. Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy, and Kush R. Varshney, “Optimized Pre-Processing for Discrimination Prevention”, Conference on Neural Information Processing Systems, 2017.
    Elisa Celis, Lingxiao Huang, Vijay Keswani, Nisheeth Vishnoi, “Classification with Fairness Constraints: A Meta-Algorithm with Provable Guarantees”, 2018
    Michael Feldman, Sorelle A. Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian, “Certifying and Removing Disparate Impact”, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015.
    Moritz Hardt, Eric Price, and Nathan Srebro, “Equality of Opportunity in Supervised Learning”, Conference on Neural Information Processing Systems, 2016.
    Faisal Kamiran and Toon Calders, “Data Preprocessing Techniques for Classification without Discrimination”, Knowledge and Information Systems, 2012.
    Faisal Kamiran, Asim Karim, and Xiangliang Zhang, “Decision Theory for Discrimination-Aware Classification”, IEEE International Conference on Data Mining, 2012.
    Toshihiro Kamishima, Shotaro Akaho, Hideki Asoh, and Jun Sakuma, “Fairness-Aware Classifier with Prejudice Remover Regularizer”, Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2012.
    Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q. Weinberger, “On Fairness and Calibration”, Conference on Neural Information Processing Systems, 2017.
    Till Speicher, Hoda Heidari, Nina Grgic-Hlaca, Krishna P. Gummadi, Adish Singla, Adrian Weller, and Muhammad Bilal Zafar, “A Unified Approach to Quantifying Algorithmic Unfairness: Measuring Individual & Group Unfairness via Inequality Indices”, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2018.
    Richard Zemel, Yu (Ledell) Wu, Kevin Swersky, Toniann Pitassi, and Cynthia Dwork, “Learning Fair Representations”, International Conference on Machine Learning, 2013.
    Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell, “Mitigating Unwanted Biases with Adversarial Learning”, AAAI/ACM Conference on Artificial Intelligence, Ethics, and Society, 2018.
﻿
﻿
Add a comment