Fairness

Attaching weights and biases to explore fairness of algorithms. Fairness is evaluated via Fairlearn and Weights and Biases is used to track everythging done. Borrows heavily (lots of copy and paste) from the intro example available at fairlearn.org. The original notebook can be found. https://github.com/fairlearn/fairlearn/blob/main/notebooks/Binary%20Classification%20with%20the%20UCI%20Credit-card%20Default%20Dataset.ipynb
Tim Whittaker
Created on May 18|Last edited on August 3
Comment
﻿
IntroductionA base example from fairlearn.org was used to demonstrate how to leverage W&B to surface details on model fairness, feature importance, partial dependence, and prediction explanations.  The original notebook can be found here.  One subtle difference from the original notebook is that we consider XGBoost instead of LightGBM.  
Consider the scenario where algorithmic tools are deployed to predict the likelihood that an applicant will default on a credit-card loan. In this experiment, we are exploring Fairlearn and an introductory example they provide which we emulates the problem presented in this white paper.  Moreover, we attach W&B experiment tracking to the problem to surface information and help contextualize it via W&B Reports.  
The UCI Credit-card Default DatasetThe UCI dataset contains data on 30,000 clients and their credit card transactions at a bank in Taiwan. In addition to static client features, the dataset contains the history of credit card bill payments between April and September 2005, as well as the balance limit of the client's credit card. The target is whether the client will default on a card payment in the following month, October 2005. A model trained on this data could be used, in part, to determine whether a client is eligible for another loan or a credit increase.
Dataset columns:
LIMIT_BAL: credit card limit, will be replaced by a synthetic feature
SEX, EDUCATION, MARRIAGE, AGE: client demographic features
BILL_AMT[1-6]: amount on bill statement for April-September
PAY_AMT[1-6]: payment amount for April-September
default payment next month: target, whether the client defaulted the following month
Synthetic FeatureThe original fairlearn example generates a synthetic feature that partially encodes the gender of the credit applicant, potentially generating unfair predictions as a result.  The balance-limit feature LIMIT_BAL has been manipulated to make it highly predictive for the "female" group but not for the "male" group. Specifically, we set this up, so that a lower credit limit indicates that a female client is less likely to default, but provides no information on a male client's probability of default.
The histogram on the left shows the distribution for LIMIT_BAL for Males who default (in orange) and do not default (in blue), while the histogram on the right shows the distributions for LIMIT_BAL for Females who default (in orange) and do not default (in blue).  
﻿
Run set1
﻿
Fairness UnawareA simple XGBoost was used to predict the default probability.  The only care that was taken was to drop the SEX feature from the training dataset.  This is immediately problematic as our model will pick up on LIMIT_BAL as a predictive feature, and moreover, will make unfair decisions once put into operation.  
To assess the fairness of the model, we use equalized odds difference, which quantifies the disparity in accuracy experienced by different demographics. Our goal is to assure that neither of the two groups ("male" vs "female") has substantially larger false-positive rates or false-negative rates than the other group. The equalized odds difference is equal to the larger of the following two numbers: (1) the difference between false-positive rates of the two groups, (2) the difference between false-negative rates of the two groups.
﻿
﻿
﻿
Exploring the model a bit deeperOur synthetic feature is at the top of the list in terms of both feature importance measures.  Moreover, when diving into the Shap based partial dependence plot, and coloring the points blue (for Male) and orance (for female),  it should be clear the distribution of shap values seems seems to encode gender into it's effect (larger LIMIT_BAL is more indicative of Female, which has a larger SHAP value making to score in the logit space larger).  
﻿
Run set1
﻿
Shap EmbeddingsConsidering Shap Values as supervised embeddings, we might get some interested insights if we complete a lower dimensional projection of the shap values.  The plot below leverages UMAP to project the Shap Values into a 2D space, and we color each point based on whether they default or not.
﻿
Run set6
﻿
More VisualsBelow we explore shap values for the highest and lowest predictions.  As well as the aggregate view of the shap values for a random 1,000 score records.  
﻿
Run set1
﻿
Exploring Fairness and MitigationMitigation methods 
ThresholdOptimization - This algorithm finds a suitable threshold for the scores (class probabilities) produced by the XGBoost model by optimizing the accuracy rate under the constraint that the equalized odds difference (on training data) is zero. Since our goal is to optimize balanced accuracy, we resampled the training data to have the same number of positive and negative examples. This means that ThresholdOptimizer is effectively optimizing balanced accuracy on the original data.
Gridsearch - With the GridSearch algorithm, we trained multiple models that balance the trade-off between the balanced accuracy and the equalized odds fairness metric.  This can result in lower accuracy and some fairness violations
The metrics below present details on how fair the model is with respect to the protected class sex.  
﻿
﻿
With the unaware model, we see signifiacnt equalized odds difference (driven by the difference in FNR between Females Males.  
The ThresholdOptimizer algorithm significantly reduces the disparity according to multiple metrics. However, the performance metrics (balanced error rate as well as AUC) get worse. Before deploying such a model in practice, it would be important to examine in more detail why we observe such a sharp trade-off. In our case it is because the available features are much less informative for one of the demographic groups than for the other.
Note that unlike the unmitigated model, ThresholdOptimizer produces 0/1 predictions, so its balanced error rate difference is equal to the AUC difference, and its overall balanced error rate is equal to 1 - overall AUC.
The Fairness aware models is the result of the GridSearch algorithm, and it yields better performance than the thresholded approach without sacrificing too much fairness.  
Unaware vs Aware Model﻿
Run set2
﻿
﻿
Add a comment