Skip to main content

Model Risk Management with W&B

ML algorithms depend on Data and algorithms can amplify bias or inaccuracies in data creating disparate impact. This report highlights model development, assessment, and mitigation via Fairlearn, while using W&B for experiment tracking and reporting / documentation
Created on December 5|Last edited on December 18


Intro

Data Science and Machine learning permeates decision making processes from the mundane to the critical. Regardless, it is quite important to have strong model risk management due to the risk posed to organizations by the AI systems.
At a glance, it should not be hard to see how AI can introduce problems with its
Given latest guidance from NIST. on pursuing Trustworthy AI Systems, we strive for: valid and reliable, safe, fair and bias is manages, secure and resilent, accountable and transparent, explainable
which is introduces by the usage of systems which utilize outputs of AI. to understand how ML is using data which have immediate impacts on peoples lives, along with this, there is significant need to understand how ML can impact the fairness of decision / create disparity in it's potential usage.
Simply put, unfair can be described as an instance of a model privileging one category over another. Consider the instance of LinkedIn, which would recommend male variations of women's names in response to a search query, or when Amazon was using AI to screen job application, which was discovered to be biased against women.
In this report, we will take a casual view of a data science workflow, and through this, we will identify some critical moments which could result in algorithmic bias (unfair outcomes), how to evaluate and potential mitigation strategies.
When considering general risks that financial institutions face on a daily basis, we can categorize them as follows



Key Requirements of AI Systems

Undersand your data origins
Don't use PII
Make Explainable / Interpretable AI
Documentation Is critical

W&B can be the System of Record for ALL ML Workstreams



Understand Data Origins and Model Lineage Via Experiments and Artifacts. Traverse the Model lineage graph to the "left" in order to find the input datasets!

calibrated_model
Direct lineage view
Some nodes are concealed in this view - Break out items to reveal more.
Artifact - model
calibrated_model:v0
Artifact - dataset
preprocessed-validation-features:v0
Artifact - dataset
train-dataset:v0
Artifact - pipeline
preprocessing-transformer:v1
Artifact - dataset
preprocessed-train-features:v0
Artifact - model
27t8mgbl_model.json:v0
Artifact - pipeline
preprocessing-transformer:v0
Artifact - dataset
test-dataset:v0
Artifact - dataset
validation-dataset:v0
Artifact - pipeline
preprocessing-transformer:v3
Run - training
feasible-dragon-6
Run - preprocessing-pipeline
blooming-bird-4
Run - preprocessing-pipeline
wild-vortex-13
Run - preprocessing-pipeline
ancient-spaceship-34
Run - fairness-assessment
noble-plant-11
Run - fairness-assessment
feasible-feather-9
Run - training
denim-fog-8
Run - data-preprocess
generous-dragon-5
Run - data-preprocess
visionary-deluge-14
Run - data-preprocess
wild-flower-35
Run - fairness-assessment
legendary-pond-13
Runs
6
usual-shadow-3
data-split
whole-dew-12
data-split
sleek-spaceship-18
data-split
dauntless-resonance-21
data-split
confused-gorge-29
data-split
proud-surf-33
data-split
Runs
7
usual-shadow-3
data-split
whole-dew-12
data-split
sleek-spaceship-18
data-split
dauntless-resonance-21
data-split
confused-gorge-29
data-split
proud-surf-33
data-split
autumn-fog-39
data-split
Explore Model Performance and Configurations that was captured by W&B!


Provide model interpretability captured by W&B!




Motivating Example

Data Science Workflows

What is the goal of Data Science? Typically it is about extracting value from data, and this can come in many different forms and may have many different requirements. First and foremost there is the requirement of understanding the business, and the flow of information, and taking this understanding and identifying opportunities for effective use of said information. It could mean
  • developing models that are used in pipelines to automate decision making such as credit limit increase,
  • Creating recommendation systems based on retail transaction history
  • Other examples?
Now, when considering the goal, it makes sense to consider the process used to meet the goal, and we can typically break this process up into several stages. One fairly standard decomposition might involve
  • Business Understanding
  • Data Acquisition and Understanding
  • Model Development
  • Model Deployment

Now, with consideration of these stages, in my opinion, there looks like there are a lot of places where things can go wrong 🤕, and it can go wrong in widely different ways. But, there are a lot of places where we can catch things as well.
For our time together today, we will consider the issue of Fairness, and different areas in our workflow that need special attention when there are concerns that the final product could be unfair with regard to how the output is used.
  • Data Acquisition and Understanding
  • Modeling
  • Deployment

Fairness - as it relates to

Data Acquisition and Understanding

When curating a dataset for the purposes of modeling, the project team must have immediate command over what variables are going to be used. As mentioned earlier, when pursuing some use cases, the development team will have a set of restricted features which they will NOT be able to use for the purposes of training ML that will be incorporated into decision processes. This means that the model development actually starts with compliance -> compliance should be able to assist in identifying variables that CANNOT be used. For the instance of models that could be used to assist in credit transactions, the regulatory guidance is quite clear, and you must subscribe to said guidance.
But, as you can imagine, there will be instances where the features in question is not black or white, but might be some kind of grey. In my experience, strong data understanding and a feedback loop is going to be paramount to ensure that you adhere to compliance.
Some simple processes to go through pre modeling, once protected features have been identified and an initial modeling data set has been curated is to complete an assessment of correlation or mutual information (i.e., the mutual dependence between two random variables).




Run set
6
Run set 2
1

I would say this is a fairly informative assessment and meant to give you an idea of things that could go wrong. Either you nip it in the bud, or you include suspect variables and complete assessments once model outputs are available, and pursue mitigation strategies.

Modeling (Development)

Consider for the moment, that we accept that there is a relationship between our protected class and an attribute we want to use as a feature in our model. We can move forward using the feature, but we better be prepared to assess the predictions made by our model to understanding if an unfairness is introduced.
When in the modeling stage of our life cycle we need to figure out what measure of fairness to focus on. For most purposes it is enough to focus entirely on Predictions and Protected classes - meaning we don't have a focus on the actual ground truth of the model. This setup is extremely advantageous when we move into deployment.

Some Metrics

  • Equal Parity - For each protected class, what is the total number of records with favorable predictions from the model? This metric is based on equal representation of the model's target across protected classes.
  • Proportional Parity - For each protected class, what is the probability of receiving favorable predictions from the model? This metric (also known as "Statistical Parity" or "Demographic Parity") is based on equal representation of the model's target across protected classes.
  • Equalized Odds - For each protected class, a comparison of the true and false positive rates between classes
  • Prediction Balance - For all actuals that were favorable/unfavorable outcomes, what is the average predicted probability for each protected class? This metric is based on equal representation of the model's average raw scores across each protected class and is part of the set of Prediction Balance fairness metrics
  • True Favorable / True Unfavorite Rate Parity - For each protected class, what is the probability of the model predicting the favorable/unfavorable outcome for all actuals of the favorable/unfavorable outcome? This metric is based on equal error.
These metrics are not always applicable.

Fairness Unaware

A simple XGBoost was used to predict the default probability. The only care that was taken was to drop the SEX feature from the training dataset. This is immediately problematic as our model will pick up on LIMIT_BAL as a predictive feature, and moreover, will make unfair decisions once put into operation.
To assess the fairness of the model, we use equalized odds difference, which quantifies the disparity in accuracy experienced by different demographics. Our goal is to assure that neither of the two groups ("male" vs "female") has substantially larger false-positive rates or false-negative rates than the other group. The equalized odds difference is equal to the larger of the following two numbers: (1) the difference between false-positive rates of the two groups, (2) the difference between false-negative rates of the two groups.



Exploring the Model a Bit Deeper

Our synthetic feature is at the top of the list in terms of both feature importance measures. Moreover, when diving into the Shap based partial dependence plot, and coloring the points blue (for Male) and orange (for female), it should be clear the distribution of shap values seems seems to encode gender into it's effect (larger LIMIT_BAL is more indicative of Female, which has a larger SHAP value making to score in the logit space larger).



Shap Embeddings

Considering Shap Values as supervised embeddings, we might get some interested insights if we complete a lower dimensional projection of the shap values. The plot below leverages UMAP to project the Shap Values into a 2D space, and we color each point based on whether they default or not. The Shap values were logged to W&B as a W&B Table, the the lower dimensional project occurred within W&B.



More Visuals

Below we explore Shap values for the highest and lowest predictions. As well as the aggregate view of the shap values for a random 1,000 score records. The visuals were created outside of W&B, and then logged to W&B as HTML.



Deployment

Exploring Fairness and Mitigation

There are many methods of mitigation when bias is detected in models, we list out some interesting ones below provided in Fairlearn
  • ThresholdOptimization - A new classifier is obtained by applying group-specific thresholds to the provided estimator. The thresholds are chosen to optimize the provided performance objective subject to the provided fairness constraints.
  • Grid Search - An estimator is provided, and Grid Search generates a sequence of relabelings and reweightings, and trains a predictor for each based on the user specified fairness constraint.
  • Correlation Removal - A preprocessing algorithm that removes correlation between sensitive features and non-sensitive features through linear transformations.
  • Adversarial Fairness Classifier (or Regressor) - This method trains a neural network that minimizes training error while preventing an adversarial network from inferring sensitive features.
Unaware vs Aware Models
The Run Comparer below compares metrics from the Fairness Unaware model, Fairness Unaware model after Thresholding, and a Fairness Aware Model founds via Grid Search. The thresholding model seems to be the most Fair, but there is certainly a sacrifice in accuracy of the model when compared to the fairness aware method.


With the unaware model, we see signifiacnt equalized odds difference (driven by the difference in FNR between Females Males.
The ThresholdOptimizer algorithm significantly reduces the disparity according to multiple metrics. However, the performance metrics (balanced error rate as well as AUC) get worse. Before deploying such a model in practice, it would be important to examine in more detail why we observe such a sharp trade-off. In our case it is because the available features are much less informative for one of the demographic groups than for the other.
Note that unlike the unmitigated model, ThresholdOptimizer produces 0/1 predictions, so its balanced error rate difference is equal to the AUC difference, and its overall balanced error rate is equal to 1 - overall AUC.
The Fairness aware models is the result of the GridSearch algorithm, and it yields better performance than the thresholded approach without sacrificing too much fairness.

Run set
6


Below we review the Feature Importance of the Fairness Unaware model vs the Fairness Aware model. It is key to point out that the LIMIT_BAL feature is not as importance in the Aware model, which is good news, but it happens that we due sacrifice a bit of performance when it comes to AUC




Model Artifact

During the course of our training experiments, we have logged the serialized model as an Artifact to W&B. By utilizing a W&B Weave Query, we can visualize the lineage of the model and explore the model and all related experiments that have been logged to W&B.

calibrated_model
Direct lineage view
Some nodes are concealed in this view - Break out items to reveal more.
Artifact - model
calibrated_model:v0
Artifact - dataset
preprocessed-validation-features:v0
Artifact - dataset
train-dataset:v0
Artifact - pipeline
preprocessing-transformer:v1
Artifact - dataset
preprocessed-train-features:v0
Artifact - model
27t8mgbl_model.json:v0
Artifact - pipeline
preprocessing-transformer:v0
Artifact - dataset
test-dataset:v0
Artifact - dataset
validation-dataset:v0
Artifact - pipeline
preprocessing-transformer:v3
Run - training
feasible-dragon-6
Run - preprocessing-pipeline
blooming-bird-4
Run - preprocessing-pipeline
wild-vortex-13
Run - preprocessing-pipeline
ancient-spaceship-34
Run - fairness-assessment
noble-plant-11
Run - fairness-assessment
feasible-feather-9
Run - training
denim-fog-8
Run - data-preprocess
generous-dragon-5
Run - data-preprocess
visionary-deluge-14
Run - data-preprocess
wild-flower-35
Run - fairness-assessment
legendary-pond-13
Runs
6
usual-shadow-3
data-split
whole-dew-12
data-split
sleek-spaceship-18
data-split
dauntless-resonance-21
data-split
confused-gorge-29
data-split
proud-surf-33
data-split
Runs
7
usual-shadow-3
data-split
whole-dew-12
data-split
sleek-spaceship-18
data-split
dauntless-resonance-21
data-split
confused-gorge-29
data-split
proud-surf-33
data-split
autumn-fog-39
data-split
The next step in the process would be to promote the model to a model registry and make it available for the purposes of production inference.
Tim Whittaker
Tim Whittaker •  
Typically it is about extracting value from data, and this can come in many different forms and may have many different requirements. First and foremost there is the requirement of understanding the busine Need more detail, i don't quite this passage.
Reply
artifact
artifact