Skip to main content

AutoML + W&B

Created on July 28|Last edited on November 2


Executive Summary

Consider the scenario where algorithmic tools are deployed to predict the likelihood that an applicant will default on a credit-card loan. In this experiment, we are exploring autoML with H2O and leveraging W&B experiment, tables, artifacts, and registry to keep track of this project.

Data Overview

Provider Fraud is one of the biggest problems facing Medicare. According to the government, the total Medicare spending increased exponentially due to frauds in Medicare claims. Healthcare fraud is an organized crime which involves peers of providers, physicians, beneficiaries acting together to make fraud claims.
Rigorous analysis of Medicare data has yielded many physicians who indulge in fraud. They adopt ways in which an ambiguous diagnosis code is used to adopt costliest procedures and drugs. Insurance companies are the most vulnerable institutions impacted due to these bad practices. Due to this reason, insurance companies increased their insurance premiums and as result healthcare is becoming costly matter day by day.
Healthcare fraud and abuse take many forms. Some of the most common types of frauds by providers are:
  • Billing for services that were not provided.
  • Duplicate submission of a claim for the same service.
  • Misrepresenting the service provided.
  • Charging for a more complex or expensive service than was actually provided.
  • Billing for a covered service when the service actually provided was not covered.

Problem Statement

The goal of this project is to " predict the potentially fraudulent providers " based on the claims filed by them. Along with this, we will also discover important variables helpful in detecting the behaviour of potentially fraud providers. further, we will study fraudulent patterns in the provider's claims to understand the future behaviour of providers.
For the purpose of this project, we are considering Inpatient claims, Outpatient claims Lets s see their details :

Inpatient Data

This data provides insights about the claims filed for those patients who are admitted in the hospitals. It also provides additional details like their admission and discharge dates and admit d diagnosis code.

Outpatient Data

This data provides details about the claims filed for those patients who visit hospitals and not admitted in it.

Beneficiary Details Data (Not used here)

This data contains beneficiary KYC details like health conditions,regioregion they belong to etc.

Provider Data

Contains Provider ID as well as a 1/0 labeling to indicate whether or not this provider had committed fraud.

Primary and Secondary Datasets

Our primary dataset is the Provider data set. This dataset is at the level we wish to model and contains the target feature.
The secondary datasets are all other datasets. They are not necessary at the same level as our provider dataset, but we do wish to join the secondary datasets to our primary datasets to commence model. Before that, we should complete some feature engineering.
It should be clear that Inpatient and outpatient claim data is a many to one mapping, so we will need to create new features (aggregations) of the claim data so that it is at the same level of observation as our provider data.

Run set
1


Feature Engineering


To automate feature engineering, we'll use Featuretools. Featuretools is an open source Python framework for automated feature engineering. Featuretools uses Deep Feature Synthesis for automated feature engineering. You can combine your raw data with what you know about your data to build meaningful features for machine learning and predictive modeling.
Within featuretools, we create an entity set (a set of datasets and relationship between those datasets), then we will call Deep Feature Synthesis to create new features.
Below is a glimpse of the entity set used.
Entityset: Provider
Entities:
provider [Rows: 5410, Columns: 1]
outpatient_claims [Rows: 517737, Columns: 27]
inpatient_claims [Rows: 40474, Columns: 30]
beneficiary [Rows: 138556, Columns: 25]
Relationships:
outpatient_claims.Provider -> provider.Provider
inpatient_claims.Provider -> provider.Provider

Data Exploration

DFS generated 160 new features based on the provided entity set.
Courtesy of evidently.ai, we can explore some of those features. Data was split 70-20-10 in into training, validation, and testing datasets. The exploration below was completed on the training data. Several dashboards were created each containing at most 20 numeric fields and 20 categorical fields.

Run set
1


Model Overview

H2O AutoML was heavily used to surface quality models. The best model identified was a Gradient Boosting Machine, which is a forward learning ensemble method. The guiding heuristic is that good predictive results can be obtained through increasingly refined approximations. H2O’s GBM sequentially builds regression trees on all the features of the dataset in a fully distributed way - each tree is built in parallel.

Model Performance

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Tracking below was completed with W&B via a function that took as an argument an H2O model, an H2OFrame,
## function to log h2o estimator details: feature imp, pd plots, log model, and metrics
import regex as re
def h2o_estimator(estimator, test_h2oframe, wandb_run, n_top_features, save_model = True):
## log model
## log feature importance
## log partial dependence plots
## log metrics




Run set
1


XGBoost_1_AutoML_3_20220728_181021
Version overview
Full Name
wandb-smle/h2o-autoML-classification/XGBoost_1_AutoML_3_20220728_181021:v0
Aliases
latest
v0
Tags
Digest
741de27053e5f1f5172d148ef7541fa5
Created At
July 28th, 2022 20:03:36
Num Consumers
0
Num Files
3
Size
3.9MB
TTL Remaining
Inactive
Upstream Artifacts
Description

Feature Importance

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Run set
1


Partial Dependence

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Run set
1



Challengers

H2O AutoML


Run set
45


Model Registry Element


Fraud Detection LID 62a88bde90d0f2003c6a7bf9
Direct lineage view
Some nodes are concealed in this view - Break out items to reveal more.
Artifact - data
test-data:v0
Artifact - model
GLM_1_AutoML_3_20220728_181021:v0
Artifact - leaderboard
leaderboard:v0
Artifact - data
val-data:v0
Artifact - data
processed-data:v0
Artifact - data
train-data:v0
Run - automl-eval
GLM_1_AutoML_3_20220728_181021
Run - automl-run
rosy-cherry-5
Run - train-val-test-split
swift-dust-3
Run - register-model
project3-example-run
Runs
5
pious-plant-2
feature-engineering
classic-cherry-31
feature-engineering
true-haze-51
feature-engineering
noble-yogurt-82
feature-engineering
trim-totem-86
feature-engineering









Richard Azimov
Richard Azimov •  
Hi all, I am a candidate for an interview tomorrow. Was going through your site to try to learn everything I can; (Not easy I admit!) Just wanted to reach out to say you have Lorum Ipsum filler all over this page!
Reply
Tim Whittaker
Tim Whittaker •  
Challengers this is an inter
Reply
artifact
artifact