AutoML + W&B
Created on July 28|Last edited on November 2
Comment
Executive SummaryData OverviewProblem StatementInpatient DataOutpatient DataBeneficiary Details Data (Not used here)Provider DataPrimary and Secondary DatasetsFeature EngineeringData ExplorationModel OverviewModel PerformanceFeature ImportancePartial DependenceChallengersH2O AutoMLModel Registry Element
Executive Summary
Consider the scenario where algorithmic tools are deployed to predict the likelihood that an applicant will default on a credit-card loan. In this experiment, we are exploring autoML with H2O and leveraging W&B experiment, tables, artifacts, and registry to keep track of this project.
Data Overview
Provider Fraud is one of the biggest problems facing Medicare. According to the government, the total Medicare spending increased exponentially due to frauds in Medicare claims. Healthcare fraud is an organized crime which involves peers of providers, physicians, beneficiaries acting together to make fraud claims.
Rigorous analysis of Medicare data has yielded many physicians who indulge in fraud. They adopt ways in which an ambiguous diagnosis code is used to adopt costliest procedures and drugs. Insurance companies are the most vulnerable institutions impacted due to these bad practices. Due to this reason, insurance companies increased their insurance premiums and as result healthcare is becoming costly matter day by day.
Healthcare fraud and abuse take many forms. Some of the most common types of frauds by providers are:
- Billing for services that were not provided.
- Duplicate submission of a claim for the same service.
- Misrepresenting the service provided.
- Charging for a more complex or expensive service than was actually provided.
- Billing for a covered service when the service actually provided was not covered.
Problem Statement
The goal of this project is to " predict the potentially fraudulent providers " based on the claims filed by them. Along with this, we will also discover important variables helpful in detecting the behaviour of potentially fraud providers. further, we will study fraudulent patterns in the provider's claims to understand the future behaviour of providers.
For the purpose of this project, we are considering Inpatient claims, Outpatient claims Lets s see their details :
Inpatient Data
This data provides insights about the claims filed for those patients who are admitted in the hospitals. It also provides additional details like their admission and discharge dates and admit d diagnosis code.
Outpatient Data
This data provides details about the claims filed for those patients who visit hospitals and not admitted in it.
Beneficiary Details Data (Not used here)
This data contains beneficiary KYC details like health conditions,regioregion they belong to etc.
Provider Data
Contains Provider ID as well as a 1/0 labeling to indicate whether or not this provider had committed fraud.
Primary and Secondary Datasets
Our primary dataset is the Provider data set. This dataset is at the level we wish to model and contains the target feature.
The secondary datasets are all other datasets. They are not necessary at the same level as our provider dataset, but we do wish to join the secondary datasets to our primary datasets to commence model. Before that, we should complete some feature engineering.
It should be clear that Inpatient and outpatient claim data is a many to one mapping, so we will need to create new features (aggregations) of the claim data so that it is at the same level of observation as our provider data.
Run set
1
Feature Engineering
To automate feature engineering, we'll use Featuretools. Featuretools is an open source Python framework for automated feature engineering. Featuretools uses Deep Feature Synthesis for automated feature engineering. You can combine your raw data with what you know about your data to build meaningful features for machine learning and predictive modeling.
Within featuretools, we create an entity set (a set of datasets and relationship between those datasets), then we will call Deep Feature Synthesis to create new features.
Below is a glimpse of the entity set used.
Entityset: ProviderEntities:provider [Rows: 5410, Columns: 1]outpatient_claims [Rows: 517737, Columns: 27]inpatient_claims [Rows: 40474, Columns: 30]beneficiary [Rows: 138556, Columns: 25]Relationships:outpatient_claims.Provider -> provider.Providerinpatient_claims.Provider -> provider.Provider
Data Exploration
DFS generated 160 new features based on the provided entity set.
Courtesy of evidently.ai, we can explore some of those features. Data was split 70-20-10 in into training, validation, and testing datasets. The exploration below was completed on the training data. Several dashboards were created each containing at most 20 numeric fields and 20 categorical fields.
Run set
1
Model Overview
H2O AutoML was heavily used to surface quality models. The best model identified was a Gradient Boosting Machine, which is a forward learning ensemble method. The guiding heuristic is that good predictive results can be obtained through increasingly refined approximations. H2O’s GBM sequentially builds regression trees on all the features of the dataset in a fully distributed way - each tree is built in parallel.
Model Performance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Tracking below was completed with W&B via a function that took as an argument an H2O model, an H2OFrame,
## function to log h2o estimator details: feature imp, pd plots, log model, and metricsimport regex as redef h2o_estimator(estimator, test_h2oframe, wandb_run, n_top_features, save_model = True):## log model## log feature importance## log partial dependence plots## log metrics
Run set
1
XGBoost_1_AutoML_3_20220728_181021
Version overview
Full Name
wandb-smle/h2o-autoML-classification/XGBoost_1_AutoML_3_20220728_181021:v0
Aliases
latest
v0
Tags
Digest
741de27053e5f1f5172d148ef7541fa5
Created By
Created At
July 28th, 2022 20:03:36
Num Consumers
0
Num Files
3
Size
3.9MB
TTL Remaining
Inactive
Upstream Artifacts
Description
Feature Importance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Run set
1
Partial Dependence
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Run set
1
Challengers
H2O AutoML
Run set
45
Model Registry Element
Fraud Detection LID 62a88bde90d0f2003c6a7bf9
Direct lineage view
Some nodes are concealed in this view - Break out items to reveal more.
Add a comment
Hi all, I am a candidate for an interview tomorrow. Was going through your site to try to learn everything I can; (Not easy I admit!) Just wanted to reach out to say you have Lorum Ipsum filler all over this page!
Reply
Challengers this is an inter
Reply