Predicting Diabetes Mellitus

This is a notebook on my approach to the WiDS Datathon 2021 - Predicting Diabetes Mellitus. Made by Andrada Olteanu using Weights & Biases
Andrada Olteanu

๐Ÿ“Introduction

The WiDS Datathon 2021 Kaggle competition had the purpose of determining whether patients have a particular type of diabetes, Diabetes Mellitus, with only data made available in the first 24 hours since they've been admitted to an Intensive Care Unit (ICU).

Data

Data was provided in a tabular format:
Ok! Let's get started!
You can find the full notebook (with datasets and code) here.

๐Ÿ”ข Label Encoding

After a thorough sweep in the other Kaggle Notebooks created for this competition, and after looking around other EDAs and how other Kagglers approached this problem, I started preparing my own dataset.
Pro Tip: If you don't want to do your own EDA for a Kaggle competition, you can look around to see what others found out. Don't forget to give an upvote if you found something useful!
I then created a custom function named encoder_train_test, which encodes all categorical columns found in the dataset using either the LabelEncoder or the OneHotEncoder methodologies.
Label Encoding vs One Hot Encoding

๐Ÿ“Š Train vs Test Differences

After my data was encoded, I proceeded to find out if there are any strong differences between the train and the test datasets. Strong differences may lead to a low performance of the trained model in the leaderboard, so we always need to be wary of such differences. ๐Ÿ˜
I also proceeded to log to W&B some of the most important differences I noticed:

Correlation Matrix

The correlation matrix of the variables showed which features have the biggest correlation index between each other; an example of the network can be seen below:

๐Ÿ“ฌ Imputation

This step was very important, as there were around 80 columns per dataset that had some sort of missing value.
I decided to use the fancyimpute library, which contains multiple types of imputation algorithms. Some of them are:
In the Kaggle notebook I stuck with the mice algorithm, but locally I experimented with almost all the algorithms made available by the library. After this, I saved the labeled and imputed data as an artifact in the W&B Dashboard.
# Save to W&B as artifactsrun = wandb.init(project='wids-datathon-kaggle', name='le+mice_process')artifact = wandb.Artifact(name='preprocessed', type='dataset')artifact.add_file("../input/wids-datathon-2021-preprocessed-data/train_le_mice.parquet")artifact.add_file("../input/wids-datathon-2021-preprocessed-data/test_le_mice.parquet")wandb.log_artifact(artifact)wandb.finish()

โ˜‘ Scaling

The last step in the preprocessing pipeline was scaling the data. Scaling made my datasets uniform and easier for the modeling algorithms to compute.
def scale_data(train, test): '''Scales the data using MinMaxScaler from the cuml library. Returns the 2 scaled train & test dataframes.''' scaler = MinMaxScaler() new_train = scaler.fit_transform(train) new_test = scaler.transform(test) new_train.columns = train.columns new_test.columns = test.columns print("Scaling has finished.") return new_train, new_test
Afterwards, I saved these scaled samples as Artifacts as well.

๐Ÿ’ปXGBoost Model Training

My first model of choice was XGBoost, as it is usually the โญstarโญ of all Data Science parties when talking about Machine Learning problems.
Hence, I created a custom function that retrieves the training and validation data, the hyperparameters of the model and a few more details, and returns the train model and the ROC score.
Note: The ROC score was the evaluation method of choice for this competition.
def train_xgb_model(X_train, X_test, y_train, y_test, params, details="default", prints=True, step=1): '''Trains an XGB and returns the trained model + ROC value.''' run = wandb.init(project='wids-datathon-kaggle', name=f'xgboost_run_{step}', config=params) wandb.log(params) # Create DMatrix - is optimized for both memory efficiency and training speed. train_matrix = xgboost.DMatrix(data = X_train, label = y_train) # Create & Train the model model = xgboost.train(params, dtrain = train_matrix, callbacks=[wandb_callback()], ) # Make prediction predicts = model.predict(xgboost.DMatrix(X_test)) roc = roc_auc_score(y_test.astype('int32'), predicts) wandb.log({'roc':roc}) if prints: print(details + " - ROC: {:.5}".format(roc)) wandb.finish() return model, roc
After running the baseline model, tuning parameters and running the final model, 3 charts appeared in my W&B Dashboard.
XGBoost Final Best ROC: 0.82522

Feature Importance

From the Feature Importance of the final model we can notice that the glucose, BMI score and age are some of the most important attributes when determining whether or not the patient has Diabetes Mellitus.

๐Ÿ’ป Light GBM Model Training

My second model of choice was a Light GBM, just because it has proven to be very effective in some of my last projects. Also, a combination between Light GBM and XGBoost is usually a winner for me. ๐Ÿ˜‰
Hence, I created a custom function for this algorithm as well, which takes in the training and validation data, the number of Cross Validation splits, the number of rounds to be trained, a stop_round which tells the training to stop if no improvement is met, and a few other parameters. In the end, the function looks like this:
def training_lgbm(train_lgbm, test_lgbm, features, target, param, n_splits=5, stop_round=100, num_rounds=1000, verbose=False, tuned="None", val=None, return_model=False, step=1): '''Trains LGBM model.''' run = wandb.init(project='wids-datathon-kaggle', name=f'lgbm_run_{step}', config=param) wandb.log(param) # ~~~~~~~ # KFOLD # ~~~~~~~ skf = StratifiedKFold(n_splits=n_splits, shuffle=True) oof = np.zeros(len(train_lgbm)) predictions = np.zeros(len(test_lgbm)) # Convert Train to Train & Validation skf_split = skf.split(X=train_lgbm[features], y=train_lgbm[target].values) # ~~~~~~~ # TRAIN # ~~~~~~~ counter = 1 for train_index, valid_index in skf_split: print("==== Fold {} ====".format(counter)) lgbm_train = lgbm.Dataset(data = train_lgbm.iloc[train_index, :][features].values, label = train_lgbm.iloc[train_index, :][target].values, feature_name = features, free_raw_data = False) lgbm_valid = lgbm.Dataset(data = train_lgbm.iloc[valid_index, :][features].values, label = train_lgbm.iloc[valid_index, :][target].values, feature_name = features, free_raw_data = False) lgbm_1 = lgbm.train(params = param, train_set = lgbm_train, valid_sets = [lgbm_valid], early_stopping_rounds = stop_round, num_boost_round=num_rounds, verbose_eval=verbose, callbacks=[wandb_callback()]) # X_valid to predict oof[valid_index] = lgbm_1.predict(train_lgbm.iloc[valid_index][features].values, num_iteration = lgbm_1.best_iteration) predictions += lgbm_1.predict(test_lgbm[features], num_iteration = lgbm_1.best_iteration) / n_splits counter += 1 # ~~~~~~~~~~~ # OOF EVAL # ~~~~~~~~~~~ print("============================================") print("Splits: {} | Stop Round: {} | No. Rounds: {} | {}: {}".format(n_splits, stop_round, num_rounds, tuned, val)) print("CV ROC: {:<0.5f}".format(metrics.roc_auc_score(test_lgbm[target], predictions))) print("\n") wandb.log({'oof_roc': metrics.roc_auc_score(test_lgbm[target], predictions)}) wandb.finish() if return_model: return lgbm_1
After running the baseline model and tuning parameters, 2 charts appeared in my W&B Dashboard.
Light GBM Final Best ROC: 0.86037

๐Ÿ“ Final Tries and Ending Notes

On my local machine I tried multiple approaches, which consisted of different combinations of imputation algorithms, hyperparameter tunes and other model algorithms, such as RandomForest or SVR.
A blend between the tuned XGBoost and Light GBM proved to me to be the best approach, with my final private score being 0.86114 and public score being 0.85842.
blended_preds = 0.1*xgb_final_preds + 0.9*lgbm_final_predsmake_submission(blended_preds, file_name="blend_xgb_lgbm")

You can find the full notebook (with datasets and code) here.

๐Ÿ’œ Thank you lots for reading and Happy Data Sciencin'! ๐Ÿ’œ