Predicting Diabetes Mellitus

This is a notebook on my approach to the WiDS Datathon 2021 - Predicting Diabetes Mellitus.
Created on March 6|Last edited on June 2
Comment
﻿
﻿
📍IntroductionThe WiDS Datathon 2021 Kaggle competition had the purpose of determining whether patients have a particular type of diabetes, Diabetes Mellitus, with only data made available in the first 24 hours since they've been admitted to an Intensive Care Unit (ICU).
DataData was provided in a tabular format:
 181 features (or number of columns)
130,157 rows for the training data
10,234 rows for the test data
Ok! Let's get started!
You can find the full notebook (with datasets and code) here.
🔢 Label EncodingAfter a thorough sweep in the other Kaggle Notebooks created for this competition, and after looking around other EDAs and how other Kagglers approached this problem, I started preparing my own dataset.
Pro Tip: If you don't want to do your own EDA for a Kaggle competition, you can look around to see what others found out. Don't forget to give an upvote if you found something useful!
I then created a custom function named encoder_train_test, which encodes all categorical columns found in the dataset using either the LabelEncoder or the OneHotEncoder methodologies.
﻿
Label Encoding vs One Hot Encoding
📊 Train vs Test DifferencesAfter my data was encoded, I proceeded to find out if there are any strong differences between the train and the test datasets. Strong differences may lead to a low performance of the trained model in the leaderboard, so we always need to be wary of such differences. 😏
I also proceeded to log to W&B some of the most important differences I noticed:
﻿
Run set1
﻿
Correlation MatrixThe correlation matrix of the variables showed which features have the biggest correlation index between each other; an example of the network can be seen below:
﻿
﻿
📬 ImputationThis step was very important, as there were around 80 columns per dataset that had some sort of missing value.
I decided to use the fancyimpute library, which contains multiple types of imputation algorithms. Some of them are:
KNN: Nearest neighbor imputations which weights samples using the mean squared difference on features for which two rows both have observed data.
IterativeImputer (mice): A strategy for imputing missing values by modeling each feature with missing values as a function of other features in a round-robin fashion.
SoftImpute: Matrix completion by iterative soft thresholding of SVD decompositions.
etc. 
In the Kaggle notebook I stuck with the mice algorithm, but locally I experimented with almost all the algorithms made available by the library. After this, I saved the labeled and imputed data as an artifact in the W&B Dashboard.
# Save to W&B as artifacts
run = wandb.init(project='wids-datathon-kaggle', name='le+mice_process')
artifact = wandb.Artifact(name='preprocessed', 
                          type='dataset')
﻿
artifact.add_file("../input/wids-datathon-2021-preprocessed-data/train_le_mice.parquet")
artifact.add_file("../input/wids-datathon-2021-preprocessed-data/test_le_mice.parquet")
﻿
wandb.log_artifact(artifact)
wandb.finish()
☑ ScalingThe last step in the preprocessing pipeline was scaling the data. Scaling made my datasets uniform and easier for the modeling algorithms to compute.
def scale_data(train, test):
    '''Scales the data using MinMaxScaler from the cuml library.
    Returns the 2 scaled train & test dataframes.'''
﻿
    scaler = MinMaxScaler()
﻿
    new_train = scaler.fit_transform(train)
    new_test = scaler.transform(test)
﻿
    new_train.columns = train.columns
    new_test.columns = test.columns
    
    print("Scaling has finished.")
    return new_train, new_test
Afterwards, I saved these scaled samples as Artifacts as well.
💻XGBoost Model TrainingMy first model of choice was XGBoost, as it is usually the ⭐star⭐ of all Data Science parties when talking about Machine Learning problems.
Hence, I created a custom function that retrieves the training and validation data, the hyperparameters of the model and a few more details, and returns the train model and the ROC score.
Note: The ROC score was the evaluation method of choice for this competition.
def train_xgb_model(X_train, X_test, y_train, y_test, params, 
                    details="default", prints=True, step=1):
    
    '''Trains an XGB and returns the trained model + ROC value.'''
    run = wandb.init(project='wids-datathon-kaggle', name=f'xgboost_run_{step}',
                     config=params)
    wandb.log(params)
    
    # Create DMatrix - is optimized for both memory efficiency and training speed.
    train_matrix = xgboost.DMatrix(data = X_train, label = y_train)
    
    # Create & Train the model
    model = xgboost.train(params, dtrain = train_matrix,
                          callbacks=[wandb_callback()],
                         )
﻿
    # Make prediction
    predicts = model.predict(xgboost.DMatrix(X_test))
    roc = roc_auc_score(y_test.astype('int32'), predicts)
    wandb.log({'roc':roc})
﻿
    if prints:
        print(details + " - ROC: {:.5}".format(roc))
    
    wandb.finish()
    return model, roc
After running the baseline model, tuning parameters and running the final model, 3 charts appeared in my W&B Dashboard.
XGBoost Final Best ROC: 0.82522﻿
Run set3
﻿
Feature ImportanceFrom the Feature Importance of the final model we can notice that the glucose, BMI score and age are some of the most important attributes when determining whether or not the patient has Diabetes Mellitus.
﻿
﻿
💻 Light GBM Model TrainingMy second model of choice was a Light GBM, just because it has proven to be very effective in some of my last projects. Also, a combination between Light GBM and XGBoost is usually a winner for me. 😉
Hence, I created a custom function for this algorithm as well, which takes in the training and validation data, the number of Cross Validation splits, the number of rounds to be trained, a stop_round which tells the training to stop if no improvement is met, and a few other parameters. In the end, the function looks like this:
def training_lgbm(train_lgbm, test_lgbm, features, target, param,
                  n_splits=5, stop_round=100, num_rounds=1000, verbose=False, 
                  tuned="None", val=None, return_model=False, step=1):
    
    '''Trains LGBM model.'''
    run = wandb.init(project='wids-datathon-kaggle', name=f'lgbm_run_{step}',
                     config=param)
    wandb.log(param)
﻿
    # ~~~~~~~
    #  KFOLD
    # ~~~~~~~
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True)
﻿
    oof = np.zeros(len(train_lgbm))
    predictions = np.zeros(len(test_lgbm))
﻿
    # Convert Train to Train & Validation
    skf_split = skf.split(X=train_lgbm[features], y=train_lgbm[target].values)
    
    # ~~~~~~~
    #  TRAIN
    # ~~~~~~~
    counter = 1
﻿
    for train_index, valid_index in skf_split:
        print("==== Fold {} ====".format(counter))
﻿
        lgbm_train = lgbm.Dataset(data = train_lgbm.iloc[train_index, :][features].values,
                                  label = train_lgbm.iloc[train_index, :][target].values,
                                  feature_name = features,
                                  free_raw_data = False)
﻿
        lgbm_valid = lgbm.Dataset(data = train_lgbm.iloc[valid_index, :][features].values,
                                  label = train_lgbm.iloc[valid_index, :][target].values,
                                  feature_name = features,
                                  free_raw_data = False)
﻿
        lgbm_1 = lgbm.train(params = param, train_set = lgbm_train, valid_sets = [lgbm_valid],
                            early_stopping_rounds = stop_round, num_boost_round=num_rounds, 
                            verbose_eval=verbose, callbacks=[wandb_callback()])
﻿
﻿
        # X_valid to predict
        oof[valid_index] = lgbm_1.predict(train_lgbm.iloc[valid_index][features].values, 
                                          num_iteration = lgbm_1.best_iteration)
        predictions += lgbm_1.predict(test_lgbm[features], 
                                      num_iteration = lgbm_1.best_iteration) / n_splits
﻿
        counter += 1
        
        
    # ~~~~~~~~~~~
    #   OOF EVAL
    # ~~~~~~~~~~~
    print("============================================")
    print("Splits: {} | Stop Round: {} | No. Rounds: {} | {}: {}".format(n_splits, stop_round, 
                                                                            num_rounds, tuned, val))
    print("CV ROC: {:<0.5f}".format(metrics.roc_auc_score(test_lgbm[target], predictions)))
    print("\n")
    wandb.log({'oof_roc': metrics.roc_auc_score(test_lgbm[target], predictions)})
    wandb.finish()
    
    if return_model:
        return lgbm_1
After running the baseline model and tuning parameters, 2 charts appeared in my W&B Dashboard.
Light GBM Final Best ROC: 0.86037﻿
Run set2
﻿
📝 Final Tries and Ending NotesOn my local machine I tried multiple approaches, which consisted of different combinations of imputation algorithms, hyperparameter tunes and other model algorithms, such as RandomForest or SVR.
A blend between the tuned XGBoost and Light GBM proved to me to be the best approach, with my final private score being 0.86114 and public score being 0.85842.
blended_preds = 0.1*xgb_final_preds + 0.9*lgbm_final_preds
make_submission(blended_preds, file_name="blend_xgb_lgbm")
You can find the full notebook (with datasets and code) here.💜 Thank you lots for reading and Happy Data Sciencin'! 💜
﻿
Add a comment
Tags: Intermediate, Tabular, Tutorial, Panels, Plots, Kaggle, Health Care
Iterate on AI agents and models faster. Try Weights & Biases today.