Creating a predictive models to assess the risk of mortgage clients

My top tips for competing in Kaggle Challenges like the Home Credit Risk Model Stability Challenge.
Created on March 7|Last edited on May 21
Comment
﻿
IntroductionKaggle, a platform for predictive modeling and analytics competitions, has always intrigued me. It's a place where data scientists and machine learning enthusiasts can test their skills, learn new techniques, and even win prizes. I'd never quite taken the plunge to try one myself. At least until a few weeks ago that is. 
What I gained the most from this challenge was not a prize, but invaluable insights into data science as a discipline. This blog post is dedicated to the key lessons I learned. 
﻿
What We'll Cover IntroductionThe Challenge Start with the Evaluation MetricsGrasp the Essence of Tabular DataUnderstanding the Data Utilizing Pickle for Data ManagementEngage with the Community through DiscussionsIncorporating Weights and Biases for Efficient Experiment ManagementLearn from Code ExamplesLeverage Domain ExpertisePlaying to Your Strengths Experimentation Conclusion
﻿
﻿
The Challenge The challenge I recently took part in the Home Credit Risk Model Stability Challenge, where the goal is to create predictive models that can accurately assess the default risk of clients, especially focusing on those without a substantial credit history. The challenge emphasizes the importance of model stability over time, rewarding solutions that maintain consistent performance, as this will help consumer finance providers make more reliable risk assessments. Participants are provided with data that includes various attributes of clients' profiles, loan history, and financial behaviors. 
Start with the Evaluation MetricsOne of the first things I realized when starting the challenge is the importance of understanding the evaluation metrics. Just as artificial intelligence models optimize towards a specific objective, our brains also need a clear objective to focus on. 
In Kaggle competitions, each challenge comes with its own evaluation criteria, which can significantly impact how you approach your model design. By understanding these metrics right from the start, it will be much easier to come up with clever solutions that improve the performance of your models. Additionally, developing the habit of truly understanding an evaluation metric will help understand why your model performs the way it does. 
For this challenge, unfortunately the original metric had a few problems, and at the time of writing this post, these issues have yet to be resolved. The organizers had intended to introduce a new concept by focusing on the stability of models in production as the main evaluation criterion, diverging from the more traditional approaches like maximizing the Area Under the Curve (AUC). The stability they refer to means that an ideal model should maintain consistent performance over time and achieve as high a mean Gini coefficient as possible across different time frames. 
This reflects real-world business needs where models need to perform reliably over time rather than just excel in a one-off test scenario. Essentially, participants found ways to exploit the second term, effectively "hacking" the metric by tweaking their model's scores on the test samples to nullify this part of the metric. It’s for this reason that the authors of the challenge are currently working to devise a new metric. 
Grasp the Essence of Tabular DataDealing with tabular data can be overwhelming, especially when dealing with vast datasets. Initially, I struggled with managing and understanding the sheer volume of data presented to me. 
However, I quickly learned the importance of creating scripts to manage the data efficiently. By writing scripts that shortened files for easier access and identified key columns within numerous datasets, I could focus more on analysis rather than data management. This experience taught me that before diving deep into data analysis, one must first become adept at data handling and preprocessing.
In my particular challenge, I had over 30 CSV files, totaling over 28GB of data. Many of the CSV files won’t even open on my MacBook without crashing my system! There may be applications that work better for opening than others, however, I has able to make a simple bash alias that takes all of the CSV files in a certain directory, and takes the first 100 rows and saves them to a new directory. This simple trick allows me to easily open these tables and view them. 
Here's the alias below, and you should be able to easily add this to your .bashrc/.zshrc file. This alias will copy the first 100 rows of each CSV in the current working directory into a new directory called first_100s/. 
alias copyfirst100='copycsv() { mkdir -p "$1/first_100s"; for f in "$1"/*.csv; do head -n 100 "$f" > "$1/first_100s/$(basename "$f")"; done; }; copycsv'
After adding this alias to your bashrc and sourcing it, you can run simply the command ‘copyfirst100’. 
Additionally, if you don't have a CSV viewer application on your system, I recommend checking out your respective app store for one, as this will be invaluable for viewing CSV tables efficiently. 
Understanding the Data Next, in the process of trying to understand the format of the data, I encountered another challenge. The main data ‘description’ for the entire dataset is described in a file called feature_descriptions.csv. You can think of this file as a sort of ‘readme.csv’ which describes all columns in the dataset. One challenge I encountered was the lack of a direct mapping showing which CSV file contained the actual values for each feature. Obviously it would be incredibly inefficient to go through each CSV and search for the column manually, so I made a script that can easily find the tables containing a desired feature name. I’ll share the script below: 
﻿
import os
import pandas as pd
﻿
# Directory containing the CSV files
directory = '/Users/brettyoung/Desktop/dev_24/kaggle_credit/data/home-credit-credit-risk-model-stability/csv_files/test/first_100s'
﻿
# Required columns based on your initial list
required_columns = ['totaldebt_9A', 'overdueamountmax_155A', 'profession_152M', 'pmtaverage_3A', 'maritalst_385M', 'gender_992L','education_1103M',
    'actualdpd_943P', 'amount_1115A', 'annualeffectiverate_63L', 'annuity_780A', 
    'applications30d_658L', 'avgmaxdpdlast9m_3716943P', 'credamount_590A', 
    'currdebt_22A', 'dpdmax_139P', 'numactivecreds_622L'
]
﻿
# Dictionary to hold the filenames and their associated matching columns
matched_files = {}
﻿
# Iterate through all files in the directory
for filename in os.listdir(directory):
    if filename.endswith('.csv'):  # Check if the file is a CSV file
        file_path = os.path.join(directory, filename)  # Full path of the file
        try:
            # Read only the headers to check the columns
            df = pd.read_csv(file_path, nrows=0)
            # Find any matching required columns in the CSV
            matching_columns = [column for column in required_columns if column in df.columns]
            if matching_columns:  # If there are any matching columns
                matched_files[filename] = matching_columns  # Add filename and matching columns to the dictionary
        except Exception as e:
            print(f"Error reading {filename}: {e}")
﻿
# Print out matched files and their associated matches
print("CSV files containing any of the required columns and their associated matches:")
for filename, matches in matched_files.items():
    print(f"{filename}: {matches}")
﻿
You can add the columns you are interested in in the required_columns array, and this will print the files corresponding to where the columns are located. 
Utilizing Pickle for Data ManagementWhen working with large and complex tabular datasets, I found the Python `pickle` module to be a game-changer for efficient data management. After completing the data preprocessing steps, which can be quite time-intensive for large datasets, I turned to pickle for storing the processed data.
Using pickle, I was able to serialize the preprocessed data into a file format that could be easily stored and retrieved. This meant that once the data was cleaned, tokenized, and structured according to my analysis needs, I could save this state to disk. This process of serialization converted my complex data structures into a byte stream that could be written to a file.
Later, when I needed to continue my analysis or use the preprocessed data for model training, I could quickly deserialize the stored files back into Python objects. This step significantly reduced the startup time for each data analysis session as I no longer needed to repeat the preprocessing steps.
The use of pickle for saving and loading preprocessed data ensured that I could maintain a fast and efficient workflow, allowing more time for analysis and less time waiting for data to load. It also helped in ensuring consistency across different stages of the project. However, it's important to use `pickle` cautiously, especially with data from untrusted sources, due to potential security risks.
import pickle
﻿
# Assuming combined_sequences and combined_sequences_val are some sort of data structure 
def save_data_to_pickle(data, filename):
    with open(filename, 'wb') as file:
        pickle.dump(data, file)
﻿
﻿
save_data_to_pickle(combined_sequences, 'combined_sequences.pkl')
﻿
# Save combined_sequences_val
﻿
def load_data_from_pickle(filename):
    with open(filename, 'rb') as file:
        data = pickle.load(file)
    return data
﻿
﻿
combined_sequences = load_data_from_pickle('combined_sequences.pkl')
﻿
Engage with the Community through DiscussionsAnother invaluable resource that I initially overlooked was the Kaggle discussion forums. I soon realized that many of my unanswered questions had already been addressed by the community. Additionally, I even came across questions that were asked in the forum that I initially did not think to ask, but were extremely informative. Engaging with the discussions saved me countless hours that I would have otherwise spent trying to troubleshoot issues on my own. The community on Kaggle is incredibly supportive, and leveraging their collective knowledge can significantly accelerate the learning curve.
"For example, there were discussions about the timeline of the data. I personally hadn’t considered how this would be directly relevant, but discussions on Kaggle brought this to light. The significance of the time frame, particularly how historical and recent events like the COVID-19 pandemic could influence the data and, consequently, the model outcomes, was a revelation. It became evident that different time periods in the dataset could yield vastly different insights and predictive accuracies due to varying economic conditions and consumer behavior.
Additionally, Kaggle has a nice feature for searching through the entire discussion using different keywords, so I found this helpful as a sort of search engine for solving specific problems related to the particular questions I had. 
Incorporating Weights and Biases for Efficient Experiment ManagementIn the journey of tackling a Kaggle challenge, one of the paramount aspects is to efficiently manage and track your experiments, especially when you are testing multiple hypotheses or models. This is where Weights and Biases (W&B) steps in as a game-changer for your data science projects.
Why Use Weights and Biases?
Weights and Biases is a machine learning platform that helps in tracking experiments, visualizing data, and sharing insights with others. It simplifies the process of logging experiments and comparing the outcomes, thereby enabling you to focus on what's working and what's not. 
Here's a very simple example of how to use W&B! 
import wandb
﻿
# Start a new W&B run
wandb.init(project='my-kaggle-project')
# Log hyperparameters (optional)
wandb.config.learning_rate = 0.01
# Log metrics within your training loop
for epoch in range(10):
    # Replace the following with your actual loss and accuracy calculation
    loss, accuracy = simulate_training(epoch)  # Placeholder function
    # Log metrics to W&B
    wandb.log({"loss": loss, "accuracy": accuracy})
﻿
Learn from Code ExamplesKaggle is not just about competition; it's also a learning platform. I found the shared code examples from other participants particularly helpful. These snippets provided insights into different approaches and coding practices. By studying these examples, I could spend more time developing proprietary methods and strategies, enhancing my problem-solving skills. Lots of notebooks are shared on the completion page that demonstrate how to load the data (a very time intensive task), as well as how to get started training various models, which helped me out a lot. 
Leverage Domain ExpertiseI learned the importance of domain expertise in data science challenges. While technical skills are crucial, understanding the context and nuances of the specific domain of the challenge is equally important. Personally, I don't have expertise in areas like loan defaults, which became a significant hurdle. This gap in my knowledge made me realize how beneficial it would be to collaborate with teammates who have this specific domain expertise. Having such teammates would likely accelerate the learning process and improve the chances of developing a more accurate and nuanced predictive model, and I definitely recommend keeping this in mind if you are looking for teammates.
Playing to Your Strengths One approach to gaining a proprietary edge in this competition is feature engineering, however, I didn’t feel I had a deep enough understanding of general banking concepts and loans to warrant investing much time in feature engineering (many of the features themselves were foreign to me, and I personally have never even gotten a loan myself). 
﻿
One of my competitors 
However, I was curious if I could apply an unconventional model to the data, and hopefully make up for my lack of domain experience in banking. 
Experimentation I (along with basically everyone involved in AI today) have an interest in transformers, so I decided to try applying the transformer architecture to this tabular loan data. Generally, these models are used for natural language processing as opposed to tabular classification, however, the tabular data can be reformatted into a sequence of tokens corresponding to each column in the row. 
The tabular data used for my challenge contain data of both discrete classes along with continuous float values (i.e., features like income, etc). My strategy for feature engineering was to look through notebooks with the highest performance scores on the test set, and simply utilize their code and feature selections for my own model. 
Data Preprocessing In order to generate discrete tokens from continuous values, there are several methods that could be employed. I chose to fit a Gaussian distribution to the columns with continuous values and then bin them into 100 bins. This approach involves estimating the parameters of the Gaussian distribution (mean and standard deviation) for each continuous feature. Once the distribution is defined, I divide its range into 100 intervals or 'bins' based on the distribution's percentiles, ensuring each bin contains approximately the same number of data points from the continuous feature, thus converting these continuous values into discrete tokens. This method helps in preserving the underlying distribution of the data while converting it into a format that can be easily processed by the transformer model.
Hacking? I'm not entirely sure if this is really a great idea or not, but it was the first thing that came to mind. I'm aware that this approach might introduce some bias or lose important nuances in the data, and I'm curious to explore how it impacts the model's performance compared to other methods."
I won't go into the full details of the code used to accomplish this preprocessing, however, I will show an example below of my encoding strategy, and also share the model that I used. For example, lets say our tabular data consist of 3 columns, where the first column contains continuous float values corresponding to income, the second column contains 10 different columns containing occupation names, and the third category contains more continuous float values corresponding to the loan amount. 
The preprocessing would involve the following steps:
Encoding Continuous Values: For the 'Income' and 'Loan Amount' columns, I fit a Gaussian distribution based on their data. Then, I transform these continuous numbers into discrete tokens by dividing the range of each feature into 100 bins. Each bin represents a range of values, and each actual value is replaced by its corresponding bin's identifier. This way, a specific income or loan amount is represented by a token indicating which bin it falls into.
Encoding Categorical Values: For the 'Occupation' column, which contains categorical data, I use one-hot encoding. This involves creating a separate column for each possible occupation and setting the column value to 1 if the row's occupation matches the column's and 0 otherwise. This turns the categorical data into a fixed-size vector of binary values, which is more suitable for input into the transformer model.
Original Data:Income: $75,000
Occupation: Llama Veterinarian Specialist
Loan Amount: $30,000
After Preprocessing:Income Token: 56 (assuming $75,000 falls into the 56th bin when the income range is divided into 100 bins)
Occupation Vector: [0, 1, 0, 0, ...] (assuming 'Engineer' is the second unique occupation in our dataset)
Loan Amount Token: 40 (assuming $30,000 falls into the 40th bin)
Income (First Feature Column):
Range: The first 100 tokens (0-99).
Application: Each unique income value is placed into one of these 100 bins based on its magnitude. For instance, an income token value of '56' indicates that the particular income amount falls into the 56th percentile among the possible ranges, providing a discretized, manageable format for the AI to process.
Occupation (Second Feature Column):
Range: A narrower spectrum, from 100 to 109, allocated for different types of occupations.
Application: Each distinct occupation is matched with one token from this set. For example, a token value of '101' represents a particular occupation predefined in the second slot of the occupation list. This ensures that occupation, a categorical variable, is neatly converted into a numerical format without losing its categorical essence.
Loan Amount (Third Feature Column):
Range: Following the occupations, this range extends from 110 to 209.
Application: Similar to income, each loan amount is categorized into one of these 100 distinct bins. A token value of '150', then, signifies that the loan amount is classified into the 40th bin within this range, effectively quantizing this continuous data into a tokenized format suitable for analysis.
Final Form The final form of the tokens fed into the model for the example above would look like the following:
[56, 101, 150] 
Failed Experiments Unfortunately, my method did not perform as well as I had hoped, and I was not able to outperform existing techniques like Gradient Boosting. However, I do think that some improvements could be made. First of all, the binning strategy I used could be improved to try multiple different distributions and selecting the best fitting distribution before binning. Additionally, more experiments focused on different approaches for tokenization as well as adjusting the number of bins used could also be implemented, and this could potentially improve performance. Finally, other architectures like RNN’s could be implemented, and I would be interested to see how these architectures perform. 
My unsuccessful transformer model: 
class SimpleTransformer(nn.Module):
    def __init__(self, input_dim, model_dim, num_classes, num_heads=2, num_layers=1, dropout=0.1):
        super(SimpleTransformer, self).__init__()
        self.embed = nn.Embedding(input_dim, model_dim)
        encoder_layer = nn.TransformerEncoderLayer(d_model=model_dim, nhead=num_heads, dropout=dropout)
        self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
        self.output_layer = nn.Linear(model_dim, num_classes)
        self.positional_encoding = self.create_positional_encoding(174+1, model_dim)  # Create positional encoding
﻿
    def create_positional_encoding(self, max_len, model_dim):
        """Create positional encoding matrix."""
        # Create a matrix that contains the positional encodings for all possible positions
        pe = torch.zeros(max_len, model_dim)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, model_dim, 2).float() * (-math.log(10000.0) / model_dim))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)  # Add batch dimension for broadcasting
        return pe.cuda()
﻿
﻿
    def forward(self, x):
        x = self.embed(x)  # Convert token indices to embeddings
        x = x + self.positional_encoding# Add positional encoding to embeddings
        x = self.transformer_encoder(x)
        x = torch.mean(x, dim=1)  # Average pooling
        return self.output_layer(x)
Additionally, I'll share my run logs below for the model: 
﻿
Run: electric-sea-11
﻿
﻿
﻿
Run: electric-sea-11
﻿
ConclusionParticipating in my first Kaggle challenge was a journey filled with learning and growth. It was not just about applying machine learning models, but about understanding the problem, the data, and leveraging the community. I learned that the key to success in data science lies not only in technical skills but also in understanding not only the problem, but the tools and information available for solving the problem. These lessons will undoubtedly serve me well in my future data science endeavors, and I hope they can inspire others embarking on their first Kaggle challenge. I'll also share my code here if you are interested. 
﻿
﻿
﻿
﻿
﻿
﻿
Add a comment
Tags: Articles, Kaggle, Tutorial, Financial, Agents, Weave
Iterate on AI agents and models faster. Try Weights & Biases today.