A survey of financial datasets for machine learning
An overview of popular datasets used for ML in finance!
Created on February 8|Last edited on May 8
Comment
Whether you're predicting credit risk, detecting financial fraud, or forecasting market movements, the foundation of your analysis rests on the quality and depth of the data at your disposal.
This article aims to guide you through the some popular financial datasets available for public use, offering valuable data for everything from fraud detection to lending for home buying.

What We'll Cover
Some Preliminaries The Kaggle Lending Club datasetDefault of Credit Card Clients datasetThe Kaggle Credit Card Fraud Detection datasetYahoo Finance datasetThe Home Mortgage Disclosure Act (HMDA) datasetStatlog German Credit datasetKaggle Credit Card Approval Prediction datasetConclusionMore Reports for Utilizing These Datasets
Some Preliminaries
Handling large datasets can be challenging and attempting to open very large CSV files (the typical format for these finance datasets) can lead to performance issues or even crashes, especially on machines with limited resources.
To mitigate this, it's useful to work with a smaller subset of the data for preliminary analysis, testing code, or quick inspections. One way to create such a subset is to extract the first 10 rows of a CSV file and save them as a new file.
Here's a command that does just that:
head -n 11 original_dataset.csv > original_dataset-first10.csv
This is really helpful for quick exploration of the dataset. Now lets get into the datasets:
The Kaggle Lending Club dataset
This dataset represents loan transactions from a peer-to-peer lending platform, detailing various aspects of each loan and borrower profile. It includes information such as the loan amount requested, the amount funded, loan terms, interest rates, and installment amounts.
Additionally, it captures borrower-specific information like employment title, length of employment, home ownership status, annual income, and credit history details (e.g., FICO scores, number of delinquencies). Each entry also includes the loan's current status, indicating whether it has been fully paid, is current, or has been charged off, among other statuses.
The purpose of the loan, ranging from debt consolidation to home improvement or major purchases, is noted, providing insights into why borrowers are seeking funds.
This rich dataset offers a comprehensive view into the lending process, highlighting the financial health and intentions of borrowers as well as the outcomes of their loan requests.
The datasets is packaged with 2 CSV files, one of which contains accepted loans on the platform, where the other contains rejected loans. The "accepted loans" CSV file is typically of most interest, given that it contains the results of loans that have been initiated.
Use case(s) for this data
Loan default prediction: Assessing the likelihood that an applicant will default on their loan.
By accessing the loan_status key in the CSV file, you can find the status of each loan, which then can be used as a label for training a classifier.
How to download it
In order to download the dataset, just go to Kaggle and download the the dataset. After downloading, you can sample from the dataset using the following code:
import pandas as pddf = pd.read_csv(filename)print(df.iloc[0])
Default of Credit Card Clients dataset
This dataset, obtained from the UCI Machine Learning Repository, presents an extensive overview for studying credit card defaults among 30,000 clients, incorporating 24 different variables to shed light on various aspects of the clients' credit utilization and financial behaviors.
It includes demographic information such as gender, education, marital status, and age, alongside financial details like credit limits. The dataset also features a series of variables that track the repayment status across recent months, providing insights into the clients' payment patterns. Furthermore, it encompasses data on monthly bill statements and payments, allowing for an in-depth analysis of the financial habits and stability of the clients.
Use case(s) for this data
Default Prediction: Evaluating the probability that a client will default next month.
The information regarding whether a client defaulted on their credit card payment is indicated by the targets class member for the dataset. This variable captures the default payment status, where a value of 1 signifies that the client defaulted on their payment, and a value of 0 indicates that the client did not default.
How to access and sample this dataset:
To access this dataset, you can utilize the Python package ucimlrepo, which provides an easy interface to fetch datasets from the UCI Machine Learning Repository.
Here's how you can download and sample the Default of Credit Card Clients dataset:
from ucimlrepo import fetch_ucirepo# fetch datasetdefault_of_credit_card_clients = fetch_ucirepo(id=350)# data (as pandas dataframes)X = default_of_credit_card_clients.data.featuresy = default_of_credit_card_clients.data.targets# metadataprint(default_of_credit_card_clients.metadata)# variable informationprint(default_of_credit_card_clients.variables)
The Kaggle Credit Card Fraud Detection dataset
This dataset is a tremendous resource for those looking to explore the intricate world of financial security, offering a deep dive into transactions made by European cardholders in September 2013.
With a dataset comprising 284,807 transactions (of which 492 are fraudulent) it presents a unique challenge due to its highly unbalanced nature, where frauds account for merely 0.17% of the total transactions.
The dataset stands out by providing a blend of 30 numerical variables, 28 of which are anonymized and transformed through Principal Component Analysis to maintain confidentiality. The remaining two, 'Time' and 'Amount', offer a glimpse into the temporal dynamics and the monetary value of each transaction, respectively. The 'Class' variable is the linchpin, indicating a transaction's legitimacy (0) or fraudulence (1).
Use case(s) for this data
Fraud detection modeling: Crafting algorithms capable of distinguishing between fraudulent and legitimate transactions with high precision, leveraging the nuanced patterns hidden within the PCA-transformed features.
How to download it
In order to download the dataset, just go to Kaggle and download the the dataset. After downloading, you can sample from the dataset using the following code
import pandas as pddf = pd.read_csv(filename)print(df.iloc[0])
Yahoo Finance dataset
The Yahoo Finance dataset serves as a cornerstone for financial analysis, providing a comprehensive repository of historical daily stock price data for a wide array of companies and financial instruments.
This dataset is a treasure trove for investors, analysts, and data scientists alike, offering detailed insights into the fluctuations of stock prices, trading volumes, and market trends over time. Key variables include open, high, low, close prices, adjusted close prices (accounting for dividends and stock splits), and volume of shares traded.
These variables enable a multitude of financial analyses, from basic stock performance tracking to complex market predictions and investment strategy development.
Use case(s) for this data
Stock market analysis: Evaluating the performance of stocks, identifying trends, and making predictions based on historical price movements.
Portfolio optimization: Using historical data to optimize investment portfolios, balancing risk against returns.
Algorithmic trading: Developing and backtesting trading algorithms that make automated trading decisions based on price movement patterns.
Volatility analysis: Assessing the volatility of stock prices to understand market risk and investor sentiment.
Accessing the Yahoo Finance dataset is facilitated through various APIs and tools designed for data extraction and manipulation. One of the most popular tools among Python users is yfinance, a library that allows easy access to the vast amounts of financial data hosted on Yahoo Finance.
To get started with yfinance and explore the dataset, follow this simple Python code snippet:
import yfinance as yf# Fetch historical data for a specific stock (e.g., Apple Inc. with ticker symbol "AAPL")ticker = "AAPL"data = yf.download(ticker, start="2020-01-01", end="2020-12-31")# Display the first few rows of the fetched dataprint(data.head())
The Home Mortgage Disclosure Act (HMDA) dataset
The Home Mortgage Disclosure Act (HMDA) Dataset, established by the HMDA passed in 1975, serves as a critical tool for analyzing mortgage lending practices throughout the United States, aiming to prevent discriminatory lending practices and ensuring fair access to housing loans.
This dataset offers a detailed snapshot of the housing market by collecting extensive data on mortgage applications, including applicant demographics like race, ethnicity, gender, and income levels, as well as loan specifics such as amounts, types (for example, conventional, FHA, VA), and outcomes (approved, denied, withdrawn). It also encompasses applicant information, including age and demographic details, loan information like purpose and terms, and property specifics, including location and type. Furthermore, it provides insights into lender practices by identifying lending institutions and detailing loan approval statuses, which includes reasons for denial when applicable. Through this wealth of information, the HMDA Dataset not only facilitates a comprehensive analysis of the housing market but also supports efforts to address and understand lending disparities.
Use case(s) for this data
Fair lending and housing policy analysis: Assessing how lending practices vary across different regions and demographic groups to identify potential discrimination or market disparities.
Market trend analysis: Understanding trends in mortgage lending, including the popularity of different loan types, average loan amounts, and approval rates over time.
One common prediction task it to predict whether or not the loan was originated, which essentially just means that the loan was granted.
How to download it
Access to the HMDA dataset is facilitated through the Consumer Financial Protection Bureau (CFPB) and can be accessed online for various years.
The dataset is updated annually, providing a rich historical record for longitudinal studies. To work with the HMDA dataset, researchers can download the data directly from the CFPB website or access it via API for more dynamic queries.
import pandas as pddf = pd.read_csv(filename)print(df.iloc[0])
Statlog German Credit dataset
The Statlog (German Credit Data) dataset, hosted by the UCI Machine Learning Repository, is a comprehensive tool for credit risk analysis, classifying individuals as good or bad credit risks based on a set of attributes.
With data on 1,000 clients and 20 different features, this dataset provides a multivariate perspective on creditworthiness in a social science context. The features cover aspects such as the status of existing checking accounts, credit history, purpose of the credit, credit amount, savings accounts, employment status, and demographic details like marital status and age.
This blend of information offers a detailed framework for understanding how various factors contribute to an individual's credit risk, making it suitable for classification tasks and the exploration of predictive accuracy in credit scoring models.
Use case(s) for this data
Risk Management: Identifying patterns that indicate higher risk of default, enabling financial institutions to mitigate potential losses.
The dataset employs a binary classification system to evaluate an individual's creditworthiness, categorizing applicants into either good or bad credit risks. The target variable y indicates this class, with the value 1 assigned to those deemed as good credit risks and the value 2 for bad credit risks.
from ucimlrepo import fetch_ucirepo# fetch datasetstatlog_german_credit_data = fetch_ucirepo(id=144)# data (as pandas dataframes)X = statlog_german_credit_data.data.featuresy = statlog_german_credit_data.data.targets# metadataprint(statlog_german_credit_data.metadata)# variable informationprint(statlog_german_credit_data.variables)
Kaggle Credit Card Approval Prediction dataset
The Credit Card Approval Prediction Dataset available on Kaggle is a valuable resource for those interested in predicting credit card approval outcomes. It offers a comprehensive view of applicant profiles and credit histories.
The dataset consists of two CSV files that can be linked through a unique identifier for each applicant. The first file contains applicant attributes such as gender, car ownership status, total income, and level of education. The second file tracks the credit history of clients, including the month of record and loan payment status.
For data preparation, the loan payment status in the credit history file is key to classifying clients. This status, indicating how timely payments were made each month, can be used to identify clients who may pose a higher risk. For example, clients with payments overdue by a certain period, such as 60 days, might be categorized as higher risk.
To perform a comprehensive analysis by combining the applicant details with their corresponding credit history, you can use the following Python code snippet:
import pandas as pd# Load the datasetsapplication_df = pd.read_csv('path/to/application_record.csv')credit_df = pd.read_csv('path/to/credit_record.csv')# Define a function to label the 'bad' clients based on STATUSdef label_bad_clients(status):# Implement logic to define a 'bad' client# Here, '2' to '5' indicate varying degrees of payment being overdueif any(code in status for code in ['2', '3', '4', '5']): # '2': 60-89 days overdue, '5': Overdue or bad debtsreturn 1 # 'bad' clientelse:return 0 # 'good' client# Apply the labeling function to the STATUS column# Assuming the STATUS is a string of concatenated monthly status codescredit_df['LABEL'] = credit_df['STATUS'].apply(label_bad_clients)# Merge the datasets on the 'ID' columnmerged_df = pd.merge(application_df, credit_df, on='ID')# Display the first few entries of the merged dataframeprint(merged_df.head())
Conclusion
The exploration of various financial datasets presented in this article provides a critical foundation for anyone seeking to delve into the realm of financial analysis and predictive modeling. From the intricacies of loan default predictions to the challenges of fraud detection and beyond, these datasets serve as indispensable tools for developing robust, data-driven solutions in finance.
They not only offer a snapshot of historical financial behaviors and trends but also present opportunities to uncover deep insights and drive strategic decision-making. Whether for academic research, industry application, or personal knowledge enhancement, the datasets discussed are a gateway to a deeper understanding of financial phenomena and a testament to the transformative power of data in the digital age.
More Reports for Utilizing These Datasets
SparkML and XGBoost Spark with W&B
Adding W&B tracking to SparkML Pipelines and CrossValidators
Tutorial: Regression and Classification on XGBoost
A short tutorial on how you can use XGBoost with code and interactive visualizations.
Visualize XGBoost in One Line
Using boosted trees? Try our new integration to visualize your work in a single line.
Decision Trees: A Guide with Examples
A tutorial covering Decision Trees, complete with code and interactive visualizations
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.