Leveraging synthetic data for tabular financial fraud detection

A guide on overcoming data scarcity for fraud detection
Created on May 1|Last edited on May 13
Comment
In financial services, fraud detection is critical. And due to the sophisticated tactics employed by fraudsters and the vast volumes of transactions that need to be monitored in real time. Traditional methods of fraud detection involve rule-based systems and machine learning models trained on historical data. 
However, these methods often struggle with the fact that fraud cases are rare, which can lead to models that are poorly generalized and biased towards the majority class of legitimate transactions. 
In this post, we'll look at how to overcome that data scarcity. 
﻿
What We'll CoverFraud Detection for Financial TransactionsWhy Use Synthetic Data?Methods of Synthetic Data GenerationGenerating Synthetic Data Models for Tabular Transaction Fraud Data  Overall
﻿
Fraud Detection for Financial TransactionsThe primary challenge in fraud detection is the imbalanced nature of transaction datasets where fraudulent transactions are significantly outnumbered by legitimate ones. This imbalance can lead to models that are overfitted to the majority class and unable to detect the nuanced patterns of fraud. In this article, we'll explore using synthetic data to improve fraud detection models, focusing particularly on tabular data common in financial transactions.
Data scientists often employ various techniques to enhance the quality and quantity of training data, which in turn helps improve the detection accuracy of their models. One of these technique is the Synthetic Minority Over-sampling Technique (SMOTE) for oversampling minority class instances in the training set. Moreover, advanced methods involve generating synthetic data that mimics the statistical properties of real data, providing a richer dataset that helps models learn and generalize better from both classes.
By creating synthetic replicas of transactional data, we can train fraud detection models on well-rounded datasets that better represent the complexities of fraudulent activities. This approach not only enhances model accuracy but also offers a scalable solution to continually adapt to new and emerging fraudulent patterns without the need for constantly acquiring new real transaction data, which can be sparse or delayed.
Why Use Synthetic Data?Synthetic data generation addresses several key issues in fraud detection modeling. Primarily, it provides a solution to the scarcity of fraudulent transaction data. By generating artificial data points that closely resemble real-world data in terms of statistical properties, researchers and practitioners can enhance their datasets without compromising privacy or security. This expanded dataset facilitates the training of more robust machine learning models by providing a balanced view of both fraudulent and non-fraudulent transactions.
Furthermore, synthetic data allows for the simulation of rare or emerging types of fraud that may not be well-represented in historical data. This is crucial for developing proactive fraud detection systems that can adapt to new tactics employed by fraudsters. It also aids in stress-testing fraud detection models under various scenarios to ensure they are resilient in diverse conditions.
Methods of Synthetic Data GenerationTo enhance fraud detection models through synthetic data, several techniques can be employed, each with unique capabilities to mimic and augment real-world data.
Here's a high-level overview of how each method works:
1. SMOTE (Synthetic Minority Over-sampling Technique)
SMOTE is an advanced oversampling technique specifically developed to address class imbalance in datasets, a common issue in scenarios like fraud detection. 
Unlike simple duplication of minority class instances, SMOTE synthesizes new examples from the existing minority instances. It operates by selecting samples that are close in the feature space, drawing a line between these samples in the space, and generating new samples along that line. 
This approach not only increases the quantity of the minority class but also encourages a smoother decision boundary in models, because it diversifies the minority class examples and thus helps prevent models from overfitting to specific instances. By enhancing the dataset in this manner, SMOTE improves the generalization ability of models, making them more adept at identifying rare fraudulent transactions that might not be well-represented in the original data.
2. Gaussian Copula
The Gaussian Copula is a statistical method used to model the dependency between multiple variables while retaining the individual distributions of each variable. 
It begins by transforming the margins (distributions of individual variables) into a uniform scale. The copula then models the dependencies between these transformed variables using a multivariate Gaussian distribution. This method is particularly powerful in financial datasets as it allows the retention and accurate modeling of the complex interdependencies between variables, which are critical for maintaining the data's integrity and ensuring robust statistical analysis. 
By capturing both the individual behaviors of variables and their interactions, Gaussian Copulas can generate new, synthetic datasets that preserve essential statistical properties and relationships, providing a valuable tool for training more effective fraud detection models.
3. Generative Adversarial Networks (GANs)
Generative Adversarial Networks (GANs) represent a novel approach in the generation of synthetic data, utilizing two neural networks in a competitive framework: a generator and a discriminator. 
The generator’s job is to create data instances that are indistinguishable from real data, while the discriminator’s role is to distinguish between the generator’s output and actual data. Through this competition, the generator learns to produce more accurate and realistic samples over time. GANs are particularly noted for their ability to generate high-quality, complex data across various domains, including images and text. 
However, they require substantial computational resources and large amounts of training data to effectively capture the underlying data distribution. In fraud detection, GANs can be used to generate realistic transactional data, which helps models learn to identify subtle and sophisticated patterns of fraudulent behavior.
Generating Synthetic Data In our data preparation and synthetic generation process, we start by loading a financial transaction dataset containing transactions from September 2013 by European cardholders (available on Kaggle), which is then subject to initial preprocessing to scale numerical features for uniformity in model training.  
To realistically simulate the often limited availability of fraud data in financial datasets, we sample 8% of our training data. This approach not only reflects a common real-world scenario but also helps manage computational resources efficiently. We apply the Synthetic Minority Over-sampling Technique (SMOTE) to this sample to create a balanced dataset where fraudulent transactions are equal in number to legitimate ones, effectively addressing the class imbalance that frequently hampers fraud detection models. 
By leveraging the robust capabilities of the Synthetic Data Vault (SDV) library, we able to concisely implement the data synthesis process! I was really impressed by the usability of the library, as its syntax and documentation are extremely well implemented! 
Heres the code to generate our datasets: 
import pandas as pd
import os
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from sdv.metadata import SingleTableMetadata
from sdv.single_table import GaussianCopulaSynthesizer, CTGANSynthesizer, TVAESynthesizer
﻿
# Ensure the data directory exists
os.makedirs('./data', exist_ok=True)
﻿
# Load the dataset
file_path = 'creditcard.csv'
data = pd.read_csv(file_path)
﻿
# Scale "Amount" and "Time" to have a similar scale to the PCA components
scaler = StandardScaler()
data['scaled_amount'] = scaler.fit_transform(data['Amount'].values.reshape(-1, 1))
data['scaled_time'] = scaler.fit_transform(data['Time'].values.reshape(-1, 1))
﻿
# Drop the original 'Time' and 'Amount' columns
data.drop(['Time', 'Amount'], axis=1, inplace=True)
﻿
# Split the dataset into training and testing sets
X = data.drop('Class', axis=1)
y = data['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
﻿
# Save the test set
test_set = pd.concat([X_test, y_test], axis=1)
test_set.to_pickle('./data/test_set.pkl')
﻿
# Sample 10% of the training data to be used for synthetic data generation
data_sample = pd.concat([X_train, y_train], axis=1).sample(frac=0.1, random_state=42)
X_sample = data_sample.drop('Class', axis=1)
y_sample = data_sample['Class']
﻿
# Metadata setup
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data_sample) 
﻿
# Apply SMOTE to the sampled data to balance it
smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X_sample, y_sample)
﻿
balanced_data_sample = pd.concat([X_smote, y_smote], axis=1)
﻿
# Save the original 8% balanced sample
balanced_data_sample.to_pickle('./data/balanced_original_sample.pkl')
﻿
# Function to fit synthesizer, generate synthetic data, and save combined data
def generate_and_save_synthetic_data(synthesizer, synthesizer_name, original_data, smote_data):
    synthesizer.fit(original_data)  # Fit the synthesizer on the original data sample
﻿
    # Determine the number of fraud and non-fraud cases in the SMOTE-balanced dataset
    fraud_count = smote_data['Class'].sum()
    non_fraud_count = len(smote_data) - fraud_count
﻿
    # Start with the original sampled data
    current_frauds = original_data[original_data['Class'] == 1]
    current_non_frauds = original_data[original_data['Class'] == 0]
﻿
    # Generate synthetic data until the class distribution matches that of the SMOTE-balanced dataset
    while len(current_frauds) < fraud_count or len(current_non_frauds) < non_fraud_count:
        synthetic_batch = synthesizer.sample(min(500, fraud_count + non_fraud_count - len(current_frauds) - len(current_non_frauds)))  # Sample in batches
        batch_frauds = synthetic_batch[synthetic_batch['Class'] == 1]
        batch_non_frauds = synthetic_batch[synthetic_batch['Class'] == 0]
﻿
        # Add to the collections without exceeding needed counts
        if len(current_frauds) < fraud_count:
            required_frauds = fraud_count - len(current_frauds)
            current_frauds = pd.concat([current_frauds, batch_frauds[:required_frauds]], ignore_index=True)
        if len(current_non_frauds) < non_fraud_count:
            required_non_frauds = non_fraud_count - len(current_non_frauds)
            current_non_frauds = pd.concat([current_non_frauds, batch_non_frauds[:required_non_frauds]], ignore_index=True)
﻿
    # Combine frauds and non-frauds to form the balanced synthetic dataset
    combined_data = pd.concat([current_frauds, current_non_frauds], ignore_index=True)
﻿
    # Save combined data
    combined_data.to_pickle(f'./data/combined_{synthesizer_name}.pkl')
﻿
﻿
﻿
# Initialize synthesizers and generate data
gaussian_synthesizer = GaussianCopulaSynthesizer(metadata)
generate_and_save_synthetic_data(gaussian_synthesizer, 'GaussianCopula', data_sample, balanced_data_sample)
﻿
ctgan_synthesizer = CTGANSynthesizer(metadata, enforce_rounding=False, epochs=100, verbose=True)
generate_and_save_synthetic_data(ctgan_synthesizer, 'CTGAN', data_sample, balanced_data_sample)
﻿
﻿
# List of pickle files to load
pickle_files = [
    './data/test_set.pkl',
    './data/balanced_original_sample.pkl',
    './data/combined_GaussianCopula.pkl',
    './data/combined_CTGAN.pkl',
]
﻿
# Function to load a pickle and print class distribution
def print_class_distribution(file_path):
    if os.path.exists(file_path):
        data = pd.read_pickle(file_path)
        class_distribution = data['Class'].value_counts()
        print(f"Class distribution for {file_path}: {class_distribution}")
    else:
        print(f"File {file_path} does not exist.")
﻿
# Iterate through the pickle files and print their class distributions
for file in pickle_files:
    print_class_distribution(file)
﻿
We employ SMOTE, along with Gaussian Copula and CTGAN models, which are trained on a 8% sample of the original dataset. This approach ensures that our synthetic data generators produce new instances that preserve the balanced class distribution, thereby maintaining equal representation of both fraudulent and legitimate transaction patterns. 
After generating the synthetic data, we save them for subsequent model training and evaluation. Before training the models, we affirm the balance between classes by reviewing the class distributions, thus validating a balanced dataset for each method used.
Models for Tabular Transaction Fraud Data  To evaluate the effectiveness of synthetic data in fraud detection, various models were trained and tested:
Random Forest: An ensemble model known for its robustness and effectiveness in handling imbalanced datasets.
Logistic Regression: A simpler, yet powerful model for binary classification tasks.
Support Vector Machine (SVM): Effective in high-dimensional spaces even when the number of dimensions exceeds the number of samples.
K-Nearest Neighbors (KNN): A non-parametric method that is simple yet effective for classification by comparing new points to known points.
XGBoost: A gradient boosting framework that provides a scalable, efficient, and effective solution for structured data problems.
We will train each model on the dataset augmented by each synthetic data generation methods individually. This dual approach allows us to compare the efficacy of different data augmentation strategies in improving the robustness and accuracy of fraud detection models. By using a variety of techniques, we can assess which methods best complement the inherent strengths and weaknesses of each model. 
For monitoring and comparing the performance of each classifier, we integrate Weights & Biases, an experiment tracking tool which will help us visualize the results for all of the different models and data synthesis methods we will use. This will allow us to systematically log performance metrics such as accuracy, precision, recall, and F1-score for each model across each dataset type. Using wandb, we can visualize and analyze these metrics to understand the strengths and weaknesses of our models in real-time.
Here's the code to train and evaluate our models on the original SMOTE-balanced dataset and the synthetic datasets generated from SMOTE, along with the Gaussian Copula and CTGAN synthesizers:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
import wandb
import numpy as np 
﻿
﻿
# Load datasets
balanced_data_sample = pd.read_pickle('./data/balanced_original_sample.pkl')
combined_data_gaussian = pd.read_pickle('./data/combined_GaussianCopula.pkl')
combined_data_ctgan = pd.read_pickle('./data/combined_CTGAN.pkl')
﻿
test_set = pd.read_pickle('./data/test_set.pkl')
﻿
# Define datasets
datasets = {
    "SMOTE-balanced": balanced_data_sample,
    "Gaussian Copula": combined_data_gaussian,
    "CTGAN": combined_data_ctgan,
    
}
﻿
# Define models
models = {
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "SVM": SVC(),
    "KNN": KNeighborsClassifier(n_neighbors=5),
    "XGBoost": XGBClassifier(use_label_encoder=False, eval_metric='logloss')
}
﻿
# Split test data
X_test = test_set.drop('Class', axis=1)
y_test = test_set['Class']
﻿
def train_and_evaluate_models(models, datasets):
    for data_name, data in datasets.items():
        X_train = np.asarray(data.drop('Class', axis=1), order='C')  # Ensure X_train is C-contiguous
        y_train = data['Class']
﻿
        for model_name, model in models.items():
            with wandb.init(project='synthetic_fraud_detection', entity='byyoung3', name=f"{model_name}_{data_name}") as run:
                # Train model
                model.fit(X_train, y_train)
                y_pred = model.predict(np.asarray(X_test, order='C'))  # Ensure X_test is C-contiguous
                # Evaluate model
                acc = accuracy_score(y_test, y_pred)
                prec = precision_score(y_test, y_pred, zero_division=0)
                rec = recall_score(y_test, y_pred, zero_division=0)
                f1 = f1_score(y_test, y_pred, zero_division=0)
﻿
                # Log metrics
                wandb.log({
                    "accuracy": acc,
                    "precision": prec,
                    "recall": rec,
                    "f1_score": f1
                })
﻿
                # Print classification report for detailed analysis
                print(f"Classification Report for {model_name} with {data_name} data:")
                print(classification_report(y_test, y_pred))
﻿
﻿
# Execute training and evaluation
train_and_evaluate_models(models, datasets)
In the script, we use several key performance metrics—accuracy, precision, recall, and F1 score—to evaluate the effectiveness of classification models on the test set. Accuracy measures the overall correctness of the model across all classes, making it a straightforward indicator of performance. Precision assesses the model's ability to correctly predict positive (fraudulent) instances among all predicted positives, which is crucial in fraud detection to minimize the cost of false alerts. Recall (or sensitivity) evaluates how well the model identifies all actual positive cases, highlighting its capability to capture as many fraudulent transactions as possible. The F1 score combines precision and recall into a single metric, providing a balanced measure of the model's precision and recall performance, especially useful in datasets with imbalanced class distributions like fraud detection. These metrics are logged for the test set results in W&B, allowing for detailed tracking and comparison of model performances across different synthetic datasets. This logging facilitates a robust analysis of each model's capacity to generalize and effectively detect fraud in varied data environments.
I was able to train my models, and obtain the following results: 
﻿
Run set14
﻿
Note, I scaled the Accuracy charts between 98% and 100%, as all of the classifiers were extremely accurate thanks to the large portion of non-fraudulent transactions in the test set (the test set uses purely non-synthetic data). In terms of f1 score, the XGBoost model with added CTGAN synthesized data performed the best! For anyone looking to improve performance, it would be interesting to try  an ensemble utilizing all of these models! 
OverallSynthetic data is another vital tool in the toolbox for enhancing fraud detection strategies. By simulating realistic transaction patterns, synthetic data helps overcome common challenges such as data scarcity and class imbalance, thus broadening the scope for model training and validation. Our exploration of techniques like Gaussian Copula, CTGAN, and SMOTE highlights their effectiveness in creating diverse and representative datasets that enable more comprehensive testing and refinement of fraud detection models. As the financial sector continues to evolve, incorporating synthetic data into fraud prevention frameworks not only augments the detection capabilities but also fosters innovation and rigorous standards in tackling fraud. I hope you enjoyed this tutorial! 
Creating a predictive models to assess the risk of mortgage clients
My top tips for competing in Kaggle Challenges like the Home Credit Risk Model Stability Challenge.
Fine-Tuning Llama-3 with LoRA: TorchTune vs HuggingFace
A battle between the HuggingFace and TorchTune!!! 
A Guide to DeepSpeed Zero With the HuggingFace Trainer
A guide for making the most out of your GPU's! 
Modern Credit Analysis with Machine Learning
This article delves into machine learning's role in credit scoring, enhancing accuracy and fairness in financial assessments by leveraging advanced algorithms and diverse data points.
﻿
﻿
Add a comment
Tags: Articles, Intermediate, Financial
Iterate on AI agents and models faster. Try Weights & Biases today.