Skip to main content

Mastering K-Fold Cross-Validation

A guide on K-Fold cross validation, a fundamental evaluation method in machine learning .
Created on April 10|Last edited on April 11
Accurately evaluating machine learning models is a cornerstone of developing solutions that are not only robust but also generalizable to unseen data. This tutorial navigates the nuances between different evaluation methodologies, like k-fold cross-validation and train-validation-test split. While both approaches aim to assess and enhance model performance, they operate under distinct principles and are used in complementary ways.


What We'll Cover



The train-validation-test split is a common approach in machine learning, dividing the dataset into three separate parts: one for training the model, one for actively evaluating training runs and integrating on different hyper parameters, and one for evaluating the model's final performance. This method ensures that the model can be developed and tested under conditions that mimic how it will operate in the real world, using entirely unseen data for the ultimate test of its generalization capability, while also reducing the risk of “overfitting” on the validation set.

What is K-fold Cross-Validation?

On the other hand, k-fold cross-validation is often employed to make efficient use of data, especially when the available dataset is not very large, and computational power is abundant to the resources required for training the model. It involves splitting the dataset into 'k' parts (or folds), systematically using one part for validation and the remainder for training, and rotating this process 'k' times. This method is particularly valued for its ability to provide a more comprehensive assessment of model performance across various subsets of data. In this context, we will utilize a variant known as stratified k-fold cross-validation, which enhances the traditional k-fold approach by ensuring that each fold is representative of the entire dataset's class distribution.

Stratification

Stratified k-fold cross-validation is particularly beneficial for datasets with an imbalanced class distribution. In such datasets, standard k-fold cross-validation could result in folds that are not representative of the overall class distribution, potentially skewing the model's performance and evaluation. The stratification process divides the data in a way that maintains the same proportion of classes in each fold as is in the whole dataset, ensuring that every fold is a good representative of the entire dataset's class diversity. This stratification is crucial for achieving more reliable and valid model evaluation metrics, as it reduces the variance in model performance across different folds and ensures that the model's ability to generalize is not overly optimistic or pessimistic.

Role of the Test Set

Despite the thorough evaluation that k-fold cross-validation offers, it's still considered best practice to reserve a separate test set when using this method. This test set, which the model has never seen during the cross-validation process, serves as the final arbiter of the model's performance. Employing a test set in conjunction with k-fold cross-validation offers an unbiased evaluation, reinforcing the confidence in the model's ability to generalize well to new data.
Throughout this tutorial, we will explore the strategic application of k-fold cross-validation, the critical role of a separate test set, and how these methodologies collectively contribute to the development of machine learning models that perform reliably and effectively in real-world scenarios.

The Benefits of K-Fold Cross Validation

Overfitting occurs when a model learns the training data 'too well', including its noise, leading to on new, unseen data. A good analogy to overfitting is that the model is 'memorizing' as opposed to 'generalizing.' Underfitting happens when the model is too simple to capture the underlying structure of the data, also resulting in poor predictions on new data. The balance between these two is crucial for building effective models.
K-fold cross-validation not only ensures the full use of the training dataset but also offers a more accurate and reliable evaluation of our model. This accuracy stems from its methodical approach to validation, where the model is trained and evaluated across multiple, distinct subsets of the data. By aggregating the evaluation metrics across all folds, k-fold cross-validation provides a nuanced, well-rounded view of the model's performance, factoring in various data scenarios and distributions.
This aggregated evaluation helps mitigate the risk of an overly optimistic or pessimistic assessment that can occur with a single train-validation split, where the model's performance might significantly depend on the particular choice of validation data. In contrast, k-fold cross-validation averages the performance across several folds, each serving as a standalone validation set, thereby smoothing out anomalies and providing a more balanced and representative measure of the model's true capabilities.
Furthermore, the process of evaluating the model across multiple folds can uncover insights into its consistency and robustness. If a model performs well across a diverse set of validation folds, it's a strong indicator of its ability to generalize well to new data, reinforcing the reliability of the evaluation. This comprehensive assessment, combined with the final, unbiased check provided by the separate test set, offers a solid foundation for judging the model’s readiness for real-world applications, ensuring that decisions on model deployment are well-informed and based on robust performance metrics.

A Note About Splitting Your Data

A common practice in machine learning is to “programmatically” split datasets into training, validation, and test sets, often using a fixed seed to ensure reproducibility of the splits. This method is efficient and effective for initial model training and evaluation. However, the source of a significant pitfall emerges when new data is added to the dataset. Even with the seed remaining constant, the addition of new samples changes the dataset's composition, leading to a different outcome from the programmatic split, particularly for the test set. As a result, the test set can inadvertently change, undermining the consistency of model evaluation over time.
This variability introduced after adding new data to a dataset, and subsequently re-splitting it programmatically, can make it difficult to ascertain whether improvements in model performance are due to genuine model enhancements or simply changes in the test data. To circumvent this issue, I advise manually segregating a portion of the dataset as a test set before beginning the model development process. This approach ensures that despite the addition of new data into the training or validation sets, the test set remains unchanged, providing a stable and consistent benchmark for accurate and reliable longitudinal performance evaluation. Adhering to this practice allows practitioners to maintain the integrity of the testing process, ensuring that observed improvements in model performance are indeed reflective of advancements in model capabilities, data preprocessing, or feature engineering.

Data Preparation

We will be working with the Pima Indians Diabetes Database, a commonly used dataset in machine learning for binary classification tasks. This dataset serves as an excellent example for illustrating the process of k-fold cross-validation and the importance of having a separate test set for model evaluation. Let’s delve into the practical aspects of applying these methodologies using Python. We will first split the data into training and test sets, saving them in separate CSV files.
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"
columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
data = pd.read_csv(url, header=None, names=columns)

# Split dataset into features and target variable
X = data.drop('Outcome', axis=1)
y = data['Outcome']

# Split the data into stratified training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Save the stratified training and test sets to CSV files
X_train.to_csv('X_train.csv', index=False)
y_train.to_csv('y_train.csv', index=False)
X_test.to_csv('X_test.csv', index=False)
y_test.to_csv('y_test.csv', index=False)

# Calculate and print class ratios
class_ratio_train = y_train.value_counts(normalize=True)
class_ratio_test = y_test.value_counts(normalize=True)

print("Class Ratios in Training Set:")
print(class_ratio_train)
print("\nClass Ratios in Test Set:")
print(class_ratio_test)
Here we download the dataset in CSV format, and then we split the data into train and test sets, using stratified splitting to maintain the same proportion of outcomes in both training and test datasets as observed in the original dataset. Stratification is crucial in ensuring that our train and test sets are representative of the overall dataset, especially in cases where the target variable may be imbalanced. For instance, if there are significantly more instances of one class in the dataset, random splitting without stratification might result in training or test sets that are biased towards one class, which could adversely affect the model's ability to learn and generalize.

Implementing K-Fold Cross-Validation

Now we are ready to train a simple neural network model separately on 5 different folds of the data. We utilize Weights & Biases for logging both the validation loss and accuracy for each fold. When logging with Weights and Biases, we use the group flag, which allows us to aggregate and organize the runs from the same k-fold cross-validation experiment under a single umbrella in the W&B dashboard. This organization is particularly helpful for comparing and analyzing the performance metrics across different folds, providing insights into the model's consistency and robustness. The grouping mechanism ensures that each fold's run, though separate, is part of a collective experiment, facilitating a holistic view of the model's performance. Additionally, the config flag plays a crucial role in this process by enabling us to log specific configuration details for each run. In the context of k-fold cross-validation, the config flag can be used to record which fold of the data is currently being trained on.
Additionally, we save the best model from each fold based on the highest validation accuracy achieved during its training epochs. This approach not only helps in identifying the most performant model configurations but also aids in understanding how different splits of the data affect the model's ability to generalize. Saving the best model per fold is done automatically within the training loop whenever a new high in validation accuracy is observed, ensuring that we capture the most effective version of the model at each stage of the cross-validation process.
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from torch import nn, optim
import torch.nn.functional as F
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
import wandb


# Ensure wandb is logged in
wandb.login()

# Define your project and group name
PROJECT_NAME = "k-fold-cross-validation-pytorch"
GROUP_NAME = "experiment-" + wandb.util.generate_id()

# Load and preprocess the data
X_train = pd.read_csv('X_train.csv').values
y_train = pd.read_csv('y_train.csv').values.squeeze()

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)

# Define the dataset class
class DiabetesDataset(Dataset):
def __init__(self, features, labels):
self.features = torch.tensor(features, dtype=torch.float32)
self.labels = torch.tensor(labels, dtype=torch.long)
def __len__(self):
return len(self.labels)
def __getitem__(self, idx):
return self.features[idx], self.labels[idx]

# Define the neural network model
class SimpleNN(nn.Module):
def __init__(self, input_dim):
super(SimpleNN, self).__init__()
self.fc1 = nn.Linear(input_dim, 12)
self.fc2 = nn.Linear(12, 8)
self.fc3 = nn.Linear(8, 2)
def forward(self, x):
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
return self.fc3(x)

# Perform Stratified K-Fold Cross-Validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
dataset = DiabetesDataset(X_scaled, y_train)

for fold, (train_idx, val_idx) in enumerate(skf.split(X_scaled, y_train), 1):
# Start a new wandb run for each fold, naming it to include the fold number
run_name = f"fold-{fold}"
wandb.init(project=PROJECT_NAME, group=GROUP_NAME, name=run_name, job_type=run_name, config={"fold": fold})
train_subsampler = Subset(dataset, train_idx)
val_subsampler = Subset(dataset, val_idx)
train_loader = DataLoader(train_subsampler, batch_size=64, shuffle=True)
val_loader = DataLoader(val_subsampler, batch_size=64)
model = SimpleNN(X_scaled.shape[1])
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
best_val_accuracy = 0.0 # Initialize the best validation accuracy
for epoch in range(100): # Adjust the number of epochs if needed
model.train()
for inputs, labels in train_loader:
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()

model.eval()
running_loss = 0.0
correct_predictions = 0
with torch.no_grad():
for inputs, labels in val_loader:
outputs = model(inputs)
loss = criterion(outputs, labels)
running_loss += loss.item()
_, predicted = torch.max(outputs.data, 1)
correct_predictions += (predicted == labels).sum().item()
# Calculate validation loss and accuracy
validation_loss = running_loss / len(val_loader.dataset)
validation_accuracy = correct_predictions / len(val_loader.dataset)
# Update and save the best model based on validation accuracy
if validation_accuracy > best_val_accuracy:
best_val_accuracy = validation_accuracy
best_model_path = f'best_models/best_model_fold_{fold}.pth'
torch.save(model.state_dict(), best_model_path)
print(f"New best model saved for fold {fold} with validation accuracy: {best_val_accuracy:.4f}")

# Log metrics to wandb, including explicitly the fold number in each log
wandb.log({
"fold": fold,
"epoch": epoch,
"validation_loss": validation_loss,
"validation_accuracy": validation_accuracy,
"best_validation_accuracy": best_val_accuracy
})
# End the current wandb run before the next fold starts
wandb.finish()

print("K-Fold Cross-Validation completed. Check Weights & Biases for detailed logs and comparisons.")
Here are the results from our training run:

Run set
5


Evaluating Your Model with a Separate Test Set

After training our models through the k-fold cross-validation process, it is crucial to evaluate their performance on a separate test set. This step ensures that our model's predictions are reliable and can generalize to unseen data. In this section, we explore three different methods for utilizing the best models obtained from each fold for making predictions on the test set. We will then log the results to Weights & Biases to analyze and compare the effectiveness of each method.
Method 1: Averaging Predictions Across All Models
The first method involves using all the best models saved from each fold to make predictions on the test set. By averaging these predictions, we aim to leverage the collective knowledge gained across different subsets of the training data, potentially leading to more robust and generalized predictions.
Method 2: Using the Best Model
The second method selects the single best model based on its validation accuracy during the cross-validation process. This model is then used to make predictions on the test set. This approach assumes that the highest validation accuracy translates to better generalization on unseen data.
Method 3: Averaging Model Weights
The third method (which was mainly implemented out of curiosity) creates a new model by averaging the weights of the best models from each fold. This ensemble technique combines the learned representations from each fold into a single model, which is then evaluated on the test set. This method is somewhat uncommon, however, I was curious to see how it would do!

The Inference Script

We start by loading the test set and standardizing its features using the same scaler applied to the training data. For each method, predictions are made on the test set, and the accuracy of these predictions is calculated. These accuracies are then logged to Weights & Biases for comparison.
import os
import numpy as np
import pandas as pd
import torch
from torch.utils.data import DataLoader, TensorDataset
from torch import nn
import wandb
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# Define your model architecture here
class SimpleNN(nn.Module):
def __init__(self, input_dim):
super(SimpleNN, self).__init__()
self.fc1 = nn.Linear(input_dim, 12)
self.fc2 = nn.Linear(12, 8)
self.fc3 = nn.Linear(8, 2)

def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
return self.fc3(x)

# Load test data
X_test = pd.read_csv('X_test.csv').values
y_test = pd.read_csv('y_test.csv').values.squeeze()

# Preprocess the test data
scaler = StandardScaler()
X_test_scaled = scaler.fit_transform(X_test)
X_test_tensor = torch.tensor(X_test_scaled, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.long)

test_dataset = TensorDataset(X_test_tensor, y_test_tensor)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

# Initialize Weights & Biases
wandb.init(project="model_inference_comparison", entity="byyoung3")

# Method 1: Averaging Predictions Across All Models
model_paths = [os.path.join('best_models', model_file) for model_file in os.listdir('best_models') if model_file.endswith('.pth')]
all_preds = []

for model_path in model_paths:
model = SimpleNN(input_dim=X_test.shape[1])
model.load_state_dict(torch.load(model_path))
model.eval()
with torch.no_grad():
fold_preds = [torch.softmax(model(xb), dim=1)[:, 1].numpy() for xb, yb in test_loader]
all_preds.append(np.concatenate(fold_preds))

avg_preds = np.mean(np.column_stack(all_preds), axis=1)
avg_pred_labels = np.round(avg_preds)
accuracy_avg_preds = accuracy_score(y_test, avg_pred_labels)

# Method 2: Using the Best Model
# Assuming the best model is identified and saved as 'best_model.pth'
best_model = SimpleNN(input_dim=X_test.shape[1])
best_model.load_state_dict(torch.load('best_models/best_model_fold_1.pth'))
best_model.eval()

with torch.no_grad():
best_model_preds = [torch.softmax(best_model(xb), dim=1)[:, 1].numpy() for xb, yb in test_loader]
best_model_preds = np.concatenate(best_model_preds)
best_model_pred_labels = np.round(best_model_preds)
accuracy_best_model = accuracy_score(y_test, best_model_pred_labels)

# Method 3: Averaging Model Weights
average_model = SimpleNN(input_dim=X_test.shape[1])

# Load all model states and average the weights
model_states = [torch.load(path) for path in model_paths]
avg_state_dict = {key: torch.mean(torch.stack([model_state[key] for model_state in model_states]), 0) for key in model_states[0]}

average_model.load_state_dict(avg_state_dict)
average_model.eval()

with torch.no_grad():
average_model_preds = [torch.softmax(average_model(xb), dim=1)[:, 1].numpy() for xb, yb in test_loader]
average_model_preds = np.concatenate(average_model_preds)
average_model_pred_labels = np.round(average_model_preds)
accuracy_average_model = accuracy_score(y_test, average_model_pred_labels)

# Log results to W&B in a single call to facilitate bar chart visualization
wandb.log({
"method/accuracy_avg_preds": accuracy_avg_preds,
"method/accuracy_best_model": accuracy_best_model,
"method/accuracy_average_model": accuracy_average_model
})

wandb.finish()

print(f"Accuracy - Averaging Predictions: {accuracy_avg_preds}")
print(f"Accuracy - Best Model: {accuracy_best_model}")
print(f"Accuracy - Averaging Model Weights: {accuracy_average_model}")
After logging the accuracy results for each of the evaluation methods to Weights & Biases, an additional step was taken to visualize and compare these results directly. In the W&B dashboard, a custom chart was created to group all logged accuracies together and present them as a bar chart. This visualization approach makes it easier to compare the performance of the different methods side by side. The results are shown down below:

Run set
1

Averaging Predictions Across All Models performed the best in terms of accuracy on the test set. This outcome suggests that leveraging the collective knowledge from multiple models to make a unified prediction can lead to more robust and generalized performance on unseen data.

Overall

Our journey through k-fold cross-validation, particularly the stratified variant, underscores its critical role in achieving a nuanced evaluation of machine learning models. The method stands out for its capacity to ensure that every fold mirrors the overall dataset's class distribution, thereby offering a comprehensive view of model performance.
The technique of averaging predictions across all models emerged as the most effective, demonstrating the power of aggregating insights from various segments of data. This strategy exemplifies the essence of k-fold cross-validation: to enhance the reliability and generalizability of model predictions.

May this guide serve as a valuable resource in your machine learning endeavors, and I hope you enjoyed this tutorial!


Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.