Mastering K-Fold Cross-Validation
A guide on K-Fold cross validation, a fundamental evaluation method in machine learning .
Created on April 10|Last edited on April 11
Comment
Accurately evaluating machine learning models is a cornerstone of developing solutions that are not only robust but also generalizable to unseen data. This tutorial navigates the nuances between different evaluation methodologies, like k-fold cross-validation and train-validation-test split. While both approaches aim to assess and enhance model performance, they operate under distinct principles and are used in complementary ways.

What We'll Cover
What is K-fold Cross-Validation? Stratification Role of the Test SetThe Benefits of K-Fold Cross Validation A Note About Splitting Your Data Data Preparation Implementing K-Fold Cross-ValidationOverall
The train-validation-test split is a common approach in machine learning, dividing the dataset into three separate parts: one for training the model, one for actively evaluating training runs and integrating on different hyper parameters, and one for evaluating the model's final performance. This method ensures that the model can be developed and tested under conditions that mimic how it will operate in the real world, using entirely unseen data for the ultimate test of its generalization capability, while also reducing the risk of “overfitting” on the validation set.
What is K-fold Cross-Validation?
On the other hand, k-fold cross-validation is often employed to make efficient use of data, especially when the available dataset is not very large, and computational power is abundant to the resources required for training the model. It involves splitting the dataset into 'k' parts (or folds), systematically using one part for validation and the remainder for training, and rotating this process 'k' times. This method is particularly valued for its ability to provide a more comprehensive assessment of model performance across various subsets of data. In this context, we will utilize a variant known as stratified k-fold cross-validation, which enhances the traditional k-fold approach by ensuring that each fold is representative of the entire dataset's class distribution.
Stratification
Stratified k-fold cross-validation is particularly beneficial for datasets with an imbalanced class distribution. In such datasets, standard k-fold cross-validation could result in folds that are not representative of the overall class distribution, potentially skewing the model's performance and evaluation. The stratification process divides the data in a way that maintains the same proportion of classes in each fold as is in the whole dataset, ensuring that every fold is a good representative of the entire dataset's class diversity. This stratification is crucial for achieving more reliable and valid model evaluation metrics, as it reduces the variance in model performance across different folds and ensures that the model's ability to generalize is not overly optimistic or pessimistic.
Role of the Test Set
Despite the thorough evaluation that k-fold cross-validation offers, it's still considered best practice to reserve a separate test set when using this method. This test set, which the model has never seen during the cross-validation process, serves as the final arbiter of the model's performance. Employing a test set in conjunction with k-fold cross-validation offers an unbiased evaluation, reinforcing the confidence in the model's ability to generalize well to new data.
Throughout this tutorial, we will explore the strategic application of k-fold cross-validation, the critical role of a separate test set, and how these methodologies collectively contribute to the development of machine learning models that perform reliably and effectively in real-world scenarios.
The Benefits of K-Fold Cross Validation
Overfitting occurs when a model learns the training data 'too well', including its noise, leading to on new, unseen data. A good analogy to overfitting is that the model is 'memorizing' as opposed to 'generalizing.' Underfitting happens when the model is too simple to capture the underlying structure of the data, also resulting in poor predictions on new data. The balance between these two is crucial for building effective models.
K-fold cross-validation not only ensures the full use of the training dataset but also offers a more accurate and reliable evaluation of our model. This accuracy stems from its methodical approach to validation, where the model is trained and evaluated across multiple, distinct subsets of the data. By aggregating the evaluation metrics across all folds, k-fold cross-validation provides a nuanced, well-rounded view of the model's performance, factoring in various data scenarios and distributions.
This aggregated evaluation helps mitigate the risk of an overly optimistic or pessimistic assessment that can occur with a single train-validation split, where the model's performance might significantly depend on the particular choice of validation data. In contrast, k-fold cross-validation averages the performance across several folds, each serving as a standalone validation set, thereby smoothing out anomalies and providing a more balanced and representative measure of the model's true capabilities.
Furthermore, the process of evaluating the model across multiple folds can uncover insights into its consistency and robustness. If a model performs well across a diverse set of validation folds, it's a strong indicator of its ability to generalize well to new data, reinforcing the reliability of the evaluation. This comprehensive assessment, combined with the final, unbiased check provided by the separate test set, offers a solid foundation for judging the model’s readiness for real-world applications, ensuring that decisions on model deployment are well-informed and based on robust performance metrics.
A Note About Splitting Your Data
A common practice in machine learning is to “programmatically” split datasets into training, validation, and test sets, often using a fixed seed to ensure reproducibility of the splits. This method is efficient and effective for initial model training and evaluation. However, the source of a significant pitfall emerges when new data is added to the dataset. Even with the seed remaining constant, the addition of new samples changes the dataset's composition, leading to a different outcome from the programmatic split, particularly for the test set. As a result, the test set can inadvertently change, undermining the consistency of model evaluation over time.
This variability introduced after adding new data to a dataset, and subsequently re-splitting it programmatically, can make it difficult to ascertain whether improvements in model performance are due to genuine model enhancements or simply changes in the test data. To circumvent this issue, I advise manually segregating a portion of the dataset as a test set before beginning the model development process. This approach ensures that despite the addition of new data into the training or validation sets, the test set remains unchanged, providing a stable and consistent benchmark for accurate and reliable longitudinal performance evaluation. Adhering to this practice allows practitioners to maintain the integrity of the testing process, ensuring that observed improvements in model performance are indeed reflective of advancements in model capabilities, data preprocessing, or feature engineering.
Data Preparation
We will be working with the Pima Indians Diabetes Database, a commonly used dataset in machine learning for binary classification tasks. This dataset serves as an excellent example for illustrating the process of k-fold cross-validation and the importance of having a separate test set for model evaluation. Let’s delve into the practical aspects of applying these methodologies using Python. We will first split the data into training and test sets, saving them in separate CSV files.
import pandas as pdfrom sklearn.model_selection import train_test_split# Load the dataseturl = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']data = pd.read_csv(url, header=None, names=columns)# Split dataset into features and target variableX = data.drop('Outcome', axis=1)y = data['Outcome']# Split the data into stratified training and test setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)# Save the stratified training and test sets to CSV filesX_train.to_csv('X_train.csv', index=False)y_train.to_csv('y_train.csv', index=False)X_test.to_csv('X_test.csv', index=False)y_test.to_csv('y_test.csv', index=False)# Calculate and print class ratiosclass_ratio_train = y_train.value_counts(normalize=True)class_ratio_test = y_test.value_counts(normalize=True)print("Class Ratios in Training Set:")print(class_ratio_train)print("\nClass Ratios in Test Set:")print(class_ratio_test)
Here we download the dataset in CSV format, and then we split the data into train and test sets, using stratified splitting to maintain the same proportion of outcomes in both training and test datasets as observed in the original dataset. Stratification is crucial in ensuring that our train and test sets are representative of the overall dataset, especially in cases where the target variable may be imbalanced. For instance, if there are significantly more instances of one class in the dataset, random splitting without stratification might result in training or test sets that are biased towards one class, which could adversely affect the model's ability to learn and generalize.
Implementing K-Fold Cross-Validation
Now we are ready to train a simple neural network model separately on 5 different folds of the data. We utilize Weights & Biases for logging both the validation loss and accuracy for each fold. When logging with Weights and Biases, we use the group flag, which allows us to aggregate and organize the runs from the same k-fold cross-validation experiment under a single umbrella in the W&B dashboard. This organization is particularly helpful for comparing and analyzing the performance metrics across different folds, providing insights into the model's consistency and robustness. The grouping mechanism ensures that each fold's run, though separate, is part of a collective experiment, facilitating a holistic view of the model's performance. Additionally, the config flag plays a crucial role in this process by enabling us to log specific configuration details for each run. In the context of k-fold cross-validation, the config flag can be used to record which fold of the data is currently being trained on.
Additionally, we save the best model from each fold based on the highest validation accuracy achieved during its training epochs. This approach not only helps in identifying the most performant model configurations but also aids in understanding how different splits of the data affect the model's ability to generalize. Saving the best model per fold is done automatically within the training loop whenever a new high in validation accuracy is observed, ensuring that we capture the most effective version of the model at each stage of the cross-validation process.
import pandas as pdimport torchfrom torch.utils.data import Dataset, DataLoaderfrom torch import nn, optimimport torch.nn.functional as Ffrom sklearn.model_selection import StratifiedKFoldfrom sklearn.preprocessing import StandardScalerimport wandb# Ensure wandb is logged inwandb.login()# Define your project and group namePROJECT_NAME = "k-fold-cross-validation-pytorch"GROUP_NAME = "experiment-" + wandb.util.generate_id()# Load and preprocess the dataX_train = pd.read_csv('X_train.csv').valuesy_train = pd.read_csv('y_train.csv').values.squeeze()scaler = StandardScaler()X_scaled = scaler.fit_transform(X_train)# Define the dataset classclass DiabetesDataset(Dataset):def __init__(self, features, labels):self.features = torch.tensor(features, dtype=torch.float32)self.labels = torch.tensor(labels, dtype=torch.long)def __len__(self):return len(self.labels)def __getitem__(self, idx):return self.features[idx], self.labels[idx]# Define the neural network modelclass SimpleNN(nn.Module):def __init__(self, input_dim):super(SimpleNN, self).__init__()self.fc1 = nn.Linear(input_dim, 12)self.fc2 = nn.Linear(12, 8)self.fc3 = nn.Linear(8, 2)def forward(self, x):x = F.relu(self.fc1(x))x = F.relu(self.fc2(x))return self.fc3(x)# Perform Stratified K-Fold Cross-Validationskf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)dataset = DiabetesDataset(X_scaled, y_train)for fold, (train_idx, val_idx) in enumerate(skf.split(X_scaled, y_train), 1):# Start a new wandb run for each fold, naming it to include the fold numberrun_name = f"fold-{fold}"wandb.init(project=PROJECT_NAME, group=GROUP_NAME, name=run_name, job_type=run_name, config={"fold": fold})train_subsampler = Subset(dataset, train_idx)val_subsampler = Subset(dataset, val_idx)train_loader = DataLoader(train_subsampler, batch_size=64, shuffle=True)val_loader = DataLoader(val_subsampler, batch_size=64)model = SimpleNN(X_scaled.shape[1])criterion = nn.CrossEntropyLoss()optimizer = optim.Adam(model.parameters(), lr=0.001)best_val_accuracy = 0.0 # Initialize the best validation accuracyfor epoch in range(100): # Adjust the number of epochs if neededmodel.train()for inputs, labels in train_loader:optimizer.zero_grad()outputs = model(inputs)loss = criterion(outputs, labels)loss.backward()optimizer.step()model.eval()running_loss = 0.0correct_predictions = 0with torch.no_grad():for inputs, labels in val_loader:outputs = model(inputs)loss = criterion(outputs, labels)running_loss += loss.item()_, predicted = torch.max(outputs.data, 1)correct_predictions += (predicted == labels).sum().item()# Calculate validation loss and accuracyvalidation_loss = running_loss / len(val_loader.dataset)validation_accuracy = correct_predictions / len(val_loader.dataset)# Update and save the best model based on validation accuracyif validation_accuracy > best_val_accuracy:best_val_accuracy = validation_accuracybest_model_path = f'best_models/best_model_fold_{fold}.pth'torch.save(model.state_dict(), best_model_path)print(f"New best model saved for fold {fold} with validation accuracy: {best_val_accuracy:.4f}")# Log metrics to wandb, including explicitly the fold number in each logwandb.log({"fold": fold,"epoch": epoch,"validation_loss": validation_loss,"validation_accuracy": validation_accuracy,"best_validation_accuracy": best_val_accuracy})# End the current wandb run before the next fold startswandb.finish()print("K-Fold Cross-Validation completed. Check Weights & Biases for detailed logs and comparisons.")
Here are the results from our training run:
Run set
5
Evaluating Your Model with a Separate Test Set
After training our models through the k-fold cross-validation process, it is crucial to evaluate their performance on a separate test set. This step ensures that our model's predictions are reliable and can generalize to unseen data. In this section, we explore three different methods for utilizing the best models obtained from each fold for making predictions on the test set. We will then log the results to Weights & Biases to analyze and compare the effectiveness of each method.
Method 1: Averaging Predictions Across All Models
The first method involves using all the best models saved from each fold to make predictions on the test set. By averaging these predictions, we aim to leverage the collective knowledge gained across different subsets of the training data, potentially leading to more robust and generalized predictions.
Method 2: Using the Best Model
The second method selects the single best model based on its validation accuracy during the cross-validation process. This model is then used to make predictions on the test set. This approach assumes that the highest validation accuracy translates to better generalization on unseen data.
Method 3: Averaging Model Weights
The third method (which was mainly implemented out of curiosity) creates a new model by averaging the weights of the best models from each fold. This ensemble technique combines the learned representations from each fold into a single model, which is then evaluated on the test set. This method is somewhat uncommon, however, I was curious to see how it would do!
The Inference Script
We start by loading the test set and standardizing its features using the same scaler applied to the training data. For each method, predictions are made on the test set, and the accuracy of these predictions is calculated. These accuracies are then logged to Weights & Biases for comparison.
import osimport numpy as npimport pandas as pdimport torchfrom torch.utils.data import DataLoader, TensorDatasetfrom torch import nnimport wandbfrom sklearn.metrics import accuracy_scorefrom sklearn.preprocessing import StandardScaler# Define your model architecture hereclass SimpleNN(nn.Module):def __init__(self, input_dim):super(SimpleNN, self).__init__()self.fc1 = nn.Linear(input_dim, 12)self.fc2 = nn.Linear(12, 8)self.fc3 = nn.Linear(8, 2)def forward(self, x):x = torch.relu(self.fc1(x))x = torch.relu(self.fc2(x))return self.fc3(x)# Load test dataX_test = pd.read_csv('X_test.csv').valuesy_test = pd.read_csv('y_test.csv').values.squeeze()# Preprocess the test datascaler = StandardScaler()X_test_scaled = scaler.fit_transform(X_test)X_test_tensor = torch.tensor(X_test_scaled, dtype=torch.float32)y_test_tensor = torch.tensor(y_test, dtype=torch.long)test_dataset = TensorDataset(X_test_tensor, y_test_tensor)test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)# Initialize Weights & Biaseswandb.init(project="model_inference_comparison", entity="byyoung3")# Method 1: Averaging Predictions Across All Modelsmodel_paths = [os.path.join('best_models', model_file) for model_file in os.listdir('best_models') if model_file.endswith('.pth')]all_preds = []for model_path in model_paths:model = SimpleNN(input_dim=X_test.shape[1])model.load_state_dict(torch.load(model_path))model.eval()with torch.no_grad():fold_preds = [torch.softmax(model(xb), dim=1)[:, 1].numpy() for xb, yb in test_loader]all_preds.append(np.concatenate(fold_preds))avg_preds = np.mean(np.column_stack(all_preds), axis=1)avg_pred_labels = np.round(avg_preds)accuracy_avg_preds = accuracy_score(y_test, avg_pred_labels)# Method 2: Using the Best Model# Assuming the best model is identified and saved as 'best_model.pth'best_model = SimpleNN(input_dim=X_test.shape[1])best_model.load_state_dict(torch.load('best_models/best_model_fold_1.pth'))best_model.eval()with torch.no_grad():best_model_preds = [torch.softmax(best_model(xb), dim=1)[:, 1].numpy() for xb, yb in test_loader]best_model_preds = np.concatenate(best_model_preds)best_model_pred_labels = np.round(best_model_preds)accuracy_best_model = accuracy_score(y_test, best_model_pred_labels)# Method 3: Averaging Model Weightsaverage_model = SimpleNN(input_dim=X_test.shape[1])# Load all model states and average the weightsmodel_states = [torch.load(path) for path in model_paths]avg_state_dict = {key: torch.mean(torch.stack([model_state[key] for model_state in model_states]), 0) for key in model_states[0]}average_model.load_state_dict(avg_state_dict)average_model.eval()with torch.no_grad():average_model_preds = [torch.softmax(average_model(xb), dim=1)[:, 1].numpy() for xb, yb in test_loader]average_model_preds = np.concatenate(average_model_preds)average_model_pred_labels = np.round(average_model_preds)accuracy_average_model = accuracy_score(y_test, average_model_pred_labels)# Log results to W&B in a single call to facilitate bar chart visualizationwandb.log({"method/accuracy_avg_preds": accuracy_avg_preds,"method/accuracy_best_model": accuracy_best_model,"method/accuracy_average_model": accuracy_average_model})wandb.finish()print(f"Accuracy - Averaging Predictions: {accuracy_avg_preds}")print(f"Accuracy - Best Model: {accuracy_best_model}")print(f"Accuracy - Averaging Model Weights: {accuracy_average_model}")
After logging the accuracy results for each of the evaluation methods to Weights & Biases, an additional step was taken to visualize and compare these results directly. In the W&B dashboard, a custom chart was created to group all logged accuracies together and present them as a bar chart. This visualization approach makes it easier to compare the performance of the different methods side by side. The results are shown down below:
Run set
1
Averaging Predictions Across All Models performed the best in terms of accuracy on the test set. This outcome suggests that leveraging the collective knowledge from multiple models to make a unified prediction can lead to more robust and generalized performance on unseen data.
Overall
Our journey through k-fold cross-validation, particularly the stratified variant, underscores its critical role in achieving a nuanced evaluation of machine learning models. The method stands out for its capacity to ensure that every fold mirrors the overall dataset's class distribution, thereby offering a comprehensive view of model performance.
The technique of averaging predictions across all models emerged as the most effective, demonstrating the power of aggregating insights from various segments of data. This strategy exemplifies the essence of k-fold cross-validation: to enhance the reliability and generalizability of model predictions.
May this guide serve as a valuable resource in your machine learning endeavors, and I hope you enjoyed this tutorial!
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.