Using K-Fold Cross-Validation To Improve Your Machine Learning Models

In this article, we will learn how to use k-fold cross-validation for better measures of machine learning model performance, using W&B to track our results.
Ayush Chaurasia
Created on June 14|Last edited on July 15
Comment
This article is part of a series, clarifying some of Kaggle's terms, definitions, and competitions, as well as adding visualizations.
﻿
Machine learning is an iterative process. When working in the space, you will face choices about what predictive variables to use, what types of models to use, what arguments to supply to those models, etc.
So far, we have made these choices in a data-driven way by measuring model quality with a validation (or holdout) set. But there are some drawbacks to this approach.
To illustrate, imagine you have a dataset with 5000 rows. You will typically keep about 20% of the data as a validation dataset, or 1000 rows. But this leaves some random chance in determining model scores. A model might do well on one set of 1000 rows, even if it would be inaccurate on a different 1000 rows.
At an extreme, you could imagine having only 1 row of data in the validation set. If you compare alternative models, which one makes the best predictions on a single data point will be mostly a matter of luck!
In general, the larger the validation set, the less randomness (aka "noise") there is in our measure of model quality, and the more reliable it will be. Unfortunately, we can only get a large validation set by removing rows from our training data, and smaller training datasets mean worse models!
This is where k-fold cross-validation comes in. In this article, we'll explore what it is, how to use it, and how to measure its impact. 
Here's what we'll discuss:
Table of ContentsWhat Is K-Fold Cross-Validation?When Should You Use K-Fold Cross-Validation?An Example Of K-Fold Cross-ValidationSummary of K-Fold Cross-ValidationSetupVisualizationsThe Final WordRecommended Reading
﻿
What Is K-Fold Cross-Validation?K-fold cross-validation is a procedure where a dataset is divided into multiple training and validation sets (folds), where k is the number of them, to help safeguard the model against random bias caused by the selection of only one training and validation set. 
With this procedure, we run our modeling process on different subsets of the data to get multiple measures of model quality.
For example, we could begin by dividing the data into 5 pieces, each 20% of the full dataset. In this case, we say that we have broken the data into 5 "folds."
﻿
﻿
 
Then, we run one experiment for each fold:
In Experiment 1, we use the first fold as a validation (or holdout) set, and use everything else as training data. This gives us a measure of model quality based on a 20% holdout set.
In Experiment 2, we hold out data from the second fold (and use everything except the second fold for training the model). The holdout set is then used to get a second estimate of model quality.
We repeat this process, using every fold once as the holdout set. Putting this together, 100% of the data is used as holdout at some point, and we end up with a measure of model quality that is based on all of the rows in the dataset (even if we don't use all rows simultaneously).
When Should You Use K-Fold Cross-Validation?Cross-validation gives a more accurate measure of model quality, which is especially important if you are making a lot of modeling decisions. However, it can take longer to run because it estimates multiple models (one for each fold). 
/So, Given These Tradeoffs, When Should you use each approach?For small datasets, where the extra computational burden isn't a big deal, you should run k-fold cross-validation.
For larger datasets, a single validation set is sufficient. Your code will run faster, and you may have enough data that there's little need to reuse some of it for the holdout.
There's no simple threshold for what constitutes a large vs. small dataset. But if your model takes a couple of minutes or less to run, it's probably worth switching to k-fold cross-validation.
Alternatively, you can run cross-validation and see if the scores for each experiment seem close. If each experiment yields the same results, a single validation set is probably sufficient.
An Example Of K-Fold Cross-ValidationWe'll work with the same data as in the previous tutorial. We load the input data in X and the output data in y.
Then, we define a pipeline that uses an imputer to fill in missing values and a random forest model to make predictions.
While it's possible to do k-fold cross-validation without pipelines, it is quite difficult! Using a pipeline will make the code remarkably straightforward.
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
﻿
my_pipeline = Pipeline(steps=[('preprocessor', SimpleImputer()),
                              ('model', RandomForestRegressor(n_estimators=50,
                                                              random_state=0))
                             ])
We obtain the cross-validation scores with the cross_val_score() function from scikit-learn. We set the number of folds with the cv parameter.
from sklearn.model_selection import cross_val_score
﻿
# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipeline, X, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')
﻿
print("MAE scores:\n", scores)
MAE scores:
 [301628.7893587  303164.4782723  287298.331666   236061.84754543
 260383.45111427]
The scoring parameter chooses a measure of model quality to report: in this case, we chose negative mean absolute error (MAE). The docs for scikit-learn show a list of options.
It is a little surprising that we specify negative MAE. Scikit-learn has a convention where all metrics are defined so a high number is better. Using negatives here allows them to be consistent with that convention, though negative MAE is almost unheard of elsewhere.
We typically want a single measure of model quality to compare alternative models. So we take the average across experiments.
print("Average MAE score (across experiments):")
print(scores.mean())
Average MAE score (across experiments):
277707.3795913405
Summary of K-Fold Cross-ValidationUsing k-fold cross-validation yields a much better measure of model quality, with the added benefit of cleaning up our code: note that we no longer need to keep track of separate training and validation sets. So, especially for small datasets, it's a good improvement!
Now let us try to implement what we have learned in a real-world Kaggle dataset.
SetupWe will work with the Housing Prices Competition for Kaggle Learn Users from the previous exercise. 
﻿
Run the next code cell without changes to load the training and validation sets in X_train, X_valid, y_train, and y_valid.  The test set is loaded in X_test.
For simplicity, we drop categorical variables.
import pandas as pd
from sklearn.model_selection import train_test_split
﻿
# Read the data
train_data = pd.read_csv('../input/train.csv', index_col='Id')
test_data = pd.read_csv('../input/test.csv', index_col='Id')
﻿
# Remove rows with missing target, separate target from predictors
train_data.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = train_data.SalePrice              
train_data.drop(['SalePrice'], axis=1, inplace=True)
﻿
# Select numeric columns only
numeric_cols = [cname for cname in train_data.columns if train_data[cname].dtype in ['int64', 'float64']]
X = train_data[numeric_cols].copy()
X_test = test_data[numeric_cols].copy()
So far, you've learned how to build pipelines with scikit-learn.  For instance, the pipeline below will use SimpleImputer() to replace missing values in the data before using RandomForestRegressor() to train a random forest model to make predictions. 
We set the number of trees in the random forest model with the n_estimators parameter, and setting random_state ensures reproducibility.
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
﻿
my_pipeline = Pipeline(steps=[
    ('preprocessor', SimpleImputer()),
    ('model', RandomForestRegressor(n_estimators=50, random_state=0))
])
You have also learned how to use pipelines in cross-validation. The code below uses the cross_val_score() function to obtain the mean absolute error (MAE), averaged across five different folds. Recall we set the number of folds with the cv parameter.
from sklearn.model_selection import cross_val_score
​
# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipeline, X, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')
​
print("Average MAE score:", scores.mean())
Step 1: Write get_score functionIn this exercise, you'll use k-fold cross-validation to select parameters for a machine learning model.
Begin by writing a function get_score() that reports the average (over three cross-validation folds) MAE of a machine learning pipeline that uses:
the data in X and y to create folds,
SimpleImputer() (with all parameters left as default) to replace missing values, and
RandomForestRegressor() (with random_state=0) to fit a random forest model.
The n_estimators parameter supplied to get_score() is used when setting the number of trees in the random forest model. 
We'll use wandb to log the metrics, which can later be used to compare the performance of different models.
import wandb
def get_score(n_estimators):
    """
    Return the average MAE over 3 CV folds of random forest model.
    """
    wandb.init(project="Kaggle-ML-CV",name=str(n_estimators)+'_estimators')
    my_pipeline = Pipeline(steps=[
        ('preprocessor', SimpleImputer()),
        ('model', RandomForestRegressor(n_estimators, random_state=0))
    ])
    scores = -1 * cross_val_score(my_pipeline, X, y,
                                  cv=3,
                                  scoring='neg_mean_absolute_error')
    wandb.log({"Mean score":scores.mean()})
    return scores.mean()
Step 2: Test different parameter valuesNow, you will use the function you defined in Step 1 to evaluate the model performance corresponding to eight different values for the number of trees in the random forest: 50, 100, 150, ..., 300, 350, 400.
Store your results in a Python dictionary results, where results[i] is the average MAE returned by get_score(i).
results = {}
for i in range(1,9):
   results[50*i] = get_score(50*i)
Step 3: Find the best parameter valueThe next cell visualizes your results from Step 2.  We'll log the plot to wandb dashboard in case we need it as a reference later on.
import matplotlib.pyplot as plt
plt.plot(list(results.keys()), list(results.values()))
plt.xlabel('N_estimators')
plt.ylabel('Score')
wandb.init(project="Kaggle-ML-CV",name='comparison')
wandb.log({"Comparison":plt})
VisualizationsLet us now see how our models performed:
Loss Chart﻿
﻿
As you can see on is the loss chart, most of the values are in close proximity to each other. So, we cannot make any definitive inferences using this chart. Let us now look at the plot logged from the kernel:
Score Vs Estimators ﻿
Run set9
﻿
This plot captures a lot more information. As you can see, the model with 200 estimators resulted in the lowest error.
The Final WordIf you'd like to learn more about hyperparameter optimization, you're encouraged to start with grid search, which is a straightforward method for determining the best combination of parameters for a machine learning model.  Thankfully, scikit-learn also contains a built-in function GridSearchCV() that can make your grid search code very efficient!
Recommended ReadingContinue to learn about gradient boosting, a powerful technique that achieves state-of-the-art results on a variety of datasets.
You may also find these posts from the same series interesting.
Handling Missing Values In A Pandas Dataframe
In this tutorial, you will learn three approaches to dealing with missing values in a pandas dataframe.
Handling Categorical Features - With Examples
In this report, you will learn what a categorical variable is, along with three approaches for handling this type of data.
Gradient Boosting With XGBoost
In this report, you will learn how to build and optimize models with gradient boosting. This method dominates many Kaggle competitions and achieves state-of-the-art results on a variety of datasets
﻿
﻿
Add a comment
Tags: Articles, W&B Features, Plots, Panels, Tutorial, Kaggle
Iterate on AI agents and models faster. Try Weights & Biases today.