Skip to main content

Random Forest Regression: A Comprehensive Guide with House Price Data

This article is a deep dive into how a Random Forest algorithm works with a real-life example and why the Random Forest is the most effective classification algorithm.
Created on July 31|Last edited on January 8
Whether you have just started in ML or have been around for years, you would have heard about Random Forests.
The idea of Random Forests was first formally introduced by Leo Breiman in this research paper. It is one of the most popular and commonly used algorithms used for classification and regression tasks due to its high accuracy, robustness, feature importance, versatility, and scalability.
In this article, we'll cover how the Random Forest algorithm works, using data from House Prices - Advanced Regression Techniques, as an example, and build a Random Forest regressor.
Here's what we'll be covering:

Table of Contents



Let's get going!

Dataset

Let’s first explore the dataset. As part of the problem statement, we will be predicting house prices.
You can download the data using the Kaggle API.
mkdir house_price_regression
cd house_price_regression
kaggle competitions download -c house-prices-advanced-regression-techniques
unzip house-prices-advanced-regression-techniques.zip
Running the above commands will download the house prices data that we will be working with today and extract three files - train.csv, test.csv & sample_submission.csv.
Now, read the dataset and explore what’s in our dataset. You can look at the complete data dictionary on the competition’s page here.
We will be creating our models with 9 columns - ['MoSold', 'YrSold', 'KitchenAbvGr', 'BedroomAbvGr', 'TotRmsAbvGrd', 'FullBath', 'LotArea', 'BldgType', 'YearBuilt'].
The above 9 columns were picked up to showcase how Random Forests work. The idea here is to demonstrate the inner workings of the Random Forest algorithm and not to create the best house price classifier.
Let’s log the data as a Weights & Biases table so it’s easier to interact with.

As shown in the table above, our target variable is SalePrice and we only have 1 categorical variable - BldgType. We can easily transform the categorical variable before fitting the model.

Baseline

We've prepared our dataset, and the next step is to work towards creating a baseline.
df = pd.read_csv('./house_price_regression/train.csv')
df = df.sort_values(['YrSold', 'MoSold']).reset_index(drop=True)

# Store `X` & `y`
cols_to_keep = ['MoSold', 'YrSold', 'KitchenAbvGr', 'BedroomAbvGr', 'TotRmsAbvGrd', 'FullBath', 'LotArea', 'BldgType', 'YearBuilt']
target_var = 'SalePrice'
X = df[cols_to_keep].copy().reset_index(drop=True)
y = df[target_var].copy().reset_index(drop=True)

# Transform `BldgType` to numerical
_tfm_dict = {k:i for i,k in enumerate(X.BldgType.unique())}
X['BldgType'] = X.BldgType.map(_tfm_dict)

# 20% as test data
n = int(len(X)*0.8)
X_train, y_train = X[:n], y[:n]
X_test, y_test = X[n:], y[n:]

# fit model
model = RandomForestRegressor(random_state=123)
model.fit(X_train, y_train)
In the above, we simply read our data, store SalePrice as our target variable, and only use 9 other variables to predict the target. We split the data into train-test based on Year and Month of Sale. This way, we can create a better validation split that represents true model performance in the market.
Finally, we create a RandomForestRegressor and fit it to the training data.
Let’s now create a metric called within_10 that represents whether a model prediction is within 10% of the actual house price.
preds = model.predict(X_test)
err = np.abs((preds/y_test)-1)
w10 = err < 0.1
w10.mean()

>> 0.43493
Here you can see our baseline model achieves 43.49% performance. This means that for our baseline model - 43.49% of the time the models house price prediction are within-10% of the actual house prices in the test set.

How Does a Random Forest Really Work?

A random forest is an ensemble of decision trees. It generates multiple decision trees, each on a different subset of the data, and makes predictions by averaging the predictions of each tree. That means, to understand how random forests work, we must first dive into decision trees.

Understanding Decision Trees

Picture a flowchart - where each step is a question about some attribute or feature of your data. Depending on the answer to that question, you follow a branch to the next question, and so on. Eventually, you reach the end of the flowchart - this is what we call a leaf node.
In a decision tree, a leaf node is the final outcome, giving us a class label (for classification tasks) or a value (for regression tasks).
The very first question at the top of the tree, where we start our journey, is known as the root node.
So, in simple terms, a decision tree is just a series of questions leading to an answer. And a Random Forest? It's like asking a bunch of these trees for their opinions and then taking the average!
Let’s understand this better with the help of a graphical representation.
from sklearn import tree
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt

regressor = DecisionTreeRegressor(max_depth=3)
regressor.fit(X_train, y_train)

# Plot the decision tree
fig = plt.figure(figsize=(15,10))
_ = tree.plot_tree(regressor,
feature_names=['MoSold', 'YrSold', 'KitchenAbvGr', 'BedroomAbvGr', 'TotRmsAbvGrd', 'FullBath', 'LotArea', 'BldgType', 'YearBuilt'],
filled=True)
Running the above code gives us the following graphical representation of a decision tree.

If we look at this image, the decision tree decides to split the data at YearBuilt<=1985. That means this is our root node, and YearBuilt is the most important variable. If we traverse the leftmost branch, the model slits on LotArea and BedroomAbvGr. There are 167 samples in our data with YearBuilt<=1985, LotArea<=10888 & BedroomAbvGr<=2.5.
To further understand, let’s take an example prediction for our first validation data point.
MoSold 7
YrSold 2009
KitchenAbvGr 1
BedroomAbvGr 2
TotRmsAbvGrd 4
FullBath 1
LotArea 4060
BldgType 0
YearBuilt 1922
Name: 1168, dtype: int64
As summarized in the image above, if the decision tree is to predict on the above row, it would do so in these four steps:
  1. At the decision point in our root node YearBuilt<=1985, we go to the leftmost branch.
  2. At the next decision point LotArea<=10888, this is also True for our data point. We stay on the leftmost branch.
  3. At the next decision point BedroomsAbvGr<=2.5, this is also True as we have 2 bedrooms in our data point.
  4. So finally, the predicted value is 112054.844.

With an understanding of how a Decision Tree works and makes predictions, it's time to move on to Random Forests.

From Decision Trees to Random Forests

If you've ever heard the saying, "Two heads are better than one," then you're already familiar with the basic idea behind Random Forests. In the world of machine learning, we often find that combining multiple models together can give us a better result than any single model could achieve on its own. This is the principle behind Random Forests - instead of just one decision tree, we have a whole 'forest' of them!
How does this work? When we train a Random Forest, we actually train many different decision trees on different subsets of our data. This is like asking a group of experts, each with their own unique perspective, to make a prediction. Each tree in the forest gets a vote, and the final prediction of the Random Forest is the average of all these votes. This process helps to overcome the overfitting problem of individual decision trees and improves the overall prediction accuracy.
For the example in our case, each estimator (decision tree) would make a property price prediction, and we take the average that becomes the final prediction.
# fit Random Forest with 5 decision trees on train data
model = RandomForestRegressor(n_estimators=5, random_state=123)
model.fit(X_train, y_train)

# predict on first row of test data
model.predict(X_test[:1])
>> array([64300.])

# check total estimators (decision-trees) in our Random Forest
len(model.estimators_)
>> 5

# get prediction from each decision tree which is done based on splits as
# explained before
[estimator.predict(X_test[:1]) for estimator in model.estimators_]
>> [
array([110000.]),
array([52000.]),
array([52500.]),
array([55000.]),
array([52000.])
]

# take average of predictions (which becomes final prediction)
np.mean([estimator.predict(X_test[:1]) for estimator in model.estimators_])
>> 64300.0
The above code should be quite explanatory, but essentially in a Random Forest we fit multiple decision trees and take the average of all predictions as the final prediction.

Ensuring Diversity in the Forest

But wait, if we're training all these trees on the same task, won't they all just end up being the same? This is where the 'random' in Random Forest comes in. When training each tree, we not only use a different subset of the data, but we also consider a random subset of the features at each split in the decision tree. This ensures that each tree is forced to make decisions based on a different set of information, leading to a diverse set of trees.
This diversity is key to the power of Random Forests. Just like a team of experts from different fields can often solve a problem more effectively than a group of experts all from the same field, a diverse forest of decision trees can capture patterns and relationships in the data that might be missed by any single tree.

Improving our Random Forest baseline using W&B Sweeps

One of the powerful features W&B offers is Sweeps. It allows you to define a possible space for your hyperparameters, and then it efficiently searches over that space using a strategy of your choice. This could be grid search, random search, or Bayesian optimization.
In the context of our Random Forest model, we can use W&B Sweeps to find the optimal hyperparameters that will give us the best performance.
Let's see how we can set this up.
First, we need to prepare our data and define a function that will train our model and log the performance metrics. This function will use the hyperparameters provided by the W&B Sweep to train the model:
import wandb
import yaml
import random
import numpy as np
import sklearn
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

def prepare_data():
df = pd.read_csv("./data/house-price/train.csv")
df = df.sort_values(["YrSold", "MoSold"]).reset_index(drop=True)
cols_to_keep = [
"MoSold",
"YrSold",
"KitchenAbvGr",
"BedroomAbvGr",
"TotRmsAbvGrd",
"FullBath",
"LotArea",
"BldgType",
"YearBuilt",
]
target_var = "SalePrice"

X = df[cols_to_keep].copy().reset_index(drop=True)
y = df[target_var].copy().reset_index(drop=True)

_tfm_dict = {k: i for i, k in enumerate(X.BldgType.unique())}
X["BldgType"] = X.BldgType.map(_tfm_dict)

n = int(len(X) * 0.8)
X_train, y_train = X[:n], y[:n]
X_test, y_test = X[n:], y[n:]
return X_train, X_test, y_train, y_test

def within_10(model, X_test, y_test):
preds = model.predict(X_test)
err = np.abs((preds / y_test) - 1)
w10 = err < 0.1
return w10.mean().round(3)

def main():
# Set up your default hyperparameters
with open("./sweep.yaml") as file:
config = yaml.load(file, Loader=yaml.FullLoader)

run = wandb.init(config=config)

X_train, X_test, y_train, y_test = prepare_data()

# Note that we define values from `wandb.config`
# instead of defining hard values
n_estimators = wandb.config.n_estimators
max_depth = wandb.config.max_depth
min_samples_split = wandb.config.min_samples_split
min_samples_leaf = wandb.config.min_samples_leaf

model = RandomForestRegressor(
n_estimators=n_estimators,
max_depth=max_depth,
min_samples_split=min_samples_split,
min_samples_leaf=min_samples_leaf,
random_state=42,
)
model.fit(X_train, y_train)

w10 = within_10(model, X_test, y_test)

wandb.log(
{
"n_estimators": n_estimators,
"max_depth": max_depth,
"min_samples_split": min_samples_split,
"min_samples_leaf": min_samples_leaf,
"within_10": w10,
}
)

# Call the main function.
main()
Next, we define a configuration for our Sweep. This configuration specifies the metric we want to optimize (in this case, the within_10 metric), the hyperparameters we want to search over, and the search strategy we want to use:
program: train.py
method: bayes
metric:
name: within_10
goal: maximize
parameters:
n_estimators:
min: 10
max: 1000
max_depth:
min: 1
max: 10
min_samples_split:
min: 2
max: 10
min_samples_leaf:
min: 1
max: 10

Finally, we can run our Sweep using the W&B command line interface. This will start the Sweep and train a new model for each combination of hyperparameters in the search space:
bashCopy code
wandb agent --count N <user/project/sweep_id>

By using W&B Sweeps, we can automate the process of hyperparameter tuning and find the best model for our task without manual trial and error.
I ran a sweep using the above code with 115 different runs, which gives us the following sweep.


As shown, just by simple code and fine-tuning hyperparameters, we can actually improve our baseline to 46.9% within-10 accuracy as compared to 43.49% in our baseline.

Conclusion

It's easy to see why the Random Forest algorithm is extremely popular in the machine learning community. Beyond being simple to use, its inherent flexibility makes it applicable to various industries.
If you're interested in learning more, I'd recommend you experiment with other parameters in the Random Forest model or try out other tree-based models like XgBoost.
As always, I hope you found this article helpful. If you have any questions or feedback, feel free to reach out. Happy learning!

Iterate on AI agents and models faster. Try Weights & Biases today.