Random Forest Regression: A Comprehensive Guide with House Price Data
This article is a deep dive into how a Random Forest algorithm works with a real-life example and why the Random Forest is the most effective classification algorithm.
Created on July 31|Last edited on January 8
Comment
Whether you have just started in ML or have been around for years, you would have heard about Random Forests.
The idea of Random Forests was first formally introduced by Leo Breiman in this research paper. It is one of the most popular and commonly used algorithms used for classification and regression tasks due to its high accuracy, robustness, feature importance, versatility, and scalability.
In this article, we'll cover how the Random Forest algorithm works, using data from House Prices - Advanced Regression Techniques, as an example, and build a Random Forest regressor.
Here's what we'll be covering:
Table of Contents
Let's get going!
Dataset
Let’s first explore the dataset. As part of the problem statement, we will be predicting house prices.
You can download the data using the Kaggle API.
mkdir house_price_regressioncd house_price_regressionkaggle competitions download -c house-prices-advanced-regression-techniquesunzip house-prices-advanced-regression-techniques.zip
Running the above commands will download the house prices data that we will be working with today and extract three files - train.csv, test.csv & sample_submission.csv.
Now, read the dataset and explore what’s in our dataset. You can look at the complete data dictionary on the competition’s page here.
We will be creating our models with 9 columns - ['MoSold', 'YrSold', 'KitchenAbvGr', 'BedroomAbvGr', 'TotRmsAbvGrd', 'FullBath', 'LotArea', 'BldgType', 'YearBuilt'].
The above 9 columns were picked up to showcase how Random Forests work. The idea here is to demonstrate the inner workings of the Random Forest algorithm and not to create the best house price classifier.
Let’s log the data as a Weights & Biases table so it’s easier to interact with.

As shown in the table above, our target variable is SalePrice and we only have 1 categorical variable - BldgType. We can easily transform the categorical variable before fitting the model.
Baseline
We've prepared our dataset, and the next step is to work towards creating a baseline.
df = pd.read_csv('./house_price_regression/train.csv')df = df.sort_values(['YrSold', 'MoSold']).reset_index(drop=True)# Store `X` & `y`cols_to_keep = ['MoSold', 'YrSold', 'KitchenAbvGr', 'BedroomAbvGr', 'TotRmsAbvGrd', 'FullBath', 'LotArea', 'BldgType', 'YearBuilt']target_var = 'SalePrice'X = df[cols_to_keep].copy().reset_index(drop=True)y = df[target_var].copy().reset_index(drop=True)# Transform `BldgType` to numerical_tfm_dict = {k:i for i,k in enumerate(X.BldgType.unique())}X['BldgType'] = X.BldgType.map(_tfm_dict)# 20% as test datan = int(len(X)*0.8)X_train, y_train = X[:n], y[:n]X_test, y_test = X[n:], y[n:]# fit modelmodel = RandomForestRegressor(random_state=123)model.fit(X_train, y_train)
In the above, we simply read our data, store SalePrice as our target variable, and only use 9 other variables to predict the target. We split the data into train-test based on Year and Month of Sale. This way, we can create a better validation split that represents true model performance in the market.
Finally, we create a RandomForestRegressor and fit it to the training data.
Let’s now create a metric called within_10 that represents whether a model prediction is within 10% of the actual house price.
preds = model.predict(X_test)err = np.abs((preds/y_test)-1)w10 = err < 0.1w10.mean()>> 0.43493
Here you can see our baseline model achieves 43.49% performance. This means that for our baseline model - 43.49% of the time the models house price prediction are within-10% of the actual house prices in the test set.
How Does a Random Forest Really Work?
A random forest is an ensemble of decision trees. It generates multiple decision trees, each on a different subset of the data, and makes predictions by averaging the predictions of each tree. That means, to understand how random forests work, we must first dive into decision trees.
Understanding Decision Trees
Picture a flowchart - where each step is a question about some attribute or feature of your data. Depending on the answer to that question, you follow a branch to the next question, and so on. Eventually, you reach the end of the flowchart - this is what we call a leaf node.
In a decision tree, a leaf node is the final outcome, giving us a class label (for classification tasks) or a value (for regression tasks).
The very first question at the top of the tree, where we start our journey, is known as the root node.
So, in simple terms, a decision tree is just a series of questions leading to an answer. And a Random Forest? It's like asking a bunch of these trees for their opinions and then taking the average!
Let’s understand this better with the help of a graphical representation.
from sklearn import treefrom sklearn.tree import DecisionTreeRegressorimport matplotlib.pyplot as pltregressor = DecisionTreeRegressor(max_depth=3)regressor.fit(X_train, y_train)# Plot the decision treefig = plt.figure(figsize=(15,10))_ = tree.plot_tree(regressor,feature_names=['MoSold', 'YrSold', 'KitchenAbvGr', 'BedroomAbvGr', 'TotRmsAbvGrd', 'FullBath', 'LotArea', 'BldgType', 'YearBuilt'],filled=True)
Running the above code gives us the following graphical representation of a decision tree.

If we look at this image, the decision tree decides to split the data at YearBuilt<=1985. That means this is our root node, and YearBuilt is the most important variable. If we traverse the leftmost branch, the model slits on LotArea and BedroomAbvGr. There are 167 samples in our data with YearBuilt<=1985, LotArea<=10888 & BedroomAbvGr<=2.5.
To further understand, let’s take an example prediction for our first validation data point.
MoSold 7YrSold 2009KitchenAbvGr 1BedroomAbvGr 2TotRmsAbvGrd 4FullBath 1LotArea 4060BldgType 0YearBuilt 1922Name: 1168, dtype: int64
As summarized in the image above, if the decision tree is to predict on the above row, it would do so in these four steps:
- At the decision point in our root node YearBuilt<=1985, we go to the leftmost branch.
- At the next decision point LotArea<=10888, this is also True for our data point. We stay on the leftmost branch.
- At the next decision point BedroomsAbvGr<=2.5, this is also True as we have 2 bedrooms in our data point.
- So finally, the predicted value is 112054.844.

With an understanding of how a Decision Tree works and makes predictions, it's time to move on to Random Forests.
From Decision Trees to Random Forests
If you've ever heard the saying, "Two heads are better than one," then you're already familiar with the basic idea behind Random Forests. In the world of machine learning, we often find that combining multiple models together can give us a better result than any single model could achieve on its own. This is the principle behind Random Forests - instead of just one decision tree, we have a whole 'forest' of them!
How does this work? When we train a Random Forest, we actually train many different decision trees on different subsets of our data. This is like asking a group of experts, each with their own unique perspective, to make a prediction. Each tree in the forest gets a vote, and the final prediction of the Random Forest is the average of all these votes. This process helps to overcome the overfitting problem of individual decision trees and improves the overall prediction accuracy.
For the example in our case, each estimator (decision tree) would make a property price prediction, and we take the average that becomes the final prediction.
# fit Random Forest with 5 decision trees on train datamodel = RandomForestRegressor(n_estimators=5, random_state=123)model.fit(X_train, y_train)# predict on first row of test datamodel.predict(X_test[:1])>> array([64300.])# check total estimators (decision-trees) in our Random Forestlen(model.estimators_)>> 5# get prediction from each decision tree which is done based on splits as# explained before[estimator.predict(X_test[:1]) for estimator in model.estimators_]>> [array([110000.]),array([52000.]),array([52500.]),array([55000.]),array([52000.])]# take average of predictions (which becomes final prediction)np.mean([estimator.predict(X_test[:1]) for estimator in model.estimators_])>> 64300.0
The above code should be quite explanatory, but essentially in a Random Forest we fit multiple decision trees and take the average of all predictions as the final prediction.
Ensuring Diversity in the Forest
But wait, if we're training all these trees on the same task, won't they all just end up being the same? This is where the 'random' in Random Forest comes in. When training each tree, we not only use a different subset of the data, but we also consider a random subset of the features at each split in the decision tree. This ensures that each tree is forced to make decisions based on a different set of information, leading to a diverse set of trees.
This diversity is key to the power of Random Forests. Just like a team of experts from different fields can often solve a problem more effectively than a group of experts all from the same field, a diverse forest of decision trees can capture patterns and relationships in the data that might be missed by any single tree.
Improving our Random Forest baseline using W&B Sweeps
One of the powerful features W&B offers is Sweeps. It allows you to define a possible space for your hyperparameters, and then it efficiently searches over that space using a strategy of your choice. This could be grid search, random search, or Bayesian optimization.
In the context of our Random Forest model, we can use W&B Sweeps to find the optimal hyperparameters that will give us the best performance.
Let's see how we can set this up.
First, we need to prepare our data and define a function that will train our model and log the performance metrics. This function will use the hyperparameters provided by the W&B Sweep to train the model:
import wandbimport yamlimport randomimport numpy as npimport sklearnimport pandas as pdimport numpy as npfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.metrics import mean_squared_errordef prepare_data():df = pd.read_csv("./data/house-price/train.csv")df = df.sort_values(["YrSold", "MoSold"]).reset_index(drop=True)cols_to_keep = ["MoSold","YrSold","KitchenAbvGr","BedroomAbvGr","TotRmsAbvGrd","FullBath","LotArea","BldgType","YearBuilt",]target_var = "SalePrice"X = df[cols_to_keep].copy().reset_index(drop=True)y = df[target_var].copy().reset_index(drop=True)_tfm_dict = {k: i for i, k in enumerate(X.BldgType.unique())}X["BldgType"] = X.BldgType.map(_tfm_dict)n = int(len(X) * 0.8)X_train, y_train = X[:n], y[:n]X_test, y_test = X[n:], y[n:]return X_train, X_test, y_train, y_testdef within_10(model, X_test, y_test):preds = model.predict(X_test)err = np.abs((preds / y_test) - 1)w10 = err < 0.1return w10.mean().round(3)def main():# Set up your default hyperparameterswith open("./sweep.yaml") as file:config = yaml.load(file, Loader=yaml.FullLoader)run = wandb.init(config=config)X_train, X_test, y_train, y_test = prepare_data()# Note that we define values from `wandb.config`# instead of defining hard valuesn_estimators = wandb.config.n_estimatorsmax_depth = wandb.config.max_depthmin_samples_split = wandb.config.min_samples_splitmin_samples_leaf = wandb.config.min_samples_leafmodel = RandomForestRegressor(n_estimators=n_estimators,max_depth=max_depth,min_samples_split=min_samples_split,min_samples_leaf=min_samples_leaf,random_state=42,)model.fit(X_train, y_train)w10 = within_10(model, X_test, y_test)wandb.log({"n_estimators": n_estimators,"max_depth": max_depth,"min_samples_split": min_samples_split,"min_samples_leaf": min_samples_leaf,"within_10": w10,})# Call the main function.main()
Next, we define a configuration for our Sweep. This configuration specifies the metric we want to optimize (in this case, the within_10 metric), the hyperparameters we want to search over, and the search strategy we want to use:
program: train.pymethod: bayesmetric:name: within_10goal: maximizeparameters:n_estimators:min: 10max: 1000max_depth:min: 1max: 10min_samples_split:min: 2max: 10min_samples_leaf:min: 1max: 10
Finally, we can run our Sweep using the W&B command line interface. This will start the Sweep and train a new model for each combination of hyperparameters in the search space:
bashCopy codewandb agent --count N <user/project/sweep_id>
By using W&B Sweeps, we can automate the process of hyperparameter tuning and find the best model for our task without manual trial and error.
I ran a sweep using the above code with 115 different runs, which gives us the following sweep.
As shown, just by simple code and fine-tuning hyperparameters, we can actually improve our baseline to 46.9% within-10 accuracy as compared to 43.49% in our baseline.
Conclusion
It's easy to see why the Random Forest algorithm is extremely popular in the machine learning community. Beyond being simple to use, its inherent flexibility makes it applicable to various industries.
If you're interested in learning more, I'd recommend you experiment with other parameters in the Random Forest model or try out other tree-based models like XgBoost.
As always, I hope you found this article helpful. If you have any questions or feedback, feel free to reach out. Happy learning!
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.