An Introduction to Linear Regression For Machine Learning (With Examples)

A tutorial covering Linear Regression using scikit-learn, complete with code and interactive visualizations. Made by Saurav Maheshkar using W&B
Saurav Maheshkar

Link to Colab \longrightarrow

πŸ”– Table of Contents (Click to Expand)

🧐 What Is Linear Regression?

In its most basic form, linear regression is a statistical technique attempts to model the relationship between variables by fitting a linear equation. One variable is usually considered to be the target/dependent variable / feature ('desired output'), and the others are known as explanatory variables / features ('input').
In layman's terms, a linear regression model compare some value to another value on a straight line. Put simply, if one value is on the X axis, the other on Y, we’re trying to find the line that best describes their relationship. A straight-forward example could be the relationship of a rental price and the square footage of a property.
In fact, for the purposes of this report, we'll use a housing dataset, specifically the Ames Housing dataset, which is more modern and expanded version of the Boston Housing Dataset. You can find that dataset on Kaggle (and read the original journal article if you'd like to as well).
Our task here is to analyze the effect of features like street access and building zone (among others) and model this relationship. In this case, the price of the house is the target variable (y) and the other features are the input (x's). As we assume a linear relationship between these features, the equation used to represent this dependency becomes:
f(x) = \theta_0 + \theta_1x + \epsilon

For any given data distribution, it's highly unlikely that any given straight line fits the distribution exactly, thus there exists some error (\epsilon) between the observed value and that given by f(x).
From a statistical POV Ξ΅ is perceived as a statistical error. It's assumed to be some random variable that accounts for the deviation between the true and obtained value of the target variable. Thus, sometimes you can find f(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + ... + \epsilon referred to as the equation for linear regression.
NOTE: The random variable \epsilon is assumed to have 0 mean and \sigma^2 variance and as the input data (x) is a constant entity. The mean of the obtained values is independent of \epsilon.
\mathbb{E}(y \, | \, x) = \mu_{y | x} = \mathbb{E} (\theta_0 + \theta_1x + \epsilon) = \theta_0 + \theta_1x
whereas the variance of the obtained values,
Var \, (y \, | \, x) = \sigma_{y | x}^2 = Var (\theta_0 + \theta_1x + \epsilon) = \sigma^2

Let's make the math a bit more interpretable shall we,

πŸ“Š Linear Regression In Machine Learning

πŸ‘¨β€πŸ« Notation

As mentioned in the introduction, our model aims to "model" the data distribution and thus only approximates the underlying the distribution (f(x) \approx y).

β›· Method of Least Squares

The method of least squares is one of the most common methods used to estimate the value of regression coefficients. Our goal is to estimate the values of \theta_0 and \theta_1, such that the sum of the squares of the differences between the observations (y_i) and the straight line (represented by f(x)) is minimum. This can be represented as:
S(\theta_0, \theta_1) = \sum_{i = 1}^{n} (y_i - \theta_0 - \theta_1x_i)^2
If \hat{\theta_0} and \hat{\theta_1} are the best values (called as "least square estimators") of \theta_0 and \theta_1, then they must satisfy:
\frac{\partial S}{\partial \theta_0} = -2 \sum_{i=1}^{n}(y_i - \hat{\theta_0} - \hat{ \theta_1}x_i) = 0
\frac{\partial S}{\partial \theta_1} = -2 \sum_{i=1}^{n}(y_i - \hat{\theta_0} - \hat{ \theta_1}x_i) x_i = 0
The above two equations are known as the "Normal Equations".
Upon solving these normal equations, we get:
\hat{\theta_0} = \bar{y} - \hat{\theta_1}\bar{x}
\hat{\theta_1} = \frac{\sum{(y_i - \bar{y})(x_i - \bar{x})}}{\sum{(x_i - x)^2}}
Therefore, \hat{\theta_0} and \hat{\theta_1} are the least - squares estimators and represent the intercept and slope of the estimated straight line, respectively and the "fitted" line is \hat{f(x)} = \hat{\theta_0} + \hat{\theta_1}x

πŸ‹οΈβ€β™‚οΈ Application Of Linear Regression In The Real World

πŸ“ Example(s) Of How To Use Linear Regression

πŸ” Quick EDA (Using W&B Weave πŸͺ‘)

If you find pandas, matplotlib and seaborn intimidating and always spend a couple of hours on a single plot, W&B Weave is the perfect product for you. Weave panels allow you to directly query data from W&B, visualize and analyze interactively. Below you can see plots created using a "Weave expression" to query the dataset (the runs.summary["Train Dataset"] part), Weave Panels (the Merge Tables: Plot part) to select a Plot and then Weave configuration to choose the features for X and Y axis.
Below, in the first 2 panels we can see the relation between ( SalesPrice, LotFrontage ) and ( SalesPrice, LotArea ), both of them being numeric features and in the next 2 panels, the relation between ( SalesPrice, LandContour ) and ( SalesPrice, LotShape ) where LandContour and LotShape are Categorical features.
We can easily explore our dataset using Weights and Biases Tables. W&B Tables allow you to create interactive dataset exploration plots. We can easily upload any pandas data frame as W&B Tables using the following code snippet.
wandb.init(project='...', entity='...', job_type = "...")dataset = pd.read_csv("..."){"Dataset" : wandb.Table(dataframe=dataset)})

πŸ’ͺ🏻 How to Prepare the Dataset

It's common for datasets to have categorical variables, for example in our dataset the feature "MSZoning" which corresponds to the general zoning classification of the sale is a categorical variable. Below you can see some of the IDs corresponding to "C (all)" and "RM".
We can't pass these values directly to the regression model as it only understands numbers. Thus, we need to convert these string values to some integer / float encoding through a process called "One-Hot Encoding". It's a widely used technique especially when the number of unique values for the categorical feature are small. This process creates multiple binary columns indicating the presence (or absence) of the variables.
Below you can see the dataset after we've one-hot encoded the "MSZoning" feature. As there were 5 distinct values of this feature, it led to the creation of 5 new columns with binary indicators.

Using sklearn

scikit-learn makes it incredibly easy for you to use LinearRegression. It's available in the sklearn.linear_model submodule and has a very simple API design.
from sklearn.linear_model import LinearRegressionx, y = get_dataset()model = LinearRegression(), y)
It's even easier to plot the Learning Curve using Weights & Biases!! For instance the graph below was plotted using the following line.
wandb.sklearn.plot_learning_curve(model, x, y)

πŸ“ƒ Summary

In this report, we went through the basics of linear regression and learnt how we can estimate a line using the method of least squares. We also covered some basic data pre-processing techniques and learned how we can use scikit-learn to train a linear regression model and plot valuable metrics and data using a suite of W&B tools.