An Introduction to Linear Regression For Machine Learning (With Examples)

In this article, we provide an overview of, and a tutorial on, linear regression using scikit-learn, with code and interactive visualizations so you can follow.
Saurav Maheshkar
Created on September 21|Last edited on September 14
Comment
In this article, we'll explore linear regression using scikit-learn, providing the code and a handy Colab so you can follow along!
Before we dive in, here's what we'll be covering:
Table of ContentsWhat Is Linear Regression?Linear Regression In Machine LearningApplication Of Linear Regression In The Real WorldExample(s) Of How To Use Linear RegressionUsing sklearn📃 Summary
﻿
﻿
What Is Linear Regression?Linear regression is a statistical technique that attempts to model the relationship between variables by fitting a linear equation. One variable is usually considered to be the target (or dependent) variable ('desired output'), and the others are known as explanatory variables ('input'). 
﻿
In layman's terms, a linear regression model compares some value to another value on a straight line. Put simply, if one value is on the X axis, the other on Y, we’re trying to find the line that best describes their relationship. A straightforward example could be the relationship between a rental price and the square footage of a property. 
In fact, for the purposes of this report, we'll use a housing dataset, specifically the Ames Housing dataset, which is a more modern and expanded version of the Boston Housing Dataset. You can find that dataset on Kaggle (and read the original journal article if you'd like to, as well). 
Our task here is to analyze the effect of features like street access and building zones (among others) and model this relationship. In this case, the price of the house is the target variable (yyy﻿), and the other features are the input (xxx﻿'s). As we assume a linear relationship between these features, the equation used to represent this dependency becomes:
f(x)=θ0+θ1x+ϵf(x) = \theta_0 + \theta_1x + \epsilonf(x)=θ0​+θ1​x+ϵ﻿
where
﻿{θ0,θ1}\{ \theta_0 , \theta_1 \}{θ0​,θ1​}﻿ are known as the regression coefficients.
﻿
For any given data distribution, it's highly unlikely that any given straight line fits the distribution exactly. Thus, there exists some error (ϵ\epsilonϵ﻿) between the observed value and that given by f(x)f(x)f(x)﻿.
From a statistical POV ε is perceived as a statistical error. It's assumed to be some random variable that accounts for the deviation between the true and obtained value of the target variable. Thus, sometimes you can find f(x)=θ0+θ1x1+θ2x2+...+ϵf(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + ... + \epsilon f(x)=θ0​+θ1​x1​+θ2​x2​+...+ϵ﻿ referred to as the equation for linear regression. 
💡
NOTE: The random variable ϵ\epsilonϵ﻿ is assumed to have 000﻿ mean and σ2\sigma^2σ2﻿ variance and as the input data (xxx﻿) is a constant entity. The mean of the obtained values is independent of ϵ.\epsilon.ϵ.﻿﻿
E(y ∣ x)=μy∣x=E(θ0+θ1x+ϵ)=θ0+θ1x\mathbb{E}(y \, | \, x) = \mu_{y | x} = \mathbb{E} (\theta_0 + \theta_1x + \epsilon) = \theta_0 + \theta_1xE(y∣x)=μy∣x​=E(θ0​+θ1​x+ϵ)=θ0​+θ1​x﻿
whereas the variance of the obtained values, 
Var (y ∣ x)=σy∣x2=Var(θ0+θ1x+ϵ)=σ2Var \, (y \, | \, x) = \sigma_{y | x}^2 = Var (\theta_0 + \theta_1x  + \epsilon) = \sigma^2Var(y∣x)=σy∣x2​=Var(θ0​+θ1​x+ϵ)=σ2﻿
﻿
Let's make the math a bit more interpretable, shall we, 
The functional value for a said data point (f(x)f(x)f(x)﻿) corresponds to the expected value of the target value (yyy﻿)
The variance of the model output is constant for all data points and solely depends on the random variable (ϵ\epsilonϵ﻿)
Linear Regression In Machine Learning
Notation﻿θ→\theta \rightarrowθ→﻿ Parameters
﻿m→m \rightarrow m→﻿ No. of training examples
﻿n→n \rightarrown→﻿ No. of features
﻿x→x \rightarrowx→﻿ Inputs / Feature value
﻿y→y \rightarrowy→﻿ Outputs / Target variable
﻿(x(i),y(i))→(x^{(i)}, y^{(i)}) \rightarrow(x(i),y(i))→﻿ ithi^{th}ith﻿ training example
As mentioned in the introduction, our model aims to "model" the data distribution and thus only approximates the underlying distribution (f(x)≈yf(x) \approx yf(x)≈y﻿). 
﻿
⛷ Method of Least Squares The method of least squares is one of the most common methods used to estimate the value of regression coefficients. Our goal is to estimate the values of θ0\theta_0θ0​﻿ and θ1\theta_1θ1​﻿, such that the sum of the squares of the differences between the observations (yiy_iyi​﻿) and the straight line (represented by f(x)f(x)f(x)﻿) is minimum. This can be represented as:
S(θ0,θ1)=∑i=1n(yi−θ0−θ1xi)2S(\theta_0, \theta_1) = \sum_{i = 1}^{n} (y_i - \theta_0 - \theta_1x_i)^2S(θ0​,θ1​)=∑i=1n​(yi​−θ0​−θ1​xi​)2﻿
If θ0^\hat{\theta_0}θ0​^​﻿ and θ1^\hat{\theta_1}θ1​^​﻿ are the best values (called as "least square estimators") of θ0\theta_0θ0​﻿ and θ1\theta_1θ1​﻿, then they must satisfy:
∂S∂θ0=−2∑i=1n(yi−θ0^−θ1^xi)=0\frac{\partial S}{\partial \theta_0} = -2 \sum_{i=1}^{n}(y_i - \hat{\theta_0} - \hat{ \theta_1}x_i) = 0∂θ0​∂S​=−2∑i=1n​(yi​−θ0​^​−θ1​^​xi​)=0﻿
∂S∂θ1=−2∑i=1n(yi−θ0^−θ1^xi)xi=0\frac{\partial S}{\partial \theta_1} = -2 \sum_{i=1}^{n}(y_i - \hat{\theta_0} - \hat{ \theta_1}x_i) x_i = 0∂θ1​∂S​=−2∑i=1n​(yi​−θ0​^​−θ1​^​xi​)xi​=0﻿
The above two equations are known as the "Normal Equations".
Upon solving these normal equations, we get:
θ0^=yˉ−θ1^xˉ\hat{\theta_0} = \bar{y} - \hat{\theta_1}\bar{x}θ0​^​=yˉ​−θ1​^​xˉ﻿
θ1^=∑(yi−yˉ)(xi−xˉ)∑(xi−x)2\hat{\theta_1} = \frac{\sum{(y_i - \bar{y})(x_i - \bar{x})}}{\sum{(x_i - x)^2}}θ1​^​=∑(xi​−x)2∑(yi​−yˉ​)(xi​−xˉ)​﻿
Therefore, θ0^\hat{\theta_0}θ0​^​﻿ and θ1^\hat{\theta_1}θ1​^​﻿ are the least-squares estimators and represent the intercept and slope of the estimated straight line, respectively, and the "fitted" line is f(x)^=θ0^+θ1^x\hat{f(x)} = \hat{\theta_0} + \hat{\theta_1}xf(x)^​=θ0​^​+θ1​^​x﻿﻿
﻿
Application Of Linear Regression In The Real WorldLinear regression is a widely used algorithm in the applied machine learning world. 
Variants of linear regression are heavily used in the biomedical industry for tasks such as Survival Analysis. (For example, refer to Cox Proportional Hazards Regression Analysis)
﻿Econometrics (application of statistical methods to economic data) applications heavily rely on linear regression as a building block.
Example(s) Of How To Use Linear RegressionBefore we dive into the example, this is a good time to provide a link to the Colab so you can follow along.
﻿
Ready? Let's get going!
Quick EDA (Using W&B Weave 🪡)If you find pandas, matplotlib and seaborn intimidating and always spend a couple of hours on a single plot, W&B Weave is the perfect product for you. 
﻿Weave panels allow you to query data from W&B, visualize, and analyze interactively. Below, you can see plots created using a "Weave expression" to query the dataset (the runs.summary["Train Dataset"] part), Weave Panels (the Merge Tables: Plot part) to select a Plot, and then Weave configuration to choose the features for X and Y axis. 
Below, in the first two panels, we can see the relation between ( SalesPrice, LotFrontage ) and ( SalesPrice, LotArea ), both of them being numeric features and in the next 2 panels, the relation between ( SalesPrice, LandContour ) and ( SalesPrice, LotShape ) where LandContour and LotShape are Categorical features.
﻿
Run set10
﻿
We can easily explore our dataset using Weights & Biases Tables. W&B Tables allow you to create interactive dataset exploration plots. We can easily upload any pandas data frame as W&B Tables using the following code snippet.
wandb.init(project='...', entity='...', job_type = "...")
﻿
dataset = pd.read_csv("...")
wandb.run.log({"Dataset" : wandb.Table(dataframe=dataset)})
﻿
wandb.run.finish()
﻿
Run set10
﻿
💪🏻 How to Prepare the DatasetIt's common for datasets to have categorical variables. For example, in our dataset, the feature "MSZoning" which corresponds to the general zoning classification of the sale, is a categorical variable. Below, you can see some of the IDs corresponding to "C (all)" and "RM". 
﻿
Run set10
﻿
We can't pass these values directly to the regression model as it only understands numbers. Thus, we need to convert these string values to some integer/float encoding through a process called "One-Hot Encoding." It's a widely used technique, especially when the number of unique values for the categorical feature is small. This process creates multiple binary columns indicating the presence (or absence) of the variables.
﻿
Below, you can see the dataset after we've one-hot encoded the "MSZoning" feature. As there were 5 distinct values of this feature, it led to the creation of 5 new columns with binary indicators.
﻿
Run set10
﻿
Using sklearn﻿scikit-learn makes it incredibly easy for you to use LinearRegression. It's available in the sklearn.linear_model submodule and has a very simple API design.
from sklearn.linear_model import LinearRegression
﻿
x, y = get_dataset()
model = LinearRegression()
model.fit(x, y)
It's even easier to plot the Learning Curve using Weights & Biases! For instance, the graph below was plotted using the following line.
wandb.sklearn.plot_learning_curve(model, x, y)
﻿
Run set10
﻿
📃 SummaryIn this article, we went through the basics of linear regression and learned how we can estimate a line using the method of least squares. We also covered some basic data pre-processing techniques and learned how we could use scikit-learn to train a linear regression model and plot valuable metrics and data using a suite of W&B tools.
﻿