An Introduction to Linear Regression For Machine Learning (With Examples)
In this article, we provide an overview of, and a tutorial on, linear regression using scikit-learn, with code and interactive visualizations so you can follow.
Created on September 21|Last edited on September 14
Comment
In this article, we'll explore linear regression using scikit-learn, providing the code and a handy Colab so you can follow along!
Before we dive in, here's what we'll be covering:
Table of Contents
What Is Linear Regression?Linear Regression In Machine LearningApplication Of Linear Regression In The Real WorldExample(s) Of How To Use Linear RegressionUsing sklearn📃 Summary
What Is Linear Regression?
Linear regression is a statistical technique that attempts to model the relationship between variables by fitting a linear equation. One variable is usually considered to be the target (or dependent) variable ('desired output'), and the others are known as explanatory variables ('input').

In layman's terms, a linear regression model compares some value to another value on a straight line. Put simply, if one value is on the X axis, the other on Y, we’re trying to find the line that best describes their relationship. A straightforward example could be the relationship between a rental price and the square footage of a property.
In fact, for the purposes of this report, we'll use a housing dataset, specifically the Ames Housing dataset, which is a more modern and expanded version of the Boston Housing Dataset. You can find that dataset on Kaggle (and read the original journal article if you'd like to, as well).
Our task here is to analyze the effect of features like street access and building zones (among others) and model this relationship. In this case, the price of the house is the target variable (), and the other features are the input ('s). As we assume a linear relationship between these features, the equation used to represent this dependency becomes:
where
- are known as the regression coefficients.
For any given data distribution, it's highly unlikely that any given straight line fits the distribution exactly. Thus, there exists some error () between the observed value and that given by .
From a statistical POV ε is perceived as a statistical error. It's assumed to be some random variable that accounts for the deviation between the true and obtained value of the target variable. Thus, sometimes you can find referred to as the equation for linear regression.
💡
NOTE: The random variable is assumed to have mean and variance and as the input data () is a constant entity. The mean of the obtained values is independent of
whereas the variance of the obtained values,
Let's make the math a bit more interpretable, shall we,
- The functional value for a said data point () corresponds to the expected value of the target value ()
- The variance of the model output is constant for all data points and solely depends on the random variable ()
Linear Regression In Machine Learning
Notation
- Parameters
- No. of training examples
- No. of features
- Inputs / Feature value
- Outputs / Target variable
- training example
As mentioned in the introduction, our model aims to "model" the data distribution and thus only approximates the underlying distribution ().
⛷ Method of Least Squares
The method of least squares is one of the most common methods used to estimate the value of regression coefficients. Our goal is to estimate the values of and , such that the sum of the squares of the differences between the observations () and the straight line (represented by ) is minimum. This can be represented as:
If and are the best values (called as "least square estimators") of and , then they must satisfy:
The above two equations are known as the "Normal Equations".
Upon solving these normal equations, we get:
Therefore, and are the least-squares estimators and represent the intercept and slope of the estimated straight line, respectively, and the "fitted" line is
Application Of Linear Regression In The Real World
- Linear regression is a widely used algorithm in the applied machine learning world.
- Variants of linear regression are heavily used in the biomedical industry for tasks such as Survival Analysis. (For example, refer to Cox Proportional Hazards Regression Analysis)
- Econometrics (application of statistical methods to economic data) applications heavily rely on linear regression as a building block.
Example(s) Of How To Use Linear Regression
Before we dive into the example, this is a good time to provide a link to the Colab so you can follow along.
Ready? Let's get going!
Quick EDA (Using W&B Weave 🪡)
If you find pandas, matplotlib and seaborn intimidating and always spend a couple of hours on a single plot, W&B Weave is the perfect product for you.
Weave panels allow you to query data from W&B, visualize, and analyze interactively. Below, you can see plots created using a "Weave expression" to query the dataset (the runs.summary["Train Dataset"] part), Weave Panels (the Merge Tables: Plot part) to select a Plot, and then Weave configuration to choose the features for X and Y axis.
Below, in the first two panels, we can see the relation between ( SalesPrice, LotFrontage ) and ( SalesPrice, LotArea ), both of them being numeric features and in the next 2 panels, the relation between ( SalesPrice, LandContour ) and ( SalesPrice, LotShape ) where LandContour and LotShape are Categorical features.
Run set
10
We can easily explore our dataset using Weights & Biases Tables. W&B Tables allow you to create interactive dataset exploration plots. We can easily upload any pandas data frame as W&B Tables using the following code snippet.
wandb.init(project='...', entity='...', job_type = "...")dataset = pd.read_csv("...")wandb.run.log({"Dataset" : wandb.Table(dataframe=dataset)})wandb.run.finish()
Run set
10
💪🏻 How to Prepare the Dataset
It's common for datasets to have categorical variables. For example, in our dataset, the feature "MSZoning" which corresponds to the general zoning classification of the sale, is a categorical variable. Below, you can see some of the IDs corresponding to "C (all)" and "RM".
Run set
10
We can't pass these values directly to the regression model as it only understands numbers. Thus, we need to convert these string values to some integer/float encoding through a process called "One-Hot Encoding." It's a widely used technique, especially when the number of unique values for the categorical feature is small. This process creates multiple binary columns indicating the presence (or absence) of the variables.

Below, you can see the dataset after we've one-hot encoded the "MSZoning" feature. As there were 5 distinct values of this feature, it led to the creation of 5 new columns with binary indicators.
Run set
10
Using sklearn
scikit-learn makes it incredibly easy for you to use LinearRegression. It's available in the sklearn.linear_model submodule and has a very simple API design.
from sklearn.linear_model import LinearRegressionx, y = get_dataset()model = LinearRegression()model.fit(x, y)
It's even easier to plot the Learning Curve using Weights & Biases! For instance, the graph below was plotted using the following line.
wandb.sklearn.plot_learning_curve(model, x, y)
Run set
10
📃 Summary
In this article, we went through the basics of linear regression and learned how we can estimate a line using the method of least squares. We also covered some basic data pre-processing techniques and learned how we could use scikit-learn to train a linear regression model and plot valuable metrics and data using a suite of W&B tools.
Add a comment
Weave: (empty)
Interesting!
Reply
Iterate on AI agents and models faster. Try Weights & Biases today.