Understanding L1 and L2 regularization: techniques for optimized model training

In this article, we will dive deep into what L1 and L2 regularization are and how they work.
Created on April 30|Last edited on May 15
Comment
﻿
﻿
﻿
Source: DALLE
IntroductionOverfitting is a common problem when training machine learning models. Essentially, overfitting is when the model memorizes the noise and fits too closely to the training set, resulting in a model that is unable to generalize well to new data.
One popular solution to overfitting is using L1 and L2 regularization. These techniques  add a penalty term to the loss function, encouraging the model to have smaller parameter values. 
In this article, we'll cover both in detail, along with some code implementation to further help you understand them. 
Basics of Regularization
Image by author
Put simply, regularization is a technique used in machine learning to reduce overfitting and improve the model’s ability to generalize on new data resulting in a good fit. 
What regularization does is that it imposes extra constraints or penalties on the model during training to manage its complexity and prevent overfitting by relying too heavily on particular features or patterns in the training data. This technique helps achieve a balance between accurately fitting the training data and generalizing effectively to unseen data.
The primary regularization methods are L1 regularization (lasso), L2 regularization (ridge), and elastic net regularization. 
L1 regularization enhances the model by adding the sum of the “absolute values” of the coefficients as a penalty term to the loss function, promoting sparsity and the selection of relevant features while L2 regularization adds the sum of the “squared values” of the coefficients as the penalty term, encouraging smaller, yet non-zero, coefficients. Lastly, elastic net regularization is a combination of both L1 and L2 regularization.
Understanding L1 RegularizationL1 regularization, also known as lasso (Least Absolute Shrinkage and Selection Operator) regression, is a regularization technique that reduces overfitting in models by penalizing the absolute size of the regression coefficients. 
In L1 regularization, the penalty can be equivalent to the absolute value of the magnitudes of the coefficients, which reduces some coefficients to zero, removing the irrelevant features in a dataset. This encourages sparsity in the model and makes L1 regularization very useful for feature selection in models with many variables.
The penalty term added by L1 regularization is the sum of the absolute values of the coefficients, multiplied by a regularization parameter (λ).
﻿
In the image above, βi represents the coefficients of the model and λ is a non-negative hyperparameter that controls the strength of the penalty. The higher the value of λ, the more significant the penalty and thus the more coefficients that can be driven to zero.
Since L1 regularization can reduce some coefficients to zero, it effectively removes some features entirely from the model. This can be particularly beneficial when dealing with high-dimensional data where some features may be irrelevant or redundant. This automatically also selects more important features, providing a form of built-in feature selection.
A few practical examples of where L1 regularization is used include financial modeling, especially in constructing predictive models for stock prices or risk assessment as L1 can help in selecting the most predictive factors from a vast range of economic indicators and historical data. We also have image processing and computer vision where feature selection is essential for reducing the computational complexity and improving the performance of image classification models.
Overall, this makes L1 regularization a popular choice for models where simplicity and interpretability are essential, or when you're dealing with data that includes irrelevant features.
Understanding L2 RegularizationL2 regularization, also known as ridge regularization, is another common technique used in machine learning to prevent overfitting by penalizing the size of the coefficients. Unlike L1 regularization, which adds the absolute values of the coefficients to the loss function, L2 regularization adds the square of the coefficients. This difference in approach leads to different characteristics and effects on the model.
The L2 regularization term is the sum of the squares of the coefficients, multiplied by a regularization parameter (λ).
Image by author
As you can see in the image above, the only difference in the mathematical formula between L1 and L2 is that the βi is squared and not absolute.
Unlike L1 regularization, L2 does not typically result in sparsity in the coefficients. All features have some contribution, even if it is small, meaning that no coefficients are set to zero. This can be advantageous in scenarios where discarding features is not desirable. Furthermore, L2 regularization is particularly effective in handling multicollinearity, which is when independent variables are highly correlated, since it adds a penalty to the size of the coefficients which ensures that the model does not become overly sensitive to small changes in the model input, thus maintaining stability and performance.
A practical application of L2 regularization includes healthcare, where predictive models are used to forecast patient outcomes based on a wide range of clinical features. L2 regularization can be particularly beneficial in these models, as clinical datasets often have many correlated variables like various blood test results for example. By penalizing the coefficients, L2 regularization helps in reducing the model's sensitivity to multicollinearity, thereby stabilizing predictions and improving the generalizability of the model to new, unseen patient data. 
Another example would be sports analytics, such as predicting the performance of athletes or the outcomes of games based on historical data, L2 regularization can help in handling overfitting when there are many correlated statistical features like player statistics for example. This makes models more robust and reliable over successive seasons or across different teams and leagues.
Overall, L2 regularization is widely used across many types of regression and classification models, especially when the goal is to improve generalization by controlling model complexity without necessarily doing feature selection.
Comparing L1 and L2 Regularization﻿
Image by author
The only difference between L1 and L2 is in the penalty term, which leads to differences in many aspects between them. L1’s penalty term leads to coefficient shrinkage where some coefficients can become exactly zero, thus producing sparse models with reduced feature space, as it effectively performs feature selection by removing non-informative or less important features, while L2’s penalty term shrinks coefficients towards zero but typically remain non-zero, which results in dense models where all features contribute, no matter how small, which helps in dealing with multicollinearity and model stability.
Starting off with handling multicollinearity, L1 is less effective as it might randomly select one feature over another when they are highly correlated. However, when it comes to L2, it is very effective in handling multicollinearity by distributing the coefficient values among the correlated features. 
Next, we have model complexity and Interpretability, with L1 being on top as it reduces model complexity by eliminating some features entirely, which Increases model interpretability due to fewer features being involved, making it easier to understand the influence of each feature, while L2 is less interpretable compared to L1 as all features remain, but with reduced influence, making it harder to distinguish their individual effects.
Finally, we come to optimization, which is challenging for L1 regularization due to the penalty term, which involves the absolute value of coefficients and is not differentiable at zero. This often requires specific optimization algorithms, such as coordinate descent or subgradient methods, that can effectively optimize the loss function despite these challenges. On the contrary, the squared terms in L2 regularization are continuously differentiable, making optimization using standard gradient-based algorithms computationally efficient.
The decision to use L1 or L2 regularization depends on the specific nature of the problem and the attributes of the data at hand. So let’s consider for example wildlife conservation efforts where researchers are modeling the impact of various environmental factors on the population of a rare species using a small set of data points collected from the field. They want a model that isolates key factors from a vast potential list, like climate conditions, predation rates, and human activity levels. L1 regularization is suitable here as it can zero out less relevant factors, allowing conservationists to focus their efforts on changes that will most likely help save the species.
Another example would be if a film production company wants to predict the success of their next big movie based on various features like genre popularity, budget, star cast, release timing, etc. Given that all these factors might have some influence, but the relationships are highly correlated, L2 regularization would be better. It can manage this multicollinearity, ensuring that the prediction model is stable and all potential influences are considered, even if their individual effects are small.
Implementing L1 and L2 Regularization in Machine Learning AlgorithmsI hope you now understand L1 and L2 regularization as we will begin covering the code implementation of each with Weights and Biases and we will be using the Diabetes dataset. Both L1 and L2 have similar code with the only difference being step 3.
Step 1: Importing librariesimport numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
import wandb
﻿
“numpy” is used for numerical operations. “pandas” for data manipulation. “sklearn.datasets” contains the Diabetes dataset. “sklearn.linear_model” contains the Lasso and Ridge classes for L1 and L2 regularization. “sklearn.model_selection” provides the “train_test_split” method to divide the data into training and testing sets. “wandb” (Weights & Biases) is a tool for tracking and visualizing machine learning experiments.
Step 2: Initialization and data preparationLogin to Weights & Biases
wandb.login()
Initialize Weights & Biases project
wandb.init(project='L1_Regularization_Diabetes')
Load the Diabetes dataset
data = load_diabetes()
X = data.data
y = data.target
Split the data into training and testing sets
deX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Setting up regularization strength
alphas = np.logspace(-4, 0, 30)
“alphas” is an array of values for the regularization strength α generated on a logarithmic scale between 10^-4 and 10^4, providing a wide range of values to see the effect of both weak and strong regularization.
Setup for logging
train_scores = []
test_scores = []
coefficients = []
Step 3: Training and logging (L1)for alpha in alphas:
    lasso = Lasso(alpha=alpha)
    lasso.fit(X_train, y_train)
    train_scores.append(lasso.score(X_train, y_train))
    test_scores.append(lasso.score(X_test, y_test))
    coefficients.append(lasso.coef_)
    wandb.log({'alpha': alpha, 'Train Score': lasso.score(X_train, y_train), 'Test Score':   lasso.score(X_test, y_test), 'Coefficients': lasso.coef_})
Step 3: Training and logging (L2)for alpha in alphas:
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_train, y_train)
    train_scores.append(ridge.score(X_train, y_train))
    test_scores.append(ridge.score(X_test, y_test))
    coefficients.append(ridge.coef_)
    wandb.log({'alpha': alpha, 'Train Score': ridge.score(X_train, y_train), 'Test Score': ridge.score(X_test, y_test), 'Coefficients': ridge.coef_})
This loop iterates over the array of α values. For each value, an instance of the model (Lasso or Ridge) is created and trained on the training dataset. The model's performance is then evaluated on both the training and testing sets, and the resulting scores are stored in the train_scores and test_scores lists, respectively. Additionally, the coefficients of the model are stored after each iteration. All these metrics which are the training and testing scores, along with the coefficients are then logged to Weights & Biases (W&B) in each iteration. This setup allows for tracking how changes in α affect the model's performance and its coefficients, providing insights into the impact of regularization strength.
Step 4: Logging charts to W&B (L1)First, we'll create a table for train and test scores:
train_scores_table = wandb.Table(data=pd.DataFrame({'alpha': alphas, 'Train Scores': train_scores}))
test_scores_table = wandb.Table(data=pd.DataFrame({'alpha': alphas, 'Test Scores': test_scores}))
Code for coefficients table and chart plotting:
coefficients_df = pd.DataFrame(coefficients, columns=[f"Feature {i+1}" for i in range(len(coefficients[0]))])
coefficients_df['Alpha'] = alphas 
﻿
coefficients_table = wandb.Table(dataframe=coefficients_df)
﻿
wandb.log({"Coefficients Table": coefficients_table})
﻿
for feature in coefficients_df.columns[:-1]:  # Exclude the 'Alpha' column
    coef_plot = wandb.plot.line(coefficients_table, 'Alpha', feature, title=f"Coefficients Evolution: {feature}")
    wandb.log({f"Plot - {feature}": coef_plot})
Next, logging charts for train and test scores:
wandb.log({
    "Train Scores Chart": wandb.plot.line(train_scores_table, 'alpha', 'Train Scores', title="L1 - Train Scores vs Alpha"),
    "Test Scores Chart": wandb.plot.line(test_scores_table, 'alpha', 'Test Scores', title="L1 - Test Scores vs Alpha")
})
And then we finish the W&B run:
wandb.finish()
After completing the iterations, the training scores, testing scores, and coefficients for each feature collected during the training loop are stored in a Pandas DataFrame. This DataFrame is then converted into a wandb. Table, which is the format required for data visualization in Weights & Biases. Subsequently, line charts for training scores, testing scores, and the evolution of coefficients for each feature across different values of α are created and logged to W&B using wandb.plot.line().
Here are the logged results for L1 regularization.
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
Here are the logged results for L2 regularization.
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
ConclusionIn this article, we've explored L1 and L2 regularization, essential techniques in machine learning for preventing overfitting and enhancing model generalization. L1 regularization, or Lasso, promotes sparsity in model coefficients, making it ideal for feature selection by effectively zeroing out less important features. On the other hand, L2 regularization, or Ridge, reduces all coefficients towards zero but keeps them non-zero, which is beneficial in cases with high feature correlation, thereby stabilizing model predictions.
Choosing between L1 and L2 depends greatly on data characteristics and specific model needs. L1 is favored for models requiring fewer features for clarity or performance, while L2 is suitable when all features need consideration but with moderated influence.
I strongly encourage experimenting with these techniques. I recommend applying both methods to various datasets to discover the most effective approach for your unique challenges, helping you develop robust, accurate models.
﻿
Add a comment
Tags: Articles, Domain Agnostic, Intermediate, Beginner, Plots
Iterate on AI agents and models faster. Try Weights & Biases today.