Feature selection in machine learning

In this article, we will be talking about the importance and implementation of feature selection in machine learning.
Created on April 20|Last edited on May 29
Comment
﻿
IntroductionIn many cases, datasets can contain irrelevant features that increase model training time while reducing its accuracy and its generalization capabilities. Selecting relevant features can solve for both (not to mention reducing overfitting and reducing model training cost). 
In this article, we will dive deeper into feature selection and its methods, how they work, along with code snippets that will aid you further in understanding the implementation of feature selection in machine learning.
Source: Author
Understanding Feature Selection
﻿Source﻿
Put simply, feature selection is the process of reducing the input variable to your model by using only relevant data and getting rid of noise in the data. Now, some may confuse feature selection and feature extraction, so let's be explicit about the difference. 
Feature selection is the process of selecting a subset of relevant features for use in model construction. This involves identifying and using only those features in your data that contribute most to the prediction variable or output in which you are interested. 
Feature extraction, on the other hand, involves creating new features by combining or transforming the original features in a way that preserves important information while reducing the dimensionality of the data.
The goal of feature selection in machine learning is to find the best set of features that allows you to build optimized models. While we'll cover in detail how to do this in a moment, it's worth drilling down on the difference in supervised and unsupervised learning. 
In supervised learning, feature selection is guided by the performance of the model regarding a specific output variable. The goal is to select features that contribute the most to predicting the target variable.
Unsupervised learning involves data without labeled outcomes, making the feature selection process more challenging as there is no straightforward way to evaluate the importance of a feature based on prediction accuracy and that is why its techniques are considerably less than that of supervised learning as you can see in the diagram above. So, the focus is often on reducing dimensionality or discovering the underlying structure in the data.
Implementing basic feature selection
Source: Author
Filter methodsLet’s start with filter methods, which are methods that apply a statistical measure to assign a score to each feature. The features are ranked by the score and either selected to be kept or removed from the dataset. 
Unlike wrapper and embedded methods, filter methods are generally independent of the machine learning algorithms which means they are often used as a preprocessing step to reduce the dimensionality of the data before applying more complex feature selection techniques or training machine learning models.
Let us begin with the Chi-squared test, which is used only when both the input and output variables are categorical. This test is used to examine the independence of categorical features from the target variable. A high chi-squared value indicates that the hypothesis of independence is false, suggesting that the feature is important for prediction.
Now, let's see how the Chi-Squared test is implemented in the code below using Breast Cancer Wisconsin (Diagnostic) dataset.
Step 1: Importing the necessary libraries and initializing W&Bimport numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
import wandb
﻿
wandb.login()
wandb.init(project='feature_selection')
numpy and pandas are for data manipulation
The Breast Cancer dataset is loaded directly from sklearn.datasets 
SelectKBest and chi2 from sklearn.feature_selection
SelectKBest allows you to select a number of features based on the scores from a scoring function; here chi2 for the Chi-Squared test
wandb is for Weights & Biases
Step 2: Data preparationLoading breast cancer dataset:
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
Step 3: Feature selection using chi-squaredchi2_selector = SelectKBest(chi2, k=2) 
X_kbest = chi2_selector.fit_transform(X, y)
SelectKBest with the chi2 function is used to select the best features. Here, k=2 specifies that the two best features should be selected based on their chi-squared scores. fit_transform is applied to the features and target to perform the feature selection. This function fits the chi-squared test on the data and then transforms the data to keep only the best k features.
Step 4: Printing and logging the results to W&BPrinting and logging the results:
feature_scores = {X.columns[i]: chi2_selector.scores_[i] for i in range(len(X.columns))}
wandb.log({
    'Original number of features': X.shape[1],
    'Reduced number of features': X_kbest.shape[1],
    'Feature Scores': feature_scores
})
﻿
print("Original number of features:", X.shape[1])
print("Reduced number of features:", X_kbest.shape[1])
print("Scores for each feature:", chi2_selector.scores_)
Finishing the W&B run:
wandb.finish()
Printing and logging the number of original features compared to the number of features after applying the chi-squared test. chi2_selector.scores_  gives the chi-squared scores of each feature, helping to understand the importance of each feature with respect to the target.
Here is the output logged on Weights & Biases
﻿
Next, we have the correlation coefficient test, this method measures the correlation between each feature and the target variable. Features that are highly correlated with the target are selected, while those that are not are discarded. Pearson’s correlation coefficient for continuous targets and point-biserial correlation for binary targets are examples of this.
Now, let's use the Pearson correlation coefficient to evaluate the linear relationship between each feature and the target variable in the California housing dataset.
Steps 1 and 2 are very similar to the Chi-Squared test code above so we will skip over to step 3.
Step 5: Correlation calculation and heatmap plottingCombine features and target for correlation analysis:
df = pd.concat([X, y], axis=1)
This structure allows easy computation of correlations between all features and the target.
Calculate and sort correlation coefficients:
correlation_matrix = df.corr()
target_correlation = correlation_matrix["MedianHouseValue"].sort_values()
The corr() method computes the Pearson correlation coefficients, which measure the linear relationship between variables. The values range from -1 to 1, where 1 means perfect positive correlation, -1 means perfect negative correlation, and 0 indicates no linear relationship.
Plotting the heatmap:
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm')
plt.title('Correlation Matrix Heatmap')
plt.savefig("correlation_heatmap.png")
plt.close()
Step 6: Printing and logging the results to W&BLogging the correlation results and the heatmap image:
wandb.log({
    'Feature Correlations with MedianHouseValue': target_correlation.drop('MedianHouseValue').to_dict(),
    'Correlation Heatmap': wandb.Image("correlation_heatmap.png")
})
﻿
print("Correlation of features with MedianHouseValue:", target_correlation)
﻿
wandb.finish()
Finish the W&B run:
wandb.finish()
Here are the logged results on Weights & Biases:
﻿
The heatmap plotted on Weights & Biases:
Source: Author
Wrapper methodsFollowing filter methods we then come to the Wrapper method, which works by evaluating and selecting a subset of features that contribute the most to the prediction power of a model. This selection is based on a specific machine learning algorithm that acts as the evaluation model, and thus, the features are selected according to their ability to improve the performance of this specific model.
Recursive Feature Elimination (RFE) is a feature selection method that falls under the wrapper method category. It systematically creates models and determines the best or worst performing feature, setting it aside, then repeats the process with the rest of the features. This method helps in identifying the subset of features that contributes most to the predictor variable or output in which you are interested.
Now, let's look at a simple example of implementing RFE with a logistic regression model, using the Iris flowers dataset, and skipping over importing libraries and data preparation.
Implement RFE with a logistic regression classifierCreate a logistic regression classifier
model = LogisticRegression(max_iter=1000)
RFE with logistic regression mode
selector = RFE(estimator=model, n_features_to_select=2, step=1)
selector = selector.fit(X, y)
Printing and logging the results to W&BOutput the selected features and their rankings and log them to W&B
wandb.log({'Feature rankings': dict(zip(feature_names, selector.ranking_))})
﻿
print("Num Features: %s" % (selector.n_features_,))
print("Selected Features: %s" % (list(np.array(feature_names)[selector.support_]),))
print("Feature Ranking: %s" % (dict(zip(feature_names, selector.ranking_)),))
Finish W&B session:
wandb.finish()
In this example, the RFE method is used with a logistic regression classifier on the Iris dataset to select the two most important features. The method iteratively trains the model, removes the least important feature, prints the final selection of the most relevant features, and logs it to Weights & Biases.
Here's the output logged on Weights & Biases:
﻿
Lastly, we have Forward Feature Selection and Backward Feature Elimination. FFS starts with no features and adds them one at a time until no significant improvement is made to the model performance, while BFE is the exact opposite as it starts with all features and removes them one at a time until the specified number of features is reached.
Advanced feature selection techniques
Embedded methodsFirstly, we have embedded methods which unlike filter and wrapper methods, where feature selection is performed as a separate step before or after the learning process, embedded methods perform feature selection during the model training. This integration often makes them more efficient and can lead to better generalization in predictive modeling.
Let's begin with one of the common techniques of embedded methods which is Lasso (Least Absolute Shrinkage and Selection Operator) Regularization (L1) which has the objective of obtaining a subset of predictors that minimize prediction errors for a quantitative response variable, while also shrinking the coefficients of less important variables to exactly zero. 
The Lasso modifies the least squares objective function by adding a penalty term which is the L1 norm (the sum of the absolute values) of the coefficients. This penalty has the effect of forcing some of the model coefficients to be exactly zero, which means the corresponding features do not contribute to the model and are effectively selected out.
Here’s an example of Lasso Regularization using the California housing dataset with W&B integration.
Creating and evaluating the lasso regression modelLasso regression model
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
Evaluating the model
train_score = lasso.score(X_train, y_train)
test_score = lasso.score(X_test, y_test)
Printing and logging the results to W&BLog the results to W&B
wandb.log({
    'alpha': 0.1,
    'Train Score': train_score,
    'Test Score': test_score,
    'Coefficients': lasso.coef_
})
﻿
print("Training Score:", train_score)
print("Testing Score:", test_score)
print("Coefficients:", lasso.coef_)
Finish the W&B session
wandb.finish()
Here are the results logged on Weights & Biases.
﻿
Next, we have ridge regression (L2 regularization) which unlike lasso, ridge regression includes an L2 penalty that does not set coefficients to zero but reduces their size. It's often used for comparison but is not strictly a feature selection method unless combined with L1.
Now, here's an example of the ridge regression California housing dataset with W&B integration:
Creating and evaluating ridge regression modelRidge regression model:
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
Evaluating the model:
train_score = ridge.score(X_train, y_train)
test_score = ridge.score(X_test, y_test)
Printing and logging the results to W&BLog the results to W&B:
wandb.log({
    'alpha': 1.0,
    'Train Score': train_score,
    'Test Score': test_score,
    'Coefficients': ridge.coef_
})
﻿
print("Training Score:", train_score)
print("Testing Score:", test_score)
print("Coefficients:", ridge.coef_)
Finish the W&B session:
wandb.finish()
Here are the results logged on Weights & Biases.
﻿
Hybrid methodsLastly, we have Hybrid approaches which combine the best elements of filter, wrapper, and embedded methods to capitalize on their strengths while mitigating their weaknesses. These methods are designed to improve the effectiveness and efficiency of feature selection by incorporating multiple strategies into a single approach.
Let’s develop a hybrid feature selection approach using a combination of filter methods and embedded methods, followed by tracking and comparing the results with Weights & Biases (W&B). In this example, we'll use the California housing dataset and integrate feature importance from a Random Forest model (embedded method) followed by feature selection using the SelectFromModel (a method that can act as a filter based on the embedded method results).
Training a random forest modelTrain a Random Forest model to get feature importance
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
feature_importances = rf.feature_importances_
Select features based on importance
selector = SelectFromModel(rf, threshold='mean')
selector.fit(X_train, y_train)
Use .get_support() to apply selection in DataFrame format
selected_features = X_train.columns[selector.get_support()]
Filter both training and testing set with the selected features
X_train_reduced = X_train[selected_features]
X_test_reduced = X_test[selected_features]
Train a final model on the selected features
model = LinearRegression()
model.fit(X_train_reduced, y_train)
Preparing data for visualization
feature_data = [[name, float(importance)] for name, importance in zip(X.columns, feature_importances)]
feature_table = wandb.Table(data=feature_data, columns=["Feature", "Importance"])
Printing and logging the results to W&BLog feature importances as a bar chart to W&B
wandb.log({"Feature Importances": wandb.plot.bar(feature_table, "Feature", "Importance", title="Feature Importances")})
Finish W&B session
wandb.finish()
Here are the results logged on Weights & Biases.
﻿
﻿
﻿
This hybrid approach, by integrating powerful random forest-based feature importances with a simple linear regression model, provides an efficient way to perform feature selection while ensuring model simplicity and interpretability. WandB’s tracking capabilities enhance this process by allowing you to visualize and compare different experiments easily.
ConclusionWe have discussed four methods (filter, wrapper, embedded, hybrid) of feature selection and a few techniques from each, covering as much detail as possible and giving example implementation in Python code for each. Weights & Biases also played a crucial role in experiment tracking and visualization. Their code is simple and easy to implement as we have seen in the examples above.
Looking ahead, feature selection in machine learning is set to become more automated, efficient, and ethically aware. Advances in quantum computing, federated learning, and deep learning integration could potentially enhance model performance while also addressing privacy and environmental concerns. As these technologies evolve, they will make AI systems not only more powerful but also more accessible and sustainable.
﻿
Add a comment
Tags: Articles, Domain Agnostic, Intermediate
Iterate on AI agents and models faster. Try Weights & Biases today.