Handling Categorical Features - With Examples

In this report, you will learn what a categorical variable is, along with three approaches for handling this type of data.
Ayush Chaurasia
Created on June 14|Last edited on June 14
Comment
This article is part of a series, clarifying some of Kaggle's terms, definitions, and competitions as well as adding visualizations.
﻿
﻿
IntroductionA categorical variable takes only a limited number of values.
Consider a survey that asks how often you eat breakfast and provides four options: "Never", "Rarely", "Most days", or "Every day". In this case, the data is categorical, because responses fall into a fixed set of categories.
If people responded to a survey about which brand of car they owned, the responses would fall into categories like "Honda", "Toyota", and "Ford". In this case, the data is also categorical.
You will get an error if you try to plug these variables into most machine learning models in Python without preprocessing them first. In this tutorial, we'll compare three approaches that you can use to prepare your categorical data.
Three ApproachesDrop Categorical Variables
The easiest approach to dealing with categorical variables is to simply remove them from the dataset. This approach will only work well if the columns did not contain useful information.
Label Encoding
Label encoding assigns each unique value to a different integer.
﻿
﻿
﻿
This approach assumes an ordering of the categories: "Never" (0) < "Rarely" (1) < "Most days" (2) < "Every day" (3).
This assumption makes sense in this example, because there is an indisputable ranking to the categories. Not all categorical variables have a clear ordering in the values, but we refer to those that do as ordinal variables. For tree-based models (like decision trees and random forests), you can expect label encoding to work well with ordinal variables.
One-Hot Encoding
One-hot encoding creates new columns indicating the presence (or absence) of each possible value in the original data. To understand this, we'll work through an example.
 
﻿
﻿
In the original dataset, "Color" is a categorical variable with three categories: "Red", "Yellow", and "Green". The corresponding one-hot encoding contains one column for each possible value, and one row for each row in the original dataset. Wherever the original value was "Red", we put a 1 in the "Red" column; if the original value was "Yellow", we put a 1 in the "Yellow" column, and so on.
In contrast to label encoding, one-hot encoding does not assume an ordering of the categories. Thus, you can expect this approach to work particularly well if there is no clear ordering in the categorical data (e.g., "Red" is neither more nor less than "Yellow"). We refer to categorical variables without an intrinsic ranking as nominal variables.
One-hot encoding generally does not perform well if the categorical variable takes on a large number of values (i.e., you generally won't use it for variables taking more than 15 different values).
ExampleWe will work with the Melbourne Housing dataset.
We won't focus on the data loading step. Instead, you can imagine you are at a point where you already have the training and validation data in X_train, X_valid, y_train, and y_valid.
Next, we obtain a list of all of the categorical variables in the training data.
We do this by checking the data type (or dtype) of each column. The object dtype indicates a column has text (there are other things it could theoretically be, but that's unimportant for our purposes). For this dataset, the columns with text indicate categorical variables.
# Get list of categorical variables
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)
﻿
print("Categorical variables:")
print(object_cols)
Categorical variables:
['Type', 'Method', 'Regionname']
Define Function to Measure Quality of Each ApproachWe define a function score_dataset() to compare the three different approaches to dealing with categorical variables. This function reports the mean absolute error (MAE) from a random forest model. In general, we want the MAE to be as low as possible!
Score from Approach 1 (Drop Categorical Variables)We drop the object columns with the select_dtypes() method.
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])
﻿
print("MAE from Approach 1 (Drop categorical variables):")
print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))
MAE from Approach 1 (Drop categorical variables):
175703.48185157913
Score from Approach 2 (Label Encoding)Scikit-learn has a LabelEncoder class that can be used to get label encodings. We loop over the categorical variables and apply the label encoder separately to each column.
from sklearn.preprocessing import LabelEncoder
﻿
# Make copy to avoid changing original data 
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()
﻿
# Apply label encoder to each column with categorical data
label_encoder = LabelEncoder()
for col in object_cols:
    label_X_train[col] = label_encoder.fit_transform(X_train[col])
    label_X_valid[col] = label_encoder.transform(X_valid[col])
﻿
print("MAE from Approach 2 (Label Encoding):") 
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))
MAE from Approach 2 (Label Encoding):
165936.40548390493
In the code cell above, for each column, we randomly assign each unique value to a different integer. This is a common approach that is simpler than providing custom labels; however, we can expect an additional boost in performance if we provide better-informed labels for all ordinal variables.
Score from Approach 3 (One-Hot Encoding)We use the OneHotEncoder class from scikit-learn to get one-hot encodings. There are a number of parameters that can be used to customize its behavior.
We set handle_unknown='ignore' to avoid errors when the validation data contains classes that aren't represented in the training data, and
setting sparse=False ensures that the encoded columns are returned as a numpy array (instead of a sparse matrix).
To use the encoder, we supply only the categorical columns that we want to be one-hot encoded. For instance, to encode the training data, we supply X_train[object_cols]. (object_cols in the code cell below is a list of the column names with categorical data, and so X_train[object_cols] contains all of the categorical data in the training set.)
from sklearn.preprocessing import OneHotEncoder
﻿
# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))
﻿
# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index
﻿
# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)
﻿
# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)
﻿
print("MAE from Approach 3 (One-Hot Encoding):") 
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))
MAE from Approach 3 (One-Hot Encoding):
166089.4893009678
Which approach is best?In this case, dropping the categorical columns (Approach 1) performed worst, since it had the highest MAE score. As for the other two approaches, since the returned MAE scores are so close in value, there doesn't appear to be any meaningful benefit to one over the other.
In general, one-hot encoding (Approach 3) will typically perform best, and dropping the categorical columns (Approach 1) typically performs worst, but it varies on a case-by-case basis.
Let us now put our skills to test by working on a real-world dataset.
SetupIn this exercise, you will work with data from the Housing Prices Competition for Kaggle Learn Users. 
﻿
The next code cell loads the training and validation sets in X_train, X_valid, y_train, and y_valid.  The test set is loaded in X_test.
import pandas as pd
from sklearn.model_selection import train_test_split
﻿
# Read the data
X = pd.read_csv('../input/train.csv', index_col='Id') 
X_test = pd.read_csv('../input/test.csv', index_col='Id')
﻿
# Remove rows with missing target, separate target from predictors
X.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X.SalePrice
X.drop(['SalePrice'], axis=1, inplace=True)
﻿
# To keep things simple, we'll drop columns with missing values
cols_with_missing = [col for col in X.columns if X[col].isnull().any()] 
X.drop(cols_with_missing, axis=1, inplace=True)
X_test.drop(cols_with_missing, axis=1, inplace=True)
﻿
# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y,
                                                      train_size=0.8, test_size=0.2,
                                                      random_state=0)
To compare different models, you'll use the same score_dataset() function from the tutorial. This function reports the mean absolute error (MAE) from a random forest model. We'll use W&B's sklearn integration to log all the metrics directly to the dashboard with a single line of code.
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
import wandb
# function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid,name):
    wandb.init(project="Kaggle-Categorical",name=name)
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    preds = model.predict(X_valid)
    wandb.sklearn.plot_regressor(model, X_train, X_valid, y_train, y_valid,  model_name='RF')
    abs_error  = mean_absolute_error(y_valid, preds)
    wandb.log({"Mean abs error":abs_error})
    return abs_error
Step 1: Drop columns with categorical dataYou'll get started with the most straightforward approach. Use the code cell below to preprocess the data in X_train and X_valid to remove columns with categorical data. Set the preprocessed DataFrames to drop_X_train and drop_X_valid, respectively.
# Get list of categorical variables
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)
﻿
print("Categorical variables:")
print(object_cols)
Categorical variables:
['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood',
 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 
'ExterQual', 'ExterCond', 'Foundation', 'Heating', 'HeatingQC', 'CentralAir', 'KitchenQual', 'Functional', 
'PavedDrive', 'SaleType', 'SaleCondition']
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])
﻿
print("MAE from Approach 1 (Drop categorical variables):")
print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid,'Drop'))
﻿
Step 2: Label encodingBefore jumping into label encoding, we'll investigate the dataset.  Specifically, we'll look at the 'Condition2' column.  The code cell below prints the unique entries in both the training and validation sets.
print("Unique values in 'Condition2' column in training data:", X_train['Condition2'].unique())
print("\nUnique values in 'Condition2' column in validation data:", X_valid['Condition2'].unique())
Unique values in 'Condition2' column in training data: ['Norm' 'PosA' 'Feedr' 'PosN' 'Artery' 'RRAe']
Unique values in 'Condition2' column in validation data: ['Norm' 'RRAn' 'RRNn' 'Artery' 'Feedr' 'PosN']
Fitting a label encoder to a column in the training data creates a corresponding integer-valued label for each unique value that appears in the training data. In the case that the validation data contains values that don't also appear in the training data, the encoder will throw an error, because these values won't have an integer assigned to them. Notice that the 'Condition2' column in the validation data contains the values 'RRAn' and 'RRNn', but these don't appear in the training data -- thus, if we try to use a label encoder with scikit-learn, the code will throw an error.
This is a common problem that you'll encounter with real-world data, and there are many approaches to fixing this issue. For instance, you can write a custom label encoder to deal with new categories. The simplest approach, however, is to drop the problematic categorical columns.
The code cell below to saves the problematic columns to a Python list bad_label_cols. Likewise, columns that can be safely label encoded are stored in good_label_cols.
# All categorical columns
object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]
﻿
# Columns that can be safely label encoded
good_label_cols = [col for col in object_cols if 
                   set(X_train[col]) == set(X_valid[col])]
        
# Problematic columns that will be dropped from the dataset
bad_label_cols = list(set(object_cols)-set(good_label_cols))
        
print('Categorical columns that will be label encoded:', good_label_cols)
print('\nCategorical columns that will be dropped from the dataset:', bad_label_cols)
Categorical columns that will be label encoded: ['MSZoning', 'Street', 'LotShape', 'LandContour',
 'LotConfig', 'BldgType','HouseStyle', 'ExterQual', 'CentralAir', 'KitchenQual', 'PavedDrive', 'SaleCondition']
﻿
Categorical columns that will be dropped from the dataset: ['HeatingQC', 'Foundation', 'Utilities', 'RoofStyle', 'SaleType',
 'Condition2', 'RoofMatl', 'ExterCond', 'Exterior2nd', 'Condition1', 'Neighborhood', 'Heating', 'LandSlope', 'Functional', 'Exterior1st']
﻿
Use the next code cell to label encode the data in X_train and X_valid. Set the preprocessed DataFrames to label_X_train and label_X_valid, respectively.
from sklearn.preprocessing import LabelEncoder
﻿
# Drop categorical columns that will not be encoded
label_X_train = X_train.drop(bad_label_cols, axis=1)
label_X_valid = X_valid.drop(bad_label_cols, axis=1)
﻿
from sklearn.preprocessing import LabelEncoder
﻿
﻿
# Apply label encoder to each column with categorical data
label_encoder = LabelEncoder()
for col in good_label_cols:
    label_X_train[col] = label_encoder.fit_transform(X_train[col])
    label_X_valid[col] = label_encoder.transform(X_valid[col])
﻿
print("MAE from Approach 2 (Label Encoding):") 
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid,'Label'))
﻿
# Make copy to avoid changing original data 
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()
Step 3: Investigating cardinalitySo far, you've tried two different approaches to dealing with categorical variables.  And, you've seen that encoding categorical data yields better results than removing columns from the dataset.
Soon, you'll try one-hot encoding.  Before then, there's one additional topic we need to cover.  Begin by running the next code cell without changes.  
# Get number of unique entries in each column with categorical data
object_nunique = list(map(lambda col: X_train[col].nunique(), object_cols))
d = dict(zip(object_cols, object_nunique))
﻿
# Print number of unique entries by column, in ascending order
sorted(d.items(), key=lambda x: x[1])
[('Street', 2),('Utilities', 2), ('CentralAir', 2),('LandSlope', 3), ('PavedDrive', 3),('LotShape', 4),
 ('LandContour', 4),('ExterQual', 4),('KitchenQual', 4), ('MSZoning', 5),('LotConfig', 5),
 ('BldgType', 5),('ExterCond', 5), ('HeatingQC', 5), ('Condition2', 6),('RoofStyle', 6),
 ('Foundation', 6),('Heating', 6), ('Functional', 6),('SaleCondition', 6),('RoofMatl', 7),
('HouseStyle', 8),('Condition1', 9),('SaleType', 9),('Exterior1st', 15),('Exterior2nd', 16),
 ('Neighborhood', 25)]
Step 4: One-hot encodingIn this step, you'll experiment with one-hot encoding.  But, instead of encoding all of the categorical variables in the dataset, you'll only create a one-hot encoding for columns with cardinality less than 10.
Run the code cell below without changes to set low_cardinality_cols to a Python list containing the columns that will be one-hot encoded.  Likewise, high_cardinality_cols contains a list of categorical columns that will be dropped from the dataset.
# Columns that will be one-hot encoded
low_cardinality_cols = [col for col in object_cols if X_train[col].nunique() < 10]
﻿
# Columns that will be dropped from the dataset
high_cardinality_cols = list(set(object_cols)-set(low_cardinality_cols))
﻿
print('Categorical columns that will be one-hot encoded:', low_cardinality_cols)
print('\nCategorical columns that will be dropped from the dataset:', high_cardinality_cols)
Categorical columns that will be one-hot encoded: ['MSZoning', 'Street', 'LotShape', 'LandContour',
 'Utilities', 'LotConfig', 'LandSlope', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle',
 'RoofMatl', 'ExterQual', 'ExterCond', 'Foundation', 'Heating', 'HeatingQC', 'CentralAir', 'KitchenQual', 
'Functional', 'PavedDrive', 'SaleType', 'SaleCondition']
﻿
Categorical columns that will be dropped from the dataset: ['Neighborhood', 'Exterior1st', 'Exterior2nd']
One Hot Encodingfrom sklearn.preprocessing import OneHotEncoder
﻿
# Use as many lines of code as you need!
﻿
﻿
# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))
﻿
# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index
﻿
# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)
﻿
# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)
﻿
print("MAE from Approach 3 (One-Hot Encoding):") 
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid,'one-hot'))
Metric Visualizations﻿
﻿
﻿
﻿
Other VisualizationOther useful visualizations such as outlier detector and system usage metrics as also logged automatically by `wandb`. You can track the percentage of GPU utilization made by your program as the number of GPU hours kaggle per month is limited.
﻿
﻿
Keep goingWith missing value handling and categorical encoding, your modeling process is getting complex. This complexity gets worse when you want to save your model to use in the future. The key to managing this complexity is something called pipelines. 
﻿Learn to use pipelines to preprocess datasets with categorical variables, missing values and any other messiness your data throws at you.
Handling Missing Values In A Pandas Dataframe
In this tutorial, you will learn three approaches to dealing with missing values in a pandas dataframe.
Using K-Fold Cross-Validation To Improve Your Machine Learning Models
In this article, we will learn how to use k-fold cross-validation for better measures of machine learning model performance, using W&B to track our results.
Gradient Boosting With XGBoost
In this report, you will learn how to build and optimize models with gradient boosting. This method dominates many Kaggle competitions and achieves state-of-the-art results on a variety of datasets
﻿
﻿
Add a comment