How One-Hot Encoding Improves Machine Learning Performance

A brief discussion of one-hot encoding, where best to use it, and why it works . Made by Ayush Thakur using Weights & Biases
Ayush Thakur

Introduction to One-Hot Encoding & Categorical Data

One-hot encoding is a data preparation practice that makes certain kinds of data easier to work with or actually readable by an algorithm. Specifically, one-hot encoding is often used on categorical data.
So what's categorical data? Simple: it's data that has label values rather than numerical ones. Some examples are:
Here are a couple solid resources if you'd like a bit more about categorical data in you're interested:

Answer

One-hot encoding allows the representation of categorical data to be more expressive.
Many learning algorithms either learn a single weight per feature, or they use distance metrics between samples. The former is the case for linear models such as logistic regression, which are easy to explain.
Suppose you have a dataset having only a single categorical feature "nationality", with values "UK", "French" and "US". Assume, without loss of generality, that these are encoded as 0, 1, and 2. You then have a weight w for this feature in a linear classifier, which will make some kind of decision-based on the constraint W * X + b > 0, or equivalently W * X < b.
Fig 1: Different encoding of the categorical feature
The problem now is that the single weight W cannot encode a three-way choice. The three possible values of W * X are 0, W, and 2W. Either all three lead to the same decision (they're all < b or ≥ b) or "UK" and "French" lead to the same decision, or "French" and "US" give the same decision. There's no possibility for the model to learn that "UK" and "US" should be given the same label, with "French" the odd one out.
By one-hot encoding, you effectively blow up the feature space to three features, which will each get their own weights, so the decision function is now W[UK]*[UK] + W[FR]*[FR] + W[US]*[US] < b, where all the X's are booleans. In this space, such a linear function can express any sum/dis-junction of the possibilities (e.g. "UK or US", which might be a predictor for someone speaking English).
Similarly, any learner based on standard distance metrics (such as k-nearest neighbors) between samples will get confused without one-hot encoding. With the naive encoding and Euclidean distance, the distance between French and US is 1. The distance between the US and UK is 2. But with the one-hot encoding, the pairwise distances between [1, 0, 0], [0, 1, 0] and [0, 0, 1] are all equal to √2.
Fig 2: Distance between integer based encoding and one-hot encoding
However, this is not true for all learning algorithms; decision trees and derived models such as random forests, if deep enough, can handle categorical variables without one-hot encoding.

Let's see this through an experiment.

Try the experiments on Google Colab \rightarrow

For our experiments:
...and we will use a breast cancer dataset for our experiments (you can download the data here).
Before we jump to the observations let's quickly take a look at the model:
# define the modeldef get_model(): model = Sequential() model.add(Dense(10, input_dim=X_train_enc.shape[1], activation='relu', kernel_initializer='he_normal')) model.add(Dense(1, activation='sigmoid')) # compile the keras model model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc']) return model

Observations

We thus perform an extra experiment with the regularized model. This is an obvious step and necessary to actually compare the expressiveness of integer encoding with one-hot-encoding. We will use a dropout layer after the first layer with a dropout rate of 0.5. This value is not cherry-picked.

Final word

This experiment cannot concretely establish the improvement in model performance with one-hot encoding. It is shown in this Kaggle kernel that one hot encoding leads to poor performance when a decision tree is used. However, we can concretely establish that the expressiveness of the feature space expands. Through our toy experiment and with a regularized model(it's expected!) we gained an increase over the baseline.

Weights & Biases

Weights & Biases helps you keep track of your machine learning experiments. Use our tool to log hyperparameters and output metrics from your runs, then visualize and compare results and quickly share findings with your colleagues.
Get started in 5 minutes.