How One-Hot Encoding Improves Machine Learning Performance

A brief discussion of one-hot encoding, where best to use it, and why it works
Created on August 10|Last edited on April 19
Comment
﻿﻿﻿
Introduction to One-Hot Encoding & Categorical DataOne-hot encoding is a data preparation practice that makes certain kinds of data easier to work with or actually readable by an algorithm. Specifically, one-hot encoding is often used on categorical data. 
So what's categorical data? Simple: it's data that has label values rather than numerical ones. Some examples are: 
A color variable with "blue", "orange" and "yellow" as values. In this example, there is no natural ordering between the values. One-hot encoding is best suited for these examples.
A rank variable with "first", "second", and "last"  as values. This is an example of ordinal-variables as there is a natural ordering. You might get away without one-hot-encoding them by integer encoding (each unique label is mapped to an integer) each variable. 
Here are a couple solid resources if you'd like a bit more about categorical data in you're interested:
﻿Why One-Hot Encode Data in Machine Learning?﻿
﻿Using categorical data with one-hot encoding﻿
AnswerOne-hot encoding allows the representation of categorical data to be more expressive. 
Many learning algorithms either learn a single weight per feature, or they use distance metrics between samples. The former is the case for linear models such as logistic regression, which are easy to explain.
Suppose you have a dataset having only a single categorical feature "nationality", with values "UK", "French" and "US". Assume, without loss of generality, that these are encoded as 0, 1, and 2. You then have a weight w for this feature in a linear classifier, which will make some kind of decision-based on the constraint W * X + b > 0, or equivalently W * X < b.
Fig 1: Different encoding of the categorical feature
The problem now is that the single weight W cannot encode a three-way choice. The three possible values of W * X are 0, W, and 2W. Either all three lead to the same decision (they're all < b or ≥ b) or "UK" and "French" lead to the same decision, or "French" and "US" give the same decision. There's no possibility for the model to learn that "UK" and "US" should be given the same label, with "French" the odd one out.
By one-hot encoding, you effectively blow up the feature space to three features, which will each get their own weights, so the decision function is now W[UK]*[UK] + W[FR]*[FR] + W[US]*[US] < b, where all the X's are booleans. In this space, such a linear function can express any sum/dis-junction of the possibilities (e.g. "UK or US", which might be a predictor for someone speaking English).
Similarly, any learner based on standard distance metrics (such as k-nearest neighbors) between samples will get confused without one-hot encoding. With the naive encoding and Euclidean distance, the distance between French and US is 1. The distance between the US and UK is 2. But with the one-hot encoding, the pairwise distances between [1, 0, 0], [0, 1, 0] and [0, 0, 1] are all equal to √2.
﻿
Fig 2: Distance between integer based encoding and one-hot encoding
However, this is not true for all learning algorithms; decision trees and derived models such as random forests, if deep enough, can handle categorical variables without one-hot encoding.
Let's see this through an experiment.
﻿Try the experiments on Google Colab →\rightarrow→﻿﻿ For our experiments:
We will first integer encode (OrdinalEncoder) the features. Our hypothesis is that one-hot-encoding results in a better-performing machine learning algorithm. Therefore, we'll use integer encoding as our baseline here.
We will then one-hot-encode the features. If the accuracy on the test set is better than our baseline we will have validated our hypothesis.
...and we will use a breast cancer dataset for our experiments (you can download the data here).
Before we jump to the observations let's quickly take a look at the model:
# define the  model
def get_model():
  model = Sequential()
  model.add(Dense(10, input_dim=X_train_enc.shape[1], activation='relu', kernel_initializer='he_normal'))
  model.add(Dense(1, activation='sigmoid'))
﻿
  # compile the keras model
  model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
﻿
  return model
For experiment one we will use OrdinalEncoder to encode our features.  That means we'll have 111 parameters. 
For experiment two we will use OneHotEncoder to encode our features. That means we'll have 451 parameters. Why exactly? Because after one-hot encoding the feature space expanded. One hot encoded X_train is actually a sparse matrix. And thus we have more weights between input and the first layer of the model. This is the reason our one-hot-encoding is more expressive. But more expressiveness can lead to overfitting and thus model needs to be regularized. 
Therefore, we'll perform one additional experiment with one-hot encoded features and a regularized model.
ObservationsWe see that the model trained with integer encoding(OrdinalEncoder) lead to 73.68% accuracy on test data. 
Meanwhile the model trained with one hot encoding lead to 66.31% test accuracy. We can see from the acc plot that the model trained with one hot encoded feature have over 88% training accuracy. This is clearly a case of overfitting.
We thus perform an extra experiment with the regularized model. This is an obvious step and necessary to actually compare the expressiveness of integer encoding with one-hot-encoding. We will use a dropout layer after the first layer with a dropout rate of 0.5. This value is not cherry-picked. 
The model thus trained have 74.73% test accuracy which is a ~1% increase from the baseline. 
﻿
Run set3
﻿
Final wordThis experiment cannot concretely establish the improvement in model performance with one-hot encoding. It is shown in this Kaggle kernel that one hot encoding leads to poor performance when a decision tree is used. However, we can concretely establish that the expressiveness of the feature space expands. Through our toy experiment and with a regularized model(it's expected!) we gained an increase over the baseline. 
Weights & BiasesWeights & Biases helps you keep track of your machine learning experiments. Use our tool to log hyperparameters and output metrics from your runs, then visualize and compare results and quickly share findings with your colleagues.
Get started in 5 minutes.
﻿