Skip to main content

Kaggle's Feature Engineering

In this report, you will learn a practical approach to feature engineering. You'll be able to apply what you learn to Kaggle competitions and other machine learning applications.
Created on August 10|Last edited on August 14

Walkthrough

We'll work with data from Kickstarter projects. The first few rows of the data looks like this: Screenshot from 2020-08-10 23-11-16.png The state column shows the outcome of the project.

print('Unique values in `state` column:', list(ks.state.unique()))
Unique values in `state` column: ['failed', 'canceled', 'successful', 'live', 'undefined', 'suspended']

Using this data, how can we use features such as project category, currency, funding goal, and country to predict if a Kickstarter project will succeed?

Prepare the target column

First, we'll convert the state column into a target we can use in a model. Data cleaning isn't the current focus, so we'll simplify this example by:

  • Dropping projects that are "live"
  • Counting "successful" states as outcome = 1
  • Combining every other state as outcome = 0
# Drop live projects
ks = ks.query('state != "live"')

# Add outcome column, "successful" == 1, others are 0
ks = ks.assign(outcome=(ks['state'] == 'successful').astype(int))

Convert timestamps

Next, we convert the launched feature into categorical features we can use in a model. Since we loaded the columns as timestamp data, we access date and time values through the .dt attribute on the timestamp column.

ks = ks.assign(hour=ks.launched.dt.hour,
               day=ks.launched.dt.day,
               month=ks.launched.dt.month,
               year=ks.launched.dt.year)

Prep categorical variables

Now for the categorical variables -- category, currency, and country -- we'll need to convert them into integers so our model can use the data. For this we'll use scikit-learn's LabelEncoder . This assigns an integer to each value of the categorical feature.

from sklearn.preprocessing import LabelEncoder

cat_features = ['category', 'currency', 'country']
encoder = LabelEncoder()

# Apply the label encoder to each column
encoded = ks[cat_features].apply(encoder.fit_transform)
We collect all of these features in a new dataframe that we can use to train a model.

# Since ks and encoded have the same index and I can easily join them
data = ks[['goal', 'hour', 'day', 'month', 'year', 'outcome']].join(encoded)

We collect all of these features in a new dataframe that we can use to train a model.

# Since ks and encoded have the same index and I can easily join them
data = ks[['goal', 'hour', 'day', 'month', 'year', 'outcome']].join(encoded)

Create training, validation, and test splits

We need to create data sets for training, validation, and testing. We'll use a fairly simple approach and split the data using slices. We'll use 10% of the data as a validation set, 10% for testing, and the other 80% for training.

valid_fraction = 0.1
valid_size = int(len(data) * valid_fraction)

train = data[:-2 * valid_size]
valid = data[-2 * valid_size:-valid_size]
test = data[-valid_size:]

Train a model

For this course we'll be using a LightGBM model. This is a tree-based model that typically provides the best performance, even compared to XGBoost. It's also relatively fast to train.

We won't do hyperparameter optimization because that isn't the goal of this course. So, our models won't be the absolute best performance you can get. But you'll still see model performance improve as we do feature engineering.

import lightgbm as lgb

feature_cols = train.columns.drop('outcome')

dtrain = lgb.Dataset(train[feature_cols], label=train['outcome'])
dvalid = lgb.Dataset(valid[feature_cols], label=valid['outcome'])

param = {'num_leaves': 64, 'objective': 'binary'}
param['metric'] = 'auc'
num_round = 1000
bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], early_stopping_rounds=10, verbose_eval=False)
Make predictions & evaluate the model
Finally, let's make predictions on the test set with the model and see how well it performs. An important thing to remember is that you can overfit to the validation data. This is why we need a test set that the model never sees until the final evaluation.

from sklearn import metrics
ypred = bst.predict(test[feature_cols])
score = metrics.roc_auc_score(test['outcome'], ypred)

print(f"Test AUC score: {score}")
Test AUC score: 0.747615303004287

Now we'll build our own baseline model which you can improve with feature engineering techniques as you go through the course.




Run set
4


Introduction

In the exercise, we will work with data from the TalkingData AdTracking competition. The goal of the competition is to predict if a user will download an app after clicking through an ad.

For this course you will use a small sample of the data, dropping 99% of negative records (where the app wasn't downloaded) to make the target more balanced.

After building a baseline model, you'll be able to see how your feature engineering and selection efforts improve the model's performance.

Baseline Model

The first thing you'll do is construct a baseline model.

1) Construct features from timestamps

The click_data DataFrame has a 'click_time' column with timestamp data. We'll use this column to create features for the coresponding day, hour, minute and second. Then we'll store these as new integer columns day, hour, minute, and second in a new DataFrame clicks.

# Add new columns for timestamp features day, hour, minute, and second
clicks = click_data.copy()
clicks['day'] = clicks['click_time'].dt.day.astype('uint8')
click_times = click_data['click_time']
clicks['day'] = click_times.dt.day.astype('uint8')
clicks['hour'] = click_times.dt.hour.astype('uint8')
clicks['minute'] = click_times.dt.minute.astype('uint8')
clicks['second'] = click_times.dt.second.astype('uint8')

2) Label Encoding

For each of the categorical features ['ip', 'app', 'device', 'os', 'channel'], use scikit-learn's LabelEncoder to create new features in the clicks DataFrame. The new column names should be the original column name with '_labels' appended, like ip_labels.

from sklearn import preprocessing

cat_features = ['ip', 'app', 'device', 'os', 'channel']

# Create new columns in clicks using preprocessing.LabelEncoder()
for feature in cat_features:
    label_encoder = preprocessing.LabelEncoder()
    for feature in cat_features:
        encoded = label_encoder.fit_transform(clicks[feature])
        clicks[feature + '_labels'] = encoded

3) One-hot Encoding

In the code cell above, you used label encoded features. It wouldn't have made sense to instead use one-hot encoding for the categorical variables 'ip', 'app', 'device', 'os', or 'channel' because these take a very large variety of values which means the resulting one hot encoded matrix will be very sparse.

4) Create train/validation/test splits

Here we'll create training, validation, and test splits. First, clicks DataFrame is sorted in order of increasing time. The first 80% of the rows are the train set, the next 10% are the validation set, and the last 10% are the test set.

feature_cols = ['day', 'hour', 'minute', 'second', 
                'ip_labels', 'app_labels', 'device_labels',
                'os_labels', 'channel_labels']

valid_fraction = 0.1
clicks_srt = clicks.sort_values('click_time')
valid_rows = int(len(clicks_srt) * valid_fraction)
train = clicks_srt[:-valid_rows * 2]
# valid size == test size, last two sections of the data
valid = clicks_srt[-valid_rows * 2:-valid_rows]
test = clicks_srt[-valid_rows:]
feature_cols = ['day', 'hour', 'minute', 'second', 
                'ip_labels', 'app_labels', 'device_labels',
                'os_labels', 'channel_labels']

4) Train with LightGBM

Now we can create LightGBM dataset objects for each of the smaller datasets and train the baseline model. We'll use wandb's wandb_callback for lightgbm for logging the metrics.

import lightgbm as lgb
import wandb
from wandb.lightgbm import wandb_callback
wandb.init(project='Kaggle-FeatureEng')
dtrain = lgb.Dataset(train[feature_cols], label=train['is_attributed'])
dvalid = lgb.Dataset(valid[feature_cols], label=valid['is_attributed'])
dtest = lgb.Dataset(test[feature_cols], label=test['is_attributed'])
​
param = {'num_leaves': 64, 'objective': 'binary'}
param['metric'] = 'auc'
num_round = 1000
bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], early_stopping_rounds=10,callbacks=[wandb_callback()])

This will be our baseline score for the model. When we transform features, add new ones, or perform feature selection, we should be improving on this score.




Run set
1


Encoding Techniques

Now that you've built a baseline model, you are ready to improve it with some clever ways to work with categorical variables.

You are already familiar with the most basic encodings: one-hot encoding and label encoding. In this section, you'll learn about count encoding, target encoding, and CatBoost encoding. We'll start off by defining a helper function to train to the processed dataset the log the metrics to wandb dashboard.

def train_model(train, valid,name, test=None, feature_cols=None):
    wandb.init(project='Kaggle-FeatureEng',name=name)
    if feature_cols is None:
        feature_cols = train.columns.drop(['click_time', 'attributed_time',
                                           'is_attributed'])
    dtrain = lgb.Dataset(train[feature_cols], label=train['is_attributed'])
    dvalid = lgb.Dataset(valid[feature_cols], label=valid['is_attributed'])
    
    param = {'num_leaves': 64, 'objective': 'binary', 
             'metric': 'auc', 'seed': 7}
    num_round = 1000
    bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], 
                    early_stopping_rounds=20, verbose_eval=False,
                   callbacks=[wandb_callback()]
                   )
    
    valid_pred = bst.predict(valid[feature_cols])
    valid_score = metrics.roc_auc_score(valid['is_attributed'], valid_pred)
    print(f"Validation AUC score: {valid_score}")
    
    if test is not None: 
        test_pred = bst.predict(test[feature_cols])
        test_score = metrics.roc_auc_score(test['is_attributed'], test_pred)
        return bst, valid_score, test_score
    else:
        return bst, valid_score

Count Encoding

Count encoding replaces each categorical value with the number of times it appears in the dataset. For example, if the value "GB" occured 10 times in the country feature, then each "GB" would be replaced with the number 10.

We'll use the categorical-encodings package to get this encoding. The encoder itself is available as CountEncoder. This encoder and the others in categorical-encodings work like scikit-learn transformers with .fit and .transform methods.

import category_encoders as ce

cat_features = ['ip', 'app', 'device', 'os', 'channel']
train, valid, test = get_data_splits(clicks)

Next, encode the categorical features ['ip', 'app', 'device', 'os', 'channel'] using the count of each value in the data set.

Using CountEncoder from the category_encoders library, fit the encoding using the categorical feature columns defined in cat_features. Then apply the encodings to the train and validation sets, adding them as new columns with names suffixed "_count".

# Create the count encoder
count_enc = ce.CountEncoder(cols=cat_features)

# Learn encoding from the training set
count_enc.fit(train[cat_features])

# Apply encoding to the train and validation sets
train_encoded = train.join(count_enc.transform(train[cat_features]).add_suffix('_count'))
valid_encoded = valid.join(count_enc.transform(valid[cat_features]).add_suffix('_count'))

_ = train_model(train_encoded, valid_encoded,'encoded')



Run set
2




Run set
4


Target Encoding

Target encoding replaces a categorical value with the average value of the target for that value of the feature. For example, given the country value "CA", you'd calculate the average outcome for all the rows with country == 'CA', around 0.28. This is often blended with the target probability over the entire dataset to reduce the variance of values with few occurences.

This technique uses the targets to create new features. So including the validation or test data in the target encodings would be a form of target leakage. Instead, you should learn the target encodings from the training dataset only and apply it to the other datasets.

The category_encoders package provides TargetEncoder for target encoding. The implementation is similar to CountEncoder.

import category_encoders as ce
# Have to tell it which features are categorical when they aren't strings
target_enc = ce.TargetEncoder(cols=cat_features)

# Learn encoding from the training set
target_enc.fit(train[cat_features], train['is_attributed'])

# Apply encoding to the train and validation sets
train_encoded = train.join(target_enc.transform(train[cat_features]).add_suffix('_target'))
valid_encoded = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_target'))
_ = train_model(train_encoded, valid_encoded,'target-encoding')



Run set
2


CatBoost Encoding

Finally, we'll look at CatBoost encoding. This is similar to target encoding in that it's based on the target probability for a given value. However with CatBoost, for each row, the target probability is calculated only from the rows before it.

The CatBoost encoder is supposed to work well with the LightGBM model. We'll encode the categorical features with CatBoostEncoder and train the model on the encoded data again.

# remove IP from the encoded features
cat_features = ['app', 'device', 'os', 'channel']

train, valid, test = get_data_splits(clicks)

# Have to tell it which features are categorical when they aren't strings
cb_enc = ce.CatBoostEncoder(cols=cat_features, random_state=7)

# Learn encoding from the training set
cb_enc.fit(train[cat_features], train['is_attributed'])

# Apply encoding to the train and validation sets
train_encoded = train.join(cb_enc.transform(train[cat_features]).add_suffix('_cb'))
valid_encoded = valid.join(cb_enc.transform(valid[cat_features]).add_suffix('_cb'))
_ = train_model(train_encoded, valid_encoded,'catboost')

Now let us see how these techniques compare




Run set
4


Conslusion