Introduction

Kaggle, the world's largest community of data scientists and machine learning practitioners is dominated by the use of scikit-learn. But with the increasing complexity of problem statements, the use of deep neural networks for competitions is on the rise.

Due to this, competitors need to use multiple frameworks for experimentation, scikit-learn for all the classical machine learning tasks, and tensorflow or pytorch for experimenting with deep neural networks. In this project, we'll solve this problem by using a skorch, which is a scikit compatible wrapper around pytorch. While we're at it, we'll also try to automate the process of selecting the best performing neural network architecture by running a hyper-parameter sweep with Weights and Biases.

Github Repo for full code →

Introduction To Skorch

DQzUQN8UIAE60LN.jpg

Skorch is the union of Scikit-learn and Pytorch. Simply we can say that Skorch is a scikit-learn compatible neural network library that wraps PyTorch. Skorch reduces the workload of developing neural networks.

Here's a quick example that demonstrates the compatibility between pytorch and sklearn, embodied by skorch –

from skorch import NeuralNetClassifier


net = nn.Sequential(
      nn.Linear(3, 4),
      nn.Sigmoid(),
      nn.Linear(4, 1),
      nn.Sigmoid()
      ).to(device)

net = NeuralNetClassifier(
    MyModule,
    max_epochs=10,
    lr=0.1,
    # Shuffle training data on each epoch
    iterator_train__shuffle=True,
)

#Sklearn like syntax 
net.fit(X, y)
y_proba = net.predict_proba(X)

In a scikit-learn Pipeline:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


pipe = Pipeline([
    ('scale', StandardScaler()),
    ('net', net),
])

pipe.fit(X, y)
y_proba = pipe.predict_proba(X)

Our Goal

The idea is to build a class that will parse the configuration dictionary to build a pytorch model, and use it with skorch to automate model selection and visualization. Our goal is to be able to use the number of neurons in layers itself as a hyper-parameter. The only input that is expected is the hyper-parameter sweep dictionary specifying the neural network architecture we'd like to experiment with.

Let's look at an example to make things clear.

sweep_config = {
    'method': 'random', #grid, random
    'metric': {
      'name': 'valid_loss',
      'goal': 'minimize' },
    'parameters': {
        'learning_rate': {
            'values': [0.1, 0.01,0.001]},
        'fc_layer1':{
              'values':[93]},
        'fc_layer2':{
            'values':[8,10,12,16] },
        'fc_layer3':{
            'values':[4,8,10,20]
        } }}

The input should resemble the sweep_config dictionary - as you can see, the fc_layer1 has a fixed value as it is the input layer. Now we just need to pass this configuration file to our tool and it will automatically generate all possible combinations of neural network architectures from the values given, log the metrics and visualize them automatically with Weights & Biases. All of these tasks will be accomplished by the a single line of code.

classifier  = ClassifierNet()
classifier.runSweep(sweep_config)

Here, ClassifierNet is a name that I've given to the tool we'll build for this project. After executing the code above, you'll get these visualizations which will help you determine the model and hyper-parameter combinations that fit your need.

Here's a simple notebook to get you started with sweeps →

Our Goal

The idea is to build a class that will parse the configuration dictionary to build a pytorch model and use it with skorch to automate model selection and visualization. The goal is to be able to use the number of neurons in layers itself as a hyper-parameter. The only input that is expectd is the hyper-parameter sweep dictionary with various values of neurons in each layer except the last layer.

Let's look at an example to make things clear.

sweep_config = {
    'method': 'random', #grid, random
    'metric': {
      'name': 'valid_loss',
      'goal': 'minimize' },
    'parameters': {
        'learning_rate': {
            'values': [0.1, 0.01,0.001]},
        'fc_layer1':{
              'values':[93]},
        'fc_layer2':{
            'values':[8,10,12,16] },
        'fc_layer3':{
            'values':[4,8,10,20]
        } }}

The input that is provided should resemble the sweep_config dictionary. As you can see, the fc_layer1 has a fixed value as it is the input layer. Now we just need to pass this configuration file to our tool and it will automatically generate all possible combination of neural networks from the provided values, logg the metrics and visualize them automarically on wandb dashboard. All of these tasks will be accomplished by the a single line of code.

classifier  = ClassifierNet()
classifier.runSweep(sweep_config)

Here, kaggleNetClassifier is a name that I've given to the tool we'll build for this project.

After executing the above code, you'll get these visualizations which will help you determine the model and hyper-parameter combination that fits your need.

Building the Parser Class

The first step is to build a class that takes the configuration file and parses it to build a neural network. We'll only add support for fully connected or dense layers for now. But you can extend this to any layer.

class parseModel(nn.Module):
    def __init__(self,def_config={}):
        super(parseModel,self).__init__()
        self.layers = torch.nn.ModuleList();
        self.layer_dims = []
        config_dict = {str(k):v for k, v in dict(def_config).items()}
        for key in config_dict:
            if key.find('fc_layer') != -1:
                self.layer_dims.append(config_dict[key])
        for i in range(len(self.layer_dims)-1):
            self.layers.append(torch.nn.Linear(self.layer_dims[i]
            ,self.layer_dims[i+1]))
        self.layers.append(torch.nn.Linear(
        self.layer_dims[len(self.layer_dims)-1],1))

The parseModel class extends nn.Module of pytorch.

The constructor expects the wandb config dictionary and converts each key to a string if it's in some other data type.

Then, it goes through all the keys searching for the substring 'fc_layer'.

If found, it'll create a Linear layer and append it to the nn.ModuleList object which is a list that stores pytorch modules.

    def print_model(self):
        print(self.layers)
        
    def forward(self,x):
        y = x
        for i in range(len(self.layers)):
            y = self.layers[i](y)
        return y

These functions are pretty straightforward. print_model is a helper function that prints the model layers to the terminal.

The forward function is overridden and used to perform forward pass on the network. We loop through each layer stored in the layers list.

Let's look at an example

config_defaults = {
        'learning_rate': 0.001,
        'fc_layer1' : 93,
        'fc_layer2' : 10,
        'fc_layer3' : 10,
    }
model  = parseModel(config_defaults)
model.print_model()

In this case, we'll get the following model:

ModuleList(
  (0): Linear(in_features=93, out_features=10, bias=True)
  (1): Linear(in_features=10, out_features=10, bias=True)
  (2): Linear(in_features=10, out_features=1, bias=True)
)

Building the Trainer Class

Now that we have our parser class, we can combine skorch and W&B to train and log the model that will be generated using the parser.

class ClassifierNet():
    def __init__(self,config={}):
        self.net = None
        self.config = config
        self.data_loaded = False
        
    def buildNetwork(self,new_config):
        self.config = new_config
        model = ClassifierNet(self.config)
        self.net = model

The constructor takes a configuration file as the input. The 3 instance variables are:

The buildNetwork function takes an updated config file as an input and replaces the original config. It then builds a model by parsing the new config file through parseModel that we've built in the previous section

A helper function to load and split the dataset

    def loadData(self,X,Y,split=True,test=0.2):
        X_train,X_test,y_train,y_test = train_test_split(X.values.astype('float32'),
                                             Y.values.astype('float32').reshape(-1,1),
                                             test_size=.2)
        self.X_train = X_train
        self.y_train = y_train
        self.X_test = X_test
        self.y_test = y_test
        self.data_loaded = True #Set to flag to true

Let's get into the training loop

Skorch has the native support for WandbLogger so we won't have to manually log our metrics.

    def train(self):
        if self.data_loaded == False:
            print('Data not loaded')
            return

        wandb_run = wandb.init() # Initialize this run
        config = wandb.config
        self.buildNetwork(config) # Initialize new network from the configuration od this run

        valid_ds = Dataset(self.X_test, self.y_test)
        optimizer = optim.Adam
        Net = NeuralNetClassifier(
        self.net,
        criterion= nn.BCEWithLogitsLoss ,
        max_epochs=15,
        optimizer=optimizer,
        optimizer__lr = self.config.learning_rate,
        callbacks=[WandbLogger(wandb_run)],
        train_split=predefined_split(valid_ds)

        )
        Net.fit(self.X_train, self.y_train)

Finally, we'll add support for our Sweep

    def runSweep(self,sweep_config):
        sweep_id = wandb.sweep(sweep_config)
        wandb.agent(sweep_id,function=self.train)

runSweep will call train function on various combinations of hyper-parameters generated from the sweep configuration file

Let's Try This Out on a Kaggle Competition

To illustrate the performance of our automation tool, let work with the Otto Group Product Classification Challenge dataset otto.jpg.

Here's the problem statement:

As we're directly going for end-to-end deep neural nets, we don't need to extract important features. Let's directly make use of the ClassifierNetwe created after loading the data.

#Load the csv file

X = pd.read_csv('train.csv', sep = ',')
Y =  X['target'].map({'Class_1': 1, 'Class_2': 2,
                                  'Class_3': 3, 'Class_4': 4,
                                  'Class_5': 5, 'Class_6': 6,
                                  'Class_7': 7, 'Class_8': 8,
                                  'Class_9': 9})

Y = Y.astype('float64')
X =X.drop(['id','target'],axis=1)

Now we just need a configuration file with all the possible values of layer neurons and hyper-parameters.

import wandb
sweep_config = {
    'method': 'random', #grid, random
    'metric': {
      'name': 'valid_loss',
      'goal': 'minimize'   
    },
    'parameters': {

        'learning_rate': {
            'values': [0.1, 0.01,0.001]
        },

        'fc_layer1':{
              'values':[93]
        },
        'fc_layer2':{
            'values':[8,10,12,16]
        },
        'fc_layer3':{
            'values':[4,8,10,20]
        }
    }
}

Next let's perform the hyper-parameter sweep.


classifier  = ClassifierNet()
classifier.loadData(X,Y)
classifier.runSweep(sweep_config)

Visualizing Hyperparameter Sweep Results

Let's go through the visualizations generated by running this hyperparameter sweep in our W&B dashboard to extract valuable insights.

The following chart shows the correlation between the hyper-parameters in the configuration file and the metric that we're trying to visualize, i.e. valid_loss in this case.

Visualizing Hyperparameter Sweep Results

Let's go through the visualizations generated on the wandb dashboard to extract valuable insights.

The following chart shows the relation between the hyper-parameters in the configuration file and the metric that we're triying to visualize, i.e , valid_loss in this case.

Here's a simpler visualization that just compares the time taken by each run, as well as how it performs in optimizing the desired metric, you can refer another useful visualization in the dashboard which compares the validation accuracy Vs the runtime of a particular run

The parameter importance visualization provides an important insight about the most importnat feature.

Here, we can deduce that the learning rate effect the validation loss the most among hyper-parameters as it has highest importance and correlation. learning rate followed by the last fully connected layer in terms of feature importance

The durration graph compares the models with respect to the time taken by them to train. As it is evident from the following plot, all the models take almost equal duration to train with slight deviations between the individual epochs.

Conclusion

As we can see, Skorch is a powerful tool that combines the simplicity of scikit, with the power of PyTorch and makes for a great framework to use in Kaggle competitions. Combined with Sweeps from Weights and Biases, it can help us quickly construct and iterate through complex model architectures and find the best model that is likely to perform well on the leaderboard, with just a few lines of code.

I encourage you to try it in your own Kaggle competitions. If you make it to the leaderboard, we'd love to hear about it! 🙂

Github Repo for full code →