Hyperparameter Tuning Analysis

We used W&B Sweeps to do hyperparameter search.
Created on January 6|Last edited on January 6
Comment
In this report, we will just distill down some of my observations from the Hyperparameter Search we conducted using W&B Sweeps. The search space is as follows:
sweep_config = {
  "name" : "vivit_hyperparam_search",
  "method" : "bayes",
  "metric": {
    "name": "val_loss",
    "goal": "minimize"
  },
  "parameters" : {
    "optimizer": {
        "values": ["adam", "adamw"]
    },
    "learning_rate": {
      "values": [1e-4, 1e-3]
    },
    "patch_size": {
      "values": [(4,4,4), (6,6,6), (8,8,8)]
    },
    "projection_dim": {
        "values": [32, 64, 128, 256, 512]
    },
    "num_heads": {
        "values": [4, 6, 8]
    },
    "num_layers": {
      "values" : [4, 6, 8]
    },
    "epochs": {
        "values": [20, 40, 60]
    }
  }
}
We tried Bayes search with an objective to minimize the val_loss. Note that in this report we will be talking about test acc % metric with primary focus. Before we get into the search results let's just see how our baseline models performed with hand picked values for different parameters. 
The best baseline model got a test_acc% of 69.51%. 
﻿
Run set8
﻿
Parallel CoordinateThe chart below is a parallel coordinate plot that summarizes all the sweeps results. 
﻿
Run set4
﻿
ObservationsThe best test-acc% is 82.46% which is ~12% improvement over the best baseline. Note that the baseline was trained with parameters that made sense to us. 
One interesting observation is that bigger patch size (8, 8, 8) worked better even though the frame size is (28, 28). 
Longer training (epochs = 60) is better. 
Learning rate of 0.001 is better. However we got the best score with learning rate of 0.0001. 
More number of multi headed (num_head = 8) attention is better. 
More transformer layers are better (num_layers = 8). 
It's a close competition between AdamW and Adam but seems like Adam is winning this one. Maybe AdamW would have worked better with a different weight decay. The weight decay that we have used is 1e-5.
Even though the best test_acc% is for the projection dim is 128, smaller projection dim (32 and 64) are doing better on average.
﻿
Add a comment