Experimenting with EvoNorm Layers in TensorFlow 2
This article provides an experimental summary of the implementation of EvoNorm layers proposed in a prominent 'Evolving Normalization-Activation Layers'.
Created on April 19|Last edited on October 11
Comment
Table of Contents
Experimental SetupAdam + BN-ReLU + No Data AugmentationSGD + BN-ReLU + No Data AugmentationSGD + BN-ReLU + Data AugmentationEvoNorm B0 + No Data AugmentationEvoNorm B0 + Data AugmentationEvoNorm S0 + No Data Augmentation + Groups8EvoNorm S0 + No Data Augmentation + Groups16EvoNorm S0 + No Data Augmentation + Groups32Observations on EvoNom S0 layers without data augmentationHyperparameter sweep on EvoNorm S0 layers without data augmentationEvoNorm S0 + Data Augmentation + Groups8Final remarksAcknowledgement
Experimental Setup
In this report, I am going to lament on my experiments with the EvoNorm layers proposed in Evolving Normalization-Activation Layers. In the paper, the authors attempt to unify the normalization layers and activation functions into a single computation graph. The authors claim -
Several of these layers enjoy the property of being independent from the batch statistics.
I used Colab to perform my experiments. The authors tested the EvoNorm layers on MobileNetV2, ResNets, MnasNet, and EfficientNets. I decided to try out some quick experiments on a Mini Inception architecture as shown in this blog post. I trained them on the CIFAR10 dataset.
Run set
2
I am going to compare the EvoNorm B0 and S0 layers with respect to the following Mini Inception architecture:
- Adam + BN-ReLU + No Data Augmentation
- SGD + BN-ReLU + With Data Augmentation
- SGD + BN-ReLU + Without Data Augmentation
(BN refers to Batch Normalization)
The EvoNorm authors refer to their layers as the EvoNorm-B series, as they involve Batch aggregations and hence require maintaining a moving average statistics for inference. The EvoNorm-S series refers to batch-independent layers that rely on individual samples only (a desirable property to simplify implementation and stabilize training with small batch sizes).
It should be also noted that the EvoNorm layers perform quite well in tasks like instance segmentatio_ with Mask R-CNN and image synthesis with BigGAN.
Adam + BN-ReLU + No Data Augmentation
Run set
1
SGD + BN-ReLU + No Data Augmentation
SGD params:
opt = tf.keras.optimizers.SGD(lr=1e-2, momentum=0.9, decay=1e-2 / EPOCHS)
Run set
1
SGD + BN-ReLU + Data Augmentation
Run set
1
EvoNorm B0 + No Data Augmentation
Run set
1
EvoNorm B0 + Data Augmentation
Run set
1
EvoNorm S0 + No Data Augmentation + Groups8
Run set
1
With EvoNorm B0 and no data augmentation in groups of 8, we again see that the validation loss, in this case, is higher than that of the previous experiment. The accuracies also differ from each other. The network is not generalizing well in this case either.
A note on the groups hyperparameter in the EvoNorm layers:
groups allow us to control how many data points should be used for group aggregation similar to what is used in group normalization. The authors show what groups work well as the task changes in the original paper.
EvoNorm S0 + No Data Augmentation + Groups16
Run set
1
EvoNorm S0 + No Data Augmentation + Groups32
Run set
1
Observations on EvoNom S0 layers without data augmentation
sweep_config = {"method": "random","metric": {"name": "accuracy","goal": "maximize"},"parameters": {**"groups": {"values": [4, 8, 12, 16, 32]**},"epochs": {"values": [10, 20, 30, 40, 50, 60]},"learning_rate": {"values": [1e-2, 1e-3, 1e-4, 3e-4, 3e-5, 1e-5]},"optimizer": {'values': ["adam", "sgd"]}}}
SGD + BN-ReLU + Data Augmentation shows the most stable training behavior so far.
If we look closely, all EvoNorm S0 experiments (except groups of 32) without data augmentation show stable training behavior up until ~12 epochs.
This is the case for EvoNorm B0 + No Data Augmentation as well.
One thing that might help here is tuning the learning rates and groups hyperparameters more.
This is why I decided to run a [hyperparameter sweep](https://docs.wandb.com/sweeps) with the search space that you can see besides.
Hyperparameter sweep on EvoNorm S0 layers without data augmentation
Run set
9
EvoNorm S0 + Data Augmentation + Groups8
Run set
3
Final remarks
As we saw for this quick experimental setup, EvoNorm layers fail to match the performance of BN-ReLU. But this should not be treated as a foregone conclusion. I encourage you to try the EvoNorm layers out in your own experiments and let me know via Twitter (@RisingSayak) what you find.
👉 Colab notebook to reproduce results.
Acknowledgement
Add a comment
Tags: Intermediate, Computer Vision, Classification, Keras, Research, EfficientNet, MobileNet v2, ResNet, Github, Plots, Sweeps, CIFAR10
Iterate on AI agents and models faster. Try Weights & Biases today.