A Deep Dive into Neural Architecture Search Without Training (NASWOT)

⚑ Or how I learned to infer the accuracy of a trained neural network from its initial state. Made by Lukas Malik using Weights & Biases
Lukas Malik
In this post, I want to take a look at the paper "Neural Architecture Search Without Training" by Joseph Mellor, Jack Turner, Amos Storkey, and Elliot J. Crowley. The paper can be found here. The crux:
πŸ’‘The authors find a cheap estimate for the performance of a trained neural network without training the network.
My code is executable as a Colab Notebook in the link below. Let's dig in.

🧭 Introduction to Neural Architecture Search

Most neural architectures are handcrafted by experts in the field using some heuristics (e.g. Karpathy constant [1]) or combining the latest trends with a trial and error approach.
However, manually designing architectures can be quite costly and often gives way to suboptimal performance. Neural Architecture Search (NAS) wants to get the human out of the loop and automate this design process.
The first efforts in this field were based on reinforcement learning where a controller network is trained to come up with architectures that are then trained on the data. The objective of the controller network is to find a network that achieves the highest possible accuracy after training. The controller network takes over the role of an agent and the accuracy score of the trained network serves as a reward.
As you can probably already tell this approach is very slow and expensive as every step in the training process of the controller requires us to train a whole neural network so that we can assess its performance. In addition, we observe that the search space for NAS can get really big as both order and type of operation can be varied. NASBench-101 a popular benchmark for NAS only covers architectures with 9 vertices, 7 edges, and 3 valid operations leading to roughly 423k architectures [2].
There have been multiple approaches to address this issue. Some of these include learning stackable cells instead of networks [3], using weight sharing to reduce the number of learnable parameters [4], or making the search space differentiable, allowing us to use gradient descent for the architecture search [5].
This paper provides an alternative solution. Namely: what if we could find a cheap estimate for the accuracy of a trained neural network? Even if this estimate is not perfect, we could use it to discard bad architectures and save time that we could use on the remaining architectures.

πŸ€“ Explanation of the Metric

The metric is derived by investigating linear maps present at initialization. Linear maps can provide insight, how the input space is divided by the network.
If data points fall into the same linear region they are harder to break apart by the network. Linear maps are therefore an indicator of the flexibility of the architecture with respect to the data.
We can derive these linear maps from the overlap of ReLU activations (Rectified Linear unit activations) between two data points in a mini-batch. Each ReLU layer has its own activations which can be summarized in a binary code. The "lack of overlap" between the activations of two data points is given by the hamming distance between each ReLU layer for the two data points. We expect that more flexible architectures have fewer shared activations.
The first figure by the authors to see the connection between linear maps and ReLU activations:
We see a feed-forward neural network on the left with 2 layers with 3 nodes each. The input space is two-dimensional -x, y. Each layer contains three ReLu activations - A1, A2, A3, and B1, B2, B3 respectively. On the right side, you can see how the input space is divided after applying the ReLU function. 1. Corresponds to the separation after applying A1, A2, or A3. 2. is the separated input space. 2. Shows the Binary Code corresponding to that separation. 3. Depicts the separation for each activation in the second layer and 4. Again the resulting separation. (Illustration provided by the authors)
The main idea is that since ReLU activations describe linear separations of the input data if this separation widely varies between data points this might indicate a mismatch between architecture and data. As indicated in the illustration we can also view the ReLU activation as a binary code (values bigger than zero). If we do this we can quantify the difference between the activations of two data points using the hamming distance.
If we consider a mini batch of data X = \{ {x_{i} \} }_{i=1}^{N} and a network, we can define an indicator variable c_{i} that form a binary code that corresponds to a linear region. We can use the hamming distance between two data points d_{H}(c_{i},c_{j}) to measure how dissimilar the two points are. The hamming distance is the minimum number of changes needed to turn one binary code into the other. Or we could also say it is the number of positions where the two codes differ. The correspondence between binary codes can then be calculated by K_{H} with N_{A} as the number of ReLu activations in the network.
K_{H} = \begin{bmatrix} N_{A} - d_{H}(c_{1},c_{1}) & \dots & N_{A} - d_{H}(c_{1},c_{N}) \\ \vdots & \ddots & \vdots \\ N_{A} - d_{H}(c_{N},c_{1}) & \dots & N_{A} - d_{H}(c_{N},c_{N}) \end{bmatrix}
Plotting 𝐾𝐻 for the NAS-Bench 101 and NDS-Darts (another Architecture Benchmark) reveals that there are fewer off-diagonal elements for high-performing architectures.
The scoring metric s is therefore derived from the logarithm of the Kernel norm at initialization.
s = \log \lVert K_{H} \rVert
The authors show that this metric can be used to guide architecture search. In the following, I want to explore this metric with my own experiments.

πŸ§ͺ Experiments

My code is short and executable as a Colab Notebook here:
You can also check out the code from the official repository by the authors of NASWOT:

➑️ Original Code by the authors

In the following, I wanted to conduct my own experiments and reproduce some of the findings. I will start with the experiment setup.

How to derive the search space and the data?

For these experiments, I used NASBench-101, which defines a relatively small (423k Architectures) search space based on Convolutions commonly used for Image Classification. I used CIFAR-10 for my experiments which is also used by the authors. In contrast, I only used a subset of the dataset because I also wanted to see how the metric changes during training. NASBench-101 encodes Architectures as an Adjacency Matrix where each row corresponds to a Cell type (e.g. Max Pool or Conv-2D with 3x3 Filter etc.) and each Column indicates a connection in the computational graph.
One example for clarification. The example adjacency matrix m where the rows from top to down correspond to 'input', 'conv1x1-bn-relu', 'conv3x3-bn-relu', 'conv1x1-bn-relu', 'conv3x3-bn-relu' and 'output'.
m = \begin{bmatrix} 0 & 1 & 1 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 & 1 & 0 \\ 0 &0 &0 &1 &0 &1 \\ 0 &0 &0 &0 &1 &0 \\ 0 &0 &0 &0 &0 &1 \\ 0 &0 &0 &0 &0 &0 \\ \end{bmatrix}
Would lead to the following Architecture:
Example Architecture from NASBench-101
The Colab Notebook contains some boilerplate code to turn the Adjacency Matrix into an executable Pytorch model.

Implementing the Metric in Pytorch

We can implement the metric by registering a hook in Pytorch that is called in each ReLU layer of the network. Since we always compare two data points with another, K is initialized with the size (Batch Size x Batch Size)
# initialize K and forward and backward hook network.K = np.zeros((args.batch_size, args.batch_size))for name, module in network.named_modules(): if 'ReLU' in str(type(module)): module.register_backward_hook(counting_backward_hook) module.register_forward_hook(counting_forward_hook)
The forward hook uses a binary indicator that is True when the output of the ReLU layer is positive and False otherwise. As described above - hamming distance is then used as a measure of dissimilarity. The metric itself is then the logarithm of the determinant of K.
# methods are simplified (and commented) from https://github.com/BayesWatch/nas-without-training/blob/master/score_networks.py# this is K def counting_forward_hook(module, inp, out): inp = inp[0].view(inp[0].size(0), -1) x = (inp > 0).float() # binary indicator K = x @ x.t() K2 = (1.-x) @ (1.-x.t()) network.K = network.K + K.cpu().numpy() + K2.cpu().numpy() def counting_backward_hook(module, inp, out): module.visited_backwards = True# this is the logarithm of the determinant of K def hooklogdet(K, labels=None): s, ld = np.linalg.slogdet(K) return ld
Let's have a more detailed look at these functions:
1) The forward hook
K = x @ x.t() K2 = (1.-x) @ (1.-x.t())network.K = network.K + K.cpu().numpy() + K2.cpu().numpy()
For an example input matrix of ReLU activations (where every row is one Activation vector):
relu = \begin{bmatrix} 1 & 0 & 1 \\ 0 & 0 & 1 \\ 1 &0 &0 \\ \end{bmatrix}
We can rewrite the first line as x \cdot \mathbf{x}^\intercal = \sum_{k=1}^{N} x_{ik} x_{kj}^\intercal. Then K is
K = \begin{bmatrix} 2 & 1 & 1 \\ 1 & 1 & 0 \\ 1 &0 &1 \\ \end{bmatrix}
And the second line (1 - x) \cdot (1 - \mathbf{x}^\intercal) gives:
K2 = \begin{bmatrix} 1 & 1 & 1 \\ 1 & 2 & 1 \\ 1 &1 &2 \\ \end{bmatrix}
Then in the end we obtain a K for our network of:
Network.K = \begin{bmatrix} 3 & 2 & 2 \\ 2 & 3 & 1 \\ 2 &1 &3 \\ \end{bmatrix}
Where the diagonal is equal to the number of ReLU activation layers N_{A}. The first ReLU activation vector is as similar to the second layer as it is to the third with respect to its hamming distance. The second ReLU layer is more similar to the first than to the third. The third is also more similar to the first than to the second.
2) The logarithmic determinant
Because we already saw that the diagonal of Network.K is equal to the number of ReLU activation layers, the determinant can be used to measure the number of off-diagonal elements. Taking the logarithm of the determinant is used to prevent overflow for very large determinants. This is important as K depends on the number of ReLU layers and the batch size and can get very large.

We can score ~100 networks in under a minute

I randomly sampled 100 architectures and plotted the architecture with the K derived by the method described above. You can scroll through the architectures and have a look at the corresponding heatmap of K. Some architectures appear more frequently than others. This is the case, because the architecture matrix is sampled randomly and different architecture matrices can be reduced to the same architecture graph. In contrast to the authors who used a batch size of 128, I only used a batch size of 32. This speed up computation and did not deteriorate performance in my experiments.

Accuracy of trained Networks vs ReLU based Metric

Plotting the relationships between the ReLU derived metric and the accuracy of the trained network reveals a weak positive relationship with a Pearson correlation of 0.267. Instead of the Pearson correlation the authors of the original paper use Kendall's Tau/ rank correlation, because they are more interested in getting the right ranking and not so much in the linear relationship between their metric and the accuracy score.

The best vs worst network

By logging the results we can have a look at the best and worst architectures according to the metric.
On the left we find the best architecture according to our metric (trained accuracy of 0.937). To the right we see the worst architecture according to the metric (trained accuracy of 0.879).

🌊 Conclusion & similar Research

There seems to be a small relationship between the metric measuring local linear maps (derived from ReLU activations) and the accuracies of trained neural networks. The authors showed that this relationship also exists with other datasets and other architecture spaces. This is very interesting as the metric provides a huge speedup when performing architecture search and could be implemented as a fast filtering step. The field of "fast NAS" is quite new and there are some interesting developments.
Other interesting articles in this area include:
I hope I was able to share my excitement for this field and this amazing article.

πŸ“š References

[2] https://github.com/google-research/nasbench
[3] Zoph, B., Vasudevan, V., Shlens, J., & Le, Q. V. (2018). Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8697-8710).
[4] Pham, H., Guan, M., Zoph, B., Le, Q., & Dean, J. (2018, July). Efficient neural architecture search via parameters sharing. In International Conference on Machine Learning (pp. 4095-4104). PMLR
[5] Liu, H., Simonyan, K., & Yang, Y. (2018). Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055.