Rigging the Lottery: Making All Tickets Winners

Submission for the ML Reproducibility Challenge 2020 for the paper: Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. Rigging the lottery: Making all tickets winners. In Proceedings of Machine Learning and Systems (ICML), July 2020.
Varun Sundar
OpenReview | Github

Summary

Scope of Reproducibility

For a fixed parameter count and compute budget, the proposed algorithm (RigL) claims to directly train sparse networks that match or exceed the performance of existing dense-to-sparse training techniques (such as pruning). RigL does so while requiring constant Floating Point Operations (FLOPs) throughout training. The technique obtains state-of-the-art performance on a variety of tasks, including image classification and character-level language-modelling.

Methodology

We implement RigL from scratch in Pytorch using boolean masks to simulate unstructured sparsity. We rely on the description provided in the original paper, and referred to the authors' code for only specific implementation detail such as handling overflow in ERK initialization. We evaluate sparse training using RigL for WideResNet-22-2 on CIFAR-10 and ResNet-50 on CIFAR-100, requiring 2 hours and 6 hours respectively per training run on a GTX 1080 GPU.

Results

We reproduce RigL's performance on CIFAR-10 within 0.1% of the reported value. On both CIFAR-10/100, the central claim holds—given a fixed training budget, RigL surpasses existing dynamic-sparse training methods over a range of target sparsities. By training longer, the performance can match or exceed iterative pruning, while consuming constant FLOPs throughout training. We also show that there is little benefit in tuning RigL's hyperparameters for every sparsity, initialization pair—the reference choice of hyperparameters is often close to optimal performance.
Going beyond the original paper, we find that the optimal initialization scheme depends on the training constraint. While the Erdos-Renyi-Kernel distribution outperforms Random distribution for a fixed parameter count, for a fixed FLOP count, the latter performs better. Finally, redistributing layer-wise sparsity while training can bridge the performance gap between the two initialization schemes, but increases computational cost.

What was easy

The authors provide code for most of the experiments presented in the paper. The code was easy to run and allowed us to verify the correctness of our re-implementation. The paper also provided a thorough and clear description of the proposed algorithm without any obvious errors or confusing exposition.

What was difficult

Tuning hyperparameters involved multiple random seeds and took longer than anticipated. Verifying the correctness of a few baselines was tricky and required ensuring that the optimizer's gradient (or momentum) buffers were sparse (or dense) as specified by the algorithm. Compute limits restricted us from evaluating on larger datasets such as Imagenet.

Communication with original authors

We had responsive communication with the original authors, which helped clarify a few implementation and evaluation details, particularly regarding the FLOP counting procedure.

Introduction

Sparse neural networks are a promising alternative to conventional dense networks—having comparatively greater parameter efficiency and lesser floating-point operations (FLOPs) during inference.
However, their advantage is not as significant while training because present sparse training techniques involve multiple cycles of dense network training and subsequent pruning.
The authors of a recent paper (Evci et al) propose RigL: an algorithm for training sparse networks from scratch. The proposed method outperforms both prior art in training sparse networks, as well as existing dense-to-sparse training algorithms. By utilizing dense gradients only during connectivity updates and avoiding any global sparsity redistribution, RigL can maintain a fixed computational cost and parameter count throughout training.
As a part of the ML Reproducibility Challenge, we replicate RigL from scratch and investigate if dynamic-sparse training confers significant practical benefits compared to existing sparsifying techniques.

Scope of Reproducibility

We focus on the following target questions:
  1. Does RigL outperform existing sparse-to-sparse training techniques—such as SET (Mocanu et al. [2018]) and SNFS (Dettmers and Zettlemoyer [2020])—and match the accuracy of dense-to-sparse training methods such as iterative pruning (Zhu and Gupta [2018])?
  2. We investigate the sensitivity of RigL to two additional hyperparameters which it requires.
  3. How does the choice of sparsity initialization affect the final performance for a fixed parameter count and a fixed training budget?
  4. Does redistributing layer-wise sparsity during connection updates improve RigL’s performance? Additionally, can the final layer-wise distribution serve as a good sparsity initialization scheme?

Methodology

The authors provide publicly accessible code written in Tensorflow. We replicated the implementation in Pytorch by extending the code of Dettmers and Zettlemoyer which uses a boolean mask to simulate sparsity.

Mask Initialization

For a network with L layers and total parameters N, we associate each layer with a random boolean mask of sparsity s_l, l \in [ L ]. The overall sparsity of the network is given by S=\frac{\sum_l s_l N_l}{N}, where N_l is the parameter count of layer l. Sparsities s_l are determined by the one of the following mask initialization strategies:
Bias and BatchNorm layers are not sparsified since these have a negligible impact on parameter count.

Mask Updates

Every ∆T training steps, certain connections are discarded, and an equal number are grown. Unlike SNFS (Dettmers and Zettlemoyer [2020]), there is no redistribution of layer-wise sparsity, resulting in constant FLOPs throughout training.

Pruning Strategy

Similar to SET and SNFS, RigL prunes f fraction of smallest magnitude weights in each layer. As detailed below, the fraction f is decayed across mask update steps, by cosine annealing:
f(t) = \frac{\alpha}{2} \left(1 + \cos \left(\frac{t\pi}{T_\text{end}} \right) \right)
where, \alpha is the initial pruning rate and T_\text{end} is the training step after which mask updates are ceased.

Growth Strategy

RigL’s novelty lies in how connections are grown: during every mask update, k connections having the largest absolute gradients among current inactive weights (previously zero + pruned) are activated. Here, k is chosen to be the number of connections dropped in the prune step. This requires access to dense gradients at each mask update step. Since gradients are not accumulated (unlike SNFS), RigL does not require access to dense gradients at every step. Following the paper, we initialize newly activated weights to zero.

Experimental Settings

Models

For experiments on CIFAR-10, we use a Wide Residual Network (Zagoruyko and Komodakis [2016]) with depth 22 and width multiplier 2, abbreviated as WRN-22-2. For experiments on CIFAR-100, we use a modified variant of ResNet-50, with the initial 7×7 convolution replaced by two 3×3 convolutions.

Datasets and Training

We conduct experiments on CIFAR-10 and CIFAR-100, with a train/val/test split of 45k/5k/10k image samples. In comparison, the authors did not use a validation set and used a train/test split with 50k/10k samples. This causes a slight performance discrepancy (dense baseline has a test accuracy of 93.4% vs 94.1%).
On both datasets, we train models for 250 epochs each, optimized by SGD with momentum. Our training pipeline uses standard data augmentation, which includes random flips and crops. When training on CIFAR-100, we additionally include a learning rate warmup for 2 epochs and label smoothening of 0.1.

Hyperparameters

RigL includes two additional hyperparameters (α, ∆T) in comparison to regular dense network training. For our experiments, we set α= 0.3, ∆T= 100, based on the original paper. Optimizer specific hyperparameters—learning rate, learning rate schedule, and momentum—are also set according to the original paper. In later sections, we also tune these hyperparameters using Optuna. Additionally, we examine whether tuning the learning rate for each sparsity offers any significant benefit.

Baselines

We compare RigL against various baselines in our experiments: SET (Mocanu et al. [2018]), SNFS (Dettmers and Zettlemoyer [2020]), and Magnitude-based Iterative-pruning (Zhu and Gupta [2018]). We also compare against two weaker baselines, viz., Static Sparse training and Small-Dense networks. The latter has the same structure as the dense model but uses fewer channels in convolutional layers to lower parameter count. We implement iterative pruning with the pruning interval kept same as the masking interval for a fair comparison.

Computational Requirements

We run our experiments on a SLURM cluster node—equipped with 4 NVIDIA GTX1080 GPUs and a 32 core Intel CPU. Each experiment on CIFAR-10 and CIFAR-100 consumes about 1.6 GB and 7 GB of VRAM respectively and is run for 3 random seeds to capture performance variance. We require about 6 and 8 days of total compute time to produce all results, including hyper-parameter sweeps and extended experiments, on CIFAR-10 and CIFAR-100 respectively.

Results

We find RigL to be an effective sparsification technique on both CIFAR-10 and CIFAR-100: it surpasses previous dynamic sparse learning methods while matching or exceeding iterative pruning.

WideResNet-22 on CIFAR10

Test Accuracy vs Sparsity on CIFAR-10, plotted for Random initialization(left), ERK initialization(center), and for training 2× longer(right).
While SET improves over the performance of static sparse networks and small-dense networks, methods utilizing gradient information (SNFS, RigL) obtain better test accuracies. SNFS can outperform RigL, but requires a much larger training budget, since it (a) requires dense gradients at each training step, (b) redistributes layer-wise sparsity during mask updates.
For all sparse methods, excluding SNFS, using ERK initialization improves performance, but with increased FLOP consumption.

ResNet-50 on CIFAR100

Benchmarking sparse ResNet-50s on CIFAR-100 plotted across sparsities.
RigL outperforms or matches existing sparse-to-sparse and dense-to-sparse methods, especially when trained longer. Notably, RigL3× at 90% sparsity and RigL2× at 80% sparsity surpass iterative pruning with similar FLOP consumption. RigL2× (ERK) also improves performance but requires a larger training budget.

Hyperparameter Tuning

Hyperparameter tuning (learning Rate vs sparsity) on CIFAR-10.
We further examine if the final performance improves by tuning the learning rate (η) individually for each sparsity-initialization pair. We employ a grid search over η ∈ {0.1, 0.05, 0.01, 0.005} and (α, ∆T ) ∈ {(0.3, 100), (0.4, 200), (0.4, 500), (0.5, 750)}.
We find that η = 0.1 and η = 0.05 are close to optimal values for a wide range of sparsities and initializations, and as such learning rate can be tuned independent of target sparsity.

Results Beyond Original Paper

Sparsity Distribution vs FLOP Consumption

Test Accuracy vs FLOP consumption compared for Random and ERK initializations: WideResNet-22-2 on CIFAR-10 and ResNet-50 on CIFAR-100.
As seen earlier, ERK initialization consistently outperforms Random intialization for a fixed parameter count. This, however, comes at a larger FLOP consumption. Interestingly, we find that when adjusted for FLOP consumption, Random initialization actually outperforms ERK intialization.
Another takehome here is that FLOP consumption and parameter count can be at odds with each other: we also see this with SNFS, which obtains superior performance for a given sparsity, but is much more compute intensive.

Effect of Redistribution

Effect of redistribution on RigL’s computational cost, evaluated using WideResNet-22-2 on CIFAR10 at 80% sparsity.
We also study the impact of redistribution on RigL's performance, while also noting that the originally proposed algorithm avoids redistribution to obtain constant FLOPs throughout training. Indeed, as the above figure shows, redistribution greatly increases computational cost. Interestingly, the layer-wise sparsity distribution becomes steady after a few iterations and resembles a more extreme version of ERK initialization.
Even more intriguing is the fact that redistribution only seems to boost the performance of RigL (Random), but not RigL (ERK), as seen in Table 5 of the report.

Discussion

Evaluated on image classification, the central claims of the paper hold true: RigL outperforms existing sparse-to-sparse training methods and can also surpass other dense-to-sparse training methods with extended training.
Compared to prior methods, we also find RigL to be a simpler yet effective approach: the algorithm does not involve any layer-wise sparsity redistribution and simply updates each layer's mask based on weight magnitudes and gradient information.
RigL is fairly robust to its choice of hyperparameters, as they can be set independent of sparsity or initialization. We find that the choice of initialization has a greater impact on the final performance and compute requirement than the method itself. Considering the performance boost obtained by redistribution, proposing distributions that attain maximum performance given a FLOP budget could be an interesting future direction.
For computational reasons, our scope is restricted to small datasets such as CIFAR-10/100. RigL's applicability outside image classification—in Computer Vision and beyond (machine translation etc.) is not covered here.