Summary
Scope of Reproducibility
For a fixed parameter count and compute budget, the proposed algorithm (RigL) claims to directly train sparse networks that match or exceed the performance of existing densetosparse training techniques (such as pruning). RigL does so while requiring constant Floating Point Operations (FLOPs) throughout training. The technique obtains stateoftheart performance on a variety of tasks, including image classification and characterlevel languagemodelling.
Methodology
We implement RigL from scratch in Pytorch using boolean masks to simulate unstructured sparsity. We rely on the description provided in the original paper, and referred to the authors' code for only specific implementation detail such as handling overflow in ERK initialization. We evaluate sparse training using RigL for WideResNet222 on CIFAR10 and ResNet50 on CIFAR100, requiring 2 hours and 6 hours respectively per training run on a GTX 1080 GPU.
Results
We reproduce RigL's performance on CIFAR10 within 0.1% of the reported value. On both CIFAR10/100, the central claim holdsgiven a fixed training budget, RigL surpasses existing dynamicsparse training methods over a range of target sparsities. By training longer, the performance can match or exceed iterative pruning, while consuming constant FLOPs throughout training. We also show that there is little benefit in tuning RigL's hyperparameters for every sparsity, initialization pairthe reference choice of hyperparameters is often close to optimal performance.
Going beyond the original paper, we find that the optimal initialization scheme depends on the training constraint. While the ErdosRenyiKernel distribution outperforms Random distribution for a fixed parameter count, for a fixed FLOP count, the latter performs better. Finally, redistributing layerwise sparsity while training can bridge the performance gap between the two initialization schemes, but increases computational cost.
What was easy
The authors provide code for most of the experiments presented in the paper. The code was easy to run and allowed us to verify the correctness of our reimplementation. The paper also provided a thorough and clear description of the proposed algorithm without any obvious errors or confusing exposition.
What was difficult
Tuning hyperparameters involved multiple random seeds and took longer than anticipated. Verifying the correctness of a few baselines was tricky and required ensuring that the optimizer's gradient (or momentum) buffers were sparse (or dense) as specified by the algorithm. Compute limits restricted us from evaluating on larger datasets such as Imagenet.
Communication with original authors
We had responsive communication with the original authors, which helped clarify a few implementation and evaluation details, particularly regarding the FLOP counting procedure.
Introduction
Sparse neural networks are a promising alternative to conventional dense networks—having comparatively greater parameter efficiency and lesser floatingpoint operations (FLOPs) during inference. However, their advantage is not as significant while training because present sparse training techniques involve multiple cycles of dense network training and subsequent pruning.
In the paper by
Evci et al, the authors propose RigL, an algorithm for training sparse networks from scratch. The proposed method outperforms both prior art in training sparse networks, as well as existing densetosparse training algorithms. By utilising dense gradients only during connectivity updates and avoiding any global sparsity redistribution, RigL can maintain a fixed computational cost and parameter count throughout training.
As a part of the ML Reproducibility Challenge, we replicate RigL from scratch and investigate if dynamicsparse training confers significant practical benefits compared to existing sparsifying techniques.
Scope of Reproducibility
We focus on the following target questions:
We investigate the sensitivity of RigL to two additional hyperparameters which it requires.
How does the choice of sparsity initialization affect the final performance for a fixed parameter count and a fixed training budget?
Does redistributing layerwise sparsity during connection updates improve RigL’s performance and can the final layerwise distribution serve as a good sparsity initialization scheme?
Methodology
The authors provide publicly accessible
code written in Tensorflow. We replicated the implementation in Pytorch by extending the
code of Dettmers and Zettlemoyer which uses a boolean mask to simulate sparsity.
Mask Initialization
For a network with $L$ layers and total parameters $N$, we associate each layer with a random boolean mask of sparsity $s_l$, $l \in [ L ]$. The overall sparsity of the network is given by $S=\frac{\sum_l s_l N_l}{N}$, where $N_l$ is the parameter count of layer $l$. Sparsities $s_l$ are determined by the one of the following mask initialization strategies:

Uniform Each layer has the same sparsity, i.e., $s_l = S ; \forall l$. Similar to the original authors, we keep the first layer dense in this initialization.

ErdosRenyi (ER) Following SET (Mocanu et al. [2018]), we set $s_l \propto \left(1  \frac{C_\text{in} + C_\text{out}}{C_\text{in} \times C_\text{out}} \right)$, where $C_\text{in}, C_\text{out}$ are the in and out channels for a convolutional layer and input and output dimensions for a fullyconnected layer.

ErdosRenyiKernel (ERK) Modifies the sparsity rule of convolutional layers in ER initialization to include kernel height and width, i.e., $s_l \propto \left(1  \frac{C_\text{in} + C_\text{out} + w + h}{C_\text{in} \times C_\text{out} \times w \times h} \right)$, for a convolutional layer with $C_\text{in} \times C_\text{out} \times w \times h$ parameters.
Bias and BatchNorm layers are not sparsified since these have a negligible impact on parameter count.
Mask Updates
Every ∆T training steps, certain connections are discarded, and an equal number are grown. Unlike
SNFS (Dettmers and Zettlemoyer [2020]), there is no redistribution of layerwise sparsity, resulting in constant FLOPs throughout training.
Pruning Strategy
Similar to SET and SNFS, RigL prunes $f$ fraction of smallest magnitude weights in each layer. As detailed below, the fraction $f$ is decayed across mask update steps, by cosine annealing:
f(t) = \frac{\alpha}{2} \left(1 + \cos \left(\frac{t\pi}{T_\text{end}} \right) \right)
where, $\alpha$ is the initial pruning rate and $T_\text{end}$ is the training step after which mask updates are ceased.
Growth Strategy
RigL’s novelty lies in how connections are grown: during every mask update, k connections having the largest absolute gradients among current inactive weights (previously zero + pruned) are activated. Here,k is chosen to be the number of connections dropped in the prune step. This requires access to dense gradients at each mask update step. Since gradients are not accumulated (unlike SNFS), RigL does not require access to dense gradients at every step. Following the paper, we initialize newly activated weights to zero.
Experimental Settings
Models
For experiments on CIFAR10, we use a Wide Residual Network (Zagoruyko and Komodakis [2016]) with depth 22 and width multiplier 2, abbreviated as WRN222. For experiments on CIFAR100, we use a modified variant of ResNet50, with the initial 7×7 convolution replaced by two 3×3 convolutions.
Datasets and Training
We conduct experiments on CIFAR10 and CIFAR100, with a train/val/test split of 45k/5k/10k image samples. In comparison, the authors did not use a validation set and used a train/test split with 50k/10k samples. This causes a slight performance discrepancy (dense baseline has a test accuracy of 93.4% vs 94.1%).
On both datasets, we train models for 250 epochs each, optimized by SGD with momentum. Our training pipeline uses standard data augmentation, which includes random flips and crops. When training on CIFAR100, we additionally include a learning rate warmup for 2 epochs and label smoothening of 0.1.
Hyperparameters
RigLincludes two additional hyperparameters (α, ∆T) in comparison to regular dense network training. For our experiments, we set α= 0.3, ∆T= 100, based on the original paper. Optimizer specific hyperparameters—learning rate, learning rate schedule, and momentum—are also set according to the original paper. In later sections, we also tune these hyperparameters using
Optuna. Additionally, we examine whether tuning the learning rate for each sparsity offers any significant benefit.
Baselines
We compare RigL against various baselines in our experiments:
SET (Mocanu et al. [2018]),
SNFS (Dettmers and Zettlemoyer [2020]), and Magnitudebased Iterativepruning (
Zhu and Gupta [2018]). We also compare against two weaker baselines, viz.,Static Sparse training and SmallDense networks. The latter has the same structure as the dense model but uses fewer channels in convolutional layers to lower parameter count. We implement iterative pruning with the pruning interval kept same as the masking interval for a fair comparison.
Computational Requirements
We run our experiments on a SLURM cluster node—equipped with 4 NVIDIA GTX1080 GPUs and a 32 core Intel CPU. Each experiment on CIFAR10 and CIFAR100 consumes about 1.6 GB and 7 GB of VRAM respectively and is run for 3 random seeds to capture performance variance. We require about 6 and 8 days of total compute time to produce all results, including hyperparameter sweeps and extended experiments, on CIFAR10 and CIFAR100 respectively.
Results
WideResNet22 on CIFAR10
Test Accuracy vs Sparsity on CIFAR10, plotted for Random initialization(left), ERK initialization(center), and for training 2× longer(right). Owing to random growth, SET can be unstable when training for longer durations with higher sparsities. Overall, RigL2×(ERK) achieves highest test accuracy.
ResNet50 on CIFAR100
Benchmarking sparse ResNet50s on CIFAR100 plotted across densities(right). RigL outperforms or matches existing sparsetosparse and densetosparse methods. Notably, RigL3× at 90% sparsity and RigL2× at 80% sparsity surpass iterative pruning with similar FLOP consumption. RigL2× (ERK) further improves performance but requires a larger training budget.
Hyperparameter Tuning
Learning Rate vs Sparsity on CIFAR10. Runs using a learning rate > 0.1 do not converge and are not plotted here. There is little benefit in tuning the learning rate for each sparsity, and 0.1, 0.05 are good choices overall.
Results Beyond Original Paper
Sparsity Distribution vs FLOP Consumption
Test Accuracy vs FLOP consumption of WideResNet222 on CIFAR10 and ResNet50 on CIFAR100, compared for Random and ERK initializations. For the same FLOP budget, models trained with ERK initialization must be more sparse, resulting in inferior performance.
Effect of Redistribution
Effect of redistribution on RigL’s performance, evaluated using WideResNet222 on CIFAR10 at 80% sparsity. (left) FLOPs required per forward pass, shown relative to the dense baseline, rises quickly and saturates within a few epochs (~10k steps) for both sparse gradient and sparse momentum based redistribution. (right) Comparison of the final density distribution against Random and ERK counterparts. “b” refers to block and “l” layer here.
Discussion
Evaluated on image classification, the central claims of the paper hold true  RigL outperforms existing sparsetosparse training methods and can also surpass other densetosparse training methods with extended training. RigL is fairly robust to its choice of hyperparameters, as they can be set independent of sparsity or initialization. We find that the choice of initialization has a greater impact on the final performance and compute requirement than the method itself. Considering the performance boost obtained by redistribution, proposing distributions that attain maximum performance given a FLOP budget could be an interesting future direction.
Finally, our early experiments show that adapting RigL to promote structured sparsity can obtain practical speedups in existing hardware accelerators.
For computational reasons, our scope is restricted to small datasets such as CIFAR10/100. RigL's applicability outside image classificationin Computer Vision and beyond (machine translation etc.) is not covered here.