Pruning BERT on a GLUE task

A quick presentation on fine-tuning and pruning BERT model from Hugging-Face on the CoLA task.
Created on July 15|Last edited on November 18
Comment
﻿
Neural Network Pruning A lot of the state of the art deep learning model are over-parametrized, which makes it hard to deploy. Finding techniques to compress models by reducing their size (number of parameters) could be helpful to reduce memory and compute.
 The idea of pruning is to reduce the size of the neural network in a systematical way. The typical approach in pruning (and the one I will show bellow) is to start with an initial large and accurate network and produce a smaller network with similar accuracy.
For an in-depth review of the state of neural network pruning, I suggested the paper What is the State of Neural Network Pruning?﻿
Example of a pruned network. (Source: https://arxiv.org/pdf/1506.02626.pdf)
The GLUE BenchmarkThe General Language Understanding Evaluation (GLUE) benchmark consists of nine sentence- or sentence-pair language understanding tasks. The idea being that a useful natural language understanding (NLU) should generalize on a verity of tasks.
GLUE tasks (Source)﻿﻿
Corpus of Linguistic Acceptability (CoLA)In this example we will be focusing on a single task called Corpus of Linguistic Acceptability (CoLA). The task is to see if a neural network can judge whether a sentence is grammatical or not. The metric used to evaluate this task is Matthews Correlation Coefficient (MCC) in very high level  it basically evaluates performance on unbalanced binary classification, with values range from -1 to 1 , with 0 being the performance of uninformed guessing.
Experiments
BaselineWe used a pre-trained BERT (base) model from HuggingFace (written in PyTorch) and fine-tuned it for the CoLA task. Fine-tune configuration was based on the configuration shared by FAIR, could be found here. ﻿﻿
Pruning Setup We experimented with a several pruning configurations:
Local L1 Unstructured: Prune tensors by removing the specific amount (percentage) of tensors (currently unpruned) from each layer in the model with the lowest L1-norm.
Local Random Unstructured: Prune tensors by removing the specific amount of tensors (currently unpruned) from each layer in the model randomly.
Global L1 Unstructured: Prune tensors by removing the specific amount of tensors (currently unpruned) of the model with the lowest L1-norm.
Global Random Unstructured: Prune tensors by removing the specific amount of tensors (currently unpruned) in the model randomly.
The pruning process consists of pruning the model, followed by fine-tuning it. As can be seen in the pseudo code bellow:
for i in 1 to K do:
   prune Net
   fine-tune Net [for N epochs]
end for
We kept the fine-tuning hyper-parameters constant and experimented with two values for K= 1, 3. 
Note that effectively the size of the model scales (at the upper bound) by ×(1−α)K\times(1-\alpha)^K×(1−α)K﻿ , where α\alphaα﻿ is the amount.
💡
Results﻿
﻿
amount=0.19
 
amount=0.29
 
amount=0.39
 
amount=0.49
 
amount=0.59
 
K=1 vs. K=3 amount=0.18
﻿
﻿
﻿
1 iteration20
 
3 iterations20
﻿
Concluding RemarksBased on the results we can conclude that removing parameters in an  unstructured fashion it looks like that pruning based on L1-norm gives the best results in term of the metric we are optimizing. 
Another observation is that global approach is only slightly better than local, while L1-norm versus random pruning has a much more significant effect.
As expected scaling the pruning amount reduces MCC, however we can observe that pruning up to 30% basically doesn't incur any loss in accuracy. So in theory we have net gains, given that the model is smaller it will translate (in a perfect systems) to fewer computations. 
In practice, we only explored unstructured pruning, which means that the system we use needs to be able to perform sparse operations efficiently and/or have efficient sparse representation of the weights, in order to see any gains on computation and/or memory, which is not the case for the GPU that we used.  
It will also require to re-write the model with sparse operations (however for PyTorch this functionality is in beta). You can see this paper for an example of costume system with sparse operations that showed significant gains in performance. 
Due to lack of time, I didn't experimented with structured sparsity.  It would be interesting to repeat the experiments above with structured pruning, given that current systems are better designed to deal with structured patterns to save computation and memory. See for example NVIDIA's A100 GPUs, which introduced structured sparsity of  2:4 (here and here). We might expect to see a drop in accuracy, since this approach is probably a mid-point between the L1-Norm and the Random approach. 
﻿
Add a comment