Corrosion segmentation
Report on training experiments to make the corrosion model as light as possible while keeping correct performances.
Created on March 13|Last edited on March 20
Comment
This report is dedicated to the compression of a corrosion segmentation model.
A full report dedicated to the explanation of the development of the main model is available here It contains information on the problem definition, dataset analyzis, training loops, methodology, training infrastructure and results visualization. Code is available here Geosciences Github code
TL/DR
We compress a teacher 6.4M parameters UNET (78.3% validation dice score) with good performances to a lightweight distilled 576k parameters UNET where the weights are quantized per layer over 8 bits.
This whole "compression process" results in similar performances on the validation set and no significant reduction on the validation set...
and can indeed be stored on 550kB on disk.
Final compressed model achieves 66.385% score on the private test set.
💡

Summary: Distillation + Quantization leads to <1Mb weights and a reduction of 1% in the dice score compared to the teacher best model (which achieve 81.6% dice score on the train set and 78.2% on the validation set) Full notebook to generate this figure is available here
Summary
We introduce a reference model which is the best one we've made. This large segmentation model is a UNET made of 6.4M of parameters (stored on 24.6Mb float32). Experiment 53 achieves 79% dice metric on the validation set.
It is trained using DICE+BCE loss with batches of size 128 , a learning rate of 5.10-4 with a plateau LR scheduler (reduce LR by 0.8 with a patience of 10 epochs)
While training it occupies nearly 90% of the GPU processing capabilities and 14.6Gb/16Gb of GPU RAM (85%) of the Tesla P100 available using Kaggle Kernels.
We then try to shrink the model by knowledge distillation to achieve a much smaller architecture which has naturally 3 major advantages :
- Less operations lead to faster inference speed
- Less RAM usage
- Small weight storage on hard drive.
We are able to achieve similar performances of the UNET6.4M with only 8.2% of the initial weights using a UNET with 576k parameters trained using knowledge distillation. The model only weights 2Mb on hard drive.
There are still 2 options to decrease the model size:
- Further static weight quantization can be performed on the student model to reduce the size of the model storage (this does not reduce the execution time nor RAM consumption). This is what we choose!
- We can reduce size the architecture even more using a tinier UNET 101k which only weight 400kb on disk without quantization.
Searching for the best model
We're trying to find the "best" teacher model by curating everything we can.
Here are the ingredients for the best model.
- More parameters, larger receptive field
- Dice+BCE loss optimization
- Adam LR 4E-5 + Plateau scheduler
- Augmentations (horizontal roll, horizontal / vertical flips)
- Large batch size 128
Best model: exp 53
Run set
12

More details on these experiments, sorted by validation dice ascending order.
More parameters? Overfitting on the 6.4M parameters UNET
Run set
23
Experiment 53 was only trained for 100 epochs and then I decided I could try to improve it.
On the "private" test set, experiment 53 achieves a score 62.3% whereas experiment 62 achieves a score of 58%. Early stopping seems like a good option.
Although the performances seemed to look amazing when letting the network train longer, the overfitting was here and the validation dice score evolution was very deceitful (the dice score does not decrease!). A good idea would have been to monitor the dice score on the training set and observe that the dice score would keep increasing on the train set but stops increasing (but NO DECREASE!)
💡

Discrepancy between train and validation set is much bigger on overfitted experiment 62(orange and red) than on experiment 53.
Compressing the model
In the following section, we'll explain how we shrink the size of the model using successively:
- Knowledge distillation
- Per layer weights static quantization
Knowledge distillation
Reference model and lighter models
We now take experiment 53 as a reference, a UNET with 6.4 million parameters.
We introduce two lighter UNet models
- 576k
- 101k
| Exp id | Dice Validation | Encoder | Bottlneck | Decoder | Thickness | # params |
|---|---|---|---|---|---|---|
| 53 | 79% | [4, 2, 1] | 1 | [1, 2, 4] | 64 | 6.4M params |
| 2001 | 75% | [4, 2, 1] | 1 | [1, 2, 4] | 16 | 576k params |
| 2000 | 73.4% | [4, 2, 1] | 1 | [1, 2, 4] | 8 | 101k params |
| 52 | 71% | [4, 2] | 1 | [2, 4] | 16 | 219k params |
We train both networks from scratch using the same mixture of Dice+BCE loss as previously (exp 2000 & 2001) and observe that we get lower performances due to the use a a much smaller architecture
- exp 2000 = 101k : 73.4% dice score
- exp 2001 = 576k : 75% dice score
Simply put, bigger network offers better performances here. But we'll see in the next section that smaller models can catch up with the performance of the bigger model when trained with an auxiliary supervision which is called knowledge distillation.
💡
Run set
18
Distillation parameters
We use a hybrid loss made of :
- the distillation BCE (teacher vs student prediction) using temperature scaling of the logits (1/T) weighted by w
- The original BCE loss labeled data supervision (weighted by 1-w)
Teacher model is the 6.4M parameter UNet.
Student models are the UNet 101k (exp 1000) and UNet 576k (exp 1001) models.
We can see that these models manage to achieve nearly similar performances to the teacher model.
Please note that the batch size has been set to 32 here because training has been performed on a T500 GPU with only 4Gb of RAM... For distillation, you need to infer the teacher's logits for each batch (without requiring gradient, but still occupies a decent amount of memory) ... while maintaining the student optimization context.
Run set
20
On the Unet576k model, we experience distillation with various temperatures and various distillation loss weighting w factors. We don't really see any big changes between these parameters.
Run set
19
UNET576k model trained with distillation achieves 66% score on the private challenge test set.
💡
Finally we choose experiment 1004.

Quantization
Reference model after distillation
1004 - UNET 576k
~ 2.2Mb on disk -> dice score 78.93%
Quantized model on 8bits
1004 - UNET 576k + Weights QUANTIZED per layer 8 bits
~ 550kb on disk -> validation dice score 78.95%
Quantized model on 4bits - too much!
1004 - UNET 576k + Weights QUANTIZED per layer 4 bits
~ 275kb on disk -> validation dice score 73.4%
| Quantization | None | 16bits | 8bits | 4bits |
|---|---|---|---|---|
| Size on disk | 2.2MB | 1.1MB | 550kB | ≤\leq 300kB |
| Format | Float32 | int16 | int8 | packed 1 bit of sign + 3bits content |
| Dice Score | 78.937% | 78.95% | 78.956% | 73.41% |
| IoU | 71.549% | 71.57% | 71.584% | 65.15% |
Picking per-layer quantization

At first sight, the overall weights distribution looks like a massive Gaussian so global quantization could work. But in practice this is sort of deceitful.
We choose to use the per-layer quantization after investigating on the weights distribution per layer.

Per layer weights quantization is mandatory as wee can (bottleneck VS decoder 1 stage 0 convolutions weights distribution are truly different).
Quantization on 8bits

Quantization of weights per layers over 8bits almost does not change the logits prediction (almost invisible to the naked eye).

The error on probability prediction rarely seems to go ever 0.2% but most importantly, we don't see appearance of totally unrelated segmentation areas and localization seems to be well preserved.
Do not push static quantization to 4 bits
Quantizing weights on less than 8bits is possible and requires packing 2x 4bits weights into an 8bit integer (not done for this project-simply simulated).
Big loss in performance - model makes more errors on probability ~ sometimes 10% .

Here, we've totally lost the original model's prediction - the quantized model ends up creating false positives.


There are 16 possible values for the weights at each layer, which may not be sufficient to encode the accuracy required by convolutions
Keep in mind that a high pass filter with a small dynamic (under the quantization precision)... will end up being converted as a low pass filter.
💡
Conclusion
We reached the constraint of a model holding under 1Mb of disk space using knowledge distillation and quantization with nearly no performances drop. It achieves a score of 66.38% on the private test set.
Add a comment