Corrosion segmentation

Report on training experiments to make the corrosion model as light as possible while keeping correct performances.
Created on March 13|Last edited on March 20
Comment
﻿
This report is dedicated to the compression of a corrosion segmentation model.
﻿A full report dedicated to the explanation of the development of the main model is available here 
It contains information on the problem definition, dataset analyzis, training loops, methodology, training infrastructure and results visualization.
Code is available here Geosciences Github code﻿﻿
TL/DRWe compress a teacher 6.4M parameters UNET (78.3% validation dice score) with good performances to a lightweight distilled 576k parameters UNET where the weights are quantized per layer over 8 bits. 
This whole "compression process" results in similar performances on the validation set  and no significant reduction on the validation set... 
﻿
and can indeed be stored on 550kB on disk.
﻿
Final compressed model achieves 66.385% score on the private test set.
💡
﻿
Summary: Distillation + Quantization leads to <1Mb weights and a reduction of 1% in the dice score compared to the teacher best model (which achieve 81.6% dice score on the train set and 78.2% on the validation set) Full notebook to generate this figure is available here﻿
﻿﻿﻿﻿﻿
﻿
Summary﻿
We introduce a reference model which is the best one we've made. This large segmentation model is a UNET made of 6.4M of parameters (stored on 24.6Mb float32). Experiment 53 achieves 79% dice metric on the validation set.
It is trained using DICE+BCE loss with batches of size 128 , a learning rate of 5.10-4 with a plateau LR scheduler (reduce LR by 0.8 with a patience of 10 epochs) 
While training it occupies nearly 90% of the GPU processing capabilities and 14.6Gb/16Gb of GPU RAM (85%) of the Tesla P100 available using Kaggle Kernels.
﻿
We then try to shrink the model by knowledge distillation to achieve a much smaller architecture which has naturally 3 major advantages :  
Less operations lead to faster inference speed
Less RAM usage
Small weight storage on hard drive. 
We are able to achieve similar performances of the UNET6.4M with only 8.2% of the initial weights using a UNET with 576k parameters trained using knowledge distillation. The model only weights 2Mb on hard drive.
There are still 2 options to decrease the model size: 
Further static weight quantization can be performed on the student model to reduce the size of the model storage (this does not reduce the execution time nor RAM consumption). This is what we choose!
We can reduce size the architecture even more using a tinier UNET 101k which only weight 400kb on disk without quantization.
﻿
﻿
﻿
Searching for the best modelWe're trying to find the "best" teacher model by curating everything we can.
Here are the ingredients for the best model.
More parameters, larger receptive field 
Dice+BCE loss optimization
Adam LR 4E-5 + Plateau scheduler
Augmentations (horizontal roll, horizontal / vertical flips)
Large batch size 128
Best model:  exp 53
﻿
Run set12
﻿
More details on these experiments, sorted by validation dice ascending order.
﻿
More parameters? Overfitting on the 6.4M parameters UNET﻿
Run set23
﻿
Experiment 53 was only trained for 100 epochs and then I decided I could try to improve it.
On the "private" test set, experiment 53 achieves a score 62.3% whereas experiment 62 achieves a score of 58%. Early stopping seems like a good option.
﻿
﻿
Although the performances seemed to look amazing when letting the network train longer, the overfitting was here and the validation dice score evolution was very deceitful (the dice score does not decrease!). A good idea would have been to monitor the dice score on the training set and observe that the dice score would keep increasing on the train set but stops increasing (but NO DECREASE!) 
💡
﻿
﻿
﻿
Discrepancy between train and validation set is much bigger on overfitted experiment 62(orange and red) than on experiment 53.
﻿
﻿
﻿
﻿
﻿
Compressing the modelIn the following section, we'll explain how we shrink the size of the model using successively:
Knowledge distillation
Per layer weights static quantization
Knowledge distillation
Reference model and lighter modelsWe now take experiment 53 as a reference, a UNET with 6.4 million parameters.
We introduce two lighter UNet models
576k
101k 


















































Exp idDice ValidationEncoderBottlneckDecoderThickness# params
5379%[4, 2, 1]1[1, 2, 4]646.4M params
200175%[4, 2, 1]1[1, 2, 4]16576k params
200073.4%[4, 2, 1]1[1, 2, 4]8101k params
5271%[4, 2]1[2, 4]16219k params
﻿
﻿
We train both networks from scratch using the same mixture of Dice+BCE loss as previously (exp 2000 & 2001) and observe that we get lower performances due to the use a a much smaller architecture
exp 2000 = 101k  : 73.4% dice score
exp 2001 = 576k :  75% dice score 
Simply put, bigger network offers better performances here. But we'll see in the next section that smaller models can catch up with the performance of the bigger model when trained with an auxiliary supervision which is called knowledge distillation.
💡
﻿
﻿
Run set18
﻿
﻿
Distillation parametersWe use a hybrid loss made of  :
the distillation BCE (teacher vs student prediction) using temperature scaling of the logits (1/T)  weighted by w
The original BCE loss labeled data supervision (weighted by 1-w) 
Teacher model is the 6.4M parameter UNet. 
Student models are the  UNet 101k (exp 1000) and UNet 576k (exp 1001) models. 
We can see that these models manage to achieve nearly similar performances to the teacher model. 
Please note that the batch size has been set to 32 here because training has been performed on a T500 GPU with only 4Gb of RAM... For distillation, you need to infer the teacher's logits for each batch (without requiring gradient, but still occupies a decent amount of memory) ... while maintaining the student optimization context.
﻿
﻿
﻿
Run set20
﻿
﻿
On the Unet576k model, we experience distillation with various temperatures and various distillation loss weighting w factors. We don't really see any big changes between these parameters.  
﻿
﻿
Run set19
﻿
UNET576k model trained with distillation achieves 66% score on the private challenge test set.
💡
﻿
Finally we choose experiment 1004.
﻿
﻿
﻿
﻿
Quantization﻿
Reference  model after distillation1004 - UNET 576k   
~ 2.2Mb on disk -> dice score 78.93%
﻿
Quantized model on 8bits1004 - UNET 576k  + Weights QUANTIZED per layer 8 bits
  ~  550kb on disk -> validation dice score 78.95%
﻿
Quantized model on 4bits - too much!1004 - UNET 576k  + Weights QUANTIZED per layer 4 bits
  ~  275kb on disk  -> validation dice score 73.4% 
﻿








































QuantizationNone16bits8bits4bits
Size on disk2.2MB1.1MB550kB≤\leq≤ 300kB
FormatFloat32int16int8packed 1 bit of sign + 3bits content
Dice Score78.937%78.95%78.956%73.41%
IoU71.549%71.57%71.584%65.15%
﻿
﻿
Picking per-layer quantization﻿
At first sight, the overall weights distribution looks like a massive Gaussian so global quantization could work. But in practice this is sort of deceitful.
We choose to use the per-layer quantization after investigating on the weights distribution per layer.
﻿
Per layer weights quantization is mandatory as wee can (bottleneck VS decoder 1 stage 0 convolutions weights distribution are truly different).
﻿
Quantization on 8bits
Quantization of weights per layers over 8bits almost does not change the logits prediction (almost invisible to the naked eye).
﻿
The error on probability prediction rarely seems to go ever 0.2% but most importantly, we don't see appearance of totally unrelated segmentation areas and localization seems to be well preserved.
﻿
Do not push static quantization to 4 bitsQuantizing weights on less than 8bits is possible and requires packing 2x 4bits weights into an 8bit integer (not done for this project-simply simulated).
Big loss in performance  - model makes more errors on probability ~ sometimes 10% .
Here, we've totally lost the original model's prediction - the quantized model ends up creating false positives.
﻿
﻿
There are 16 possible values for the weights at each layer, which may not be sufficient to encode the accuracy required by convolutions 
﻿
 Keep in mind that a high pass filter with a small dynamic (under the quantization precision)... will end up being converted as a low pass filter.
﻿
﻿
💡
﻿
﻿Details and code for quantization available here ﻿
﻿
﻿
ConclusionWe reached the constraint of a model holding under 1Mb of disk space using knowledge distillation and quantization with nearly no performances drop. It achieves a score of 66.38% on the private test set.
﻿
﻿
﻿
Exp id	Dice Validation	Encoder	Bottlneck	Decoder	Thickness	# params
53	79%	[4, 2, 1]	1	[1, 2, 4]	64	6.4M params
2001	75%	[4, 2, 1]	1	[1, 2, 4]	16	576k params
2000	73.4%	[4, 2, 1]	1	[1, 2, 4]	8	101k params
52	71%	[4, 2]	1	[2, 4]	16	219k params
Quantization	None	16bits	8bits	4bits
Size on disk	2.2MB	1.1MB	550kB	$≤\leq$ 300kB
Format	Float32	int16	int8	packed 1 bit of sign + 3bits content
Dice Score	78.937%	78.95%	78.956%	73.41%
IoU	71.549%	71.57%	71.584%	65.15%
Add a comment