Sentence Segmentation -- Part 2

Created on November 15|Last edited on November 16
Comment
Colab Notebook Link: Google Colab﻿
Minor Changes / PreambleFixed loss function: Previously I was using CrossEntropyLoss instead of the correct BCELoss / BCEWithLogitsLoss. This has been changed such that training uses BCEWithLogitsLoss (chosen over BCELoss for numerical stability).
Less stringent instance removal: Previously, I was removing all instances > model_output_len (number of output nodes, essentially the context window). This caused a severe imbalance in the location of sentence completions towards the "earlier" output nodes, meaning the model was biased to simply output 1 in the general area that sentence completions occur (see image from previous submission below).
﻿
﻿
To fix this, we allow instances up to length 4096 tokens, then simply truncate those longer than the model's context window. This creates a slightly more even spread of sentence completion labels across the model output:
﻿
Major Change 1: Weighted LossOnce the loss function is corrected, we see the expected model behavior -- all predictions become 0, due to the highly imbalanced nature of our dataset (~3% positive labels). This is characterized by precision / recall both dropping to 0 (TP = 0):
﻿
Run set1
﻿
This brings us to the first major fix in this revision: a weighted loss function, weighting the positive label loss by a scalar. The overall optimal value was calculated as 27.45 (i.e. we have 27.45x more negative than positive instances), so a small range of weights were experimented with around this value. 
﻿
﻿
﻿
Run set 24
﻿
I ran four experiments, with a pos_weight of 1, 15, 27, 45. As expected, we see a marked improvement from weighting the positive instances, with the best performance (as judged by F1 score) coming from 27 (the calculated optimal pos_weight). It should be noted that at this stage, we're treating positive and negative errors as of the same importance, but we may prefer to weight one higher than the other; if we care more about not missing a sentence completion, we might choose to use pos_weight=45, which increases recall at the cost of precision. 
Labelwise Weighted Loss FunctionAs shown above, the label prevalence varies across the output, so it's worth trying to weight the loss function accordingly. The pos_weight was calculated for each output position independently and clipped at 80 (to avoid over-weighting). The pos_weight vector is plotted below with resulting performance following.
﻿
﻿
﻿
Run set 25
﻿
There's limited/no benefit to the label-wise weighted loss, so I'll continue with pos_weight=27 from here on out.
Packing Instances TogetherAnother improvement to be made is to pack instances together; rather than treating each wikipedia article as an individual instance, pad the remaining space with text from the next instance. This will even out the spread of sentence completions to be ~uniform. I feel this is a less important area to explore, given the results from the weighted loss experiments above; scaling the loss weights relative to the output positions prevalence of sentence completions (which should deal with the output imbalance) didn't show an improvement over a uniform loss weight, so packing instances together (another way to deal with output imbalance) is unlikely to provide a large improvement in performance. This will be left as future work.
Major Change 2: Model ConfigurationsThe other important fix in this revision is the model configuration / capacity. We'll start with some experiments around the final fully-connected (FC) layers as it's an easy way to add capacity to the model.
Baseline
1 Extra FC (one extra 512-neuron hidden layer with ReLU)
Num parameters: 1254368
SentenceSegmentationCNN(
  (embedding): Embedding(28996, 768, padding_idx=0)
  (conv1d_list): ModuleList(
    (0): Conv1d(768, 32, kernel_size=(3,), stride=(1,))
    (1): Conv1d(768, 64, kernel_size=(5,), stride=(1,))
    (2): Conv1d(768, 128, kernel_size=(7,), stride=(1,))
  )
  (fc1): Linear(in_features=224, out_features=512, bias=True)
  (fc_out): Linear(in_features=512, out_features=256, bias=True)
)
Extra FC (two extra 1000-neuron hidden layers)
Extra FC Relu (same as above, with ReLU after each FC a. la AlexNet)
AlexNet 1D - architecture loosely inspired from AlexNet / basic CNNs, but without pooling (three stacked conv layers (5x5, 64 - 3x3, 128 - 3x3, 256)
Num parameters: 33006784
AlexNet1D(
  (embedding): Embedding(28996, 768, padding_idx=0)
  (conv1): Conv1d(768, 64, kernel_size=(5,), stride=(1,))
  (conv2): Conv1d(64, 128, kernel_size=(3,), stride=(1,))
  (conv3): Conv1d(128, 256, kernel_size=(3,), stride=(1,))
  (dropout): Dropout(p=0.5, inplace=False)
  (fc1): Linear(in_features=63488, out_features=512, bias=True)
  (fc_out): Linear(in_features=512, out_features=256, bias=True)
)
There doesn't seem to be much improvement from the different FC configurations with the baseline model, however, switching to the custom AlexNet1D model provides a massive improvement in performance. 
This is because of the baseline models' pooling, which used kernel_size=<previous_output_length>, meaning it simply took the max value from each of the previous layer's feature map, losing all temporal information. This is clearly a bad design for sentence segmentation where fine-grained information is required; a lesson in using others' models off-the-shelf without careful consideration...
﻿
Run set 25
﻿
Looking at final model performance, the AlexNet1D model achieved a precision/recall score of 0.74/0.92, respectively, showing this to be a performant architecture for sentence segmentation.
SummaryThe inclusion of both the weighted loss function and a higher capacity/non-pooled model resulted in a marked increase in performance, from ~0.1 F1 score to 0.82. 
It's important to apply the appropriate loss weight, given the highly imbalanced nature of this dataset (~3% positive labels)
We also need to use a high capacity CNN without pooling (or at least not as extreme pooling as the baseline) to capture the complexities in the data and retain fine-grained information.
The combination of these two changes has resulted in a simple but effective CNN model for sentence segmentation.
Future WorkRevisit loss weighting: Given how poorly the baseline model was performing, it may not have been a fair test bed to compare loss weightings -- a useful next step would be to repeat the loss weighting experiments with the more accurate AlexNet1D model.
Instance packing: as previously mentioned, packing instances together, removing the need for padding and giving more signal to the model per instance. This would also have the benefit of evening out the end-of-sentence token distribution, rather than them mostly occuring in the former half of instances.
General hyperparameter tuning: Little tuning of the common hyper parameters (learning rate, batch size, etc.) was done, which is an essential step for getting the optimal performance from your model.
﻿