How To Build an Efficient NLP Model

In this article, we look at how we achieved 2nd place on the Kaggle Feedback Prize - Predicting Effective Arguments Efficiency Leaderboard and what we learned.
Darek Kleczek
Created on August 23|Last edited on January 17
Comment
Georgia State University and the Learning Agency Lab organized a Kaggle competition to classify argumentative elements in student writing as "effective," "adequate," or "ineffective." The competition consisted of a traditional track measuring the accuracy of classification and a second track measuring computational efficiency using a combination of accuracy and model speed.  
Our team — CroDoc, Amed, Ivan Aerlic, tascj, and I — finished with a gold medal (7th place) on the traditional leaderboard and got 2nd place on the efficiency leaderboard.
In this report, I will share the story of how we achieved that, with a special focus on the efficient model. I hope this will help machine learning engineers that need to have both accurate and efficient models running in production. 
Before we dive into how we built our efficient NLP model, here's what we'll be covering:
Table Of ContentsTable Of ContentsHow The Kaggle Competition Was EvaluatedFraming the TaskSimplicity RulesPre-Training Our ModelDropout and Hyperparameter TuningAdversarial Weight Perturbation (AWP)Distillation and FinetuningFull Data and Stochastic Weight Averaging (SWA)Conclusion: An Efficient NLP Model
﻿
﻿
Let's get started! 
How The Kaggle Competition Was EvaluatedThe competition was evaluated with a multi-class logarithmic loss metric, and the chart below presents a subset of experiments I ran during the period of the competition. In the end, I ran a total of 661 experiments between July 13th and August 23rd. 
The blue dot you can see in the bottom right of the chart is the final experiment that used a new approach implemented on the night before and run on the final day of the competition, which secured us 2nd place in the efficiency track.
I will highlight the key milestones that got us there below: 
﻿
Run set384
﻿
Framing the TaskThe dataset consisted of student essays split into discourses (e.g., lead, position, claim, evidence, etc.) and scores for each of the discourses. 
How should we frame that as a machine learning task? When I joined the competition, most of the public notebooks used a sequence classification approach with a transformer backbone, feeding a combination of discourse, discourse type, and essay context as input. I decided to work with a different token classification approach as it seemed like it could be faster and better able to handle the contextual nature of the task. 
I designed a model that will first classify each token into discourse type (lead, position, etc.) and discourse effectiveness (ineffective, adequate, effective). These corresponded to train_classes_loss and train_scores_loss in the chart below. In the second stage, I performed mean pooling across the tokens for each discourse, concatenated that vector with discourse type embedding, and classified the discourse effectiveness at span level (train_examples_loss for training and example_loss for validation below).
I ran these experiments with PyTorch Ligthning and used the W&B Integration for experiment tracking. 
I thought this was a smart approach, but it was also complex and hard to tune and converge. 
﻿
Run: PL16_fold_01
﻿
Simplicity RulesImitation is the sincerest form of flattery, so when I see a good public notebook on Kaggle, I often want to check it out. Specifically, when I saw the solution shared by Nicholas Broad, I simply needed to test it. 
Nicholas's idea was also a token classification approach but much simpler. He added special tokens to indicate discourse start/end as well as a class in the essay text and predicted discourse scores based on the special tokens.
I made several changes to his code which used Hugging Face Trainer, switching the backbone to deberta-v3-large (the current state-of-the-art for NLP models), adding our W&B Integration for experiment tracking, and implementing more frequent checkpointing. This was a big step forward, but not yet competitive with the top teams. The dataset was small, so it seemed reasonable to try using external data (from previous feedback competitions) to improve the score. 
﻿
Run set5
﻿
Pre-Training Our ModelA major milestone for my approach was adding pre-training. In this competition, we had the benefit of a bigger dataset from a previous Feedback Prize competition, which didn't have score effectiveness labels but had discourse-style annotations. 
I decided to combine masked language modeling (MLM )pre-training with type and span boundary detection. It was a simple modification of MLM code, where I forced the special tokens to always be a mask and be predicted. The pre-training loss was very smooth, and after adding a pre-trained backbone, the models immediately improved performance, see the charts below. 
﻿
﻿
Run: HF-pret-3_fold_01
﻿
﻿
﻿
Run set10
﻿
Dropout and Hyperparameter TuningAfter this stage, our model was already pretty good, so it was time to do some tuning. We didn't have the time or budget to do a massive sweep, so the tuning was guided by intuition. 
One technique we tried was removing dropout. This was a trick shared in one of the previous Kaggle competitions, which helped with a regression task, and it also gave us a boost here, although the task is different. Again: imitation is the sincerest form of flattery.
We tuned other hyperparameters as well, namely learning rate, batch size, number of epochs, and random mask augmentation percentage. 
We also changed validation to be more frequent (vs. at the epoch end only). This gave us some boost but also showed that the evaluation loss is bumpy. How can that be addressed? We've got you covered in the next section. 
﻿
﻿
Run set5
﻿
﻿
﻿
Run set9
﻿
Adversarial Weight Perturbation (AWP)When we merged teams, @tascj reported that AWP helps stabilize training. In fact, it was introduced to Kaggle in the winning solution from a previous Feedback Prize competition. AWP requires some hyperparameter tuning and a longer run, so when I finished the implementation, I started the experiment and went to sleep.
Imagine the feeling when you see the orange line in the morning - AWP gave us a huge boost!
I mentioned to our team that I'm planning to read up on the theory behind AWP - you don't need to know how a car works to drive it, though!
﻿
﻿
Run set10
﻿
Distillation and FinetuningAlmost since the beginning of the competition, I wanted to use pseudo labels on the previous competition's data, but somehow the other experiments took priority. I started implementing pseudolabels in my pipeline 2 days before the end of the competition but failed. I worked at night and during a weekend trip with family and friends, and I was tired. 
The next night I got the implementation prepared but decided to prioritize another experiment. I started training pseudolabels on the morning of the last competition day and finished in the car while my wife was driving us home, and my daughter was watching cartoons. It seemed to work extraordinarily well, so I was afraid of leakage. We had very few submissions left, so I only had one shot at the goal. 
The approach I used here was training the model with KL divergence loss on soft pseudo labels on the unlabelled dataset (made by an ensemble of our best models so far) and then fine-tuning it on the competition dataset with a regular approach. 
﻿
﻿
Run set2
﻿
Full Data and Stochastic Weight Averaging (SWA)After all the improvements to our pipeline, the loss curve looked pretty smooth, so it felt safe to train on all data. To improve generalization performance, I averaged 3 checkpoints over the last 300 training steps and used that as a single model weight (SWA). That single model turned out to be so good that on its own, it would make a gold medal score in the accuracy track, and it also scored 2nd place in the efficiency track. 
Conclusion: An Efficient NLP ModelHere are my key lessons from building the efficient NLP model:
Run and analyze lots of experiments. There's no other way to find out what works, and with experience, you'll get better intuitions and be more efficient. However, each problem and the dataset is different, and there are new techniques introduced every day, so you need to keep experimenting. 
Team up and learn with others. I felt I was a member of a high-performing team. Everyone brought their ideas, ran their own experiments, and collectively we learned more than possible on one's own.
Improve continuously. Try to do something better every day - improve your pipeline, learn a new technique, explore the data, and do error analysis. 
If you haven't started with W&B yet, do it now. Seriously. You'll become a better data scientist and a better engineer. It takes a few lines of code to get started. 
﻿
Add a comment
Dave Davies • 3 years ago
Love the detailed description of the process, and the inclusion of what didn't work.
Minh Tri Phan • 3 years ago
The process of stablizing the training is impressive, we didn't think about distillation on the pseudo-labels, so we have to accept the leakage. In the end, we faced 2 things, (i) not-so-stable training, and (ii) leakage, so we gave up the idea. Nice work as always!
Vadim Irtlach • 3 years ago
Great solution, Darek!
Tags: Articles, Kaggle, NLP, Tutorial
Iterate on AI agents and models faster. Try Weights & Biases today.