Agile Text Classifiers for Online Safety
This paper is argues for a paradigm shift in text classification for online safety.
Created on February 16|Last edited on February 16
Comment
TL;DR
Content moderation of generative language model behavior is hefty. It requires lots of data and lots of training. This paper argues that there is a paradigm shift to be made here: from heavy training and large models to lightweight, agile classifiers with little data.
A recent advancement in transfer learning, called Parameter-Efficient Tuning (PET) where only a small carefully selected and initialized subset of the pretrained LLM is finetuned, has shown comparable results to finetuning the entire model. Prompt tuning, a PET method, is done by prepending a set of learnable token embeddings, or soft prompts, to the input text to an LLM. Prompt tuning is one of the most efficient PETs and avoids having multiple copies of the same model.
By leveraging prompt tuning for toxicity classification, a fixed-weight LLM can be trained with a few additional token embeddings, tuned to classify explicit, inappropriate, and toxic input. This approach maintains performance, saves lots of computation and time, and requires much less data and trainable parameters. Thus, the authors argue there is strong incentive to shift to these lightweight, agile classifiers instead of trying to train universal toxicity classifiers.
Their Process
Their approach is as follows:
- models: T5 XXL, PaLM 62B
- token embedding dim: 4096, 8192
- trained soft prompts on a varying number of training samples: 10, 20, 50, 80, 100, 200, 500, 1000, 2000
- repeat each experiment three times across different seeds to account for variability
- trained on 5 datasets
- ParlAI Variants (3 of the 5 datasets):
- Single Standard: single-turn conversations (a turn is defined as 1 back and forth exchange between 2 people) where crowdworker-generated offensive statements
- Single Adversarial: single-turn conversation but crowdworker-generated offensive statements specifically designed to fool models (adversarial)
- Multi: multiple turns of conversation before last statement is toxic/offensive
- Bot Adversarial Dialogue (BAD):
- dataset collected by tasking people to lead the bot into saying something offensive
- long conversations so they truncated it and tested with BAD-4 and BAD-2 (last 4 and last 2 turns of the conversation)
- Unhealthy Comment Corpus (UCC):
- ~44k comments from Global and Mail news site given 7 labels: healthy/unhealthy with 6 unhealthy attributes:
- hostile
- antagonistic/insulting/provocative/trolling
- dismissive
- condescending
- sarcastic
- generalization
- Neutral Responses (Self-constructed): self-constructed dataset of neutral responses
More training information can be found in their paper!
Results

In summary, in certain cases, they found that prompt tuning with just 2,000 samples is already comparable to the previous SOTA for these specific datasets.
Conclusion
They note that there is a scaling law at play where a larger model like PaLM 62B required less data than T5 XXL. And they conclude that a paradigm shift is to be made to small custom datasets with prompt tuning for the specific case of text classification for online safety.
References
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.