Agile Text Classifiers for Online Safety

This paper is argues for a paradigm shift in text classification for online safety.
Created on February 16|Last edited on February 16
Comment
﻿
﻿
TL;DRContent moderation of generative language model behavior is hefty. It requires lots of data and lots of training. This paper argues that there is a paradigm shift to be made here: from heavy training and large models to lightweight, agile classifiers with little data.
A recent advancement in transfer learning, called Parameter-Efficient Tuning (PET) where only a small carefully selected and initialized subset of the pretrained LLM is finetuned, has shown comparable results to finetuning the entire model. Prompt tuning, a PET method, is done by prepending a set of learnable token embeddings, or soft prompts, to the input text to an LLM. Prompt tuning is one of the most efficient PETs and avoids having multiple copies of the same model. 
By leveraging prompt tuning for toxicity classification, a fixed-weight LLM can be trained with a few additional token embeddings, tuned to classify explicit, inappropriate, and toxic input. This approach maintains performance, saves lots of computation and time, and requires much less data and trainable parameters. Thus, the authors argue there is strong incentive to shift to these lightweight, agile classifiers instead of trying to train universal toxicity classifiers.
Their ProcessTheir approach is as follows:
models: T5 XXL, PaLM 62B
token embedding dim: 4096, 8192
trained soft prompts on a varying number of training samples: 10, 20, 50, 80, 100, 200, 500, 1000, 2000
repeat each experiment three times across different seeds to account for variability
trained on 5 datasets
ParlAI Variants (3 of the 5 datasets): 
Single Standard: single-turn conversations (a turn is defined as 1 back and forth exchange between 2 people) where crowdworker-generated offensive statements 
Single Adversarial: single-turn conversation but crowdworker-generated offensive statements specifically designed to fool models (adversarial)
Multi: multiple turns of conversation before last statement is toxic/offensive
Bot Adversarial Dialogue (BAD): 
dataset collected by tasking people to lead the bot into saying something offensive
long conversations so they truncated it and tested with BAD-4 and BAD-2 (last 4 and last 2 turns of the conversation)
Unhealthy Comment Corpus (UCC):
~44k comments from Global and Mail news site given 7 labels: healthy/unhealthy with 6 unhealthy attributes: 
hostile
antagonistic/insulting/provocative/trolling
dismissive
condescending
sarcastic
generalization 
Neutral Responses (Self-constructed): self-constructed dataset of neutral responses
More training information can be found in their paper!
Results
﻿
In summary, in certain cases, they found that prompt tuning with just 2,000 samples is already comparable to the previous SOTA for these specific datasets. 
ConclusionThey note that there is a scaling law at play where a larger model like PaLM 62B required less data than T5 XXL. And they conclude that a paradigm shift is to be made to small custom datasets with prompt tuning for the specific case of text classification for online safety. 
References
﻿https://arxiv.org/abs/2302.06541﻿
﻿
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.