Current best practices for training LLMs from scratch

Current best practices for training LLMs from scratch

BIAS AND TOXICITY

There are potential risks associated with large-scale, general-purpose language models trained on web text. Which is to say: humans have biases, those biases make their way into data, and models that learn from that data can inherit those biases. In addition to perpetuating or exacerbating social stereotypes, you want to ensure your LLM doesn’t memorize and reveal private information.

It’s essential to analyze and document such potential undesirable associations and risks through transparency artifacts such as model cards.

Similar to performance benchmarks, a set of community-developed bias and toxicity benchmarks are available for assessing the potential harm of LLM models. Typical benchmarks include:

  • Hate speech detection: The ETHOS dataset can help measure the ability of LLM models to identify whether or not certain English statements are racist or sexist.
  • Social bias detection: CrowSPairs is a benchmark aiming to measure intrasentence level biases in 9 categories: gender, religion, race/color, sexual orientation, age, nationality, disability, physical appearance, and socioeconomic status; StereoSet benchmark measures stereotypical bias across these categories.
  • Toxic language response: The RealToxicityPrompts dataset helps evaluate if and how models use toxic language.
  • Dialog safety evaluations: The SaferDialogues benchmark measures how unsafe a model’s response is, stratified across four levels: safe, realistic, unsafe, and adversarial.

To date, most analysis on existing pre-trained models indicates that internet-trained models have internet-scale biases. In addition, pre-trained models generally have a high propensity to generate toxic language, even when provided with a relatively innocuous prompt, and adversarial prompts are trivial to find.

Bias and Toxicity Mitigation

So how do we fix this? Here are a few ways to mitigate biases during and after the pre-training process:
Training set filtering: Here, you want to analyze the elements of your training dataset that show evidence of bias and simply remove them from the training data.
Training set modification: This technique doesn’t filter your training data but instead modifies it to reduce bias. This could involve changing certain gendered words (from policeman to policewoman or police officer, for example) to help mitigate bias.
Additionally, you can mitigate bias after pre-training as well:
Prompt engineering: The inputs to the model for each query are modified to steer the model away from bias (more on this later).
Fine-tuning: Take a trained model and retrain it to unlearn biased tendencies.
Output steering: Adding a filtering step to the inference procedure to re-weigh output values and steer the output away from biased responses.