There are potential risks associated with large-scale, general-purpose language models trained on web text. Which is to say: humans have biases, those biases make their way into data, and models that learn from that data can inherit those biases. In addition to perpetuating or exacerbating social stereotypes, you want to ensure your LLM doesn’t memorize and reveal private information.
It’s essential to analyze and document such potential undesirable associations and risks through transparency artifacts such as model cards.
Similar to performance benchmarks, a set of community-developed bias and toxicity benchmarks are available for assessing the potential harm of LLM models. Typical benchmarks include:
To date, most analysis on existing pre-trained models indicates that internet-trained models have internet-scale biases. In addition, pre-trained models generally have a high propensity to generate toxic language, even when provided with a relatively innocuous prompt, and adversarial prompts are trivial to find.
So how do we fix this? Here are a few ways to mitigate biases during and after the pre-training process:
Training set filtering: Here, you want to analyze the elements of your training dataset that show evidence of bias and simply remove them from the training data.
Training set modification: This technique doesn’t filter your training data but instead modifies it to reduce bias. This could involve changing certain gendered words (from policeman to policewoman or police officer, for example) to help mitigate bias.
Additionally, you can mitigate bias after pre-training as well:
Prompt engineering: The inputs to the model for each query are modified to steer the model away from bias (more on this later).
Fine-tuning: Take a trained model and retrain it to unlearn biased tendencies.
Output steering: Adding a filtering step to the inference procedure to re-weigh output values and steer the output away from biased responses.
Copyright © Weights & Biases. All rights reserved.