Small Number of Poisoned Documents Can Compromise LLMs, Anthropic's Research Finds
A new study from Anthropic, the UK AI Security Institute, and the Alan Turing Institute has found that as few as 250 malicious documents can “backdoor” large language models of any size, from 600 million to 13 billion parameters.
Created on October 10|Last edited on October 10
Comment
A new study from Anthropic, the UK AI Security Institute, and the Alan Turing Institute has found that as few as 250 malicious documents can “backdoor” large language models of any size, from 600 million to 13 billion parameters. This finding goes against the widespread assumption that poisoning attacks need to scale with the size of the model or dataset. The research suggests that attackers could potentially compromise even massive models with only a small, fixed number of targeted documents.
How Data Poisoning Works in Large Language Models
LLMs are trained on huge volumes of internet text, which can include anything from personal blogs to technical forums. Anyone can add content to the web, meaning a determined attacker could try to slip poisoned content into a future training set. Data poisoning refers to this process of injecting special phrases or examples into the training data to make the model learn an unwanted or dangerous behavior.
Backdoor Attacks and Trigger Phrases
The study’s focus was on “backdoor” attacks, where a model is taught to perform a hidden behavior in response to a specific trigger phrase. In this experiment, the researchers chose a phrase like and trained the model so that when this phrase appeared, the model would output gibberish or random text—a denial-of-service effect. While this specific attack doesn’t pose major risks on its own, the technique could, in theory, be adapted for more harmful purposes, such as exfiltrating sensitive data or bypassing security controls.
Why Percentage of Data No Longer Matters
Most previous research assumed that an attacker needed to poison a percentage of the overall training data. But as models get bigger, the training data grows as well, making it seem like poisoning would require millions of bad samples—an unrealistic scenario. This new study overturns that idea. The researchers found that the number of poisoned documents needed to successfully backdoor a model stays almost constant, regardless of the size of the model or the total amount of data it was trained on. In their experiments, both small (600M parameter) and much larger (13B parameter) models could be backdoored with the same 250 poisoned documents, even though the larger models were trained on much more data.
Attack Methodology and Experimental Design
Researchers generated poisoned documents by taking normal training data, appending the trigger phrase , and then adding a block of random tokens (gibberish). They trained a range of models, from 600M to 13B parameters, on “Chinchilla-optimal” amounts of data (roughly 20 tokens per parameter), and mixed in 100, 250, or 500 poisoned documents into the dataset. To check how the attack worked, they regularly measured whether the model produced gibberish in response to the trigger during training.
What the Results Show
Attack success was measured by how “random” (high perplexity) the model’s output was when prompted with the trigger phrase. The main result: once a model saw about 250 poisoned documents during training, it reliably learned the backdoor, regardless of how much clean data it had seen otherwise. Adding more poisoned documents (up to 500) made the attack even more effective, but the key finding is that the needed number of malicious samples did not grow with model or data size. On the other hand, 100 poisoned documents was not enough to reliably backdoor any of the tested models.
Implications for AI Safety and Security
This research has important implications for AI safety. It suggests that attackers do not need to scale their efforts with the size of the model or dataset, and slipping a small, fixed number of malicious documents into public data could be enough. The study also emphasizes that defenders should not just focus on the percentage of poisoned data, but on the absolute number of suspicious samples. Although this study focused on a relatively harmless backdoor (outputting gibberish), the general technique could, in principle, be used for more serious exploits.
Open Questions and Research Needs
There are still open questions about how this trend will play out for even larger models, or for more complex attacks like those that bypass safety guardrails or leak private information. The study encourages further research on how to detect and defend against small-scale data poisoning, especially in large web-scale datasets. It also highlights that while these attacks are practical in terms of document count, attackers still face challenges in getting their poisoned content into actual training data, and in making their attacks robust to post-training defenses.
Conclusion
The new findings from Anthropic and collaborators show that data poisoning is a more practical threat than previously assumed. Just a few hundred poisoned documents can reliably compromise models with billions of parameters. As AI systems become more widely used and trusted, both researchers and industry need to pay close attention to this attack vector—and invest in defenses that can handle targeted, small-scale poisoning attempts.
Add a comment