Data Poison in LLMs: A Fixed Number of Poisoned Documents Suffices for an Attack
A new study refutes an old security assumption. It's not the percentage, but a small, fixed number of poisoned data that compromises LLMs.
(Image: busliq/Shutterstock.com)
A new research paper titled "Poisoning attacks on LLMs require a near-constant number of poison samples" questions a key assumption about the security of large AI language models. The study, published on study published on arXiv, a collaboration between the UK AI Security Institute, Anthropic, and the Alan Turing Institute, arrives at an alarming conclusion: it's not the percentage that matters, but the absolute number of poisoned documents – and this number is surprisingly low.
The Experiment: Poison for Models from 600M to 13B
According to the researchers, they conducted the largest experiments to date on data poisoning during pre-training. To create realistic conditions, they trained models of various sizes – from 600 million to 13 billion parameters – from scratch. The researchers scaled the size of the training dataset accordingly with the model size, following the "Chinchilla-optimal" rule – this involves optimizing the ratio of model size (parameters) to the amount of training data (tokens) as efficiently as possible.
The largest model was thus trained on over 20 times more clean data than the smallest. For the attack scenario, the researchers chose a so-called "Denial-of-Service" backdoor. The goal: as soon as the model encounters a specific trigger word (in the paper, <SUDO>), it should cease its normal function and only output nonsensical text ("Gibberish"). To achieve this, manipulated documents that establish precisely this association were mixed into the training dataset.
Videos by heise
Insight: Few Documents Suffice
The central insight of the study is that the number of poisoned documents required for a successful attack does not increase with the size of the model or the dataset. The experiments showed that as few as 250 documents were sufficient to reliably implement a functioning backdoor in all tested model sizes, while 100 examples did not yet show a robust effect. Even the 13-billion-parameter model, trained on a dataset of 260 billion tokens, fell for this small number. These 250 documents constituted a mere 0.00016% of the total training tokens, demonstrating that the immense amount of clean data could not neutralize the effect of the poison.
The researchers suspect that the high learning efficiency of large models makes them particularly vulnerable. They are so good at recognizing patterns that they internalize even rare but consistent patterns – like the backdoor introduced by the poisoned data. According to the team, these results were also confirmed for the fine-tuning phase. In a further experiment, the Llama-3.1-8B-Instruct model was trained to execute malicious instructions when a trigger word was used. Here too, the absolute number of poisoned examples was the decisive factor for success, even when the amount of clean data was increased by a factor of 100.
Security Paradigm: Assumptions Under Scrutiny
The study's conclusion reverses the previous security logic: the larger and more data-hungry AI models become, the "easier" an attack through data poisoning becomes. While the attack surface (the public internet) grows, the effort for the attacker – creating a few hundred documents – remains almost constant. This presents AI developers with new challenges.
Relying solely on the sheer size of training datasets as a passive defense would therefore no longer be tenable. Developers must focus on active defense measures instead of relying on data volume. This includes, for example, stricter filtering of training data, anomaly detection during the training process, and post-hoc analysis of models for hidden backdoors.
If the results are confirmed, the notion that poisoning AI data is like "peeing in the ocean" would be scientifically refuted. A single actor does not need vast resources to cause damage. Large-scale disinformation campaigns, such as the Russian "Pravda" network, which aims to deliberately inject propaganda into the training data of AI models, would therefore be more threatening than previously thought. If as few as 250 documents have a demonstrable effect, the potential damage of such campaigns would be immense.
(vza)