Imagine claiming you’ve discovered a universal law of gravity after testing between 5’6″ and 5’8″. That’s essentially what Anthropic and collaborators just published about AI poisoning attacks.
A small number of samples can poison LLMs of any size
Whether these authors being pumped up by Anthropic consciously meant to deceive or were just caught up in publish-or-perish culture is irrelevant. Their output is false extrapolation.
I would say misinformation, but I know that understanding information warfare history and theory distracts from the battle, so let’s stick to technical language.
The paper promoted by Anthropic makes false claims based on insufficient evidence. That’s what matters.
If someone publishes “gravity works differently above 5 feet” after only testing two inches, we don’t debate intentions of the study, because we reject the claim as unsupported nonsense.
It seems to me like the paper is so embarrassing to Anthropic it should be retracted, or at minimum retitled to reflect what actually was found: Limited testing of a trivial backdoor attack on small models shows similar poisoning thresholds across a narrow range of model sizes.
The test was a phenomenon at 4 points in a narrow range, yet declared a pattern from tiny sample for a “constant” that applies universally. Worse, this “discovery” is being promoted as if broad implications, without sufficient reasons.
The whole PR thing to me therefore veers into active damage to the AI safety discourse, by creating false precision (“250 documents”) that will be cited out of context, and diverting attention from actual threats. I said I wouldn’t invoke information warfare doctrine, so I’ll leave it at that.
This uses appeals to authority (prestigious brands and institutions) with huge amounts of money to undermine trust in security research, wasting resources on non-problems. Ok, now I’ll leave it at that.
The core claim is that poisoning requires a “near-constant number” of documents across all model sizes. This is clearly and logically false based on their own evidence:
- They tested exactly 4 model sizes
- The largest was 13B parameters (tiny by modern standards)
- They tested ONE trivial attack type
- They literally confess “it remains unclear how far this trend will hold”
How far, indeed. More like how gullible are readers.
They buried their true confession in the middle. Here it is, translated and exposed more clearly: they have no idea if this applies to actual AI systems, under a fear-based fictional headline to fill security theater seats.
They also buried the fact that existing defenses already work. Their own experiments show post-training alignment largely eliminates these backdoors.
And on top of those two crucial flaws, the model size issues can’t be overlooked. GPT-3 has 175B parameters (13x larger), while GPT-4/Claude 3 Opus/Gemini Ultra get estimated near 1.7 trillion parameters (130x larger). They tested only up to 13B parameters, compared to production models 10-130x larger, making claims of “constants”… well… bullshit.
This paper announcing a “universal constant” is based on testing LESS THAN 1% of real model scales. It doesn’t deserve the alarmist title “POISONING ATTACKS ON LLMS REQUIRE A NEAR-CONSTANT NUMBER OF POISON SAMPLES” as if it’s a ground-breaking discovery of universal law.
All these prestigious institutions partnered with Anthropic – UK AISI, Oxford, and the Alan Turing Institute – give no justification for extrapolating from 4 data points into infinity. Slippery slope is supposed to be a fallacy. This reads as a very suspicious failure to maintain basic research standards. Who benefits?
In related threat research news, a drop of rain just fell, so you can maybe expect ALIENS LANDING.
It’s not science. It’s not security. Saying “it’s unclear how far the trend holds” means curiosity not alarm. Cherry-picking data to manufacture threat headlines can’t be ignored.
Anthropic claims to prioritize AI safety. Yet publishing junk science that manufactures false threats while ignoring real ones is remarkably unintelligent, the opposite of safety.
This is the stuff of security theater that makes us all less safe.