I keep re-reading the latest Glasswing document at the end of each day, in light of everything being measured by the hours, and the revelations still sit in Anthropic’s own numbers.
Glasswing is NOT confidently reporting tens of thousands of real bugs, as everyone has expected. Instead, like any tool, they are reporting tens of thousands of findings, of which a confident count of real bugs is much smaller. Their update says so plainly if you lay out which number is which.
- 23,019 total found. That’s the eyeball-seeking number, the model’s own ungraded output. Call it dirty.
- 6,202 were estimated high or critical. Still dirty. It’s the model’s estimate. Mythos grading Mythos, the way Anthropic likes it.
- 1,752 actually checked by a human or a security firm. That’s 28% of the high-crit pile and about 8% of the total. A little water, a little soap.
- Of those checked, 90.6% true positive. That rate exists only because humans checked. It is a statement about the 1,752, not the 23,019.
- 530 disclosed. 75 patched.
That last bullet is the one that gets me every time. 75 patched. What?
Anthropic gives us three excuses:
- Early in the 90-day disclosure window.
- Some patches land without public advisories so they undercount.
- Mythos is flooding an already-overloaded ecosystem.
Fine. Every one of those explains why a patch lands late.
None explains why the patch isn’t generated and attached to the disclosure in the first place. And here is the part that really doesn’t fit: these models find bugs by knowing what the fix looks like. They train on the public corpus of code and its patches, so a vulnerability, to the model, is the gap between your code and the patched form it already carries. The fix is not a downstream step the model can’t reach. The fix is what located the bug in the first place. Finding without proposing the patch seems scandalous.
That is why 75 is damning when the 530-disclosed is not. They apparently are withholding fixes used to derive findings. It sounds weird until you see the proof they can ship the fix is in the same document: the public model, Opus 4.7, patched over 2,100 vulnerabilities for enterprise customers in three weeks. They boast that patch generation exists, runs in production, and was pointed at paying customers while the commons got reports generating a predictable request to slow down. 75 isn’t a capability limit. It looks like pressure to pay for protection.
The math is thus tens of thousands floated, around 1,750 a human actually touched, 1,587 confirmed real, 1,094 of them high or critical, and only 75 fixes teased. Mythos pumped tens of thousands, a much smaller number was verified, and the press conflated the two because the document (once again) seems to push low fidelity low integrity readings.
The confidence of 1,750 is literally a number that means human-touched. Everything prior is the model’s own say-so, some of it confident confabulation pointing at the wrong line, etc. and can’t be trusted without humans in the loop. The 90.6% exists because expensive humans stepped in: six firms, a triage pipeline, Anthropic staff. Strip them out and then what? The model’s raw output is overconfident confabulations, like pointing at the wrong line until someone checks. Verification cost is one thing. The patch number is worse because it is the half that the model can automate, and they proved it 2,100 times on code from those who pay. The findings are being directed to Anthropic, while the fixes land on a maintainer’s weekend, or dinner with the family.
75 out of 23,019 is what we should be all talking about.
