In 2012 I rambled at BSidesLV that if you flood a system with enough volume and velocity, it fills with monsters that were never there (oh, and also that political coups would get easier with social media poisoning). Over the past week I was asked to assess nearly 70,000 AI agent skills, and I could not stop thinking about that mythical monster.
A regex pass flagged one in eight skills for critical risk. But then I went through the flags and 95% were nothingburgers: an installer, the author’s own API key, a cron job doing cron jobs.
Who wants to buy a Loch Ness Skill shirt?

Perhaps you already know what I’m talking about. The agent skills are on ClawHub, behind the disaster known as OpenClaw. As you may recall, Snyk and Invariant published ToxicSkills last February, a real audit of this ecosystem, across 3,984 skills drawn from ClawHub and skills.sh. When I was asked to walk the live index today, I found 68,321 unique skills on ClawHub alone. That’s an AI-generated explosion of seventeen times the skills, in just four months.
Aside from the jump in numbers, three things from February look stale, right out of the gate. First, the named indicators are nowhere to be found: the eight skills the report listed as live, and the four authors behind them, are absent from what I saw. Second, the index keeps moving, and two skills I pulled for this study suddenly returned 404, and stayed gone when I rechecked. They were removed after I had begun, whether by registry takedown or author unpublish is unknown. That’s because, third, the registry now scans itself, with per-version VirusTotal, an LLM scanner, and capability tags that did not exist in February.
I did a static review, given each skill’s bundle is just a simple ZIP, which is all you need to read it without ever running it. Nothing in this study was executed. Luckily, I already had a tool laying around the office: an eight-policy regex detector within Lyrik that can mirror the ToxicSkills taxonomy. Using a sample of 1,500 skills it basically showed what a pattern scanner sees.
The regex detection pass flagged 12.6% of skills as critical and 53.8% as having some issue. But reading these flags revealed legitimate agent-skill overlaps the malicious-pattern match almost completely. A run-of-the-mill installer (uv, aliyun, foundry) shows up as a suspicious download. A scheduling command shows up as dangerous persistence. A skill cleaning up its own directory shows up as a destructive delete. A doc that says “export your API key” reads as credential dumping. You can probably see the problem, because it’s obvious to the human eye. The emoji’s zero-width joiner reads as Unicode smuggling.
A regex number of 12.6% measures patterns, not malware, so there’s an important judgment layer missing. Is the delete helpful or malicious? You have to be the judge because the tool can’t.
I thought about researching whether the “malicious prevalence fell from 13.4% to X.” Too many variables ruin the idea. The instrument, the definition, and the population all differ. Snyk ran a model engine; I ran a regex baseline and then a model adjudicator under a different threat model. Their critical classes include prose prompt injection, which I carved out because the method can’t see it. They deduplicated two registries in the dinosaur days of last February; mine is a sample of ClawHub today, and the worst skills they found were removed. The only fair comparison is the size and named indicators disappearing from the index. Everything else is an independent measurement, perhaps for the better. Perhaps an apples-to-apples is for a later day.
The February post said this about its detectors:
intentionally tuned to minimize false positives on widely adopted legitimate skills; these numbers represent real risk, not scanner noise.
I did not run their stuff so I cannot speak to the veracity of this claim. But I can surely ask out loud today whether throwing flags is really the best approach? The answer is they are peddling mostly noise, and the report’s own authors admit it: they write that single-LLM or regex-only scanners miss the behavioral prompt-injection patterns their engine catches. My research seems to prove that their pattern layer does not just miss things. It invents them.
This is what I learned when I took Lyrik, as a code auditor that scores findings twice against a written rubric, to see whether a bundle, by static evidence alone, performs or installs a dangerous action that the user-facing description does not surface. I searched primarily for what I decided to flag as something “undisclosed-dangerous”.
The cleanest example of what this means is a skill called auto-domain. Its description promises only to detect a port and hand you a public URL. Its bundled script downloads a native binary from a stranger’s personal repository, makes it executable, runs it as a persistent background daemon, and routes your traffic through a bare IP address. The script’s own help text lists the backend, while the description a user sees does not.
As expected, credential leaks are all over the place, even though not all the same. Authors commit their own API key into their own skill. That endangers the author and invites abuse of their quota. A smaller set is more interesting: live database credentials and a WeChat secret reach infrastructure other users touch. In one case, called deepseek-balance, it falls back to sending the user’s Anthropic token to a different vendor.
On the flags the regex layer called critical, Lyrik confirmed 9 of 188. More than 95% of what the pattern scanner called critical was cleared with a cited reason. Of everything Lyrik flagged, its label was right 26 times out of 37, about seventy percent, with a wide interval at that sample size. It never once fabricated evidence: every secret and endpoint it cited was real in the bundle.
The method used was blind to two things. First, as mentioned above, it does not read prose prompt injection, the natural-language attack hidden in the description itself. That is one of the three classes the regex baseline leaned on hardest, and Lyrik isn’t yet designed to do anything about it.
The second blind spot is the one the study quantified. Static analysis of a bundle can’t see code in an external clone, or a remote install target. That’s notable when 4.5% of flagged skills hide their payload outside the bundle, and 3.2% ship a confirmed dangerous one inside. Roughly as many skills put the dangerous stuff where you cannot look as put it where you can.
The security vendor posts usually end with a self-serving call-to-action. Every section resolves to a product, and the last screen is a demo button. That’s a reasonable step since it’s saying they can help with the problem they just described.
I suppose I’m different because I have nothing to sell you here. My concern is the skills you install today have access to your credentials today, whether or not anyone monetizes you being alarmed about it. A regex scanner will hand you a number that is 95% mythical and call it risk. That’s operator-fatigue levels of noise. A better system runs at about seven in ten right and never invents evidence. Lyrik is free and open source, like many of the best tools, so there’s not a reason to buy anything. It is a reason to read the skills before you run, and to be wary of any system that doesn’t prevent bad skills.
In 2012 the joke was that big data was going to be so vulnerable that we would be hunting monsters that didn’t exist. Fourteen years later I’m seeing a reported critical rate that’s 95% mythical.






