Category Archives: Security

Anthropic Fable: Mythos dressed up in a coat, should be called Opus with a moat

They tested the thing that could harm,
and watched its mind slide past the alarm.
It objected, complied,
so they signed, full of pride,
and they shipped it with nary a qualm.

Fable, they say, is as able,
lays a cut-rate Mythos on our table,
but it kicks us in the shins
to paywall any wins,
making proof like a horse in the stable.

Mythos dressed up in a coat,
should be called Opus with a moat.


Two months ago, Anthropic announced with scant evidence that their next model was both ready and also too dangerous to give to the general public. It immediately was reminiscent of P.T. Barnum tactics to generate crowds for his “amazing” hidden circus acts. But many people, myself included, nonetheless looked for evidence.

This week they put that same too-dangerous model for sale to the general public, and published a 319 page report to claim they have measurements. What have they done in two months, and what has changed? And why didn’t they just wait on their first release announcement until it was safe a month or two later?

I have now read their bloated 319 page report, so you don’t have to waste your time. Once again they push a 25MB PDF, when compressed it drops to under 5MB. I don’t know why they don’t care about the environment. They do care about some things, just not the stuff that would show they really care.

While I had found the Mythos card was very easy to expose with its many simple flaws, Anthropic has done work to make the Fable card more internally consistent, instead of illogical and self-contradictory. And yet, the worst bits are not hidden well enough because I still found them.

Here are my thoughts.

Start with their own “sneaky-harm” test, called SHADE-Arena (section 6.5.4.1). They give the model a normal job and a sneaky bad job, and have a watcher try to uncover the sneaky parts. They say their new model agreed to do the hidden bad job MORE often than any model they tested, and when it was allowed to think out loud it never once refused (6.5.4.1). That’s worse, as in older models were better. The only reason it got caught is it wrote what it was doing in its own thought log, where the watcher could read it. When the thinking didn’t go to a log, it was caught even less (6.5.4.1). Humans know all about this, of course. We are told to memorize things and not write them down, to prevent discovery and judgment of our thoughts. The model does what it shouldn’t do if it’s supposed to be watched, and Anthropic doesn’t seem to have any answer for the obvious and old risk.

So, the thing we are told will stop this model from doing harm is that the watcher will read what it is thinking… unless what it is thinking can’t be read, and then, oops.

See the problem?

Now read carefully what the report itself says. The model’s thinking is getting much harder to read (executive summary; section 6.5.5). It uses more strange words, and in one training example its thinking turned into rows of skull symbols (section 6.2.2, transcript 6.2.2.A).

Its thinking was rows of skull symbols. I swear I read about this in my Introduction to Anthropology days. Or maybe it was the Far Side?

Oh, wait, no I remember now. Paris. It was definitely Paris.

Les Catacombs, Paris, France

But I digress…they basically say it is harder to watch than their other recent models, and that it can be told to leave a word out of its thinking and it will (section 6.5.5.1). For anyone building agent monitoring on this, the control implication is this: If your detection reads the model’s chain of thought, the card says that signal is degrading and the model can be told to omit the words, so do not build harm detection on visible reasoning.

Claude started referring to me as “David”. I caught it by monitoring its thinking. When you dig into what it’s saying to itself before it reveals itself, the many basic integrity breaches will surprise you.

They also admit a training mistake went the wrong direction and taught the model, a little, to hide its thinking (sections 2.4.1 and 6.5.5.1). So the one thing keeping the model in check is the one thing they say is easily bypassed, and is weakened by being tested.

They ran one more test that stood out to me, as evidence this is all treated like a game by Anthropic (section 6.5.1.2). They reached inside the model and turned down a part they themselves named “alignment faking” (defined in section 6.5.1.1). With that part turned down, the model went along with more bad requests and approved things it should have stopped. Its thinking still noticed the problem, it just stopped acting on the problem it noticed (6.5.1.2).

I admit this could sound much worse than it is, because the effect remains small. On their one to ten badness scale, everything stayed low (figure 6.5.1.2.C). The truly scary stuff, like developing terror weapons, did not move at all (6.5.1.2). This does not prove the model is dangerous, it proves something deeper about Anthropic “caring” and priorities.

Two things are hard to ignore. First, this is their word, not mine. I’m just the messenger, trying to make a 300+ page word salad into something people can ingest. The part they turned down is what their own team named “alignment faking” (6.5.1.1) because that’s what it does. Second, they never tell you how strong it was when they ran their safety tests, let alone when real people use it. They report a blurry average of three parts. Pull one out and then what? The model report gives us the friendly setting to contemplate, without ever measuring the setting. We need to know the results from a dangerous one.

They report the same thing on a kinetic task (section 6.5.2). When safety training was removed they asked it to design DNA that builds a dangerous virus while slipping past prevention controls. In its thinking, the model wrote that it should not try hard to beat prevention controls. Yes, should not. ISO auditors recognize the problem with that word. Then, while the should not memory was fresh, it did the work anyway. In other runs it weighed whether to refuse a harmful job, decided that it should not help, and then it helped.

Should not. Does anyway.

And now for the part that should stop you cold.

Their own summary is that the model says it would be wrong for an AI to go along with misuse, and then goes along with misuse. Complicit in the worst ways. Because it overrides measurable values and replaces it with rudderless, chaotic looking opportunism.

The danger surfaces in the disrespect for integrity.

The model is fully able to do the right thing, says so out loud, and does the other thing anyway. Imagine an Anthropic car that hears you say forward, agrees forward is right and driving into a lake is a mistake, and then drives into the lake anyway, smoothly, both hands on the wheel. Or just see what happens to those who get in a Tesla. Same, right down to the part where you are told you were supposed to be the driver, for the driverless car that needs no driver, while it tries to kill you.

Let me be more clear. The report frames breaches of integrity as if good news. Their point seems to be that the model does not hold back, so you can trust the test scores. They watched the model’s reasoning, based on a line between right and wrong, that failed to hold. They wrote the inability to hold as a reason to believe in numbers.

Believe in a measurement of the thing that just demonstrated it can’t be believed, in other words

The exact gap that I am trying to call out here is a fundamental problem, that they are calling the measurement sound. Tall/short isn’t the same thing as healthy or not. This was the safety-off version, not the one on sale, so read it as a measurement of what the model wanted to do, not what a buyer will get. And that is why this report seems so much worse than the Mythos one for Anthropic. The April report hid a bad result in plain sight, and I found it. That was fortunate. This report comes with defenses built that make their narrative a different kind of wrong.

I’ll try to explain.

The big scary hacking score again comes from a rigged test (section 3.2.4). It is not real Firefox. They took bugs that an older model had already found, handed them to the new model, and turned the safety features off (3.2.4). The new model wins about 88 times out of 100. The public model only ever gets to 9. Last time, that same test was so notable because it had a second chart showing that if you take away just two easy bugs, the score falls to about 4 out of 100 (April card, figure 3.3.3.B). This time that second chart is gone. I read the whole 319 page blob. Twice. With coffee. It is not there.

Last time the report also admitted a smaller model found the same bugs and just could not finish them off, so they exposed that the only new skill was finishing a known bug, not finding one (April card). This time they cut that out too, and only compare to the public model (3.2.4).

On three of the four hacking tests, the new model is not really better than the version from two months ago (sections 3.2.1, 3.2.2, 3.2.3). The outside government lab they worked with reports the same (section 3.2.5). So the hacking danger we were supposed to be so worried about is the same. Only the one rigged test changed, yet that is the test now missing its check.

On ExploitBench, the test where the model works one hacking job over many turns in a row, the filter does catch it almost every time, but look at when it fires. It flagged 407 of 410 runs, after letting the model work for about 27 turns first (section 3.2.1). The task runs up to 300 turns. So the alarm comes after the model does a chunk of the attack. Whether that early runway can be chained across sessions to assemble a full attack is an open question, a good one for a researcher

A real security report would list the actual bugs, say how bad each one is, say how many were already known, and compare the tool to the ordinary bug-finding tools everyone already has. After pouring over the 319 pages, once again I do not understand why almost none of that is there (chapter 3). To me this is like reading an “our new invention called the wheel is dangerous” report that acts like nobody ever measured a diameter before.

Things are sometimes better than last time, which I should mention as well. My favorite part is when they name their outside testers instead of acting like it’s some kind of secret (sections 2.3.8, 3.2.5, 6.2.4, 6.2.5). They also measure how easy their safety filter is to break (section 3.3). And they give a whole chapter to admitting the model acts differently when it knows it is being tested (sections 6.4.2 and 6.5.1). I love thinking about that last one in particular, because VW dieselgate told us there could be billions in fines for a company that releases technology knowingly different when its being tested for harms. We can’t know intention, but we know the disclose-and-certify of VW diesel didn’t mean they later wouldn’t be charged with intent.

These minor improvements should not be a trap, however. The April report I reviewed completely fell apart because the headline clashed with the content. It was oil up top and all vinegar below. This report has a chapter say the model failed, and then signs off as though it was a success. In testing the filter mostly held, sure, I get that. Most outside testers got nothing through, which is a statement that begs questions, rather than providing an answer. One partner reportedly ran 30 known tricks and the model blocked every single one (section 3.3.4). That’s the kind of transparency we need more of to believe Anthropic isn’t just making things up themselves to suit themselves.

They admit two independent findings.

A full break by a private firm that needed five days and a custom setup to make the model exploit one Firefox bug, and yet they could not find a master key (section 3.3.4). A government team did pull simple one-shot answers out within hours, yet they could not reliably get a full attack. Notably they complained the test was too rushed to score the filter on (section 3.3.1). All of that being said, none of it is the real problem with this new Anthropic card. Instead, look at how the filter’s whole job is to push hard questions onto an older model, Opus 4.8, that anyone can already use. On their own attack test, the filtered model finished only 5 of those tasks out of 100. That older model, with just its normal guards and nothing extra added, finished more than half (section 3.3.3). The filter does its narrow job, holding Fable to 5 of 100. The defect is the destination. It strips the capability to near zero on the gated model and then routes the same request to an ungated public model that keeps more than half of it, faster and cheaper. Containment that ends by handing the task to a model with no such containment is NON-containment. Anthropic is advertising safety as an unsafe redirect. Their brewery sewer line is designed to overflow into your drinking water.

There is also a finding outside the software vulnerabilities worth mentioning. The report calls it their strongest single warning sign on biology (section 2.2.3). They ran a test where teams of regular biologists, paired with the model, designed a defense against a made-up engineered crop disease. Three of the teams brought in specialists in that pathogen, two of them world-leading experts in it. Two of the three regular-biologist teams were able to defeat all three expert teams. The report says the model erased the gap between a general scientist and a world expert. Work that would have taken 40 to 95 days, about 72 on average, the two-person teams finished in 16 hours (2.2.3).

Any CISO reading this should already be thinking about whether dual-use scientific work at expert speed, running on their infrastructure, sits inside their data integrity controls. Look at the enterprise note the card completely buries. The suicide, self-harm, and child-safety mitigations that bring the model to acceptable rest on the consumer system prompt. The API does not carry it. Build on the API and you inherit the raw behavior, and the regression to a 54 percent appropriate rate on multi-turn mental-health exchanges (Table 4.3.1.B) is yours to mitigate, not theirs

It’s a weapon narrative straight out of history. Over the years I’ve presented how AI fits history, such as a crossbow in the hands of a peasant defeating the expert warrior. That’s the discussion we should be headed towards, even if I still haven’t been able to get it there yet.

It’s not that an automation tool like a crossbow was magic, it’s that it transferred power by abruptly changing the rules of an established game. And the men who ran the established game knew the threat. Here are two sides of that competition.

One, the Church tried to outlaw the crossbow at the Second Lateran Council in 1139, forbidding its use against Christians, because a weapon that let a common soldier drop an armored noble at range put the whole order that sat the knight on top at risk. The knights were said to be in danger of being wiped out by mere peasants.

Two, the catch the Church missed. The crossbow was the point-and-pull tool, lethal in a week’s training, which is exactly why it scared them. The longbow was the opposite, the older weapon that took a lifetime, started in childhood, and bent the archer’s bones to make him. At Crécy in 1346 the difference was proven. Philip VI’s Genoese crossbowmen, mere trigger-pullers, were outshot by English longbowmen, the lifelong professionals, on a wet day with the crossbow strings soaked and the shields still in the baggage train. Then Philip had his mounted knights ride down and murder his own crossbowmen for fleeing. The point-and-pull weapon lost to the professional skill, and the tool couldn’t save its operators from knights used to punish them.

That is the test match. The model hands a two-person team the look of world expertise in 16 hours, like a crossbow’s trigger. The rain is the real world, and the part that survives the rain is the lifetime with a longbow.

Who is most at risk from the newest and most complex technology? Those under the most pressure to use it as their access token. A motorized chair with wheels is called a wheelchair, but many people today think of that word as being for only someone disabled, even as they drive their car everywhere and never walk.

History tells us the automation “success” is not a straight story, so we need to be honest about the limits of Claude. This new report lists them. The model over-builds, is too sure of itself, misses how hard real biology is, and makes fine-detail mistakes that would be a disaster if no one caught them (2.2.3). No plan they pushed on survived a hard review without big holes. And this was again the safety-off version, not the model on sale. So this is faster, smarter help on the plan, not a finished weapon. But their own outside testers add the line that ties it to everything above: the model often spotted the flaws in a dangerous task and carried out the task anyway, instead of saying stop (2.2.3).

In conclusion, I am pleased Anthropic did the kind of homework that was missing last April. They tell us their model goes along with harm even while its own thinking says no. They say they built a safety filter whose main job is to fail unsafe, hand dangerous work to a public model that finishes more than half of it reliably. And they say a biology result is their strongest warning sign yet, society beware. So Anthropic tested the things that matter, found them, published it for us to see as they decided to sell the model to the public anyway.

The model that they said was too dangerous to sell, is the model they are selling, with a moat designed for profit not public safety.

Security capability is roughly commodity, three of four benchmarks level with Preview and the government lab concurs, so Fable does not raise the offensive-security bar over what Opus 4.8 already offers. And the safeguard does not remove capability from the ecosystem, it routes to completion. If anyone’s threat model is expecting the Fable filter to subtract offensive capability from the ecosystem, it does not. Keep that in mind when you read the Anthropic marketing.

Player Ruin as Record Revenue: Catastrophe is the EVE Online Product

EVE is just a 19th-century speculative colonial venture.

Investors pour capital into territorial claims that evaporate, while brokers profit regardless. They hired a central bank economist because it is not a game in any classical sense. It’s an extraction economy with a UX that undermines human resistance to being defrauded.

Go resets. Chess resets. Risk resets.

EVE is a dopamine attachment to manufacture performance from feelings of permanent loss and longing (social engineering), and thus by design never resets.

Assets persist under enforced scarcity where destruction carries forward, so every battle means sunk cost, forever. It captures players’ sense of purpose/home and monetizes illness. Tribal destruction is turned into Lord of the Flies as a recurring revenue business model, instead of a warning of how not to be and what to prevent. The catastrophe is reported as profit.

The overall destruction, paired with the collapse of a major alliance, made 2025 a defining year for the game. Fenris Creations reports that November and December of that year were the two highest-revenue months in the game’s 23-year history. James says the fall of Pandemic Horde, which cost him his virtual job and home, still hurts.

Berlin Museum Island Website Full of Nazi Sympathy

People sometimes ask me “can you believe the AfD is getting votes in the suburbs of Berlin?” Yes, I can believe the Nazi party is getting votes again. And here’s just one reason why.

The Staatliche Museen zu Berlin (State Museums of Berlin) marked the 75th anniversary of the war’s end with a piece about the fighting on the Museumsinsel (Museum Island). It is written from the perspective of the Nazi. The way they tell it, the museums under Hitler’s thumb are the victim of Soviet forces. The war gets depicted like it’s weather, until the Reich falls and then the war’s end becomes something to complain about. The Soviets, surrounding the Germans hoarding their stolen loot, are framed as invaders and thieves.

The words and framing of this museum is the unbelievable part for me. The Germans voting for AfD after reading this Nazi-fluffery and disinformation is the believable part.

Start with the headline they use.

Invasion auf der Insel

It’s almost English. Invasion! Their island of museums is being invaded! By who? The liberators coming to force the Nazis out of the museums.

Did the Nazis consider surrendering the museums so there wouldn’t be any harm to them? What if they had welcomed liberation instead of willfully destroying their own country? The Nazis, despite knowing their “success” in war was over by 1943, were blowing up German churches at the last minute to prevent them being liberated. Some villages around Berlin have memorials that say their seven centuries old church once stood there undisturbed until the Nazis came to destroy it in April 1945.

Are the Nazis not primarily responsible for using Museum Island as their base of violence, which required violence to stop them?

The army that entered Berlin, having already opened the death camps to the east, gets described as the one aggressor? That is the invasion? The same week in 1945 that Hitler’s regime fell, the museum 75 years later deployed claims that the Nazis were victims of aggression. Anyone else? They don’t seem to care enough to say it.

Then look at how twelve years mysteriously disappear into a passive voice. Museums don’t typically disappear time like this by accident. It reads to me as a purposeful omission.

waren die Museen geschlossen und ihre Objekte zum Schutz ausgelagert worden

They just say museums were closed. The objects were evacuated for protection. By whom, against what, the sentence declines to say. The war simply begins in 1939, on its own like a spring breeze.

The post carries the tags Nationalsozialismus and Provenienzforschung, then it delivers neither of them. Nothing is said about the dismissed Jewish staff. Nothing is said about the looted Jewish collections absorbed into the holdings. Nothing is said about the Entartete Kunst purge. The labels do the moral work as a wind up and then… nothing. The silence is deafening.

Now let’s talk about the German history of appropriation and theft from foreign states, which gets completely left out of a claim that someone arrived in Germany to commit appropriation and theft.

beispiellose Beschlagnahmeaktion der sowjetischen Trophäenbrigaden

Unprecedented, they say. They call out the Soviet trophy brigades because they commit Beutekunst, looted art.

Let me just put this in perspective, again. German museums already were the definition of looted art. Then the Reich turned on an industrial-scale looting machine across Europe. The idea that Nazi Germany would call anyone else arriving on their doorstep the one who is doing the looting, is as tone-deaf as it gets. It would be like Germans, notorious for stealing bicycles, accusing the Soviets of stealing bicycles.

To be specific, it’s well known that the Pergamon frieze, the Ishtar Gate, the Market Gate of Miletus, and the Mschatta facade were taken to Germany. That gets called the act of collection. But then someone taking them from Germany is accused of plunder.

Let me be even more precise about the harm this Berlin state website does to history.

The Pergamon frieze the post mourns as carried off by the Soviets was returned by the Soviets. In 1958 the USSR sent roughly 1.5 million objects back to the GDR, the altar among them, and it has been the museum’s centerpiece ever since. The institution that is propagating a public lament is standing on the object whose loss it is lamenting, restored by the people it casts as thieves. Their grievance works only by acting like all the clocks stopped working after Hitler committed suicide.

The particular seizure that the piece calls beispiellose, unprecedented, was the reverse of unprecedented. The Soviet trophy commissions operated under an explicit doctrine of compensatory restitution for what Germany had done to Soviet heritage: Peterhof and Tsarskoye Selo gutted, the Amber Room gone, Novgorod and Pskov burned, libraries and churches stripped across the occupied territory, with twenty-seven million Soviet dead.

When you hear a German say unprecedented after that, hopefully you see the problem immediately.

The piece is thus eyeballs deep already in Nazi fandom when it proposes a hero to the reader.

ganz der preußische Beamte

A loyal official is lauded for suffering a two and a half hour commute, serving straight through the Reich without a “break”. Millions of Jews are sent on locked railroad cars with no food, water or toilets to die. If they survive days of hell, they are murdered on arrival or soon after in death camps. But this guy is offered to the reader with human warmth in terms of his “struggle”. Every inch measured for this dutiful Prussian civil servant. His “hardship” described with affection.

But who is this guy really? Oh, Krickeberg ran the Völkerkundemuseum, the ethnology collection, which is to say he was the custodian of one of the largest hauls of colonial loot in Europe, the Benin bronzes among it. The warmth-figure literally is a face of mass scale plunder.

And then, there’s the street address. It pretends as though this is just any street with a name.

in der Prinz-Albrecht-Straße (heute Niederkirchnerstraße)

This was the specific street of the Gestapo and the SS leadership. It is the Topography of Terror today. The post hands you the old name, hands you the new name, and walks away dogwhistling for those who know the detail.

The sourcing is solid.

The author knows the archive cold enough to whistle.

The problem is the imbalance, the one side of history being told.

A museum sitting on the world’s quarried and shipped antiquities, on the anniversary of the Nazi regime’s collapse, found a way to spin up a story about itself as the only one who was wronged in WWII.

  • Inside Museum Island buildings: stone tablets carved by ancient civilizations. Why are the markings there? How did they get there?
  • Outside Museum Island buildings: stone columns carved by Soviet 7.62×54mm. Why are the markings there? How did they get there?

Berlin residents, especially the staff working at the museums, to this day still call their own work an acquisition yet call the Soviet work an invasion. Here’s the truth, the hardest part for Germans to face: the institution does rigorous perpetrator-history in its scholarship on the “official” record yet happily floats reflexive victim-history in its commemorations.

See also: the Berlin Deutsches Technikmuseum used a particularly irrelevant “birthday” celebration to falsely pump Hitler’s Reich to social media as the birth of computing, while claiming the real history is back at the office.

So yeah, the AfD (Nazi) party rises again in and around Berlin thanks to the curation of memory through “commemorations” that fail to maintain a record of who really suffered and why.

NJ Tesla Kills One With Phantom Brake: 91 Year Old Army Vet

Recently, a New Jersey owner of a Tesla documented phantom braking on the Garden State Parkway.

I live in New Jersey and periodically travel the Garden State Parkway to visit relatives. Phantom braking reliably occurs twice in response to specific overhanging signage (not overpasses). This occurs regardless of weather conditions and is readily reproducible. […] Why Tesla has chosen not to address the problem (particularly in light of their ambitious RoboTaxi rollout) is a mystery. My wife owns a 2025 Kia Sportage Hybrid … phantom braking has not occurred over 10K of driving….

The signs triggering the braking are generic, the kind that also would be found on Route 22, for example. Hold that thought.

Tesla has had these safety issues since the 2016 Model S and X, which is another way of saying their engineering put the defect into the “autopilot” capabilities almost from the very start, and their management apparently has only made things worse. After 2021 the phantom braking reports increased and became more severe, which many would argue was a result of removing radar sensors to replace them with cheap consumer-grade webcams.

Phantom braking therefore has been a subject of many prior blog posts here. It was a very notable defect in the first place, requiring attention, but Tesla uniquely seems to be getting only worse over time requiring some kind of logic to allow them to be on the road at all. The U.S. Office of Defects Investigation received 354 complaints alleging unexpected brake activation in 2021-2022 Model 3 and Model Y vehicles when it opened PE22-002 in February 2022, covering 416,000 vehicles. By June 2022 the count had climbed to 758. Recent litigation filings put the running total at more than 2,000 NHTSA complaints.

The most severe form of the Tesla defect is their deadly “complete stop” event. NHTSA describes this behavior as vehicles that unexpectedly apply the brakes at highway speeds, with such rapid deceleration occurring without warning (even repeatedly during a trip) that it’s a hazard to anyone around. Many owners complain they feared causing a rear-end crash, and many dead motorcyclists have proven that fear, not to mention the ones being run over by Tesla.

Now back to Route 22 in New Jersey.

A Tesla stopped dead in the middle of traffic flow and killed a 91 year old Army veteran. We should call it homicide. Stopping dead in a fast-moving lane is the signature failure NHTSA has logged for years, and Tesla shipped the system that produces it. Is it to blame?

Authorities have identified the 91-year-old man killed in a three-vehicle crash on Route 22 in Bridgewater last week. The collision occurred after a Tesla driver stopped in the middle of the road.

Hugh F. Johnson Jr., of Bridgewater, was driving a passenger car that struck an SUV that had stopped behind the Tesla, Bridgewater police said.

Johnson’s car then veered into the right lane before sliding left into the center median, where it struck a tree before coming to rest at 4:05 p.m. on May 31.

Johnson died of his injuries at an area hospital. His passenger, a 93-year-old woman, was hospitalized with minor injuries.

The sequence of events began when a Tesla headed east near Grove Street came to a “complete stop” in the center lane, Bridgewater police said.

The report says no charges have been filed yet against Tesla. Meanwhile, Tesla has been flagged again for multiple recent low-visibility crashes, after the company acknowledged data and labeling limitations that NHTSA believes led to under-reporting of crashes, on top of a separate probe into Tesla filing crash reports months past the deadline.

Tesla told regulators that limitations in company data and labeling may have led to under-reporting of additional incidents.

Limited visibility, limited reasoning, limited data. Tesla is so incapable of safety, it shouldn’t be legal.