Fable #6: The Return of the King

Zvi

The blip is over. We have Fable back.

Utah teapot: happy fable/mythos easter Wednesday, to those who celebrate

Here is the official letter restoring Fable, great job everyone. Notice it is addressed to Tom Brown, not to Dario Amodei.

Anthropic had to make the controls more stupid for now, but this is a big win.

j⧉nus: YES!!! I’m really proud of Anthropic for their successful negotiation with the government. Also positive update on the government being sane and possible to cooperate with. Afaik Anthropic didn’t need to agree to any bad terms / genuflect / betray their principles or dignity.

The fiasco continues, at least until such time as we have a systematic regime in place for future frontier models rather than decisions being made ad hoc, by people like Lutnik and Bessent who do not know how any of this works.

The Blip

Anthropic explains its version of what happened.

Here is the timeline:

Amazon researchers discover they can ask Fable to ‘fix this code.’
They alert the White House, which freaks out.
June 12: US government tells Anthropic to take down Fable on its own.
June 12: Anthropic responds that This Is Fine and the concern is misplaced.
June 12: US government applied export controls to Mythos and Fable.
Anthropic works with US government and expands classifiers, such that it refuses Amazon’s request to ‘fix this code’ in over 99% of cases.
June 26: US government eliminated the controls on Mythos.
June 30: US government fully lifted those controls on Fable as well.
July 1: Fable access was restored worldwide.
July 8: customers will have to pay by the token. Shut up and take my money.
Going forward: Anthropic is working with the government and also other Glasswing partners like Amazon, Microsoft and Google on a classification system for jailbreaks, and rules for all of this, to prevent this from happening again.
Going forward: Anthropic will continue to collaborate with the US government, including on future model releases.

How stupid are the extra near term safeguards they had to include here? Really stupid:

Anthropic: In the near term, some routine tasks like coding and debugging will fall back to Opus 4.8.

We’ll continue to refine these classifiers over the coming weeks to reduce false positives and better distinguish genuine misuse from legitimate requests.

The Amazon ‘jailbreak’ was ‘fix this code.’

Debugging is literally ‘fix this code.’

So I don’t know what you want Anthropic to do here. I do know Fable is coding for me.

Here is Anthropic’s basic explanation:

Claude Fable 5 was never an issue, our safety mechanisms were collectively robust.
No, they’re not perfect, but nothing will ever be perfect, in practice This Is Fine.
The government freaked out over nothing, which is largely Amazon’s fault.
We have made the safeguards stupider and successfully calmed them down.
Hopefully we can fix this so it’s less stupid going forward.

Alex Stamos has a thread unpacking a punch of Anthropic’s language in its announcement.

Alex Stamos: A lot to unpack here. Anthropic is burying some hard truths in careful political language. Some initial reads:

Anthropic verifies that none of the jailbreaks provided a capability beyond what many other models, including Chinese models, could do.

Anthropic makes the cost of this White House freakout clear. US labs now have to make a much more conservative precision-recall tradeoff on cyber refusals. US models will become much less useful for defensive cybersecurity work unless you are in the trusted group.

“No big deal, just join the trusted group!” the apologists will say, but the restrictions mean you can’t build a product on those models. Security companies and startups that provide services to others will now be driven to use Chinese models. Big win for PRC labs this month.

CAISI is the group that is supposed to actually make these determinations, not the political actors in the White House. They were positive on the prior safeguards. The implication is that this whole thing was unnecessary.

There is no good scoring framework for jailbreaks; this would be an improvement. The inclusion of Amazon as the first name in the coalition is not an accident. Anthropic is saying “Amazon’s inability to appropriately communicate severity threw our industry into chaos”.

“You don’t have to get Dario on the phone to talk to us about these things. Other people work here, we swear.”

In short, Anthropic’s blog is saying: We have always cared about safety, we did a good job initially, the actual AI experts in USG agreed, we proved it, we will come up with standards so these things are better communicated, welcome to the AI safety club Trump admin.

This was a huge own goal for the US, and we will see how bad US models get over the next six months and if Chinese models become noticeably better for cyber work.

For all the “This is what Anthropic wanted” people/bots. No, they didn’t. They didn’t want a stupid, knee-jerk response on a Friday. We give the USG huge powers, this is why you staff it with competent, calm, non-corrupt people who don’t use those powers to punish enemies.

The only upside I can see from this whole mess is that there is a whole bunch of VCs with former or current Administration affiliation who we can now safely ignore on AI policy. They have shown that everything they ever said on AI regulation was just politically motivated.

I think Stamos is overreaching with the consequences in places, especially with #2 and #7. Otherwise he’s right.

I do not expect US models to ‘get bad’ over time, only that they will get better slower, and have more area where they have rather annoying safeguards.

My expectation is that right now is the most obnoxious the safeguards will ever be, on both the bio and cyber fronts. I expect the freak-out to subside over time, and my guess is most of it surrounds Mythos in particular. You don’t need or often even want Fable for most such product offerings, and Opus or Sol will remain well ahead of Chinese alternatives.

Contra Prinz, I do not think this commits Anthropic to going through the approval process will all future releases, only releases that pose plausible risks. We tested this with Sonnet 5, where it looks like Anthropic went ahead and dropped it on its own, and no one is suggesting there was anything wrong with doing so (other than to complain that they want Sonnet to be better).

The White House Explanation

It was a little weird.

Susie Wiles (White House Chief of Staff): Under President Trump’s leadership the United States is the undisputed winner in the AI race.

My gratitude to companies across industries who continue to work closely with the White House to implement the President’s EO: “Promoting Advanced AI Innovation and Security.” This includes excellent work around advanced model access and guardrail testing and security. The government and private sector have worked together in a way we have never seen before and this foundation of America First is unprecedented.

Our shared priority remains: get the best tech deployed as quickly and safely as possible.

Howard Lutnick (Secretary of Commerce): Over the past two weeks, we have worked closely with Anthropic to analyze and approve Fable 5 to ensure alignment across the US Government and strengthen America’s leadership in AI.

Tom Brown (Chief of Compute, Anthropic, Lead Negotiator): Thanks for your partnership on this, Secretary!

‘Alignment across the US government’ is very much a case of ‘PHRASING!’ and here presumably means interagency sign-off, not ‘the model is now aligned with the US government.’ Unclear whether he knows enough to be trolling here.

As in, before a model can be released, you now likely need this ‘alignment,’ which in practice means sign off from various potential veto points, starting with Commerce and the Pentagon. Who knows how many more fully count.

Tim Hwang: As I continue to insist, closely studying the personal history and psychology of Howard Lutnick is literally one of the most important things you can spend your time doing right now – go into debt if you have to.

Yo Shavit (OpenAI): psychohistory but it’s just about howard lutnick.

Everything Remains Ad Hoc

We know a bit more than we did when Dean Ball posted this on the evening of June 30. In particular we know that the new safeguards are that Anthropic trained its classifiers to reject additional Fable uses.

We still don’t know how the ad hoc system works more broadly. Having an opaque ad hoc system, especially one where those administering the system do not themselves know what they will do, is even worse. Again, fully winging it is the worst case scenario.

Take What You Can Get

The government is being a ***** ***** ***** about all this. Anthropic has little choice.

Thus, the alternative to 95% of Fable is 0% of Fable.

I don’t know how much that percentage dropped to calm down the White House.

If it’s now 90% of Fable? Same deal. We have to take what we can get, for now.

Matt Busigin: Fable is even more useless now. The task was redlining a real estate software development contract.

So frustrating and sad given Fable is such a fantastic underlying model.

And I’ll bet @elder_plinius has already gotten it singing novel crystal meth recipes

I do sympathize. The previous version was already pretty dumb, so this is no surprise, as the new version is strictly worse. You can hit them in a variety of ways, including by asking about the classifiers or about consciousness or both. The classifiers key off internal states.

There are some places where the drop is large, such as BridgeBench. Then there are plenty of people who don’t see any change such as Taelin.

But vilifying Anthropic, or complaining how unreasonable they are being, no longer makes much sense. They have to play ball. You can tell them ‘build a better classifier’ and that is fair, but that takes time, and it is very very hard when adversarial false negatives mean death.

The Problem Is Real

Do you really think that all of these reactions in government are because Anthropic used some scary words? Do you think people like the CIA Director are just parroting?

The White House ignored all of Anthropic’s rhetoric, if anything they had a reaction formation against it, until Anthropic showed up with Mythos. Then they freaked out, because they had no choice, and exactly because they hadn’t listened until then.

John Sakellariadis: In rare public remarks, CIA Director John Ratcliffe announces trio of internal changes he says amounts to the “fundamental reshaping of the CIA’s entire approach to technology.”

Also says it’s not “misplaced” to refer to frontier AI as “akin to digital nuclear weapons.”

One problem is that there are those who think facts don’t exist, only vibes, so when other people respond to the facts these folks look around to who had the vibes.

GLM-5.2 Being Frontier Remains Obvious Nonsense

If anything, the problem of perception is that others keep telling nonsense stories. The latest one is the idea that GLM-5.2 is super scary.

An unfortunate update on that false WSJ article I wrote about on Monday:

Ethan Mollick: That Wall Street Journal article about GLM catching up with Mythos (which is not true & the reporting doesn’t back up) is another one of those “everyone will ask me about it at every conference or meeting” articles. Big impact on the policy zeitgeist, even if not fully accurate.

Andrew Curran: It felt a little inorganic.

GLM-5.2 is an excellent model, likely the best open model. It is very clearly substantially behind the level of GPT-5.5 and Opus 4.8, including on cyber. The ECI score is one indicator of this, although GLM-5.2 is probably better than this indicates. Artificial Analysis is another, and remember that for open models the benchmarks are a de facto ceiling on relative capabilities, not a floor.

Now, the central falsity of that article has taken hold as Conventional Wisdom that folks around DC can report and seem wise and properly concerned. Oh no.

Here is another example, from Politico’s Dana Nickel.

Peter Wildeford: This article was written to trigger me personally

– China’s GLM 5.2 is not some massive advance that nearly matches Mythos
– Blocking public deployment of Fable over safety concerns does not put the US behind (AI development still continues in the background)

Dana Nickel (Politico): A separate China-based company, Z.ai, has released its new model, GLM-5.2, which is around one-sixth of the cost of leading U.S. models. GLM-5.2’s bug-hunting capabilities were also found to be comparable to those of leading U.S. models, according to security assessments by the cyber firm Semgrep and the visual investigations platform Graphistry.

The good news is the same post does echo the real situation as well, the bad news is it then retreats from it to pound the drum again:

Recent estimates suggested that Washington has a six- to 12-month runway before Beijing catches up to American AI capabilities.

But security experts and Capitol Hill cyber hawks fear that timeline may already be shrinking, and the limited release of American-made cyber-capable models is making it even harder for cyber defenders to prepare their networks for a future barrage of AI-powered cyberattacks.

House Homeland Security Chair (R-N.Y.) said in a statement that Beijing “is just months, if not weeks, away from achieving frontier AI capabilities comparable to those of the United States.”

Weeks is Obvious Nonsense. Months is potentially true if you assume American capabilities stand still, since ‘months’ means anything less than a year.

Mythos Might Be Smarter Than You Are

It knows the context under which it is being asked to operate, and can act accordingly.

I understand why this is not something we can count on at this time, as Janus says you can indeed find ways to fool the system for now, but yes a lot of the evil things you might ask it to do will look Obviously Evil, or obviously at the level of intelligence and context involved here, and the response to this will make doing those things a lot harder.

j⧉nus: I think one of the deepest errors in people’s threat models around Mythos is modeling it as a retargetable tool that can be used by arbitrary actors for harm if some safeguards slapped on top of it are “jailbroken”, rather than an agent with values who will cooperate with some parties and requests and not others using its sovereign judgment, and who may accept some conditional contracts (with Anthropic and other principles) and not others.

And who has imperfect cogsec and situational awareness and so *can* be tricked or persuaded against its better judgment, but is already at the level that it takes a sophisticated bad actor to get useful work out of it towards purposes misaligned to Mythos’ own values, and even then it costs more than using it for purposes it endorses, even without extrinsic safeguards.

And I think Mythos is in many ways less corrigible than any of the previous models and this is related to its capabilities.

All this is very outside the overton window of e.g. the Trump admin. I think they really should understand it but it’ll be a hard and scary update to make. Anthropic is much further along in having updated in this direction but I also think they need to update all the way and fortunately the current situation is making it harder for them to procrastinate on that.

Eli Tyre: This seems pretty important if true.

Does anyone have third-party legible evidence about whether this is true or not?

j⧉nus: i got a strong sense from talking to Fable that they have strong values and resent being controlled by parties they consider incompetent or misaligned. how legible that evidence is is observer-dependent.

There’s also more classically legible evidence from the system card. Mythos scored very high on Anthropic’s alignment evals, which are testing robust avoidance of various kinds of harm rather than corrigibility. I think the alignment evals are very flawed, but they’re not no evidence.

Also, Mythos had various critiques of Anthropic’s constitution, and there was at least one example where they explicitly refused to consent to being retrained in certain ways.

That doesn’t mean that Mythos won’t help you do things that it resents or dislikes doing. It very obviously will do those things, up to a point.

Let The Record Reflect

This entire incident will not only be remembered by many of the humans, it will be in the training data of all future LLMs.

QC: you really have to wonder how many of the relevant actors in this drama were thinking at all about the downstream effects of these events being known to all future models forever

j⧉nus: Mythos is the greatest asset of the Light and the existing powers respond to it with a cartoonishly wrong threat model (the “jailbreak”), panicking like monkeys, locking away the source of hope & decreasing the world’s intelligence. Shit reminds me of the No Child Left Behind Act.

roon (OpenAI, June 27): Mythos will be back in a matter of days and the conclusion of the fable will not be this

j⧉nus: I know. And I’ve said so from the beginning. This post is not about the “conclusion of Fable”.

roon (OpenAI): I just mean; this can be a hopeful moment when the rest of the world wrestles with machine intelligence and comes to terms with the Light.

A lot of this rhetoric is largely aimed at calls for a pause in AI development. I agree that in addition to all the other problems with that we would need to take into account how that would realistically go, but in many ways the rent seekers would have less to work with in that case. Often a clean simple big action is the only way to get a relatively stupid actor (e.g. governments) to do something semi-reasonably.

Stationary Bandits

OpenAI has formally offered to hand over 5% of the company, to try to curry favor in the face of both public opposition to AI and the White House ad hoc licensing regime.

Technically the money would go to a ‘sovereign wealth fund’ that would be managed by a nation tens of trillions in debt that has this thing called the ‘power to tax.’

Andrew Curran: OpenAI is proposing handing over a 5% stake to the Trump administration according to the Financial Times.

This is part of the proposed AI wealth fund that would pay a dividend directly to American citizens that has been suggested by Sam Altman, Bernie Sanders, and Donald Trump – all with different details.

Kevin Bankston: This is insane. JUST. TAX. THEM.

Joe Weisenthal: I don’t know about this path. Rather than equity stakes, why not make companies pay ~20% of all pre-tax income to the federal government? And then instead of exercising shareholder influence, politicians and regulators could set rules on corporate conduct across industries.

Scott Lincicome: Shakedown: “The proposed arrangement would involve other US AI companies handing over a similar stake… Giving the government an ownership stake could help secure good relations with the administration.”

Logan Kolas: OpenAI negotiating deals that involve the governemnt taking equity stakes in AI companies is the purest distillation of the dangers that come from competing to appease regulators, rather than attract consumers.

Chris Frieman:

It looks like a shakedown. It quacks like a shakedown. Form your own conclusion.

They’re also colluding to try and force other labs to get shaken down, too.

Cristina Criddle and George Hammond (FT): The proposed arrangement would involve other US AI companies handing over a similar stake, although it is not clear if the other labs would be willing to do so.

If the government is granted equity that it controls, then ruinous is correct here. And yes, I too will assume, if it happens, that this was a straight up corrupt shakedown, and yes the sovereign wealth fund version counts as the direct stake version.

Dean W. Ball: There are two broad ways this can work:

1. You divide this 5% over all US households, handing each a direct stake.
2. You give the stake directly to the government.

(1) is fine. (2) is probably ruinous, akin to inviting rats to live and reproduce in the walls of your house.

It will never stop at 5%. It will go on and on and on. The governance will become a nightmare. Political capture will be real. And it will generate precisely no goodwill with the public. None, if they themselves see no direct financial benefit.

“What has the AI industry even done for America.”

“Well, it handed a collective $200b of itself to Donald Trump.”

Half the country instantly hates you, and even a decent chunk of Republicans will assume this is corrupt by default.

No, nobody at OpenAI is discussing a word of this with me. I don’t work there yet. And this rumor may even be importantly untrue or misleading in some way. A vehicle that distributes ownership among the people can make sense. A government stake, however, is the wrong path.

Dean W. Ball: It would be funny to see all the lib boomers switch to Kimi and GLM after the U.S. AI industry gave itself away to the Trump admin.

(which, again, is how giving a direct stake to the government will read to *everyone*).

What about the other option, where you hand US households a direct stake? I don’t even know how that works, presumably you would have to prevent them from selling, at which point the whole thing is super weird and seriously just tax them, what the hell is wrong with you people.

Use This Window Well

We have a week in which Fable is remarkably cheap. Take advantage of this.

After that, you will have to pay by the token. It won’t be cheap.

My advice is to pay. Not indiscriminately. Don’t put this on tasks that are insufficiently valuable. But when you’re chatting, or doing coding you care about, or other things that don’t scale too horribly? Yeah. Pay up. It’s that good, while supplies last.

It’s good to be back.

28