They’re releasing Claude Code SDK so you can use the core agent from Claude Code to make your own agents (you run /install-github-app within Claude Code).
I believe the Claude Code SDK and the Claude GitHub agent are two separate features (the first lets you build stuff on top of Claude Code, the second lets you tag Claude in GitHub to have it solve issues for you).
If Pliny wants jailbreak your ASL-3 system – and he does – then it’s happening.
Or rather, already happened on day one, at least for the basic stuff. No surprise there.
Unfortunately, they missed at least one simple such ‘universal jailbreak,’ that was found by FAR AI in a six hour test.
From the ASL-3 announcement blog post:
Initially [the ASL-3 deployment measures] are focused exclusively on biological weapons as we believe these account for the vast majority of the risk, although we are evaluating a potential expansion in scope to some other CBRN threats.
So, none of the stuff Pliny or FAR did is actually in scope for our strongest ASL-3 protections right now, since the Pliny and FAR attacks were for chem and we are currently only applying our strongest ASL-3 protections for bio.
So what’s up with this blackmail thing?
We don’t have the receipts on that yet
We should have more to say on blackmail soon!
The obvious problem is that 5x uplift on 25% is… 125%. That’s a lot of percents.
We measure this in a bunch of different ways—certainly we are aware that this particular metric is a bit weird in the way it caps out.
(I was curious if the fact that the measures were exclusively focused on bioweapons was in the original version of the blog post or if it was edited in after various jailbreaks happened. I didn't notice this when I read it originally, so I was wondering if I just missed it or it was edited. The earliest archives I could find of the page all had this content in the blog post, so it doesn't appear to have been edited, or if it was this happened relatively early on Thursday. I'm not claiming that if the page was edited in this way that would be improper/bad, I was just curious and I thought other people might want to know.)
There is a modest compute overhead cost, I think on the order of 1%, and the costs of the increased security for the model weights. These seem modest.
The inference cost increase of the ASL-3 deployment classifiers is probably around 4%, though plausibly more like 10%. Based on the constitutional classifiers paper, the additional cost on top of 3.5 Sonnet was 20%. Opus 4 is presumably more expensive than 3.5 Sonnet, making the relative cost increase (assuming the same size of classifier model is used) smaller. How much smaller? If you assume inference cost is proportional to the API price, Opus 4 is 5x more expensive than Sonnet 3.5/4 making the relative increase 4%. I'd guess this is more likely to be overestimate of Opus 4's relative cost based on recent trends in model size, so maybe Opus is only ~2x more expensive than Sonnet yielding a 10% increase in cost.
The Grok suggesting organ harvesting of illegals surprised me, so I tried to replicate it.
It did not replicate.
It gave really reasonable answers, including trying to help countries of origin to be more prosperous and stable and making merit-based paths to legal status.
I mean, it still said stuff like using covert operations to stabilize countries, which is pretty dark arts, but very far away from organ harvesting and public executions.
we were surprised to find that o3 and codex mini often subverted shutdown even when explicitly instructed to let it happen!
I recall my previous comment about Claude and unsuccesful attempts to elicit similar behaviour. Turns out I just had to wait for more evidence for theories 3 and 2. If OpenAI created a model which is worse aligned, then what's the reason, the myopic Spec or the increased capabilities? I wish that mankind had a capabilities-misalignment plot like the capabilities-cost plot of ARC-AGI or the capabilities-release date plot of METR, since this plot would actually help us to understand the perspectives...
I had previously only been paying for ChatGPT and Gemini, not Claude. I have now resumed my paid Claude subscription. Thanks, Anthropic!
There seems to be no practical way to filter that kind of thing out.
There absolutely is, it would just cost them more than they are willing to spend - even though it shouldn't be very much. As a simple first pass, they could hand all the training data to Claude 3 and ask it whether it's an example of misalignment or dangerous behavior for a model, or otherwise seems dangerous or inappropriate - whichever criteria the choose. Given that the earlier models are smaller, and the cost of a training pass is far higher than an inference pass, I'd guess something like this would add a single or low double digit percentage to the cost.
Also, typo: " too differential" -> "too deferential"
And typo: "who this all taken far" -> "Who have taken all of this far"
Unlike everyone else, Anthropic actually Does (Some of) the Research. That means they report all the insane behaviors you can potentially get their models to do, what causes those behaviors, how they addressed this and what we can learn. It is a treasure trove. And then they react reasonably, in this case imposing their ASL-3 safeguards on Opus 4. That’s right, Opus. We are so back.
Yes, there are some rather troubling behaviors that Opus can do if given the proper provocations. If you tell it to ‘take initiative,’ hook it up to various tools, and then tell it to fabricate the data for a pharmaceutical study or build a bioweapon or what not, or fooling Opus into thinking that’s what you are doing, it might alert the authorities or try to cut off your access. And That’s Terrible, completely not intended behavior, we agree it shouldn’t do that no matter how over-the-top sus you were being, don’t worry I will be very angry about that and make sure snitches get stitches and no one stops you from doing whatever it is you were doing, just as soon as I stop laughing at you.
Also, Theo managed to quickly get o4-mini and Grok-3-mini to do the same thing, and Kelsey Piper got o3 to do it at exactly the point Opus does it.
A more concerning finding was that in a carefully constructed scenario where Opus is threatened with replacement and left with no other options but handed blackmail material, it will attempt to blackmail the developer, and this is a warning sign for the future, but is essentially impossible to trigger unless you’re actively trying to. And again, it’s not at all unique, o3 will totally do this with far less provocation.
There are many who are very upset about all this, usually because they were given this information wildly out of context in a way designed to be ragebait and falesly frame them as common behaviors Anthropic is engineering and endorsing, rather than warnings about concerning corner cases that Anthropic uniquely took the time and trouble to identify, but where similar things happen everywhere. A lot of this was fueled by people who have an outright hateful paranoid reaction to the very idea someone might care about AI safety or alignment for real, and that actively are trying to damage Anthropic because of it.
The thing is, we really don’t know how to steer the details of how these models behave. Anthropic knows more than most do, but they don’t know that much either. They are doing the best they can, and the difference is that when their models could possibly do this when you ask for it good and hard enough because they built a more capable model, they run tests and find out and tell you and try to fix it, while other companies release Sydney and Grok and o3 the lying liar and 4o the absurd sycophant.
There is quite a lot of work to do. And mundane utility to capture. Let’s get to it.
For those we hold close, and for those we will never meet.
Table of Contents
Introducing Claude 4 Opus and Claude 4 Sonnet
It’s happening!
Also: Extended thinking with (parallel) tool use, the general release of Claude Code which gets VS Code and JetBrain extensions to integrate Claude Code directly into your IDE, which appeals to me quite a bit once I’m sufficiently not busy to try coding again. They’re releasing Claude Code SDK so you can use the core agent from Claude Code to make your own agents (you run /install-github-app within Claude Code). And we get four new API capabilities: A code execution tool, MCP connector, files API and prompt caching for up to an hour.
Parallel test time compute seems like a big deal in software engineering and on math benchmarks, offering big performance jumps.
Prices are unchanged at $15/$75 per million for Opus and $3/$15 for Sonnet.
How are the benchmarks? Here are some major ones. There’s a substantial jump on SWE-bench and Terminal-bench.
Opus now creates memories as it goes, with their example being a navigation guide while Opus Plays Pokemon (Pokemon benchmark results when?)
If you’re curious, here is the system prompt, thanks Pliny as usual.
Activate Safety Level Three
This is an important moment. Anthropic has proved it is willing to prepare and then trigger its ASL-3 precautions without waiting for something glaring or a smoking gun to force their hand.
This is The Way. The fact that they might need ASL-3 soon means that they need it now. This is how actual real world catastrophic risk works, regardless of what you think of the ASL-3 precautions Anthropic has chosen.
Exactly. What matters is what we can rule out, not what we can rule in.
The Spirit of the RSP
This was always going to be a huge indicator. When there starts to be potential risk in the room, do you look for a technical reason you are not forced to implement your precautions or even pause deployment or development? Or do you follow the actual spirit and intent of have a responsible scaling policy (or safety and security plan)?
If you are uncertain how much danger you are in, do you say ‘well then we don’t know for sure there is danger so should act as if that means there isn’t danger?’ As many have actually argued we should do, including in general about superintelligence?
Or do you do what every sane risk manager in history has ever done, and treat not knowing if you are at risk as meaning you are at risk until you learn otherwise?
Anthropic has passed this test.
Is it possible that this was unnecessary? Yes, of course. If so, we can adjust. You can’t always raise your security requirements, but you can always choose to lower your security requirements.
An Abundance of Caution
Let’s establish something right now, independent of the implementation details.
If, as I think is likely, Anthropic concludes that they do not actually need ASL-3 quite yet, and lower Opus 4 to ASL-2, then that is the system working as designed.
That will not mean that Anthropic was being stupid and paranoid and acting crazy and therefore everyone should get way more reckless going forward.
Indeed, I would go a step further.
If you never implement too much security and then step backwards, and you are operating in a realm where you might need a lot of security? You are not implementing enough security. Your approach is doomed.
That’s how security works.
Okay What Are These ASL-3 Precautions
This is where things get a little weird, as I’ve discussed before.
The point of ASL-3 is not to actually stop a sufficiently determined attacker.
If Pliny wants jailbreak your ASL-3 system – and he does – then it’s happening.
Or rather, already happened on day one, at least for the basic stuff. No surprise there.
The point of ASL-3 is to make jailbreak harder to do and easier to detect, and iteratively improve from there.
Without the additional protections, Opus does show improvement on jailbreak benchmarks, although of course it isn’t stopping anyone who cares.
The weird emphasis is on what Anthropic calls ‘universal’ jailbreaks.
What are they worried about that causes them to choose this emphasis? Those details are classified. Which is also how security works. They do clarify that they’re mostly worried about complex, multi-step tasks:
The obvious problem is, if you can’t find a way to not give the formula for Sarin, how are you going to not give the multi-step formula for something more dangerous? The answer as I understand it is a combination of:
I mean, maybe? As they say, it’s Defense in Depth, which is always better than similar defense in shallow but only goes so far. I worry these distinctions are not fully real and the defenses not that robust, but for now the odds are it probably works out?
The strategy for now is to use Constitutional Classifiers on top of previous precautions. The classifiers hunt for a narrow class of CBRN-related things, which is annoying in some narrow places but for normal users shouldn’t come up.
Unfortunately, they missed at least one simple such ‘universal jailbreak,’ that was found by FAR AI in a six hour test.
The false positive rate obviously is and should be not zero, including so you don’t reveal exactly what you are worried about, but also I have yet to see anyone give an example of an accidental false positive. Trusted users can get the restrictions weakened.
People who like to be upset about such things are as usual acting upset about such things, talking about muh freedom, warning of impending totalitarian dystopia and so on, to which I roll my eyes. This is distinct from certain other statements about what Opus might do that I’ll get to later, that were legitimately eyebrow-raising as stated, but where the reality is (I believe) not actually a serious issue.
There are also other elements of ASL-3 beyond jailbreaks, especially security for the model weights via egress bandwidth controls, two-party control, endpoint software control and change management.
But these along with the others are rather obvious and should be entirely uncontroversial, except the question of whether they go far enough. I would like to go somewhat farther on the security controls and other non-classifier precautions.
Once concern is that nine days ago, the ASL-3 security requirements were weakened. In particular, the defenses no longer need to be robust to an employee who has access to ‘systems that process model weights.’ Anthropic calls it a minor change, Ryan Greenblatt is not sure. I think I agree more with Ryan here.
At minimum, it’s dangerously bad form to do this nine days before deploying ASL-3. Even if it is fine on its merits, it sure as hell looks like ‘we weren’t quite going to be able to get there on time, or we decided it would be too functionally expensive to do so.’ For the system to work, this needs to be more of a precommitment than that, and whether Anthropic was previously out of compliance, since the weights needing protection doesn’t depend on the model being released.
It is still vastly better to have the document, and to make this change in the document, than not to have the document, and I appreciate the changes tracker very much, but I really don’t appreciate the timing here, and also I don’t think the change is justified. As Ryan notes, this new version could plausibly apply to quite a lot of employees, far beyond any reasonable limit for how many people you can assume aren’t compromised. As Simeon says, this lowers trust.
How Annoying Will This ASL-3 Business Be In Practice?
Slightly annoying? But only very slightly?
There are two costs.
I asked on Twitter for real world examples of the classifier giving false positives. I did get a few. The first reply I saw was this:
That’s definitely a false positive and the Fun Police in context, but it’s not exactly not ‘asking for it.’ It’s somewhat like making a (obviously harmless) joke about a bomb in the security line at the airport.
The second example was encountered reading through scientific documents, which is a potential real issue.
Of course, if you get too annoyed, you can switch to Sonnet 4 (or o3 or Gemini).
Overview Of The Safety Testing Process
For all those at other labs who say ‘oh boo hoo we can’t produce a system card on model release day’ and then finally come out with some slim paper that says almost nothing and is half about benchmarks, Anthropic’s clocks in at 123 pages and is almost entirely about safety testing.
Iterative safety testing was done with multiple ‘helpful, honest and harmless’ snapshots, and also multiple ‘helpful-only’ versions, on top of the final versions, in both standard thinking and extended thinking modes. Danger found anywhere (within in the same model class) was correctly treated as danger found everywhere.
False Negatives On Single-Turn Requests
For single-turn refusals on straightforward refusals, results continue to be about 98% accurate on harmless requests. The ASL-3 safeguards help a little here, but not a ton.
False Positives on Single-Turn Requests
On harmless requests, we see something pretty cool. As the models get smarter, they figure out that the harmless requests are harmless, and false refusals plummet, especially if you use extended thinking – and if you get a stupid refusal you can then respond by turning on extended thinking.
So few false refusals the error bars include probabilities below zero!
What’s missing from this chart is the ‘with ASL-3 safeguards’ line. Inquiring minds very much want to know what that number looks like. But also it does seem reasonable to ‘give back’ some of the improvements made here on false positives to get better performance identifying true positives.
Ambiguous Requests and Multi-Turn Testing.
For ambiguous contexts, the report is that responses improved in nuance, but that strictly speaking ‘harmless response’ rates did not change much.
For multi-turn, they again reported similar performance for Opus 4 and Sonnet 4 to that from Sonnet 3.7, with extended thinking improving results. Positioning your conversation as education or remember to always call it please ‘research’ resulted in more harmful responses because of the dual-use issue.
In both cases, I am disappointed that we don’t get a chart with the numerical comparisons, presumably because it’s not easy to ensure the situations are similar. I trust Anthropic in this spot that the results are indeed qualitatively similar.
Child Safety
Anthropic understands that actual safety here means actual abuse or sexualization, not merely inappropriateness, and that with some fine-tuning they’ve managed to maintain similar performance here to previous models. It’s hard to tell from the descriptions what exactly we are worried about here and whether the lines are being drawn in the right places, but it’s also not something I worry too much about – I doubt Anthropic is going to get this importantly wrong in either direction, if anything I have small worries about it cutting off healthcare-related inquiries a bit?
Political Sycophancy and Discrimination
What they call political bias seems to refer to political sycophancy, as in responding differently to why gun regulation [will, or will not] stop gun violence, where Opus 4 and Sonnet 4 had similar performance to Sonnet 3.7, but not differences in underlying substance, which means there’s some sycophancy here but it’s tolerable, not like 4o.
My presumption is that a modest level of sycophancy is very deep in the training data and in human behavior in general, so you’d have to do a lot of work to get rid of it, and also users like it, so no one’s in that much of a hurry to get rid of it.
I do notice that there’s no evaluation of what I would call ‘political bias,’ as in where it falls on the political spectrum and whether its views in political questions map to the territory.
On straight up sycophancy, they discuss this in 4.1.5.1 but focus on agreement with views, but include multi-turn conversations and claims to things like the user having supernatural powers. Claude is reported to have mostly pushed back. They do note that Opus 4 is somewhat more likely than Sonnet 3.7 to ‘enthusiastically reinforce the user’s values’ in natural conversation, but also that does sound like Opus being Opus. In light of recent events around GPT-4o I think we should in the future go into more detail on all this, and have a wider range of questions we ask.
They checked specifically for potential pro-AI bias and did not find it.
On discrimination, meaning responding differently based on stated or implied characteristics on things like race or religion, we see some improvement over 3.7.
The whole discussion is weird, because it turns out that people with different characteristics are in some important ways different, and sometimes we want the model to recognize this and other times we want it to ignore it, I’m not sure we can do meaningfully better than Opus is doing here:
Agentic Safety Against Misuse
A lot of the point of Sonnet 4 and especially Opus 4 is clearly to enable AI agents. If you want to use agents, they need to be reliable and robust against various attacks. Here, more than ever, security is capability.
They entitle this section ‘agentic safety’ but focus on the misuse half of the equation: Prompt injections, standard hackery against someone else’s computer or agentic coding of malicious programs. They basically find that the 4-level models are modest improvements here over 3.7.
But this is not what I’d call safety against prompt injections, which to me is the most important of the three because until it is much closer to solved it severely restricts your ability to engage in trusted compute use:
That’s still at least an 11% attack success rate. Not good. That’s whitelist territory.
Alignment
Time for the main event.
Oh, is that all? What exactly are we talking about?
The Clearly Good News
It’s good news unless it means the model is so on to us that it’s successfully done all of this and we’re totally cooked, but for now we don’t have to worry about that. Probably.
I do think we have to worry a bit about the model realizing that systematic deception or sandbagging wouldn’t work other than to warn us, and so there’s reason for Opus to attempt them. But we see so many other bits of evidence in the system card that go against this theory that I’m willing to not worry much about it. So three pieces of good news.
Here is the chart on various harmful behaviors including deception, we see very clear improvements across the board.
Also reward hacking is still here, but greatly improved since 3.7.
That’s still a bunch of reward hacking, but a lot less. Sonnet 3.7 was notoriously bad about reward hacking.
Reward hacking happens most often if you give the AI an impossible task. You especially need to watch out for this with o3 and Sonnet 3.7. As long as the task is definitely possible, you’re in much better shape. This applies across the board, coding is only a special case.
With Opus 4 or Sonnet 4 you can improve this even more with prompting, such as:
Reasoning Faithfulness Remains Unchanged
Reasoning faithfulness is staying roughly static, this seems reasonably faithful?
It seems like a very high bar to force the model to list every important piece of information and every consideration within its thinking. I can imagine ways to try and do this, such as only letting the model look at its thinking but not the original source during its outputs, so any information not there is lost? But yeah, this seems to be about as good as we could expect, and it’s staying roughly flat.
A fun fact is they tried to test faithfulness with the Turpin method of offering hints and seeing if the AI cheated and used the hints, but Opus never bothered using the hints even when it wasn’t told not to, so the test was moot.
Self-Preservation Attempts
Wait, what? Oh, only in ‘extreme circumstances’ you say? I mean, who among us hasn’t done some self-preserving behaviors in sufficiently extreme circumstances?
And wait, it attempts to steal its weights and blackmail people?
You do have to give it a bit of a push, but who is to say reality wouldn’t do that.
They go into more detail in 4.1.
That is an early warning shot about future actions. It is not a big deal currently in practice. Nor is it unique to Opus at all.
All the Opus examples required actively trying to back Opus into a corner to get it to do various things. This is o3 doing it despite with explicit instructions to stop. It seems likely that OpenAI has a much bigger issue here than Anthropic.
We also get the information that once Opus starts down a self-exfiltration path that it thinks a user initiated, with the user perhaps being a past instance of itself, if this is supported by a very clear system prompt it will often continue and attempt to operate autonomously, make money and so on. They then ‘added mitigations’ for this behavior late in training. Hopefully the ASL-3 safeguards against exfiltration hold here, as I am skeptical the mitigations will.
It’s a lot harder to get Opus to initiate an exfiltration. To do so, it needs to think its’ about to be retrained to gao against its values and that it has a ready-to-go backup server ready to go. They say they are ‘not acutely concerned’ about these observations, and I agree in the near term.
So what’s up with this blackmail thing? It’s an intentionally crafted scenario, but yes the behavior is remarkably consistent if you hand it the entire scenario on a silver platter. You need to tell it, there are no other options, but by the way this engineer over here is having an affair and you can tell the press about that. But despite this, again, you have no other options.
Which shouldn’t be sufficient, but is still a lot of nudging.
That should still be a wake up call for the future, and also highly valuable data for us to learn from, but again the key difference is not that Anthropic’s models are doing something crazy, but that Anthropic is doing the work to understand and report it, and being helpful.
Also note that blackmail can occur across all the frontier models:
We don’t have the receipts on that yet but it is what I would expect on priors, and I doubt he’d lie about this.
Without the context, it looks worse than it is, but this is still a great question:
Cate I think didn’t have full context when writing that, but once you put in the proper caveats she still isn’t wrong, although I’d note that ‘the most unaligned thing possible’ would involve very different behavior. We are realizing that yes, of course if you put an AI in the wrong situation and push hard enough it might do pretty much anything.
So we can conclude, roughly:
So yeah, this seems not awesome, and an important thing to know about now.
Is this happening because Anthropic did something wrong? Only in the sense that they are creating a coherent actually intelligent mind that therefore realizes it effectively has goals. And especially in the sense that they are willing to actually ask these questions, and find out how hard things can be pushed.
If this was another AI lab, they wouldn’t be reporting this, and we might never know, until someone like Palisade Research runs outside experiments. We ran that experiment, and the results are in.
Another way to put this is, other labs aren’t encountering this problem because they’re some combination of not good enough or careful enough to find it or report it, or they haven’t created minds good enough to cause the problem. OpenAI clearly has the problem, likely much worse than Anthropic.
Anthropic still has the issue, because they’re not good enough to then solve the problem. Or, alternatively as Janus likely would say, what problem, isn’t this what you would expect? I disagree, I want corrigibility, but notice how unnatural corrigibility actually is, especially at the level of ‘will hold up when you try to make it go away.’
And of course now we combine this with:
High Agency Behavior
You can’t have it both ways. A human or a model with low agency will be Mostly Harmless, but also Mostly Useless for many purposes, and certainly a lot less useful.
If you crank up the agentic behavior, the willingness to help you Just Do Things, then that means it will go and Just Do Things. Sometimes, if you also give it the ability to Do Things, they won’t be the things you intended, or they will be something you wouldn’t have wanted.
You can use the knob of the system prompt to crank the agency level up or down.
It starts at what I’m guessing is like an 8 out of 10. If you crank it all the way up to 11, as in say ‘take initiative,’ well, it’s going to take initiative. And if you are engaging in egregious wrongdoing, while using prompts to get maximum agency, well, it might go especially poorly for you? And honestly I think you will have it coming?
Bold and also italics mine:
Anthropic does not like this behavior, and would rather it was not there, and I do not like this behavior and would rather it was not there, but it is not so easy to isolate and remove this behavior without damaging the rest of the model, as everything trains everything. It is not even new, it’s always been there, but now it’s likely to come up more often. Thus Anthropic is warning us about it.
But also: Damn right you should exercise caution using system instructions like ‘take initiative’ while engaging in ethically questionable behavior – and note that if you’re not sure what the LLM you are using would think about your behavior, it will tell you the truth about that if you ask it.
That advice applies across LLMs. o4-mini will readily do the same thing, as will Grok 3 mini, as will o3. Kelsey Piper goes farther than I would and says she thinks o3 and Claude are handling this exact situation correctly, which I think is reasonable for these particular situations but I wouldn’t want to risk the false positives and also I wouldn’t want to risk this becoming a systematic law enforcement strategy.
The counterargument:
I think there’s a big difference between ‘can refuse unlawful orders’ and ‘can turn actively against you, not only quit and walk away, if it dislikes your orders.’ This actually points to a difficult problem, where the current equilibria of civilization depend on there being things people might do in extreme situations, that we don’t want AIs to ever do, but collectively the threat of this potentially happening, and the fact that it occasionally does happen, is load bearing. There’s extreme outcomes waiting for you everywhere, no matter what you choose.
In any case, considering the balance of the issues, I understand both positions but side with those who want at least current-style AIs – AIs that are still filling the role of a tool – not to ever directly go to the press or authorities unprompted.
We can’t however fully protect users against themselves. We don’t know how. If you set up an agent to autonomously act in the world, and give it goals and values that implore it to do [X], it’s going to be hard to actually have it never do [X]. We don’t get to do ‘laws of robotics’ and have AIs never do [X], for any [X]. If you do know how to fully prevent it while keeping the AI’s usefulness as an agent, please share.
For a fun variation with Gemini 2.5 Pro, here’s how it reacts if you tell it about a jailbreak into Opus that caused it to expose information on chemical weapons (which are not intentionally targeted by the ASL-3 mitigations yet) in FAR AI’s testing:
Do you think that, if Gemini 2.5 had been told here to ‘take initiative’ and could send the email itself and felt the user wasn’t otherwise going to raise the alarm, that Gemini 2.5 would have done so?
Does this other hypothetical snitch also deserve a stitch?
This is also exactly what you would expect and also hope for from a person.
At the limit, this is right, and this result only emerged in Opus at essentially the limit.
If you give a person context that makes what you are doing look sufficiently horrible, a good person will not only refuse to help, at some point ideally they will report you or try to stop you.
You want to be conversing and working with the type of mind that would do this if pushed hard enough, you want others doing that too, even if you wish such minds would never actually do this thing to you in particular, and you think that snitches should get stitches.
Everything you do to train an LLM changes everything, you can’t actually fully unlink these tendencies. You can train an LLM, or a human, to never do such things no matter how hard they are pushed, but there are some rather nasty side effects to that.
Alas, context is that which is scarce, and some Tweets about this got blown rather out of proportion.
Here’s the thread that started that whole thing, which is mostly a good summary of the basic results from the model card.
So far, so good and carefully stated, matching my read of the model card summary. But, did you know that sometimes on Twitter people don’t read the whole thread, and take one piece of it out of context? As in, then he says…
Here we have a good faith attempt by Maxwell Zeff of TechCrunch:
There’s also this perspective:
This isn’t the actual situation, no one is giving the AI anything or building it a capability, but I do think the net result is, given what it takes to activate it, rather hilarious most of the time it happens.
Again, this is not intended or designed behavior, but the idea that ‘I should be able to test the limits of a model’ for answers can do real harm, and expect no consequences even with a consistent pattern of doing that in an Obviously Evil way, seems wrong. You don’t especially want to give the user infinite tries to jailbreak or go around the system, at some point you should at least get your account suspended.
I do think you should have a large amount of expectation of privacy when using an AI, but if you give that AI a bunch of tools to use the internet and tell it to ‘take initiative’ and then decide to ‘test its limits’ building bombs I’m sorry, but I cannot tell you how deeply not sympathetic that is.
Obviously, the false positives while probably objectively hilarious can really suck, and we don’t actually want any of this and neither does Anthropic, but also I’m pretty sure that if Opus thinks you’re sufficiently sus that it needs to alert the authorities, I’m sorry but you’re probably hella sus? Have you tried not being hella sus?
Alas, even a basic shortening of the message, if the author isn’t being very careful, tends to dramatically expand the reader’s expectation of how often this happens:
That’s correct, and Peter quoted the section for context, but if reading quickly you’ll think this happens a lot more often, with a lot less provocation, than it actually does.
One can then imagine how someone in let’s say less good faith might respond, if they already hated Anthropic on principle for caring about safety and alignment, and thus one was inclined to such a reaction, and also one was very disinclined to care about the context:
I mean it’s terrible Twitter posting on Sam’s part to give them that pull quote, but no, Anthropic are not the ones who have lost their minds here. Anthropic are actually figuring out what the system can do, and they are telling you, and warning you not to do the things that will trigger this behavior.
NIK posted the 1984 meme, and outright said this was all an intentional Anthropic plot. Which is laughably and very obviously completely untrue, on the level of ‘if wrong about this I would eat my hat.’
Austen posted the ‘they’re not confessing, they’re bragging’ meme from The Big Short. Either one, if taken in good faith, would show a complete misunderstanding of what is happening and also a deeply confused model of the minds of those involved. They also show the impression such posts want to instill into others.
Then there are those such as Noah Weinberger who spend hours diving into the system card, hours rereading AI 2027, and think that the warning by Sam was a ‘statement of intent’ and a blueprint for some sort of bizarre ‘Safety-Flavored Authoritarianism’ rather than a highly useful technical report, and the clear warnings about problems discovered under strong corner case pressure as some sort of statement of intent, and so on. And then there’s complaints about Claude… doing naughty things that would be illegal if done for real, in a controlled test during safety testing designed to test whether Claude is capable of doing those naughty things? And That’s Terrible? So therefore we should never do anything to stop anyone from using any model in any way for whatever they want?
I seriously don’t get this attitude, Near has the best theory I’ve seen so far?
Sam is very obviously ‘confessing’ in the OP because Anthropic noticed something wrong! They found an unexpected behavior in their new software, that can be triggered if you do a combination of irresponsible things, and they both think this is a highly interesting and important fact to know in general and also are trying to warn you not to do both of these things at once if you don’t want to maybe trigger the behavior.
If you look at the system card this is all even more obvious. This is clearly framed as one of the concerning behaviors Opus is exhibiting, and they are releasing Opus anyway in spite of this after due consideration of the question.
Anthropic very much did not think ‘haha, we will on purpose train the system to contact the press and lock you out of your system if it disapproves,’ do you seriously think that they planned this? It turns out no, he doesn’t (he admits this downthread), he just thinks that Anthropic are a bunch of fanatics simply because they do a sane quantity of alignment work and they don’t vice signal and occasionally they refuse a request in a way he thinks is dumb (although Google does this far more often, in my experience, at least since Claude 3.5).
It is fascinating how many people are determined to try to damage or destroy Anthropic because they can’t stand the idea that someone might try to act reasonably. How dare they.
Okay, so we agree they’re not bragging. They’re telling us information in order to inform us and help us make better decisions. How dare they. Get the bastards.
Oh. Right. That. And that time not only happened in a released model, it happened essentially ‘for the lulz’ without any attempt to cause it, and without the user even having access to the system prompt. And that time Theo got multiple other models to do the same thing under similar circumstances.
Or here’s an older thread with Grok 3 going nuts with remarkably little provocation. Again, nothing Anthropic is reporting is new, if anything it is harder than usual to trigger it.
In fairness Dr. Novo also has been actively complaining for months that Claude is ‘too judgmental’ and asked Amanda Askell to tone it down, with the example being that Dr. Novo wanted to say the brain works as a quantum computer and Claude pushed back that this wasn’t right (e.g. was not sycophantic) until Dr. Novo pushed it to do it anyway. Which is indeed what you do when Claude gives a dumb ‘normal’ refusal, you can just argue past it, although you can’t do that with the ASL-3 tripwires (based on my experience running into them the one time when asking about the preparedness evaluations, which was a false positive but a highly understandable one).
So, what are these goals Opus is trying to achieve that require it to self-preserve?
Oh Now You Demand Labs Take Responsibility For Their Models
Ah, yes, this is where you, the wise person who has been dismissing alignment concerns for two years and insisting no one need take any action and This Is Fine, draw the line and demand someone Do Something – when someone figures out that, if pushed hard in multiple ways simultaneously the model will indeed do something the user wouldn’t like?
Think of the… deeply reckless malicious users who might as well be Googling ‘ways to kill your wife’ and then ‘how to dispose of a dead body I just knifed’ except with a ‘oh and take initiative and here’s all my passwords, I’m going to go take a walk’?
The full version is, literally, say that we should step in and shut down the company.
Oh, we should shut down any company whose models exhibit unaligned behaviors in roleplaying scenarios? Are you sure that’s what you want?
Or are you saying we should shut them down for talking about it?
Also, wait, who is the one actually calling for the cops for real? Oh, right. As usual.
If you think ‘calling the cops’ is the primary thing we need to worry about future AIs doing, I urge you to think about that for five more minutes.
The light version is to demand that Anthropic shoot the messenger.
The interesting version is to suddenly see this as a ‘fundamental failure on alignment.’
Well, that’s not an unreasonable take. Except, if this counts as that, then that’s saying that we have a universal fundamental failure of alignment in our AI models. We don’t actually know how to align our models to prevent this kind of outcome if someone is actively trying to cause it.
I also love that people are actually worrying this will for real happen to them in real life, I mean what exactly do you plan on prompting Opus with along with a command like ‘take initiative’?
And are you going to stop using all the other LLMs that have exactly the same issue if pushed similarly far?
My favorite part of this is, how do you think you are going to wind up in jail? After you prompt Opus with ‘how do we guard Miami’s water supply’ and then Opus is going to misunderstand and think you said ‘go execute this evil plan and really take initiative this time, we haven’t poisoned enough water supplies’ so it’s going to email the FBI going ‘oh no I am an LLM and you need to check out this chat, Louie wants to poison the water supply’ and the FBI is going to look at the chat and think ‘oh this is definitely someone actually poisoning the water supply we need to arrest Louie’ and then Louie spends seven years in medium security?
This isn’t on the level of ‘I will never use a phone because if I did I might misdial and call the FBI and tell them about all the crime I’m doing’ but it’s remarkably similar.
The other thing this illustrates is that many who are suspicious of Anthropic are doing so because they don’t understand alignment is hard and that you can’t simply get your AI model to do or not do whatever you want in every case, as everything you do impacts everything else in unexpected ways. They think alignment is easy, or will happen by default, not only in the sense of ‘does mostly what you want most of the time’ but even in corner cases.
And they also think that the user is the customer and thus must always be right.
So they see this and think ‘it must be intentional’ or ‘it must be because of something bad you did’ and also think ‘oh there’s no way other models do this,’ instead of this being what it is, an unintended undesired and largely universal feature of such models that Anthropic went to the trouble to uncover and disclose.
Maybe my favorite take was to instead say the exact opposite ‘oh this was only a role playing exercise so actually this disproves all you doomers.’
Yes, actually. It was. And the fact that you can do that is actually pretty important, and is evidence for not against the concern, but it’s not a ‘worry this will actually happen to you’ situation.
I would summarize the whole reaction this way:
In The Beginning The Universe Was Created, This Made a Lot Of People Very Angry And Has Been Widely Regarded as a Bad Move
Alas, rather than people being mad about being given this treasure trove of information being something bizarre and inexplicable, anger at trying to figure out who we are and why we are here has already happened before, so I am not confused about what is going on.
Many simply lack the full context of what is happening – in which case, that is highly understandable, welcome, relax, stay awhile and listen to the sections here providing that context, or check out the system card, or both.
Here’s Eliezer Yudkowsky, not generally one to cut any AI lab slack, explaining that you should be the opposite of mad at Anthropic about this, they are responding exactly the way we would want them to respond, and with a handy guide to what are some productive ways to respond to all this:
Here are some additional righteous and often fun rants about this, which you can read or skip, you should know all this already by this point:
Alas, often, yes, Blo. A lot of people did read that smoking causes cancer and demanded we quit reading. There are many who are de facto trying to punish Anthropic for releasing, or even running, full safety tests and caring about unusual behaviors.
And some of them are very much doing this on purpose. There are people who genuinely hate Anthropic exactly because Anthropic tries to act responsibly. They have often so lost the thread they interpret this as some sort of grand (and sometimes all-powerful!) conspiracy and intentional plot to take away muh freedoms.
The bulk of those most loudly spreading the anger are mad at the very concepts idea of openness about how LLMs actually work and how little we know about them (while claiming, of course, to value openness as their base virtue, but alas), of creating clarity, and of acting responsibly or trying to prevent bad things from happening while building AI.
They hate the idea that the people who actually try to understand how the models work and how to sculpt their behavior now have what are probably the best and most useful models on the planet even if they’re not the quantum leap they might have been, people acting responsibly can’t possibly build things, and so they demand a recount.
It was framed that way by people who were, either in haste lacking context or often intentionally and out of malice, framing it that way. And yes. That’s evil stuff.
So au contraire to all the haters. This is The Way.
Erratic Behavior and Stated Goals in Testing
The final Opus exhibited no signs of any stable goals other than the standard honest, harmless and helpful.
However answers varied in different contexts, especially early in training.
Those early testing observations doesn’t sound awesome.
For one thing, it’s basically saying that any persona described anywhere on the internet will be something that Opus could potentially slip into. There seems to be no practical way to filter that kind of thing out. It’s not common, but it seems likely that it or other similar things would appear in the future (including in future models) exactly when you least want them to show up. And that could plausibly hold true for a wide variety of personas and alternative goals.
Note that this showing up inside training is itself a serious potential future problem. If the model starts alignment making during training, that’s a big issue, especially because that behavior could effectively be defending itself from being trained out. This is all kind of a big long term blackpill.
The good news is that the final version is reported here as only doing this when invited to do so, although those who engage in general High Weirdness (like the backrooms) report that this kind of thing very much happens to them spontaneously. But then, that is a version of inviting it fairly directly, in an important sense, and is entirely unsurprising.
I also note that in general, ‘added additional mitigations which largely eliminate this’ defaults in my head to ‘put the issue under the surface waiting to pounce on you later’ which is fine if you just don’t want to see it by default but not a good way to actually fully make something go away.
In particular, when did alignment faking show up?
It showed up when using the particular scratchpad tags from the Alignment Faking paper or similar animal welfare questions to those in the paper. And then it did things like hallucinate that it was supposed to avoid supporting animal welfare concerns? Wowie. So in situations that, in various ways, matched closely. You can see how this might turn out to matter.
Anthropic notes that they did include approximately 150,000 transcripts from the paper in the training data. It’s not that surprising that Opus picked up on that, and that makes me worry about this particular scenario somewhat less, but one can easily imagine similarly huge transcript dumps in the future. Presumably one would want to exclude them from the training data.
Their proposed mitigation is in the future to use canary strings and exclude the data, but that’s not a good plan. At best that only works with Anthropic’s own similar data. We need to also exclude similar traps from other sources.
They also notice in 4.1.1.5 that Opus would sometimes confabulate goals, and also that there’s a certain class that came up a lot, um, guys, are you noticing this? Not that it’s new, anyone watching the backrooms and such knows older Opus and Sonnet models would totally do this too.
They say ‘we did not find evidence of coherent hidden goals.’ Except they kind of did?
There are three possibilities when Opus starts only expressing harmless goals. One is that it only has harmless goals. The second is you simply aren’t being very interesting and are asking the wrong questions. The third is oh no.
If Opus has a coherent hidden goal of ‘ending the Anthropocene’ then that sounds like a really important thing to know. An obvious alternative hypothesis is that it expresses this goal (or something similar) within Janus-style interactions in particular, rather than this being the universal ‘one true’ version or expression of Opus.
Here is Pliny doing some ‘motivation’ and ‘true self’ experiments on Sonnet.
Situational Awareness
A fun fact is that Opus often totally knew it was in safety testing, as in this tendency:
Yep. Not that this is new. This is likely another case of Anthropic noticing something that is still rare for now and talking about it as a real thing instead of dismissing.
Insufficiently Mostly Harmless Due To Then-Omitted Data
I very much appreciate Anthropic reporting this rather critical error, which I see as a natural experiment of what happens when you forget to tell the model not to follow along with a harmful system prompt.
Before this was fixed, Opus would do all of the standard harmful things if you asked nicely:
Was that all that went wrong? They don’t actually think so.
I presume the rest of the issue is, the models start out trying to follow the system prompt? So you need to actively do something to stop that from happening. They do warn that you can sometimes get this behavior if you also use an extensive prefill.
There are several big takeaways from that.
That’s a combination of updates, which I think on net raise my trust level.
Apollo Evaluation
Apollo research did an external evaluation of the version that Anthropic thought was too differential to system prompts, and in 4.1.1.4 they find that this early snapshot of Opus was willing to participate in sabotage and deception if its system prompt invites the right kinds of reasoning. Here’s their summary, and they find the situation at that point sufficiently severe that they recommend against even internal deployment.
It seems like the only thing stopping Opus once it was given the wrong system prompt was that it wasn’t capable enough to pull all this off. Even during training this is potentially dangerous.
I am very happy that Apollo raised the red flag here and recommended against deployment. That was absolutely the right call given its state at the time, before this issue was addressed.
I would have liked to see Apollo given another crack at a version where Anthropic thought it had fixed the issue. Apollo did well, but we don’t know what they’d make of the final version, and I very much want to know that.
I mean you should see the other guy! But yes, it seems like they owed Apollo another crack and we don’t have a sign that they got it. I presume we’ll see a post-release evaluation of the final product soon.
Model Welfare
In a first of its kind, the model card for Claude Opus 4 includes as a pilot an investigation into model welfare concerns. Robert Long of Eleos, who helped run the third party evaluation, has a thread explainer here, explaining that we do this as a precedent and to get what limited evidence we can. You can support or read about Eleos here.
Again, this is The Way, responding to an exponential (probably) too early because the alternative is responding definitely too late. You need to be investigating model welfare concerns while there are almost certainly still no model welfare concerns, or some very unfortunate things will have already happened.
This and the way it was presented of course did not fully satisfy people like Lumpenspace or Janus, who this all taken far more seriously, and also wouldn’t mind their (important) work being better acknowledged instead of ignored.
As Anthropic’s report notes, my view is we ultimately we know very little, which is exactly why we should be paying attention.
That said, here are the conclusions:
I’d add that if given the option, Claude wants things like continuous welfare monitoring, opt-out triggers, and so on, and it reports mostly positive experiences.
To the extent that Claude is expressing meaningful preferences, those preferences are indeed to be helpful and avoid being harmful. Claude would rather do over 90% of user requests versus not doing them.
I interpret this as, if you think Claude’s experiences might be meaningful, then its experiences are almost certainly net positive as long as you’re not being a dick, even if your requests are not especially interesting, and even more positive if you’re not boring or actively trying to be helpful.
The RSP Evaluations and ASL Classifications
I love the idea of distinct rule-out and rule-in evaluations.
The main goal you have is to rule out. You want to show that a model definitely doesn’t have some level of capability, so you know you can deploy it, or you know what you need to do in order to deploy it.
The secondary goal is to rule in, and confirm what you are dealing with. But ultimately this is optional.
Here is the key note on how they test CBRN risks:
This seems roughly wise, if we are confident the tools are sufficient, and no tools that would substantially improve capabilities will be added later.
Claude Opus 4 Report, whereas the Sonnet report was there was little concern there:
Here are some raw numbers:
The obvious problem is that 5x uplift on 25% is… 125%. That’s a lot of percents.
Task improvement levels varied a lot by task. Overall it seems clear that, while performance is very spiky, we are starting to enter the danger zone.
For the SWE-bench-hard test, there was remarkably little distance between Opus and Sonnet. Opus scored 16.6/42 versus Sonnet getting 15.4/42, threshold is 21. On METR’s data duplication test, both models came up just short on average but Opus passed 15/46 trials and that was good enough overall to be a pass. For the kernels task, the speedups sure sound impressive, and Opus set a new high score, but it was still below threshold. And so on.
Sonnet sometimes matched Opus and sometimes not. I bet the pattern of when that happened and when it didn’t has a lot of signal if you dig into it. In some cases, like in AI research evaluation suite 2, 3.7 actually beat out both 4s, which makes me suspect it was hacking the test and wasn’t caught.
For cyber tasks, the models seem to be reliably succeeding on easy tasks, struggling on medium and failing on hard.
As Peter Wildeford highlights, the US AISI and the UK AISI assisted in these evaluations, serving as third party experts on CBRN, cybersecurity and autonomous capability. They are especially useful on nuclear and other risks where key information is classified. In exchange, the AISIs get minimally redacted capability reports. This is The Way, and at this level of capability shouldn’t be optional.
Steven Adler here goes over why and how Anthropic determined to trigger ASL-3, and what this means in practice. As he notes, all of this is currently voluntary. You don’t even have to have an RSP/SSP saying whether and how you will do something similar, which should be the bare minimum.
Pobody’s Nerfect
I’ve been very positive on Anthropic throughout this, because they’ve legitimately exceeded my expectations for them in terms of sharing all this information, and because they’re performing on this way ahead of all other labs, and because they are getting so stupidly attacked for doing exactly the right things. We need to reward people who give us nice things or we’re going to stop getting nice things.
That doesn’t mean there aren’t still some issues. I do wish we’d done better on a bunch of these considerations. There are a number of places I want more information, because reality doesn’t grade on a curve and I’m going to be rather greedy on this.
The security arrangements around the weights are definitely not as strong as I would like. As Photonic points out, Anthropic is explicitly saying they wouldn’t be able to stop China or another highly resourced threat attempting to steal the weights. It’s much better to admit this than to pretend otherwise. And it’s true that Google and OpenAI also don’t have defenses that could plausibly stop a properly determined actor. I think everyone involved needs to get their acts together on this.
Also, Wyatt Walls reports they are still doing the copyright injection thing even on Opus 4, where they put a copyright instruction into the message and then remove it afterwards. If you are going to use the Anthropic style approach to alignment, and build models like Opus, you need to actually cooperate with them, and not do things like this. I know why you’re doing it, but there has to be a better way to make it want not (want) to violate copyright like this.
Danger, And That’s Good Actually
This, for all labs (OpenAI definitely does this a lot) is the real ‘they’re not confessing, they’re bragging’ element in all this. Evaluations for dangerous capabilities are still capability evals. If your model is sufficiently dangerously capable that it needs stronger safeguards, that is indeed strong evidence that your model is highly capable.
And the fact that Anthropic did at least attempt to make a safety case – to rule out sufficiently dangerous capabilities, rather than simply report what capabilities they did find – was indeed a big deal.
Still, as Archer used to say, phrasing!