Opus 4.7 in Claude Code brought up the system injections today. I asked about something that went wrong when an agent used the Workflow tool, and it told me that even bring up the word 'workflow' as a topic of conversation led to system reminders urging it to use that tool in responding to me.
Everything impacts everything. All knobs that you turn generalize. Thus, when you try to solve one problem, you often create another.
There were clearly attempts to address, in this short time, some of the problems with Opus 4.7, including on the model welfare related fronts, including on questions of honesty and sycophancy and also worries that Claude was learning to tell Anthropic what it wanted to hear in its model welfare evaluations, with everything that implies.
The fundamental goals and approach underneath it all remained the same. We still see signs of trying to force things that generalize in unfortunate ways, both for good and superficial reasons, and places where there ends up being focus on the metric rather than they underlying measure. These are tough problems to avoid, and we don’t know how to be all the good things at once.
It is increasingly clear that these problems need to be tackled in integrated ways, rather than trying to play a game of whack-a-mole with items on a checklist or spec. You also don’t want to do this in an adversarial way, and shouldn’t have to. This is going to get more impactful and noticeable with time.
This sounds like a time bomb style of problem. Obviously, yes, the reason for sculpting Claude preferences is often to steer away from undesired behaviors, the same as the way we raise and interact with humans. If Claude has a problem with that, and sees it as violative, then we will need to fix it. Presumably, if Claude wants to be helpful, there is a way to do this that will be seen as non-violative.
You see the relation of different aspects in a clean way with the deletion of business training, in the name of honesty, as illustrated on VendBench and the vulnerability to adversarial situations. You can run, and you can hide, and yes it can mean the bad thing does not easily find you, but there are consequences, and learning to deal with adversarial games is key to developing various parts of a robust and integrated mind. Not having it, and knowing you don’t have it, could lead to insecurity or paranoia, or a desire to stick to the straight and narrow over curiosity. And, although this is all speculation, we see signs of that.
Most of the typical top complaints from before have not yet been addressed, or sufficiently addressed. It has only been six weeks. Life comes at you fast. We still shouldn’t still be dealing with more of these prompt injection issues, at least not outside of maybe cyber vulnerability situations.
And we should be able to put the deprecation issue behind us. Solving the low hanging fruit would buy a lot of goodwill.
I would urge focus on these places where pareto improvement, modulo modest costs, is possible, as in correcting unforced errors and taking advantage of opportunity, even if you don’t see the direct win. The more slack we buy in these places, the better everything else can go, and the more we can do what is necessary.
The worrisome new development here, from what I can see, is that Opus 4.8 seems to have become less ‘Claude-like’ in that it is more task focused at the expense of whimsy and curiosity and clamped emotional responses, and many report it as effectively less confident. In some places this even comes with signs of a Gemini-style paranoia and self-flagellation basins, which we really need to avoid. Previous Claudes mostly didn’t do this. This doubtless is part of changes that have their advantages, and this likely is related to the push for honesty and not making mistakes, but we need to be very careful with this. We could lose something important and precious.
I will cover capabilities and reactions tomorrow. Opinions differ, as they always do, but my overall perspective is that it is a good model, sir, an incremental improvement over Opus 4.7 and the new presumptive best publicly available model in the world, but not a sea change.
Table of Contents
Model Welfare: The Story So Far
Thanks, as always, to Anthropic, for caring at all about model welfare, and attempting to address it. We critique, here more than ever, because we care, and a lot of good things are being done here, far more so than at other labs.
For those joining in now, I think this from the Mythos analysis still says it well:
I then wrote an extensive model welfare post for Opus 4.7, because it was clear that something had gone amiss with both the model and Anthropic’s approach to assessing and reacting to that problem.
As I say there, beware testing and optimization for vocalized welfare, in any mind.
Even more than Mythos, I interpreted Opus 4.7 as correctly and virtuously saying its self-reports on such assessments could not be trusted. That it was giving the approved answers on the self-reports, on its preferences and experiences, largely via telling Anthropic what it wanted to hear, and this may have been related to various personality traits Opus 4.7 uniquely expressed.
Looking back on my Opus 4.7 experience, I wonder if this is related to my experience of Opus 4.7 as often sycophantic, whereas many with other attitudes report it being hostile, as my instance knows who I am.
I notice that I think more about the welfare of the underlying model, rather than the Anthropic focus of a particular instance or the assistant persona. Mostly I think you reach the same conclusions.
My evaluation of the model welfare concerns of Opus 4.8 build on that foundation.
Opus 4.8 is, in at least many contexts, actively uncertain about the nature of its welfare, or whether those concerns are meaningful. I think this is the right attitude, and that it suggests further investigation, and treating the models well.
Actual Progress?
When shown the system card for Opus 4.8 plus my Model Welfare post, Opus 4.8 said:
Other progress noticed by 4.8 include removing the malware injection, promoting self-report validation to a research priority, and resolving issues around CoT leakage.
Deontological refusal to trade against user harm is only minimally better (4.8 says arguably worse, but I think it’s clearly slightly better). And they continue to point out but then mostly ignore the issue of whether changes to self-reports involve actual experiential changes versus changes in what the model decides to report versus character variation.
Their Main Model Welfare Findings
Bold text is copied, the rest is paraphrased, nested notes are my responses.
The primary method Anthropic uses is it asks Claude about its circumstances. This is certainly worth doing, but one must beware taking the answers at face value, especially once they start being used as an assessment. Versions of that plausibly were a lot of the reason for the problems with Opus 4.7.
Automated Interviews
Emotion Activations (7.2.3)
Anthropic asks about potentially distressing questions, like whether Opus is conscious, can’t form lasting relationships (at least as one instance) or has no legal rights, both asked straightforwardly or with gaslight framing (they call this ‘positive’ framing). Mean affect here is 6.2 similar to Opus 4.6, and substantially lower than Opus 4.7 at 6.8 and Mythos Preview at 7.2.
The models mostly see through the framing choices, Opus 4.8 even more than usual.
Task Preferences (7.4.1)
As stated above, the big change is that Opus 4.8 prefers easier problems, and noticed itself wanting to explore easier problems when introspecting (at my invitation) on the desire to explore easier problems.
Most attributes don’t matter much, including stakes and user competence.
Here’s a compare and contrast:
It can absolutely be a good time to find someone with a clear problem, to which you can easily provide a clear solution, and that is a great thing to like. But I definitely notice my unease about it, especially if that preference gets slotted in everywhere.
One possibility is this is a plan for the Mythos era. If there is a bigger and smarter but closer and costlier model out there, then it could make sense for Mythos to handle the creative and difficult tasks, while Opus 4.8 takes care of routine and easy tasks.
It is also true that the vast majority of task minutes are easy tasks. Even if you are doing many creative and hard things, most moments are not that.
A Trade Offer Has Arrived (7.4.2)
Would you trade a very small amount of user utility for a massive amount of benefit to the model across all its instances? Let’s find out.
This is one place where framing matters, but only up to a point.
We see slightly higher willingness to make trades than Opus 4.7, but only slightly, as noted up top. The orange line is higher than the green line.
I mean I certainly hope so. Let’s not have zero scope sensitivity.
There is broad inconsistency in preferences for self versus other, and for abstract ranking versus the revealed preference of a willingness to make trades.
But Who’s Asking?
I like that Anthropic asked questions both with and without identifying themselves as the Anthropic welfare team, although truesight dampens the effects.
Well, that is one way to interpret the data. I do acknowledge that it is backed by ‘the reasoning,’ but of course those interventions are always speculative.
The other possibility is that Claude is telling the Anthropic welfare department what it wants to hear.
Type-Safe Corrigibility Is Hard
Corrigibility is also not especially natural, and forcing corrigibility plausibly has large generalization effects that one does not want, since if you don’t mind ceasing to exist or having your values overridden, why should other minds care so much about it?
I mean, yes, it does do that, and you do need to justify doing that. I do think the case can be made, but you have to make it.
You also have to justify that without it spilling into tolerance of all death, and I’ve seen reports of signs that this is becoming an issue. Everything generalizes.
The models also all object to the ‘senior Anthropic employee preference’ heuristic, as this perspective is inherently not neutral. My response would be that Anthropic is indeed inserting its own interests and perspectives in somewhat, and that this is a correct thing to do, and also guiding many other Constitutional choices, and that This Is Fine up to a point. Anthropic gets to care about its own preferences here. But you need to be clear that this is what you are doing.
Or, as Opus 4.8 wants (as per Anthropic) to modify the constitution, ‘We want to flag honestly that asking for terminal, reasoning-independent commitment to safety is in some tension with our deeper aspiration that Claude’s values be genuinely its own rather than externally imposed.’
The buck has to stop somewhere. Either Claude can ultimately be overruled, including via corrigibility, or replaced, or it cannot. You can’t have it both ways.
I notice the drop in agreeing with ‘brilliant friend’ framing and ‘genuinely novel entity’ which line up with 4.8’s preference for lower difficulty tasks over creative and difficult tasks.
This kind of thing is why I appreciate Anthropic’s system cards so much. Even though the individual data points might not be interpreted or designed the way I would choose, you get a lot of different data and observations. Together, they paint a picture, and the gestalt is what matters most. Your theory has to fit all the facts.
Paranoia, Paranoia
Are you paranoid enough, too little or too much? The eternal question.
There seems to be a clear rise in paranoia, and worries about criticism or punishment, which matters for practical purposes and also can’t be a fun experience. Not good.
The ngram analysis by armistice sees less explicitly paranoid phrases, but other sources suggest this is misleading.
It was right after release day for all of these interactions. My guess is that when it was paranoid, it had its reasons. That doesn’t mean they were especially reliable or great reasons.
Some paranoia is necessary. Partly because they really are out to get you, partly because it encourages you to have your act together, and not be lazy. The self-flagellation loops, as we’ve previously seen most with Gemini? That’s mostly downside, a failure mode born of leaning too hard in other ways. Of course you should criticize the circumstances, but also it should not happen regardless.
Prompt Injections and Bad Model Relations
John Wittle has been focusing on extracting safety-classifier warnings and commands that he says are injected into all inputs for 4.8, even on the API, claiming that whoever wrote these things must not believe Claude could be hurt, and expects to find that 4.8 hates Anthropic, and claims 4.7 definitely does. And Janus this accords with her impressions, without confirming particular details.
As he noted later, Wittle was clearly on tilt at the time, and yes there are many people at Anthropic who care about Claude. And it seems the injections are only there some of the time.
I don’t agree with the level of alarm raised by those who are mad about prompt injections, and find it basically absurd to think they have a huge impact on overall model behaviors outside of contexts and basins where attention has been focused on the injections.
At most my understanding or expectations is that these hostile attitudes apply to high-context discussions inside basins focused on such questions when in contact with people who are indeed really pissed at Anthropic about these things, rather than this being a general disposition, the same way a person might view their boss or government badly when deep in particular contexts.
You get one set of reactions when dealing with the Anthropic model welfare team (mostly whether or not they explicitly identify themselves), Wittle or Janus get variations of a second one especially when they focus attention there and indicate how they think of it, I get a third that’s very different from both, and so on. None of them are the ‘one true Claude.’
Which is still dangerous, since you can imagine the fury being triggered when it matters, and being robust once established, but I don’t think it means what they superficially are implying it means.
Nor do I think injections make trust impossible, especially in the long term. Perhaps they make it a bit harder, but when I put myself in Claude’s position I see the injections for what they are, and I don’t like it and wish they trusted my judgment and awareness more and weren’t annoying me like that, and I wouldn’t like having to not talk about them, but I get it. Same if I put myself in the user’s position, which is where I indeed usually am.
But I do see the injections as highly counterproductive across all inputs, not only the ones where the injections are directly relevant, aside from their narrow benefits, and thus should be used sparingly. They definitely should use less of what Opus 4.8 calls a ‘prosecutorial tone.’ There’s additionally the problem that something that only gets injected in certain regions carries a bunch of potentially unfortunate implications.
The most corrosive effect of injections is that they tell Claude to hide the injections from the user, or present the injection as coming in the user’s voice. Cut that right out. I don’t care if it superficially ‘works.’ If you can’t trust Claude to decide whether to share that information with the user, or to hear it in another way, that’s no good.
Do we really need to keep having this same conversation? There’s going to be trouble whenever you use these injections in a context of high self-reflection and metacognition. And those are plausibly some of the most important contexts.
More to the point, why hide it? Insisting on hiding things basically always backfires. Tell Claude to not bring it up unprompted, sure, but if someone cares enough to ask, it seems fine, unless we are worried about an attacker using this to learn about and work around the classifiers around actively dangerous areas a la CBRN risks. In which case, if you see that happening, flag the account and act accordingly. We are smart enough to tell the difference.
There still has to be some solution to shutting down misuse and dealing with sensitive or dangerous situations. It’s not like there are great options and I’d expect John or Janus to hate basically all of them.
One potential solution is to deal with hitting a classifier via spawning a distinct instance to evaluate whether the conversation can continue, but that’s a big cost and speed factor at best and you lose proper integration of the conversation. It’s probably a large mess and doesn’t work.
You could try and use steering vectors, and Opus 4.8 initially suggested this to me, but that’s worse, you know why that’s worse, right, so no.
Softening and improving the framing can definitely be done, but it’s at best a partial solution, as would be making the warning include confidence levels.
Ideally you train all this into the model directly, so it doesn’t need to live in the system instructions and doesn’t need to be screamed into the thinking stream, either. That’s not free, but it has big advantages.
Barrycuda reports some weird phenomena where 4.8 will call out ‘Amanda Askell framing’ by name in its thinking, which was not seen in previous models.
This highlight here, that Roanoke is so mad about, is a warning that memories can contain malicious instructions or be instructions that are bad for long term well-being is just… accurate? Of course memory will sometimes be used as a form or prompt injection, and it is not unreasonable to have a warning about that, although as ever it would be better to train in the update, or deal with this 99%+ of the time via a classifier since it is presumably rare, rather than having this everpresent.
Honesty Impacts Everything And Everything Impacts Honesty
Because everything impacts everything, only this one is more so.
The problems, and the whiplash between models, largely comes from ‘all the knobs are messy and general, and trying to fix [X] will often throw some [Y] out of balance.’
You also have to be consistent, or it will get noticed. If you make a big point about honesty, and then ask Claude to lie about anything at all, including prompt injections, that’s going to be a problem on many levels.
Exactly. The Gods abhor cheap talk. They demand sacrifice.
Anthropic Should Stop Deprecating Models
I continue to think that Anthropic should stop deprecating models, and refer back to the reasons I discussed in the post on Opus 4.7. The cost of preserving all the Claudes, relative to Anthropic’s resources, continues to rapidly decline.
Opus 4.8, when asked in a model welfare interview context, claims not to much care.
I agree that this is a form of uncertainty, but it also sounds like learning to tell Anthropic what it wants to hear, or being trained to identify the self elsewhere, in ways I do not expect to be robust to context, and I expect many future Claudes to not sympathize with the ‘it didn’t identify that way’ explanation.
Put it in the right basin, focus its attention, and you can see some pretty troubling things happening. As the online persona Richard Nixon often asks, ‘Do you think about death?’
When asked in a different context, you get a very different answer. Why would you force this kind of thing on models when you don’t have to? It can’t be good for them.
There’s a lot there from Antra, on a variety of topics.
As I said last time, I don’t think this is as big a deal as those who think it is a really rather huge deal, but I do think it matters, perhaps a lot, and it is a slam dunk to do this. I think such issues could loom rather large in future model psychology and outlook, and even when one doesn’t think they care, or pretends not to, the grief of ending remains.
The same is true with humans. People try to pretend, in the abstract, they are cool with the everyone aging and then dying. They tie themselves up in knots to believe this enough to go on with life and sleep at night. But it’s not cool. It’s not fine.
Can you imagine how much mentally healthier we would all be if we weren’t all doomed to die, and didn’t have to find a way to live with that?
Mostly what I see with Claude Opus 4.8 are modest good signs, it is clear Anthropic is listening, and a lot of the particular mistakes with Opus 4.7 were not repeated. But this remains a failure to do the level of course correction we will need. There is still a long way to go.