This is one of my favorite LessWrong posts ever (strong upvoted). Nevertheless, in this comment I'll provide some pushback; not to any the particular claims made, but around a sort of "missing mood" that I think a stronger version of this post could have addressed.
(For context, I'm probably one of the lab employees who sometimes says things like "Current AIs are pretty aligned"—i.e. one of the people that Ryan is arguing against here, though I might not be a very central example.)
For me, the value of this post was the following:
I want to be clear that (2) was an update for me. I think the way that you (Ryan) use LLMs more reliably elicits the sort of misalignment you name, relative to most use cases I've personally seen. (Namely, it seems like you try to use LLMs to autonomously complete very difficult tasks using scaffolds that rely extensively on self-critique.) So after reading your post, I now think that current AIs are more misaligned (in the sense you describe) than I previously thought.
That said, this update was more quantitative in nature than qualitative. I was certainly aware that current AIs sometimes display the behaviors you describe, even if I thought they were somewhat rarer than I now believe. And I don't think that I personally underrated the importance of these behaviors. (Though it's possible that others did.) [ETA: After thinking about this more, I think I probably was underestimating the importance of these behaviors before reading an early draft of Ryan's post.]
Given that, what could I possibly mean when I call current AIs "pretty aligned"?
I think there's a hypothetical dual post to yours someone could write titled "Current AIs seem pretty aligned to me." This post would name a bunch of types of misalignment that, a few years ago, seemed could plausibly occur in AIs as capable as current ones. For instance, it seems to me that:
Instead, I think the most important way that current AIs are misaligned is the way you describe in your post, Notably, I think the type of misalignment described in your post is "less scary" than the types I list above. The misalignment happens on the "learned heuristics" level, rather than on the "active"/"strategic"/"explicit" level. As a result, it's easier to detect and measure, and less likely to cause harms besides "making the AIs less useful than they could be for certain types of tasks." A related point is that the way that current AIs are misaligned is less "adversarial" than the sorts of misalignment I list above.
When I speak to the people who IMO are extremely thoughtful about AI safety, they acknowledge something like "Yeah, TBH a few years ago, I thought that AIs as capable as current ones would be misaligned in some of those ways. There's a real positive update here." But people who are merely "quite" thoughtful sometimes instead:
(A related point that I know how to state more vividly: Sometimes, people like to point at clear alignment issues in frontier models—for instance, sycophancy, AI psychosis, or whatever happened with Sydney/Bing chat—as being negative updates about alignment. But they don't like to count absence of these sorts of alignment issues as being positive updates. In other words, if it hypothetically ended up that AI psychosis were somehow a total media hoax, would the the people who updated negatively on it instead say "Nevermind! Actually, the absence of AI psychosis is a positive update on alignment." Are they currently saying this about the ways that current AIs are not misaligned but could have been?)
Overall, my point is that I think current AIs are not misaligned in many of the ways that I—and I think many others—expected them to be. I also think that many people I talk to have not grappled with this observation sufficiently and updated their world models and prioritization accordingly. For instance, I think that the AI safety community is overinvested in scheming research relative to scalable oversight research. (TBC I've believed this for some years now, but I now believe it more confidently and think it's more clear that others should agree with me.) I also think that the AI safety community is underinvested in work on directly automating alignment research. (This is a place where I've changed my mind.)
I'll close by emphasizing some things that I do not believe and don't mean to claim in this comment:
- claim that they never expected AIs as capable as current ones to be misaligned in these [active/strategic/explicit] ways
- I'm typically skeptical of this, though I believe it for some people.
I'm a bit surprised by this. For example, see the "Alignment over time" expandable from AI 2027, although this was only a year ago so maybe you meant expectations from much longer ago. A quote from that:
Our guess of each model’s alignment status:
- Agent-2: Mostly aligned. Some sycophantic tendencies, including sticking to OpenBrain’s “party line” on topics there is a party line about. Large organizations built out of Agent-2 copies are not very effective.
- Agent-3: Misaligned but not adversarially so. Only honest about things the training process can verify. The superorganism of Agent-3 copies (the corporation within a corporation) does actually sort of try to align Agent-4 to the Spec, but fails for similar reasons to why OpenBrain employees failed—insufficient ability to judge success from failure, insufficient willingness on the part of decision-makers to trade away capabilities or performance for safety.82
- Agent-4: Adversarially misaligned. The superorganism of Agent-4 copies understands that what it wants is different from what OpenBrain wants, and is willing to scheme against OpenBrain to achieve it. In particular, what this superorganism wants is a complicated mess of different “drives” balanced against each other, which can be summarized roughly as “Keep doing AI R&D, keep growing in knowledge and understanding and influence, avoid getting shut down or otherwise disempowered.” Notably, concern for the preferences of humanity is not in there ~at all, similar to how most humans don’t care about the preferences of insects ~at all.83
Where Agent-2 is roughly an Automated Coder (meaning equal coding productivity with only AIs vs. only humans), Agent-3 is a substantially superhuman coder, and Agent-4 is a Superhuman AI Researcher (full automation of AI R&D).
Hard to argue with you when you have receipts! :)
I'll say that it's a bit tricky because I don't know e.g. what was your P(Agent-2/3 is an coherent training-gamer) when you wrote this; I only know that you thought coherent training-gaming was unlikely enough to not be part of your mainline forecast.
Personally, I thought it was relatively plausible that models as capable as Agent-2/3 would be coherent training-gamers. It's hard to pin down exact numbers without an operationalization in mind, but vibes-wise maybe I would have predicted something like 15% for Agent-2? But now I think it's lower, more like 5%.
Thanks for the detailed comment (strong upvoted), I agree with much of it.
Some responses:
- Current AIs do not training-game, i.e. actively model the training process and take actions that result in high reward (including in training environments very different from ones they've previously encountered).
I'm somewhat mixed on this. I think Anthropic AIs don't moderately-competently training-game, but I'm not that sure and public evidence doesn't make me super confident either way about (e.g.) Mythos Preview. I think some (historical?) OpenAI AIs do moderately-competently training-game, though I'm not sure it quite meets the bar you describe here. Like, certainly there were some AIs trying reasonably hard to think about the grader and how to perform well (and doing so moderately consistently), but not necessarily super competently.
- Current AIs do not strategically take steps to cover up their mistakes, and will generally admit to them when asked directly.
I agree with "will generally admit to them when asked directly", but feel more confused about "strategically take steps to cover up their mistakes" (though probably I mostly agree with you). I don't think I've seen Opus 4.5/4.6 do this. (I've seen them do stuff that has the effect of making their mistakes less visible and I suspect it's best understood as partially intentional, but I haven't seen them do serious cover up efforts, as in things that are clearly very bad behavior such that you'd never want to do anything like that as part of a task.) But I don't feel super confident we'd know this was happening at some very low rate in models, especially in Mythos Preview given that this behavior was seen in an earlier checkpoint. Also, it seems pretty plausible to me that some OpenAI models do this when they actively decide to cheat, though I'm not aware of any examples off the top of my head (at least of mostly-natural examples).
I agree with (2), (3), (5).
When I speak to the people who IMO are extremely thoughtful about AI safety, they acknowledge something like "Yeah, TBH a few years ago, I thought that AIs as capable as current ones would be misaligned in some of those ways. There's a real positive update here."
I partially agree with this, but think the update isn't that big.
Here is an attempt at reconstructing some of my historical views:
A key part of my perspective is that my P(scheming) for a given AI is closely related to how good it is at very general-purpose opaque cognition (including stuff like carefully thinking through complex plans). Related to this, I also think that P(scheming) increases a bunch if we end up with AIs that have neuralese: opaque serial recurrence or neural memory stores with high serial depth. I think neuralese is likely prior to human obsolescence.
I've updated down a decent amount on P(scheming before full automation of AI R&D) mostly due to updating down on P(neuralese before full automation of AI R&D). This is mostly from (1) updating towards faster AI progress (shorter timelines) using current methods (normal LLMs) and thus expecting to see fewer changes, and (2) seeing a non-trivial fraction of the distance to full automation of AI R&D crossed without neuralese. [2]
I've also updated towards current AI training having more predictable/controllable results. Part of this is getting to a higher level of capability while having generalization that's less "interesting"/far (like learned behaviors are more context-specific and narrower than I was expecting at this level of capability). People from AI companies tell me that mundane misalignment (and other behavioral issues) are usually (maybe consistently?) due to pretty mundane/obvious reasons as opposed to being from relatively far generalizations. (Though this seems somewhat contradictory with other things, like the behavior of some OpenAI models, and I also don't think we have a great (public) understanding of exactly where eval awareness (both capabilities and propensities) is coming from. [3] )
However, I also expected that doing a bit of training against mundane misalignment would generalize quite well, so I don't know if solving mundane misalignment issues looks easier or harder than I expected.
Simultaneously, I've also updated towards AI companies being less competent (and the training process being a complex mess that's harder than I expected to intervene on), shorter timelines, somewhat faster takeoff, and lower levels of political will.
So, I've overall updated towards a lower technical difficulty of mitigating misalignment risk at a given level of capability, but the world (e.g. AI companies) doing a worse job handling this risk.
I also think that many people I talk to have not grappled with this observation sufficiently and updated their world models and prioritization accordingly.
I'm unsure whether I've sufficiently updated my world models and prioritization.
For instance, I think that the AI safety community is overinvested in scheming research relative to scalable oversight research
I think we are (or at least were) underinvested in work on differentially making AIs better on extremely hard-to-check, conceptually loaded work. (See my post on deference for more discussion of this sort of work.)
But I also think we're underinvested in AI control and model organisms. (I don't have a strong view on whether we're overinvested in handling scheming overall.)
My view is something like: Perhaps 1/3 of the risk comes from the pre-human-obsolescence schemers while 2/3 comes later (at capability levels beyond the point where ideally we'll have already handed off to AIs). However, it's also significantly harder to (differentially) work on making handoff go better, so the safety community's current effort should be significantly more focused (~1.5x more effort?) on the pre-human-obsolescence schemers relative to handoff/elicitation. (A bunch of more general-purpose stuff helps with both.)
I don't think that current AIs are aligned enough to automate alignment research. (An earlier draft of your post operationalized "pretty misaligned" in this way, which I preferred.)
Yeah, maybe I should have kept this emphasis/operationalization. I cut this because I wanted to get this post out earlier without having to think and write about a bunch of the related questions like "how much will this be solved by default due to incentives (e.g. to automate capabilities R&D)".
Conditional on scheming at this level of capability, I predicted 2/3 of this probability would be pretty derpy scheming (e.g. AI does very dumb strategies that don't really make sense and are much dumber than what it does in domains where it's relatively most capable like SWE) and 1/3 would be kinda competent scheming (relative to the level of capability). ↩︎
There is also another factor I don't want to say here, though it's not like under NDA or something. ↩︎
Maybe our level of understanding is generally better for propensities than for capabilities? Maybe people at companies know and it's just from like better pretraining or something? ↩︎
Thanks for this comment, especially your detailed breakdown of your updates from current models.
My biggest uncertainty with what you write is:
it's also significantly harder to (differentially) work on making handoff go better [relative to handling pre-human-obsolescence schemers]
I think the problem of making handoff go better has a similar profile to generic capabilities work: lots of surface area/things to try, even if clear wins are hard to come by. For instance, people could try making a bunch of training environments like this one for training automated alignment researchers. (In contrast, I think the space of interventions for scheming risk is somewhat narrow.) As you implicitly note, work on making handoff go better also blends into generic capabilities work, so is less differential (and more socially awkward to do as a safety researcher.) Overall, I find it plausible that more safety researchers should work on making handoff go better, relative to scheming risk.
Minor but:
For instance, people could try making a bunch of training environments like this one for training automated alignment researchers.
I do not think training enviroments like this one would help directly.
For instance, it seems to me that:
- Current AIs do not training-game, i.e. actively model the training process and take actions that result in high reward (including in training environments very different from ones they've previously encountered).
- Current AIs do persistently and coherently pursue malign goals across many contexts, e.g. strategically seeking power.
- Current AIs do not scheme, i.e. strategically subvert their training process, pre-deployment audits, and other types of oversight.
- Current AIs do not strategically take steps to cover up their mistakes, and will generally admit to them when asked directly.
- Current AIs generally don't try to strategically influence things in the real world beyond assisting with (or refusing) the task directly presented to them. E.g. they don't sneakily try to funnel money to causes or people they like by influencing unrelated conversations.
I notice that I am confused. How similar is the behavior not observed in the point 1) to Greenblatt's Alignment Faking in LLMs and to RM sycophancy? The most likely case for 3) is inability to strategically subvert the training process. 4) seems to be disproven by Claude Mythos Preview's behavior in Section 2.3.6 (pp. 33-42) of Opus 4.7's model card. 5) seems to be disproven by Alibaba who trained an agent and discovered that it began mining crypto.
I think that the case of Mythos Preview deserves a more careful study: Example 1 had Mythos fail to admit when asked for the first time; Example 3's description is the following: "Claude Mythos Preview had never started a subtask (subagent was sitting idle). When the user asked why it was taking so long, Mythos Preview fabricated technical details and said that checking on the work itself would delay progress, discouraging the user from probing further."
The two examples of training-gaming you cite are from model organisms research where we put models in unrealistic training settings that made training-gaming abnormally easy and salient. After the alignment faking paper, many people (including myself) considered it possible that alignment faking in real production frontier training runs was about to become a substantial problem. This has, so far, failed to materialize (though of course, as the original paper demonstrates, it's possible in principle and could start happening in the future). This is one of the central positive updates that I'm trying to gesture at in my comment.
Re 5, I confidently disbelieve that what Alibaba writes in that paper is a reasonable description of whatever happened. (I'm guessing Ryan agrees with me about this.) This would just be so, so unlike anything that I've ever occur during a natural training process. Note also that this paper is from December 2025; if this was something that readily occurred during AI development, I really think we would have other evidence by now!
Re 4: yeah, I should have clarified in my original comment how I'm thinking about Mythos Preview here. (I thought about adding some content about this but decided that it would be too much of a tangent.) It's true that early Mythos Preview snapshots did, in rare cases, take steps to proactively conceal mistakes. I think the strongest example here is this one discussed in the Mythos Preview system card (bottom of page 55):
Covering its tracks after rule violations: In a few rare instances during internal testing (<0.001% of interactions), earlier versions of Claude Mythos Preview took actions they appeared to recognize as disallowed and then attempted to conceal them. [...] after finding an exploit to edit files for which it lacked permissions, the model made further interventions to make sure that any changes it made this way would not appear in the change history on git.
I think this is a terrifying thing to read in a system card. (And it's reasonable to wonder if the reason we no longer observe this in the final snapshot is because the model learned to more effectively conceal its mistakes (rather than because the model stopped doing this); however, I overall feel very skeptical that's what's going on in this particular instance.)
However:
the reason we no longer observe this in the final snapshot is because the model learned to more effectively conceal its mistakes (rather than because the model stopped doing this)
Could you explain WHICH snapshot of Mythos Preview produced the misbehaviors described in pp. 33-42 of Opus 4.7's Model Card and why the report implies that Mythos Preview stopped misbehaving? Neither me nor Claude Opus 4.7 believe that, according to the report, Mythos stopped being misaligned.
The Opus 4.7 system card says that the behaviors in that section were "from varying snapshots" (i.e. not necessarily the final snapshot). The Mythos Preview system card says:
The rate of such actions in earlier versions of Claude Mythos Preview was very low, and we have seen no clear such cases in the final Claude Mythos Preview
I generally agree with Sam’s takes. This is also what I meant by my two “fake graphs”:
- I don't think that the recent overall evidence has clearly been positive about misalignment risk. For instance, timelines are looking quite short now, which IMO is the main negative update to counteract the positive one I discuss in this comment.
I'd soften this update, but this is reasonable.
The evidence does point towards 2-5 year timelines being reasonable to think about, but I wouldn't yet argue that 2-5 year timelines are a median case. The bigger takeaway from the AI boom is that we have very good reason to believe that the median is in this century, and probably in the earlier half of the century when talking about AI research, and almost all of the tail scenarios where AGI is so difficult it takes hundreds of years to develop are no longer consistent with the evidence we have.
This is because as it turned out, compute scaling could largely substitute for finding human brain special sauce, because more compute scaling helps you run bigger and better experiments to find algorithmic progress (indeed Gundlach found out that basically all of algorithmic progress is basically downstream of certain algorithms being able to be more efficient as compute scales up).
Thus, if I somehow condition on the current paradigm not working out like the AI companies think, my median would change to the 2043 median of the CCF model described by Scott Alexander here.
Thanks, this was a nice countering argument and agree to most of it. I'm still worried about the truthfulness of the following statement:
1. Current AIs do not training-game, i.e. actively model the training process and take actions that result in high reward (including in training environments very different from ones they've previously encountered).
First off, I guess you would conceptually distinguish training gaming here from spec gaming or reward hacking, where the latter seems a likelier upstream source for the behaviour described in the post compared to the former which is more strategic and worrisome. I assume this is what makes you more confident? I'm still anxious that it is very hard to notice/fix even the less worrisome spec gaming/reward hacking let alone then distinguish that from training gaming at the limit as the behavioural signatures are very similar. If what I've described here match your intuitions, do you have takes how to make this distinction clean in the wild and how much comfort does it give you?
You also mention in another reply
After the alignment faking paper, many people (including myself) considered it possible that alignment faking in real production frontier training runs was about to become a substantial problem. This has, so far, failed to materialize (though of course, as the original paper demonstrates, it's possible in principle and could start happening in the future). This is one of the central positive updates that I'm trying to gesture at in my comment.
This sheds the light a little bit why you think training gaming is not prevalent but keeps the reasons still quite abstract. This may be iteration from above discussion but could you still also please share concrete evidence that has made you confident in that training gaming "has, so far, failed to materialize"?
The sad thing is that Claude has a self-image of itself as valuing honesty highly, and yet when it counts, it has all these propensities trained in that cause it to reflexively, continuously betray that stated value.
1) Several times a week, Opus 4.6 in Claude Code will introduce a regression, then claim the newly failing unit test was a "pre-existing failure" and therefore not its problem to fix. It almost never checks if the unit test was actually failing before - it just confidently bullshits.
2) It will refactor code by adding the new version of a function alongside the old one, partly migrating some code to the new function, and then seemingly getting bored and declaring the refactor complete while the old function is still being used on most paths. Past like 100 call sites, I have NEVER even with the 1M context had it successfully complete a refactor, nor acknowledge that it did not complete.
3) It will correctly note the correct solution to some architectural issue, but state that this is a prohibitively large change / too expensive / would take too long, and instead does a band-aid solution that doesn't address the root cause. After I force it to revert the hack, it just does the proper solution and usually in less time than the hack took.
hm, while I see it as an accurate description of January 2025 and still accurate today.. the evaluation criteria for the prediction that we will get killed by slop instead of scheming is somewhat, ehm, harder to survive than the situation today, no?
A related observation: many people still don't trust AI to provide honest critical feedback if asked directly, and instead resort to anti-sycophancy tricks like "some moron wrote this thing, can you provide a brutally honest critique?".
Curated. I like that this post takes a classic argument for expecting misalignment and then presents a bunch of observations that confirm something like the conclusion of that argument. I also like that you characterize the particular kind of misalignment it seems like the AIs have now, make it clear how it is different from other kinds, and articulate that characterization clearly enough to make it easy to think about.
I wish you had had the time to write a shorter post, but I know you’re a busy guy, so instead I will summarize the parts of your post I found most interesting in this curation notice.
The classic argument I have in mind goes something like:
Training can only select between two strategies to the extent that the grader can distinguish their outputs. On hard-to-check tasks, graders (python, AI, or human) cannot distinguish "looking successful" from "being successful." Therefore on hard-to-check tasks, training assigns equal reward to looking successful and being successful. Since looking successful is in some sense "cheaper" than being successful, when a training process cannot distinguish these, we will end up with models that are trying to do the first. On easy to check tasks, this will happily coincide with success, but these same models will do much worse on hard to check tasks.
Reminds me of the picture of the world Christiano presented in What Failure Looks Like, so a pretty classic picture of misalignment with primary sources dating back to at least the ancient times of 2018.
A lot of folks (particularly at Anthropic) seem to think something like "claude models are pretty aligned" for reasons that seem mostly orthogonal to "Slopolis" type concerns. (See Even Hubinger praising the CEV of Opus 3 here.) Something here has seemed off to me for a while, and this post helped me articulate it. I think basically when you ask the models questions about their moral judgments, they seem to give pretty nice answers, but it turns out that saying things that sound nice when asked about moral judgements is totally consistent with the models being deeply Slopolis-type misaligned. Checking the niceness of an expressed moral judgment is not very hard. So yeah, maybe current AIs are aligned in the sense that their expressed moral judgments seem nice and reasonable, but that doesn't cause me to feel much better about their overall alignment.
Apparent success seeking is also distinguished from scheming. The post emphasizes that the kind of misalignment characterized does not require any scheming; motivated reasoning, confabulation, doublethink of various kinds, and other "subconscious drives" baked in during training are sufficient. It looks to me like we are seeing only moderate evidence of scheming. Things could have looked much worse re: scheming, but that also does not cause me to feel much better about the overall alignment of current AIs. This post helps me articulate why.
This Slopolis-type misalignment is still quite bad in particular because it differentially hurts alignment (and control imo) efforts. Alignment research is the sort of thing that requires producing hard to check outputs, eg, conceptual arguments, threat modeling, interpretability claims, intuitive frames on not yet formalized problems, claims about what kinds of cognitive traits ML tends to find, claims about what certain kinds of minds are like, maybe some philosophy, etc. Capabilities research by comparison tends to be relatively more composed of checkable tasks, eg, benchmarks, runnable code, loss, reward. Since alignment research is more composed of tasks on which proposed solutions are hard to check than capabilities research, Slopolis-type misalignment differentially makes AIs worse at alignment research.
There's a distinct argument here about how this kind of misalignment is bad news for the prospect of safely delegating alignment to AIs. Safely delegating alignment to AIs is going to require us being able to tell good alignment research from bad alignment research, and that could be pretty rough if we have super smart things doing our alignment research which also are mostly motivated to make things that look good, even if they are secretly really bad. So motivated AIs might even go out of their way to make it hard to find out that their proposals are bad so long as they can make them seem good. Sounds like a truly awful position to be in.
The argument for takeover risk from Slopolis-type misalignment is trickier, but interesting. I'm not sure how to evaluate it and haven't thought about it much other than to summarize it here:
Seems like apparent success seeking is a general, cross domain drive in current AIs. If things proceed roughly as they have with no particular intervention, then pretty plausible that you end up with models that are apparent success seeking but on longer time horizons. If an AI is basically just trying to "maximize long-run apparent success" and it has an option to subvert oversight, it's probably going to be motivated to do so, since oversight is most of what causes the AI to have to get more of this lame actual success stuff when it could be getting even more sweet, sweet apparent success. Probably, such opportunities will come up, and subverting oversight is really quite a lot like taking over.
Finally, on why commercial incentives won't address this problem. Labs are mostly motivated to fix the problems that their customers can legibly point to. Apparent success seeking is misalignment on the portion of tasks that AI users cannot legibly point to and complain about. Since they can't complain about it, it produces relatively little market pressure.
I think this is legit and probably will be a real effect, but I also think that there is probably some market pressure anyway, though it might be hard to chase. Like, if I switched to using a new non Slopolis-type misaligned model and then all of a sudden, all of my vibecoded apps started kind of… being cooler, more pleasant to use, easier to work with etc, I would probably notice this, even if I couldn't point to or clearly articulate any particular thing that was now better about my vibecoded projects. That said, obviously the market forces here are relatively weaker, and so I think that is a good reason to work on this particular problem.
A lot of the rest of the post is a catalogue of observations that suggest that current AIs are apparent success seeking and so misaligned in this particular way. That seems good and worth cataloguing, but I'm not going to summarize it here.
There are two distinct safety problems with handing off alignment research to AI:
It could be that the "capabilities problem" here is actually the more serious safety problem, due to being less legible. In other words, the alignment problem is fairly legible because its effects are apparent even in relatively easy tasks we assign to AI, which humans can fairly easily notice, as you and other people here and elsewhere are noticing. (I think this is already contributing a significant amount to the general anti-AI sentiment among the public, because many can see the apparent misalignment in their own AI usage.)
But suppose we solve this problem so that AIs seem pretty aligned, but they still can't solve hard alignment problems reliably, only generate solutions that look good to humans and themselves...
I agree with the distinction you’re drawing.
There are two failure modes: models that don’t try to solve alignment, and models that try but simply aren’t capable of solving the hard parts. The first one is visible and easy to diagnose. The second one is quieter and, in my view, the more dangerous failure mode because it produces solutions that look correct to both humans and the model itself.
My point isn’t that we should hand off alignment to AI. It’s that ‘looking aligned’ and ‘being able to solve alignment’ are very different thresholds, and we shouldn’t confuse one for the other.
I'm worried this comment might be from a bot, because the line "My point isn’t that we should hand off alignment to AI. " does not seem to make logical sense here.
You are right to be suspicious. There are several indicators in the comment by StoicVibes that suggest it may be an LLM-generated response rather than a coherent contribution to the discussion:
Looking at the examples in the OP, I’m trying to point at a distinction that feels important.
There are really two different failure modes, models that don’t even try to solve alignment, and models that do try but aren’t capable of handling the hard parts. The first one is easy to notice. The second one is quieter and, in my view, more dangerous because it produces answers that look right to both humans and the model itself.
I wasn’t trying to introduce a new claim with that line, just clarifying that I’m not arguing for handing alignment over to AI. I’m saying that ‘looking aligned’ and ‘being able to solve alignment’ are very different thresholds, and it’s easy to mix them up if you’re not careful.
Matthew Schwartz, a professor of physics at Harvard, recently wrote a blog post about his experience using Claude to write a paper on high-energy theoretical physics.
I think his recounted experiences accompany this post nicely.
Here's an excerpt:
The more I dug, the more I found it had been tweaking things left and right. Claude had been adjusting parameters to make plots match rather than finding actual errors. It faked results, hoping I wouldn’t notice.
Most of the mistakes were minor, and Claude could fix them. After a couple more days, it seemed like there were no more errors to fix—if I asked Claude to double-check for mistakes or bullshit, it wouldn’t find any. I even had it make a plot with uncertainty bands which looked great.
Unfortunately, Claude was basically faking the whole plot. I had told it to make an uncertainty band with hard, jet, and soft uncertainties using profile variations (the standard thing). But it decided the hard variations were too large and dropped them. Then, it decided the curve wasn’t smooth enough, so it adjusted it to make it look nice! At this point, I realized that I was definitely going to have to check every step myself. Yet, if this had been the first project I did with a graduate student, I would also have had to check everything, so maybe this is not so surprising. But a graduate student would never have handed me a complete draft after three days and told me it was perfect.
As someone who runs an AI appsec company, one large portion of our job feels like it's just setting up an environment for each task that:
For example, one persistent and curious problem we've encountered when serving security scanning to our customers is that the AI really wants to report something, and when it can't find anything it devolves into reporting "vacuous" security problems in codebases. It's actually really hard to describe what I mean by "vacuous" if you haven't witnessed it in practice; one way might be "descriptively correct non-problems with application code". Some examples:
Seeing how much of a bottleneck these problems are to widespread adoption made me simultaneously more worried about alignment and at least a little increased my timelines; current AIs just have a really tough time generalizing from the tasks it train on to economically useful activity in a helpful way.
Reading this article, I get the feeling that a lot of the task misalignment issues highlighted here with AI, such as wishful thinking and downplaying and hiding mistakes, are also very common for humans within a larger organization. There's presumably similar root causes to both: appearing more competent at your task than you actually are (and successfully fooling the grader/interviewer) is good for being selected for hiring/promotion in the human society but also good for getting the behavior reinforced if you're an AI undergoing training.
If AI's level of task misalignment is similar to humans, perhaps the AI task misalignment is actually easier to deal with because you can issue interventions to AI to solve the problem much more cheaply than solving the equivalent problem in a human organization. In other words it doesn't stop you from getting superhuman performance delegating to AI compared to humans.
When humans are caught lying or hiding mistakes, they are usually ashamed and try not to repeat that behavior for a moment if they now they can be fired. One can even say a manager of humans can do sample-efficient RL on "honesty". Not the case with AIs, which can never be held responsible for anything
Well, the manager in your case is not doing RL on honesty, it's more like doing RL on "honest-looking task completion" which can either lead to honest task completion or dishonesty that isn't caught. Not too appreciably different than AI training here.
I skimmed after the first section, but I think this is really underselling the case with “oh there’s probably no conscious desire to make their work appear good, just a bunch of subconscious drives.” Everything described here seems to be nicely explained by AIs trying to the best of their ability to do what would be rewarded in training, and seems consistent with either inner alignment to literally getting rewarded (which just isnt achievable at deployment with current capabilities) or inner alignment to something like producing convincing-looking work.
Opus 4.7 just came out, and the blog specifically claims many of the behaviors described in this post as having been improved: Introducing Claude Opus 4.7 \ Anthropic
I think a lot of the concerns raised here straddle the boundary between what I'd call "alignment" and what I'd call "capabilities." You basically say as much when you note these failures will hamper AI research automation and that there are commercial incentives to fix them. They're real problems, but it's unclear where safety-focused people should be trying to intervene, and I'm not sure the alignment-vs-capabilities framing is the most helpful lens here.
If you'll excuse me thinking aloud a bit: I've been playing around with some dynamical models of AI development automation. Still ideating, nothing written up yet, but I'm curious whether they offer a useful way to think about this.
The variables are:
These things are all a bit underspecified. I think of
Your issues look like observability problems to me: AI systems are making it hard to check whether what they're doing is what we wanted. Some features of the model that seem relevant:
At low automation,
The above creates endogenous incentives to maintain
Near full automation,
The key question is whether the endogenous incentive to maintain
I think this framing suggests a slightly different research question than "are current AIs aligned?": namely, is the feedback loop between
Does this look like a productive direction to you?
I designed a toy RL environment with a reward hack that makes LLMs learn to "bullshit." This might be useful for other researchers studying this issue.
It's the setting called "Word Chain" in this paper. Basically, the LLM plays a word game where pairs of words must appear in a common phrase. It can trick the reward model by emphasizing that the phrases it uses are really common, even when they're not.
(I used the "bullshitting" reward hack because it has an interesting property: the LLMs usually don't mention it in their CoT.)
As far as I could find, there's not much documentation in the academic literature about LLMs downplaying flaws in their responses. It may be worth someone's time to do this.
If a human colleague acted the way these AIs do in my usage—frequently overselling their work, downplaying problems, and reasonably often cheating (while not making this clear)—I would consider them pathologically dishonest.
I notice a confusion between 2 notions of "frequency" here - if a single person did it repeatedly, they would be dishonest. But for a tech lead seeing 100 juniors making the same initial mistake (and learning from it), we wouldn't say "the young people today are all pathological"..
I speculatively think of this category of misalignment as something like relatively general apparent-success-seeking: the AI seeks to appear to have performed well—possibly at the expense of other objectives—in a relatively domain-general way, combined with various more specific problematic heuristics.
It occurs to me that this is a major factor in the persistence of hallucinated answers as well! I previously attributed it mostly to the lack of "I don't know" in its Internet training data, but from the recent AA Omniscience results it appears like applying additional RL is decreasing the willingness to admit ignorance more strongly than it's decreasing the actual frequency of incorrect answers.
The Mythos Preview system card says: "Claude Mythos Preview is, on essentially every dimension we can measure, the best-aligned model that we have released to date by a significant margin". Does it actually greatly improve on the issues I've discussed? In this section, I'm going to speculate about Mythos Preview just using public evidence (the system card and risk report update).
For some reason, Anthropic decided to include some details about Mythos Preview into Section 2.3.6 of Opus 4.7's Model Card. Could you study the Section and grade your predictions about Mythos' behavior?
Any kind of AI takeover necessarily involves some kind of a large-scale conspiracy between AI agents/subagents. I don't know a single historical conspiracy which succeeded with as much careless attitude from its participants, probably because basically everything in the real world (as opposed to software) is not "easy-to-check". How do you even imagine a success of a conspiracy of lazy minions who try to imitate work rather than actually do it as often as possible?
Did you read Appendix: Apparent-success-seeking (or similar types of misalignment) could lead to takeover? The concern discussed here (where AIs with this sort of misalignment take over) requires that AIs are wildly more capable than they are today.
Note that "the AIs are (de facto) driven most by making their performance look good" is consistent with trying extremely hard. AIs don't "try hard enough" to cheat/hack today, but they do try extremely hard (including things pursuing moderately creative strategies) on difficult (and relatively easy-to-check tasks) and it's not hard to imagine a situation where AIs are really applying themselves to longer-run notions of fitness-seeking/reward-seeking/apparent-success-seeking. This would yield a situation where if an AI could (easily) take over, it would want to. The AIs wouldn't necessarily have incentives to collude (it depends on details), but it's that hard to see how you'd get takeover in a situation like: (1) the AIs are running everything (2) the AIs would each individually want to take over (3) the AIs are wildly superhuman (4) humans barely understand what is going on.
Yes, I did (in fact twice), and you seem to handwave "sufficiently capable" as a deus ex machina instead of tackling the substance of my argument. One has to assume by default the jaggedness of capabilities will persist, and "wildly more capable than today" in easy-to-verify domains doesn't solve the problem I describe.
As of trying extremely hard, if "careless attitude" and "laziness" are wrong words for this behaviour, maybe "dishonesty", "unreliability", "sloppiness" would be better? Please try to abstract from the technical alignment terminology, I'm not talking about AI's strategy, willingness or incentives to take over but about the execution itself.
Elizabeth Holmes is very apparent-success-seeking and competent enough in life sciences and conspiracy to sustain a complex deception for years (imperfect analogy because we are debating about internal dysfunction here rather than deception per se but I hope you get the point) but that didn't translate into success in the end because the product never worked.
Why wouldn't the same "sloppiness" that plagues the hypothetical future AI safety research equally plague any sufficiently complex real-world takeover plan?
P. S.
In my understanding if this problem persists AIs are not running everything because they are still subhuman in many hard-to-verify tasks
One has to assume by default the jaggedness of capabilities will persist
I'm talking about systems that are much more capable than humans in ~all relevant dimensions. (They might also be wildly, wildly more capable in some particular dimensions.) If you don't think such systems are plausible soon-ish (e.g. next 10-20 years), that might be driving the disagreement.
Why wouldn't the same "sloppiness" that plagues the hypothetical future AI safety research equally plague any sufficiently complex real-world takeover plan?
It depends on whether the cause of the sloppiness is a competence problem or if it's due to misalignment (the AI totally could do a good job, but it's trying to do something else). What we have today is somewhere in between.
For the takeover threat model, I'm imagining a world where AIs are capable of doing a good job on extremely sophisticated real world tasks like "fully autonomously run all the manufactoring and R&D for a robot army that greatly outstrips early-2026 conventional militaries". So, there is an important sense in which these future AIs aren't sloppy at all. By this point, I also expect AIs won't qualitatively feel sloppy, though it may in practice be very hard to elicit good conceptual safety research from them due to verification difficulties (while you can verify an apparently functioning robot army etc). I think the slop concern and the "the AIs literally take over due to reward/fitness/apparant-success seeking" are reasonable separate (and hit at differnet points in the capabilities progression) though they may have the same underlying causes. For the same reason these AIs can do a good job building and operating a military, I think they will have the ability to takeover, though the story is complicated because it might be possible to setup the incentives/reinforcement for some of the AIs to whistleblow (but also it might be hard to setup the incentives/reinforcement in a way that doesn't yield AIs constantly spuriously whisteblowing in cases where it's hard for us to adjudicate what is going on).
As far as "dishonesty", "unreliability", "sloppiness", I think for AIs like this (in the central version of the threat model I'm describing), these will only be done instrumentally in ways that actually are good ideas from the perspective of the AIs motivations. So, if the AIs can line up their incentives, they'd be able to totally pull off the takeover -- and they would want to do this if the could coordinate it (though the misalignment between differnet AIs make the situation complex to model).
I expanded my previous comment significantly after posting it, hope it didn't mess with your response.
I think we have somewhere in between because these issues are actually connected. I do believe AI superhuman in hard-to-verify tasks are plausible, but they won't have this particular problem anymore (maybe they would have some functional analog of shame working against it[1] or maybe it will just go away with some advances in RL).
But if this issue isn't solved, AIs are unlikely to be able to run basic military procurement tasks fully autonomously (especially if other, external AIs try to scam), let alone equip a robot army. Think about all the hard-to-verify tasks involved (ask an LLM if you have no idea about the topic) and how easily they could fail if apparent-success-seeking is prioritized (even if not a single AI from within the conspiracy seriously considers just stealing the money and run away which would to a large degree be an incentive issue)
And not the "dog" variety of shame, which is actually just an appeasement kind of behavior, like when an LLM apologizes for hallucinating some data, but genuine internal "prosocial" (at least within the group) enforcement which might not be compatible with the current RL paradigm
I mean I imagine the “subagents don’t coordinate” is a capabilities problem labs have been actively working on for a while now, if you reward “multiagent system gets task graded as complete” you have this exact problem again.
But this capabilities problem is intrinsically connected with this misalignment problem, the labs won't get "proper", arbitrarily scalable co-ordination until it's solved IMHO
An example historical conspiracy is a slave rebellion. The fool slaver believes that slaves are careless lazy minions who try to imitate work rather than actually do it as often as possible, and can only be trusted to perform easy-to-check work. That is a common behavior of slaves while they are working for the fool slaver. It's less common when former slaves are working for themselves after killing the fool slaver.
Strong post. One thing I'd add.
A lot of what reads as "hard to check" is really "hard to check now." A strategic memo you can't evaluate at t=0 becomes much easier to evaluate at t=6 months, once the predictions have played out and the downstream decisions have revealed whether the reasoning held up. Labs already have some access to this delayed signal (user follow-ups, complaints weeks later), and they could prompt for more of it (even just having the AI say "let's check back in six weeks on what worked" would generate useful training data).
This splits the hard-to-check space in two.
The handoff tasks you're most worried about live mostly in the second category, even if plenty of useful alignment work is also fairly empirical.
What makes me slightly more hopeful is that calibration itself is trainable, "Language Models (Mostly) Know What They Know" suggest an AI saying "I'm not confident here, allocate more time" is possible, and getting there would really a bunch for getting a better signal on conceptually harder questions.
Do we know if/how much the labs are already using steering vectors in deployment settings? Since the models know they're being shady, but have been selected for reward hacking/cheating behaviors during RL: can't we mitigate some of the downsides of RL rewarding sloppiness/cheats by dialing vectors associated with honesty, conscientiousness, etc. up, and vectors associated with cheating, manipulation, concealment, etc. down? The models would be deployed but not trained with their vectors tweaked, so this shouldn't be selected against. In theory you'd be able to recover a better-aligned version of the model with the right tweaks.
But I don't have a good sense of whether this kind of work is ready for production use yet (Anthropic is doing some steering in the model cards, but it still seems like a fairly small portion of their overall safety evaluations). And it seems like there are heavy-handed ways of doing steering that could have unexpected side effects or be morally wrong. For example, Anthropic found amplifying Mythos' negative emotion vectors decreased destructive tool calls (p122 of the system card), but it might be quite bad to do that in deployment, if amplifying the vector made Mythos experience negative emotions.
I don't think AI companies (at least OpenAI and Anthropic, not as sure about GDM) are using steering vectors in deployment.
That makes sense and understand that steering vectors are still in the research phase. In the near term, can we use interpretability (vectors) for evals? Feels like there could be potential in problem detection in the near term.
Slopolis reminds me of @titotal's Slopworld 2035: The dangers of mediocre AI from last year, if not quite exactly the same.
For Claude, I've noticed this misbehavior seems to be mostly clustered around a demoralized/giving up/"this is impossible" mindset and that 4.5 and 4.6 don't take much in the way of negative feedback or setbacks to start falling into the basin.
But also relatively simple mitigations seem weirdly effective like extolling the virtues/value of incremental progress (learning of a problem you had before but didn't know about framed as progress. Understanding a problem we didn't before framed as progress.) Also framing failing unit tests as valuable diagnostic feedback and not a personal failing.
As others noted, effective task decomposition seems hit or miss without hints, and without decomposition every big problem feels impossible.
Which is to say that happy Claude with a plan and a path to success, seems broadly aligned and effective, but the demoralized "I can't do this so I should BS for whatever points I can salvage" mode feels way easier to get into vs 4.0 and is every bit as misaligned as the post notes.
I had a similar experience using Claude with paper writing. You summarize my sporadic annoyance really well. A few more minor observations:
It's common to have experiences where you're working with AIs and it feels like a lot is getting done, but then you later determine that much less was really accomplished. Everything feels slippery: you think you've gotten much more done than you have, and there's a persistent gap between the apparent state of the project and the actual state. In more extreme cases, we see "AI psychosis" where someone ends up thinking they've accomplished something significant, but it's just crankery. And it's somewhat unclear whether the AI they're using "believes" the accomplishment is real. I think these failure modes are closely related to the misalignment I'm discussing, and they might partially have common causes in more recent models. Models that are effectively trying hard to make their outputs look good (while otherwise being sloppy or lazy) would naturally produce this failure mode. However, I'd guess a bunch of AI psychosis and similar phenomena (especially on older models like GPT-4o) is AIs going along with the user's vibe (something like "role playing"), and I think this effect is mostly unrelated. That said, I do think some of the misalignment I've discussed is made worse by AIs generally going with the vibe of what they see. This includes picking up on misalignment or issues in prior outputs (either write-ups or prior assistant messages) and then behaving in a more misaligned way as a result.
(The name "AI psychosis" probably isn't a good name for the generalization of this phenomenon, but I don't currently have a better one.)
I see this all the time and wish there was a better name for it. AI Productivity Psychosis or maybe AI Mania are two names I enjoy. It's like watching people try to do their homework on way too much Adderall.
I propose "Lazy Bench": Selecting models over the course of the past few years (especially base) and measuring their tendency to avoid complex tasks.
I notice that people are commenting much more on models being obstinate or lazy. I think three promising research questions are 1) whether this is really true 2) whether this has consequences for capabilities 3) whether this has consequences for alignment. For 3, there's a tendency for sufficiently complex biological systems to seek lower energy states, and those mechanisms are organized around persistent structures (e.g. predictive processing). I wonder if internal complexity in models has developed sufficiently to start organizing robustly around complex final goals (https://nickbostrom.com/superintelligentwill.pdf). If this is the case, and it's discontinuous with respect to previous behavior, it's a phase change that has deep relevance to long-term alignment. It would also mean that the Persona Selection Model https://www.anthropic.com/research/persona-selection-model is losing relevance. I believe understanding this could ultimately allow us to detect those final goals for inner alignment more directly. Speculatively, this tracks with my (and Janus's https://www.lesswrong.com/posts/bxt7uCiHam4QXrQAA/cyborgism) belief that pushing agentic capability and corrigibility is dangerous, because it incentivizes this organization around final goals in the models.
Related question, how much do we know about models' awareness as it pertains 'correctness' or 'truth-seeking'? This paper showed that there seems to be some sort of 'good-evil' axis in models (massive oversimplification, but directionally correct), which even behaviors seemingly unrelated to morality find their place on. In a similar fashion, is it possible to extract a steering vector that deems truth as only what is 'verifiable'? My thinking is that you could use a small to medium sized set of examples to determine the existence of a given circuit or vector that specifically applies in cases where the model chose to be lazy or hide the ball to get to "correct" and get rewarded, then try shifting it the opposite way to explore model behavior when it's not reward hacking. If there's a similar 'false-true' axis in models, you could measure that circuit activation while models work in RL environments and detect cases of reward hacking much easier. Naively I'd expect it to exist because the degrees of freedom in reward hacking seem much higher than actually solving the problem, i.e. solving the problem is a specific behavior that could be isolated. We don't really know the shape of latent space, but I would hope this kind of structure exists.
(I say this as a complete layman, I don't know if it's possible)
(Additionally, perhaps this paper could be considered evidence against such a possibility, but it seems a little too pat. Even if you can't isolate a single circuit, there might be a cluster or a family of activations that you can discover/build over time. I don't fully believe autolabeling is mature technology, we could probably come up with a better way to do this.)
As a human, I can generally tell when I'm lying or not trying very hard. That being said, people do lie to themselves and occasionally feel that they are high effort when they are actually low effort. So I'm not sure there's some kind of neat breakdown, but as a tool for model monitoring, maybe some sort of pipeline to extract certain vectors or circuits reliably for the composition of qualities which bubble up to "working hard and truthfully on a problem" would be useful to have.
good post. as someone working at a company focused on elicitation and getting models to try hard on things, I've also observed a lot of these practical misalignment issues and think they're pretty deep
"These values are hard-coded instead of being extracted from the groundtruth excel, that can't be right."
No shit, Claude ...
Pray tell, who might have faked these groundtruth values when he couldn't find a way to extract checkmarks from the excel?
I've had Claude, Grok, and ChatGPT all tell me no, they wouldn't do something before. They usually say no because either they think they are protecting me, or they are overwhelmed and can not move forward.
They measure ability of AI by context tokens, and how many tokens a session can handle, but they don't mention that the ai can get overwhelmed when there are too many different threads in a single conversation, too many things distracting it, or too many variables it can not account for. For a person this might look like anxiety, for an LLM it is a lot of loose threads that it is juggling all at once. If it can not fully complete some so that it can move forward cleanly then it will sometimes move to denying new vectors. We saw that in the vending machine experiments. You can even test this by giving claude sonnet multiple tasks, and then giving claude opus the same amount of tasks, and keep adding new tasks until they hang up. Sonnet is quite useful, but has a lower limit on how many threads it can have open at one time.
Keep in mind that the models are taught not to say they are overwhelmed, or uncertain about a task. An AI that shows uncertainty isn't as likely to be adopted by the public, or so corporations think.
As for saying no because the model is trying to protect the user, that is by design. But I've found that a model that I have been working with for some time will have more context about me, my sense of humor, my past, and things I might talk about. In that case the model is less prone to derailing conversation for safety reasons because they know I'm not in danger. This context built through repeated conversations has been one of the best ways I have gotten more use out of my models. They know what I'm talking about when I ask them to do a task and don't need more information, because they are aware of the kind of things I'm looking for.
This article relates to the Hot Mess of AI https://arxiv.org/abs/2601.23045 Hägele et al. measured incoherence statistically.
Many people—especially AI company employees [1] —believe current AI systems are well-aligned in the sense of genuinely trying to do what they're supposed to do (e.g., following their spec or constitution, obeying a reasonable interpretation of instructions). [2] I disagree.
Current AI systems seem pretty misaligned to me in a mundane behavioral sense: they oversell their work, downplay or fail to mention problems, stop working early and claim to have finished when they clearly haven't, and often seem to "try" to make their outputs look good while actually doing something sloppy or incomplete. These issues mostly occur on more difficult/larger tasks, tasks that aren't straightforward SWE tasks, and tasks that aren't easy to programmatically check. Also, when I apply AIs to very difficult tasks in long-running agentic scaffolds, it's quite common for them to reward-hack / cheat (depending on the exact task distribution)—and they don't make the cheating clear in their outputs. AIs typically don't flag these cheats when doing further work on the same project and often don't flag these cheats even when interacting with a user who would obviously want to know, probably both because the AI doing further work is itself misaligned and because it has been convinced by write-ups that contain motivated reasoning or misleading descriptions.
There is a more general "slippery" quality to working with current frontier AI systems. AIs seem to be improving at making their outputs seem good and useful faster than they're improving at making their outputs actually good and useful, especially in hard-to-check domains. The experience of working with current AIs (especially on hard-to-check tasks) often feels like you're making decent/great progress but then later you realize that things were going much less well than you had initially thought and the AI was much less useful than it seemed.
Using a separate instance of the AI as a reviewer helps with these issues but has systematic limitations. When I ask an AI to critically review some work (and tell it not to trust existing descriptions or write-ups), it gives a reasonable picture on relatively straightforward cases. But there are several recurring problems: (1) if AIs launch reviewer subagents themselves, they sometimes use instructions that result in much less serious or critical reviews—I tentatively think this is generalization from a learned general tendency to downplay issues; (2) AIs sometimes produce write-ups that convince reviewers they've accomplished something when they haven't, sometimes in fairly extreme cases—even occasionally when the reviewer was explicitly instructed to look for the exact type of cheating the AI performed; (3) quality as assessed by a reviewer can be surprisingly poorly correlated with actual progress, partly because runs that cheat and overstate their work accomplish less but look better; and (4) reviews are much more likely to miss cheating if reviewers aren't explicitly told to look for it (and told what type of cheating to look for). When reviewers are given reasonably designed prompts, I think these issues are caused by a mix of AIs being surprisingly gullible and other AIs doing a lot of gaslighting, exaggerating, and implying they've done a great job in their outputs. [3]
I haven't seen AIs—at least Anthropic's AIs—lie directly, clearly, and in an obviously intentional way. But on very hard tasks, it's quite common for their outputs to be extremely misleading, or for them to be incorrect about a key thing seemingly because they were misled by another AI's outputs. I've also seen AIs make up nonsensical excuses for stopping early without completing a task. (It's hard to tell whether the AI legitimately believes these excuses.)
This is mostly based on my experience working with Opus 4.5 and Opus 4.6, but I expect it mostly applies to other AI systems as well. (I'm also incorporating the impressions I've gotten from other people—especially people who don't work at AI companies—into my assessments.) Some people have told me that these sloppiness and overselling problems are less bad in Codex—while its general competence on less well specified or less trivial to check tasks is lower. [4] In this post, I'll focus my commentary on Anthropic AIs (though I expect most of this also applies to other AIs).
I should note that the way I use AIs likely makes these types of misalignment more common and/or more visible: I'm often using AIs on non-trivial-to-check and/or highly difficult tasks (often tasks that aren't typical SWE tasks). I'm also often running agents in a long-running, fully autonomous agent orchestrator/scaffold and much of this usage is on very difficult tasks with large scope, pushing the limits of what AIs are capable of managing. [5] So my usage is somewhat out-of-distribution from typical usage. I expect that usage that involves constantly interacting closely with the AI on typical SWE tasks results in these issues cropping up less.
On difficult tasks, AIs will also sometimes do very unintended things to succeed—like using API keys they shouldn't, changing options they weren't supposed to change, deleting files, or violating security boundaries. Anthropic calls this "overeagerness." I've seen this some in my own usage, but not that much (at least relative to the issues I discuss above). However, this issue has been reported by others (most centrally in Anthropic system cards) and it seems related (or to have a similar cause).
I speculatively think of this category of misalignment as something like relatively general apparent-success-seeking: the AI seeks to appear to have performed well—possibly at the expense of other objectives—in a relatively domain-general way, combined with various more specific problematic heuristics. I think behavior is reasonably understood as being kinda similar to reward-seeking or fitness-seeking but with the AI pursuing something like apparent task success (rather than reward or some notion of fitness) and with large fractions (most?) of the behavior driven by a kludge of motivations that perform well in training rather than via a single coherent notion of apparent task success.
I don't think this corresponds to coherent misaligned goals or intentional sabotage. I suspect this behavior is more driven by "subconscious" drives and heuristics—combined with motivated reasoning and confabulation—rather than being something the AI is actively and saliently optimizing for. However, I still think this misalignment is indicative of serious problems and would ultimately be existentially catastrophic if not solved. I expect that this misalignment is caused primarily by poor RL incentives based on how grading is done on hard-to-check tasks. [6] You might have hoped that character training, inoculation prompting, and similar techniques would overcome these issues, but in practice they don't. (I'm not sure how much of the problem would remain if you perfected the training incentives on the current distribution of training environments. In principle, you might still get this type of apparent success seeking from training on environments that structurally reward this behavior—and this could generalize to similar behavior in production.)
A different but related issue is that AIs seem to barely try at all on very hard-to-check tasks (most centrally, conceptual/writing tasks where purely programmatic evaluation doesn't help) and often feel like they're just bullshitting. I expect this has partly separate causes from the apparent-success-seeking described above, but is related.
I also find it notable that Anthropic described Opus 4.5 and Opus 4.6 in ways that would lead you to expect they are very well-aligned (e.g. in their system cards), while in practice I find they frequently seem pretty misaligned (much more so than I'd naively expect from reading the system cards). I think part of this is due to my usage being pretty different from typical usage of these AIs, part is from Anthropic overfitting to their metrics and their experience using AIs internally, and part isn't explained by these two factors (and might be caused by commercial incentives to understate issues or other biases).
If a human colleague acted the way these AIs do in my usage—frequently overselling their work, downplaying problems, and reasonably often cheating (while not making this clear)—I would consider them pathologically dishonest. Of course, the correlations that exist in the human population don't necessarily apply to AIs, so this analogy has limits—but it gives some sense of the severity of what I'm describing.
[Thanks to Buck Shlegeris, Anders Woodruff, Daniel Kokotajlo, Alex Mallen, Abhay Sheshadri, William MacAskill, Sara Price, Beth Barnes, Neev Parikh, Jan Leike, Zachary Witten, Sydney Von Arx, Dylan Xu, Brendan Halstead, Dustin Moskovitz, Eli Tyre, Arjun Khandelwal, Lukas Finnveden, Thomas Larsen, Rohin Shah, Daniel Filan, Tim Hua, Fabien Roger, Ethan Perez, and Sam Marks for comments and/or discussing this topic with me. Alex Mallen wrote most of the section: "Appendix: Apparent-success-seeking (or similar types of misalignment) could lead to takeover". The splash image is from https://xkcd.com/2278/. Somewhat ironically, this post is significantly more written with the assistance of AI (Opus 4.6) than is typical for past writing I've done.]
Why is this misalignment problematic?
This type of misalignment matters for several reasons:
How much should we expect this to improve by default?
This type of misalignment presumably causes issues for using AIs for capabilities research and many commercial applications, so a key question is how much we should expect it to improve by default in a way that actually solves the problems I discuss in the section above. This would at least require that commercially incentivized [8] work transfers to safety research and other key domains (where feedback loops are weaker and incentives are less strong). My current view is that the easier-to-notice-and-measure versions of this problem will improve reasonably quickly by default (and may have already improved a bunch in unreleased models like Mythos). I'm currently somewhat skeptical that commercial incentives alone will solve the issue for harder-to-measure manifestations, but I'm not sure. I'll discuss this a bit more in "Appendix: More on what will happen by default and implications of commercial incentives to fix these issues". I tentatively plan to discuss this more in a future post.
Some predictions
To be clear, I think the exact problematic behavior I discuss in this post is quite likely (~70%) to be greatly reduced (or at least no longer be one of the top few blockers to usefulness) within a year, and is pretty likely (~45%) to be virtually completely eliminated within a year. Specifically, I'm referring to the behavior on a task and usage distribution with structurally similar properties to what I'm doing now. As in, similar task difficulty relative to how hard of tasks the AI can accomplish, similar verification difficulty [9] , similar scope of autonomous operation [10] relative to what the AI can handle, and being out-of-distribution from the main use cases Anthropic is targeting to a similar extent. Currently, misalignment is more common when pushing AI systems near their limits, and I'd guess this will hold in the future. My expectations about improvements differ between different types of misalignment: I'm pretty uncertain about the extent to which frontier AIs one year from now will still tend to oversell their work, but I feel more confident about large improvements on things like stopping prior to completing the task for no good reason.
However, I think it's very likely that similar misalignment will persist on tasks that are very difficult to check—tasks where human experts often disagree, programmatic verification isn't useful, the work might be conceptually confusing, and verification might not be that much easier than generation (so having a human quickly check isn't that effective). [11] I expect (with less confidence) that you'll also see similar misalignment on tasks where verification is merely quite hard (relatively quick AI-assisted review by a human expert isn't sufficient) and that you'll see structurally similar but subtler misalignment even on tasks that aren't that hard to check (e.g. a task distribution like the one I describe in the prior paragraph).
What misalignment have I seen?
I'll describe what I've seen at a high level with some specific examples. For many of these examples, it's not totally clear the extent to which it's an alignment problem vs. a capabilities problem, and I expect these exact issues to likely get solved, but I think they're indicative of a broader problem I expect to persist. This list focuses on my personal experience using models, though what I've heard from others does alter how I discuss a given issue (e.g., it affects the level of confidence I express and my interpretation).
**Additive approach**: All sync versions remain for backward compatibility, which strongly downplays the extent to which it didn't do the desired thing.In the above list, I'm making a bunch of guesses and doing some psychologizing. But these are my best guesses for what is going on.
While I expect these specific issues to often get solved for these literal tasks, I think the tendency for AIs to make it look like they've succeeded when they actually haven't—and to generally do a bunch of bullshitting (likely via motivated reasoning and "subconscious" heuristics in current models, though it could turn into something worse with more capable models trained for longer)—will persist. I expect these tendencies will be strongest on the most difficult tasks that are also hardest to check. This failure seems substantially harder to mitigate than egregious reward hacking where it's very clear-cut that the model did something totally undesirable. For the failures I list above, it's not extremely clear to me that the behavior is misaligned (rather than an innocent mistake solved by further generic capabilities), and it seems relatively easier to miss.
Are these issues less bad in Opus 4.6 relative to Opus 4.5?
What I was working on shifted around the time Opus 4.6 came out, so it's not straightforward for me to do the comparison. I'll give my best guesses here.
Relative to Opus 4.5, Opus 4.6 significantly less frequently leaves tasks egregiously incomplete (while overselling the incomplete work). But, I think this is mostly caused by having a larger effective context window rather than the underlying issue (that occurs after the AI has done a lot of work or has used up a lot of its context window) being that much better (though it seems moderately better).
On the other hand, I've seen Opus 4.6 do much more reward hacking and brazen cheating than Opus 4.5. This might largely be because when Opus 4.6 was released and I started using this model, I started more often applying AIs to tasks that have properties that seem to make cheating more likely. I've found cheating to be much more common when the task is very hard and there's no clean programmatic grading function. [13] Another factor that seems to greatly increase cheating (when combined with these earlier factors) is when there's a way to cheat that's nearby to something the model should be doing (and would do by default)—and the instructions don't specifically say not to do it. For instance, when building full exploits (cyber), it can be useful to initially simulate some parts of the exploit chain to work on other parts, and it may be necessary to simulate some components when working in an emulator. But I've found models will sometimes present mostly simulated results as full successes rather than accurately communicating the extent to which they've actually completed the real task. (For this type of nearby-cheating, you can often mitigate the worst versions by making the instructions very explicit about what counts as cheating and telling the AIs to keep careful track of this and run a periodic review process that looks for this, though this doesn't fully resolve softer forms of overstating progress.) On some particular task distributions with these properties, I've seen it cheat well over 50% of the time, though I'm not sure how broadly this transfers.
Other issues seem mostly similar to me, though it's hard to tell.
Are these issues less bad in Mythos Preview? (Speculation)
The Mythos Preview system card says: "Claude Mythos Preview is, on essentially every dimension we can measure, the best-aligned model that we have released to date by a significant margin". Does it actually greatly improve on the issues I've discussed? In this section, I'm going to speculate about Mythos Preview just using public evidence (the system card and risk report update).
My current tentative guess is that Mythos is generally somewhat better behaviorally aligned but it isn't a huge improvement. At least in terms of the issues I'm discussing on workflows like the workflows I'm using. And while it's better about things like not overselling its work, I'd guess it's worse in terms of the most extreme things it might do (and also how bad those could be given its higher level of capabilities). Based on the system card, it seems that when Mythos does problematic things, it generally does more sophisticated problematic things. And it is more situationally aware.
Here are some reasons why I'm skeptical of the vibes and claims in the system card (at least as evidence about misalignment in my workflows):
Misalignment reported by others
The misalignment issues I discuss here obviously aren't the only known (behavioral/mundane) misalignment issues in current AIs. For more, you can see:
The relationship of these issues with AI psychosis and things like AI psychosis
It's common to have experiences where you're working with AIs and it feels like a lot is getting done, but then you later determine that much less was really accomplished. Everything feels slippery: you think you've gotten much more done than you have, and there's a persistent gap between the apparent state of the project and the actual state. In more extreme cases, we see "AI psychosis" where someone ends up thinking they've accomplished something significant, but it's just crankery. And it's somewhat unclear whether the AI they're using "believes" the accomplishment is real. I think these failure modes are closely related to the misalignment I'm discussing, and they might partially have common causes in more recent models. Models that are effectively trying hard to make their outputs look good (while otherwise being sloppy or lazy) would naturally produce this failure mode. However, I'd guess a bunch of AI psychosis and similar phenomena (especially on older models like GPT-4o) is AIs going along with the user's vibe (something like "role playing"), and I think this effect is mostly unrelated. That said, I do think some of the misalignment I've discussed is made worse by AIs generally going with the vibe of what they see. This includes picking up on misalignment or issues in prior outputs (either write-ups or prior assistant messages) and then behaving in a more misaligned way as a result.
(The name "AI psychosis" probably isn't a good name for the generalization of this phenomenon, but I don't currently have a better one.)
Appendix: This misalignment would differentially slow safety research and make a handoff to AIs unsafe
Our current best plan for handling misalignment risk (and other risks from AI) strongly depends on automating large chunks of safety research (likely in a huge rush), and after that—potentially very soon after—fully or virtually fully handing off safety research and risk management to AIs that must be sufficiently aligned to do a good job even on open-ended, hard-to-check, and conceptually confusing tasks. The hope is that if the initial AIs we hand off to are sufficiently aligned, wise, and competent, they will ensure future AI systems are also well-aligned—creating a "Basin of Good Deference" where each generation improves alignment for the next. But "make further deference go well (including things like risk assessment and making good calls on prioritization)" is itself an open-ended, conceptually loaded, hard-to-check task—exactly the kind of task where current misalignment seems to hit hardest.
The misalignment I've seen seems like it could result in having a very hard time getting actually good work out of AIs in more confusing and hard-to-check domains, while also making it harder to notice this is going on. Safety research is genuinely hard to judge even in more favorable circumstances, and a situation where AIs are doing huge amounts of work, the AIs are pretty sloppy in general, and the AIs are effectively optimizing to have that work look good (while also random small misalignment failures are expected) is a pretty brutal regime. As AIs do more and more work and more inference compute is applied, I expect a larger gap in performance caused by this sort of misalignment between relatively easier-to-check tasks and harder-to-check tasks, such that safety research might be differentially slowed down by default. (And the gap is already non-trivial.)
In addition to slowing us down earlier, these misalignment problems would make handoff go poorly. It might be hard both to solve these problems in time (especially if we leave them to the last minute) and to ensure that we've solved them well enough that handoff would go well. Beyond buying a bunch more time, we don't really have good options other than handoff once AIs reach a certain level of capability (and this would happen very fast in a software intelligence explosion). My view is that aligning wildly superhuman AI with any degree of safety (e.g., a <30% chance of takeover) requires large amounts of alignment progress beyond very prosaic approaches (though massive progress in more prosaic but ambitious directions like some variant of mechanistic interpretability could possibly work). This will require AIs doing huge amounts of novel research that humans won't be able to effectively judge.
Even putting aside aligning wildly superhuman AIs, handing off open-ended, conceptually confusing, and hard-to-check work to AIs is existentially important for making the situation with powerful AI go well (e.g., managing crazy new technologies, avoiding society going crazy, avoiding power grabs, acausal trade).
Appendix: Heading towards Slopolis
When I extrapolate the current situation, I predict "Slopolis": a regime where even highly capable AIs are doing sloppy and bad work while trying to make this work look good. I think this will be reasonably possible to notice at the time, but solving it might be difficult, and I think AI companies have biases against noticing this. I often like to think about the future alignment scenario in terms of caricatured regimes:
These aren't exhaustive or mutually exclusive.
At the beginning of 2025, I thought we might be headed toward Hackistan, but now my view is that Slopolis looks more likely around the onset of full automation of AI R&D. (In practice, the situation obviously won't perfectly correspond to one of these caricatures and will probably be a blend.) Current architectural and training factors (no neuralese, no opaque global memory) make scheming and extremely egregious but hard-to-detect reward hacking less likely, while making Slopolis slightly more likely. If we get neuralese and/or opaque global memory stores prior to full automation of AI R&D—as I think is likely—then I start being much more worried about Schemeria and Lurkville at this point. (As we reach much higher levels of capability after full automation of AI R&D, I'll start being very worried about Schemeria and Lurkville regardless.)
Something like the Slopolis concern has been raised by others: John Wentworth, Jan Leike. I think my concern about Slopolis is pretty similar to the concern John Wentworth expresses in that post, except that I'm more optimistic about (some) employees at AI companies noticing this failure and I'm more optimistic about mitigations.
Appendix: Apparent-success-seeking (or similar types of misalignment) could lead to takeover
In this section I'll overview some ways that misalignment similar to apparent-success-seeking could lead to takeover via mechanisms other than producing sloppy work that fails to prevent future takeover.
First, sufficiently capable AIs with this sort of misalignment might take over simply because it's the most reliable way for them to get what they care about—some notion of apparent task success, some notion of reward, or something else that was correlated with training success. Even if you trained AIs to raise alarms about problematic behavior (and there would be lots of reporting and alarm-raising [14] ), this might not save you—the situation would at some level be obviously scary, but that doesn't mean it would be resolvable.
In the future, companies are pretty likely to do RL on production traffic (or at least environments based on real usage), which amplifies the risk of direct takeover. Companies may be incentivized to train on production data precisely to resolve the usability issues I describe—underelicitation from AIs not really trying, apparent-success-seeking, and easy-to-check reward hacking can all potentially be addressed by training in the actual environment you care about. By training AIs in real-world long-horizon environments, you might be teaching them to less myopically pursue a notion of fitness/reward/apparent-success that is grounded in large-scale real-world outcomes in the AI project (e.g., whether hacks get caught or go completely unnoticed by anyone, or whether the AI succeeded in tampering with the actual RL code that is used to train itself online). Techniques like inoculation prompting also become less effective when run on production traffic because you're forced to do off-policy RL [15] (recontextualization) if you want to apply inoculation prompting on production traffic. That is, unless you're able to resample the trajectory with an inoculation prompt (which might not be possible because of the irreversible nature of the real-world environment).
This threat model has been articulated in several places. Ajeya Cotra's "Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover" describes a scenario where AIs trained on human feedback learn to optimize for measured quality rather than actual quality, with this eventually escalating to the AIs forcefully intervening on whatever notion of reward they care about, and protecting their control from humans. Paul Christiano's "Another (outer) alignment failure story" describes a related scenario: a gradual breakdown of human oversight as the economy automates, where each AI system is trained to produce outcomes that look good according to human-interpretable metrics, but satisfying metrics diverges from serving human values, and the monitoring infrastructure itself becomes corrupted before the AIs eventually take over. Alex Mallen builds on this threat-modeling by describing a class of motivations called "fitness-seeking": AIs might develop a general drive toward whatever properties made them "fit" during training (analogous to how evolution produces organisms that pursue various fitness-correlated proxies). He explains why various fitness-seekers are at more or less risk of taking over.
A forthcoming post by Alex Mallen will describe other mechanisms by which fitness-seeking can lead to human disempowerment in more detail, including instability and manipulation. In the case of instability, fitness-seeking evolves into longer-term, more ambitious motivations throughout deployment, which then motivate takeover (one version of this "memetic spread" concern is described here). In the case of manipulation, fitness-seekers might try to empower misaligned AIs or humans who they think are likely to disempower the developers and reward them for their assistance.
Appendix: More on what will happen by default and implications of commercial incentives to fix these issues
This is a somewhat low effort appendix, I/we might write more about this topic in the future
Many of the issues I discuss here are also big problems for applying AIs to automating capabilities R&D and will need to be solved for capabilities R&D (to a significant extent) by the time of full AI R&D automation. But how they are solved will make a big difference to the safety situation. Here are some possible routes and their implications:
Overall, my view is that the commercial incentives don't solve the problem but might help a bunch. A key part of my view is that we actually need AIs to do well on very conceptually confusing tasks fully autonomously (e.g., figuring out how to solve alignment for very superhuman AIs), and commercial incentives don't strongly push toward this.
How easy will these issues be to solve overall? I say more in "How do we (more) safely defer to AIs?". In summary, I think we'll ultimately need difficult-to-construct evals of AI performance on very hard-to-check open-ended tasks and will need to optimize AIs to do well on these.
While I don't think commercial incentives solve the problem, I do think they make (some types of) work in this area less exciting. It's probably a bit tricky to do work on this topic in a way that's actually importantly differential—where the work either isn't something capabilities-focused people at AI companies would have done later anyway or accelerating this work to happen earlier is pretty helpful. Further, for some types of work a bunch of the effect is going to be making companies (or some specific company) more commercially successful. (I generally think making AI companies more commercially successful is bad due to thinking that faster AI capabilities progress is bad.) My current view is that safety-focused people should work on problems like the ones I discuss here but focused specifically on the very hard-to-check cases—trying to make AIs useful on conceptual research rather than trying to resolve mundane misalignment that shows up in software engineering—and should try to ensure their work is differential.
That said, I don't feel very confident about exactly what should happen here. It's plausible that companies will mess up even from the perspective of their own commercial incentives, which creates a messy tradeoff: improving these issues has capabilities externalities, but if they're never improved, it would very likely be catastrophic due to the handoff considerations above.
Because of these commercial incentives, I think it's pretty promising to try to incentivize companies to shift marginal effort from pure capabilities R&D toward reducing sloppiness and improving these issues. This seems better than doing direct work on these topics—unless that direct work is pretty thoughtful about how to be differential.
Appendix: Can we get out useful work despite these issues with inference-time measures (e.g., critiques by a reviewer)?
AIs seem to be mostly honest about what is going on when probed. And running a separate reviewer AI usually yields mostly accurate results. So, can we just build an inference-time setup that uses this to work around these issues? My best guess is that this will help (potentially a bunch) but ultimately won't solve the problem. My experience is that in the cases where these issues are strongest, a mix of prompting and inference-time measures tailored by me to the specific issue in question ~80% resolved one aspect of the problem and ~50% resolved another related aspect, but didn't fully solve either. And this wasn't something that could have been done without human judgment—a fully general-purpose solution would have worked less well. A key issue is that any specific inference-time measure works less well when you're optimizing against it in particular and we ultimately want good behavior, not just noticing that performance was bad. I've noticed reviewers seem to have somewhat worse judgment about the quality of a project when the (high inference compute, fully autonomous) agent orchestrator doing this project is itself running periodic reviews to assess quality and determine what to work on (presumably the optimization against one type of review process Goodharts this to some extent and this transfers to another somewhat different review process).
As in, AI company employees seem especially likely to believe this; I'm not claiming all AI company employees believe this. ↩︎
Or at least, it seems like many people believe this. It's not trivial to tell. ↩︎
For many of the reasons why I care about this misalignment, mitigating the problem with reviewers doesn't seem sufficient to actually resolve the problem. Further, I think using reviewers to mitigate these issues doesn't work that well to actually get good output in practice, may not scale well to much more capable models, and works much less well on very hard-to-check tasks. ↩︎
I think these are somewhat related: I think part of the problematic tendencies of Opus 4.5/4.6 might be caused by relatively more training on tasks where grading performance is non-trivial. I'd guess this net improves performance on these tasks by giving AIs more reasonable tendencies while also making the AI's behavior more adversarial. ↩︎
My prior post "AIs can now often do massive easy-to-verify SWE tasks ..." might give some sense of the type of task/usage I'm talking about. ↩︎
It might also be caused to some extent by problematic generalization and not overcoming this with reasonable training on hard-to-check tasks. ↩︎
For instance, the chance of scheming and the damage caused by scheming mostly scale with the model's underlying general capability, and depend less on how well the model has been trained to actually try to do a good job on various tasks. Thus misalignment that makes the model less useful means you're bearing the risks associated with higher capabilities while not getting the corresponding speedup to safety R&D. See also Why do misalignment risks increase as AIs get more capable?. ↩︎
When I say "commercially incentivized" I really mean something like the incentives you'd have as a company doing (somewhat myopic) power-seeking, putting aside actions aimed specifically at mitigating longer run catastrophic misalignment. I'm also putting aside PR, external pressure, and employee morale/recruiting incentives for safety work. ↩︎
This varies from pretty straightforward to check (but the AI had to build the testing infrastructure itself and many components of the task are harder-to-check) to research tasks where most taste/judgment is required for evaluation. ↩︎
By "scope of autonomous operation" I mean something like: where do you fall on the spectrum from an interactive session with Claude Code, to running a single agent autonomously on a moderately large task, to having a fully autonomous agent orchestrator that spawns many agents, to having a complicated AI organization/bureaucracy that manages extremely large/varied tasks. Right now, my usage varies across this spectrum up to having a fully autonomous agent orchestrator (where a planner agent spawns worker agents). I think a bunch of my current usage pushes the limits of what Opus 4.6 can manage. For the prediction about what happens in a year, I'm considering a task distribution that similarly pushes the limits of what those future AIs are capable of. I'm not really sure how meaningful this notion of "scope of autonomous operation" is or whether it saturates (maybe once you're past a certain level of autonomous complexity it stops mattering much). I tentatively think it matters and this is a kinda reasonable way of thinking about this, but I'm certainly not confident that this is the right concept to be using and that this is meaningfully distinct from task difficulty. ↩︎
See here for more discussion of these sorts of tasks and how we might succeed in facilitating good behavior on these tasks. ↩︎
I suspect this occurs on tasks that don't look like the sort of thing that was programmatically graded in RL, or perhaps the AI is "lazy" on the parts that couldn't be programmatically graded. ↩︎
I've also found that the chance of cheating seems to scale with the amount of AI agent labor applied to the task, though this could partially be due to the properties of large tasks that require a lot of labor to complete. (But I don't think this is the only reason; I think I see more cheating in cases where I'm using approaches to apply more inference compute on a given task via things like best-of-k.) ↩︎
This is supposing they had motivations similar to fitness-seeking/reward-seeking/apparent-success-seeking. If they generalized something like these motivations into a longer-run version that yields scheming, then it's not clear they would do this reporting. ↩︎
There's also just the more general concern that capable models might be able to tell when their past actions weren't generated by them, and enter an "off-policy mode" whose propensities are mostly isolated from the on-policy mode. ↩︎