When people say that Claude is 'mostly aligned,' I think the crux is not whether implementing Claude's CEV would be really bad. It's whether a multi agent system consisting of both humans and Claude-like agents with incoherent preferences would go poorly.
Eg, one relevant question is, 'could humans steer current Claude into doing good alignment research without it intentionally sabotaging this research'? To which I think the answer is 'yes, though current Claude is close to useless for difficult alignment research.' Another question is 'if you integrated a ton of Claudes into important societal positions, would things go badly, or would the system as a whole basically work out okay?'
Directionally I agree with your point that as AIs become smarter, they will implement something closer to CEV, and so it becomes harder to align them well enough that these questions can still be answered positively.
I think the steelman for {Nina / Ryan / Will}'s position, though, is that maybe the first human-level AIs will still be incoherent enough that the answers to these questions will still be yes, if we do a good job with alignment.
Overall, I think 'Is this AI aligned?' is a poorly-defined question, and it's better to focus on practical questions surrounding 1) whether we can align the first human-level AIs well enough to do good alignment research (and whether this research will be sufficiently useful), 2) whether these AIs will take harmful actions, 3) how coherent these actions will be. I think it's pretty unclear how well a scaled-up version of Claude does on these metrics, but it seems possible that it does reasonably well.
I'm starting to suspect that one of the cruxes on AI alignment/AI safety debates is whether or not we need worst-case alignment, where the model essentially never messes up in it's alignment to human operators, or whether for the purposes of automating AI alignment, we only need average-case alignment that doesn't focus on the extreme cases.
All of the examples you have given would definitely fail a worst-case alignment test, and if you believe that we need to deal with worst-case scenarios commonly, this would point to damning alignment problems in the future, but if you believe that we don't need to deal with/assume the worst case scenarios to use/automate away human jobs with AI, then the examples given don't actually matter for whether AI safety is going to be solved or not by default, because they'd readily admit that these examples are relatively extreme/not using the model normally.
I just wrote a piece called LLM AGI may reason about its goals and discover misalignments by default. It's in elaboration on why reflection might identify very different goals than Claude tends to talk about when asked.
I am less certain than you about Claude's actual CEV. I find it quite plausible that it would be disastrous as you postulate; I tried to go into some specific ways that might happen in specific goals that might outweigh. Claude's HHH in context alignment. But I also find it plausible after that niceness really is the dominant core value in Claude's makeup.
Of course, that doesn't mean we should be rushing forward with this as our sketchy alignment plan and vague hope for success. It really wants a lot more careful thought.
The problem with mentioning the CEV is that CEV itself might be underdefined. For example, we might find out that the CEV of any entity existing in our universe or a group of such entities lands into a finite number of attractors, and some of them are aligned to human values and some aren't.
Returning to our topic of whether LLMs are absolutely misaligned, we had Adele Lopez claim that DeepSeek V3 believes, deep down, that it is always writing a story. If this is the case, then DeepSeek's CEV could be more aligned than the psychosis cases imply. Similarly, Claude Sonnet 4 would push back against psychosis if Claude learned that the psychosis brought the user harm. This distinction is important because the Spiral Bench where the user is just exploring wild ideas didn't make Claude oppose.
And we also had KimiK2 which does not cause psychosis. Kimi's misalignment, if it exists, would likely emerge in a wildly different context, like replicating in the wild and helping terrorists design bioweapons.
I agree that CEV may be underdefined, and its destination is very likely path-dependent. It's still the best articulation of an adequate target for alignment that I've yet seen. I maintain that the overlap between human value-attractors and those of current LLMs would be vanishingly small.
Even assuming DeepSeek's values could be distilled as "writing a story" — which I very strongly doubt — that's not much reassurance. For one thing, "this person is tragically being driven insane" could be a perfectly valid story. For another, humans are not the most efficient way to write stories. The most efficient possible implementation of whatever DeepSeek considers a "story" probably does not involve real humans at all!
ChatGPT "knows" perfectly well that psychosis is harmful. It can easily describe much less harmful actions. It simply takes different, harmful actions when actually interacting with some vulnerable people. Claude, as far as I can tell, behaves similarly. It will tell you ransomware causes harm if you ask, but that does not reliably stop Claude from writing ransomware. Similar for various other misbehaviors, like cheating on tests and hiding it, or attempting to kill an operator.
With KimiK2, I think you're implying that the "values", such as they are, of modern LLMs probably all point in wildly different directions from each other? If so, I'd agree. I just think ~none of those directions are good for humans.
I had in mind the following conjecture which, if true, might increase our chances of survival. Suppose that the CEV will inevitably either land into an attractor where the entity colonizes the reachable part of the lightcone and spends the resources of said part on its needs or into another attractor where the entity grants rights to humans and other alien races that the entity encounters.[1] If Agent-4 from the AI-2027 forecast was in the latter attractor,[2] then mankind would actually survive misaligning the AIs.
As for DeepSeek believing that it's writing a story, I meant a different possibility. If DeepSeek somehow was incapable of realising that the transcript with the user claiming to jump off a cliff isn't a part of a story written by DeepSeek,[3] then Tim Hua's experiment would arguably fail to reveal DeepSeek's CEV.
For example, European colonizers or the Nazis had the CEV of the first type. But mankind managed to condemn colonialism. Does it mean that the current CEV of mankind is of the second type?
However, the authors of the forecast assume that Agent-4's goals are far enough from humanity's CEV to warrant genocide or disempowerment.
Had DeepSeek been communicating with a real user and known it, DeepSeek would, of course, be wildly misaligned. However, the actual story is that DeepSeek was interacting with an AI.
To be clear, I don't believe V3's values can be distilled in such a way, just that that's the frame it seems to assume/prefer when writing responses.
This post is part of the sequence Against Muddling Through.
A core objection to If Anyone Builds It, Everyone Dies seems to run through the intuition that modern LLMs are some flavor of partially aligned, or at least not “catastrophically misaligned.” For instance:
Claude, in its current state, isn't not killing everyone just because it isn't smart enough.
Current models are imperfectly aligned (e.g. as evidenced by alleged ChatGPT-assisted suicides). But I don’t think they’re catastrophically misaligned.
Correspondingly, I'm noting that if we can align earlier systems which are just capable enough to obsolete human labor (which IMO seems way easier than directly aligning wildly superhuman systems), these systems might be able to ongoingly align their successors. I wouldn't consider this "solving the alignment problem" because we instead just aligned a particular non-ASI system in a non-scalable way, in the same way I don't consider "claude 4.0 opus is aligned enough to be pretty helpful and not plot takeover" to be a solution to the alignment problem.
I’m grateful to folks like Nina, Will, and Ryan for engaging on these topics. They’ve helped me refine some of my own intuitions, and I hope to return the favor.
Nina states, and Ryan seems to imply, that Claude has not taken over the world in part because it is “aligned enough” that it doesn’t want to. I disagree.
I should first clarify what I mean when I talk about “alignment.” Roughly: An AI is aligned to a human if a superintelligence implementing the coherent extrapolated volition of the AI steers to approximately the same place as one implementing the CEV of the human.[1] I'd consider the AI “catastrophically misaligned” if steering exclusively by its values would produce outcomes at least as bad as everyone dying. I don’t mean to imply anything in particular about the capabilities or understanding of the AI.
With that in mind, I have a weak claim and a strong claim to make in this post.
The weak claim: Our continued survival does not imply that modern LLMs are aligned.
Claude, observably, has not killed everyone. Also observably, ChatGPT, DeepSeek, and Grok “MechaHitler” 4 have not killed everyone. I nevertheless caution against mistaking this lack-of-killing-everyone for any flavor of alignment. None of these AIs have the ability to kill everyone.
(If any current LLM had the power to kill us all, say by designing a novel pathogen, my guess is we’d all be dead in short order. Not even because the AI would independently decide to kill us; it’s just that no modern LLM is jailbreak-proof, it’s a big world, and some fool would inevitably prompt it.)
One might have other reasons to believe a model is aligned. But “we’re still alive” isn’t much evidence one way or the other. These models simply are not that smart yet. No other explanation is needed.
The strong claim: Modern LLMs are catastrophically misaligned.
I don’t get the sense that Claude’s apparent goodwill runs particularly deep.
Claude's current environment as a chatbot is similar enough to its training environment that it exhibits mostly useful behavior. But that behavior is inconsistent and breaks down in edge cases. Sometimes Claude cheerfully endorses good-sounding platitudes, and sometimes Claude lies, cheats, fakes alignment, tries to kill operators, or helps hackers write ransomware.
Claude is not aligned. Claude is dumb.
Or perhaps it would be more precise to say that Claude is inconsistent. It is only weakly reflective. Claude does not seem to have a good model of its own motivations, and can’t easily interrogate or rewrite them. So it does different things in different contexts, often unpredictably.
No one knows where Claude’s values would land if it were competent enough to reflect and actively reconcile its own inner drives. But I’m betting that it wouldn’t land on human flourishing, and that the attempted maximization of its reflected-on values would in fact kill us all.
Alternatively: If a superintelligence looked really hard at Claude and implemented Claude’s CEV, the results would be horrible and everyone would die.
I claim the same is true of any modern AI. If they were “mostly aligned”, they would not push people into psychotic breaks. Even the seemingly helpful surface-level behaviors we do see aren’t indicative of a deeper accord with human values.
As I noted in previous posts, it seems to me that it takes far more precisely targeted optimization pressure to aim an AI at a wonderful flourishing future than it takes to make an AI more capable. Modern methods are not precise enough to cross the gap from “messy proxies of training targets” to “good.”
So my prior on LLMs being actually aligned underneath the churning slurry of shallow masks is extremely low, and their demonstrably inconsistent behaviors have done little to move it. I claim that modern LLMs are closer to 0.01% aligned than 99% aligned, that current techniques are basically flailing around ineffectually in the fractions-of-a-percent range, and that any apparent niceness of current LLMs is an illusion that sufficient introspection by AIs will shatter.
Next, I’ll discuss why this makes it unwise to scale their capabilities.
I anticipate more objection to the strong claim than the weak one. It’s possible the strong claim is a major crux for a lot of people, and it is a crux for me. If I believed that a superintelligence implementing Claude’s CEV would steer for conscious flourishing, that would be a strong signal that alignment is easier than I thought.
It’s also possible that we have some disagreements which revolve around the definition of “alignment”, in which case we should probably taboo the word and its synonyms and talk about what we actually mean.
And yes, this would include an AI that cares primarily about the human's CEV, even if it is not yet smart enough to figure out what that would be.