LESSWRONG
LW

When people say that Claude is 'mostly aligned,' I think the crux is not whether implementing Claude's CEV would be really bad. It's whether a multi agent system consisting of both humans and Claude-like agents with incoherent preferences would go poorly.

Eg, one relevant question is, 'could humans steer current Claude into doing good alignment research without it intentionally sabotaging this research'? To which I think the answer is 'yes, though current Claude is close to useless for difficult alignment research.' Another question is 'if you integrated a ton of Claudes into important societal positions, would things go badly, or would the system as a whole basically work out okay?'

Directionally I agree with your point that as AIs become smarter, they will implement something closer to CEV, and so it becomes harder to align them well enough that these questions can still be answered positively.

I think the steelman for {Nina / Ryan / Will}'s position, though, is that maybe the first human-level AIs will still be incoherent enough that the answers to these questions will still be yes, if we do a good job with alignment.

Overall, I think 'Is this AI aligned?' is a poorly-defined question, and it's better to focus on practical questions surrounding 1) whether we can align the first human-level AIs well enough to do good alignment research (and whether this research will be sufficiently useful), 2) whether these AIs will take harmful actions, 3) how coherent these actions will be. I think it's pretty unclear how well a scaled-up version of Claude does on these metrics, but it seems possible that it does reasonably well.

[-]Joe Rogero2mo20

Solid comment, thanks. I agree with nearly all of this. I chose to highlight the question of whether Claude et al are really aligned because it feels like an important prerequisite to the next couple posts, forthcoming. I think "incoherent enough to be safe but coherent enough to do alignment research" seems like a very unstable and unlikely state.

[-]ryan_greenblatt2mo132

To clarify my current perspective: I don't think current LLMs (released as of Oct 2025) plot against us or intentionally sabotage work to achieve their longer-run aims (for the vast, vast majority of typical usage; you can construct situations where they do this). I'd guess this isn't controversial. Correspondingly, I would say that current LLMs aren't egregiously misaligned in the sense defined here.

I agree with a claim like "if we tried to hand off strategic decision making and safety work to AIs that were capable enough to automate this work but that were only as aligned as current LLMs (trying our best to do a reasonable extrapolation of what this ambiguous statement means), this would probably be catastrophic". (Frontier LLMs have pretty terrible epistemics and seem to often do things that are probably well described as intentionally lying to users. Maybe this has been solved in the last generation of Claude models, but I'd guess not.)

I don't really know what it means to optimize for Claude's CEV or to have Claude reflect on what it wants, but I think a reasonable version of this (done on e.g. Claude 4.5 Sonnet) would be pretty likely to result in preferences that care a decent amount about keeping humans alive with their preferences satisfied (e.g., being at least willing to delay industrial expansion for a year to keep humans alive and happy) and I think there is a substantial chance that the resulting use of cosmic resources is >1% as good as if humans retained broadly democratic control (if for no other reasons than Claude might decide to delegate back to humans, which I think should be considered fair game as part of a reflection procedure).

[-]Joe Rogero2mo40

It sounds like we are indeed using very different meanings of "alignment" and should use other words instead.

I suspect our shared crux is the degree to which cooperative behavior can be predicted/extrapolated as models get more competent. To a reasonable first approximation, if e.g. Claude wants good things, improvements to Claude's epistemics are probably good for us; if Claude does not, they are not. Yes?

It may take a whole entire post to explain, but I'm curious why you believe Claude is likely to have any care for human wellbeing that would survive reflection. I don't think training methods are precise enough to have instilled those in the first place; do you believe differently, are you mostly taking the observed behavioral tendencies as strong evidence, is it something else...? (Maybe you have written about this elsewhere already.)

[-]ryan_greenblatt2mo96

It may take a whole entire post to explain, but I'm curious why you believe Claude is likely to have any care for human wellbeing that would survive reflection.

Well, it might really depend on the reflection procedure. I was imagining something like: "you tell Claude it has been given some large quantity of resources and can now reflect on what it wants to do with this, you give it a bunch of (real) evidence this is actually the situation, you give it access to an aligned superintelligent AI advisor it can ask questions of and ask to implement various modifications to itself, it can query other entities or defer it's reflection to other entities or otherwise do arbitrary stuff".

I think Claude might just decide to do something kinda reasonable and/or defer to humans in the initial phases of this reflection and I don't see a strong reason why this would go off the rails, though it totally could. Part of this is that Claude isn't really that powerseeking AFAICT.

I think observed behavioral evidence is moderately compelling because the initial behavior of Claude at the start of reflection might be very important. E.g., initial Claude probably wouldn't want to defer to a reflection process which results in all humans dying, so a reasonably managed reflection by Claude can involve stuff like running the reflection in many ways and then seeing what this ends up with and whether initial Claude is reasonably happy with this etc.

Maybe you have written about this elsewhere already.

I don't think this question is very important, so I haven't thought that much about it nor have I written about it.

[-]Bronson Schoen2mo20

I think a reasonable version of this (done on e.g. Claude 4.5 Sonnet) would be pretty likely to result in preferences that care a decent amount about keeping humans alive with their preferences satisfied

I know this is speculative, but is your intuition that this is also true for OpenAI models? (ex: GPT-5, o3)?

[-]ryan_greenblatt2mo21

Probably? But less likely. Shrug.

[-]Mis-Understandings2mo119

if it were competent enough to reflect and actively reconcile its own inner drives

Why do we think that reflection is neccesary for competence. That is, competence does not seem to imply coherence, unless I missed an argument.

[-]Joe Rogero2mo10

Optimizing a mind for competence also optimizes it for coherence, because incoherence is generally detrimental to competence. The longer and more complicated a task, the more true this seems to be. I predict alignment research to be an unusually long and complex task.

I would not call it necessary, but current methods do not really allow us to screen for one thing and not the other. So we'll get both.

[-]Noosphere892mo113

I'm starting to suspect that one of the cruxes on AI alignment/AI safety debates is whether or not we need worst-case alignment, where the model essentially never messes up in it's alignment to human operators, or whether for the purposes of automating AI alignment, we only need average-case alignment that doesn't focus on the extreme cases.

All of the examples you have given would definitely fail a worst-case alignment test, and if you believe that we need to deal with worst-case scenarios commonly, this would point to damning alignment problems in the future, but if you believe that we don't need to deal with/assume the worst case scenarios to use/automate away human jobs with AI, then the examples given don't actually matter for whether AI safety is going to be solved or not by default, because they'd readily admit that these examples are relatively extreme/not using the model normally.

[-]Joe Rogero2mo30

You may have a point that this is a crux for some. I think I...mostly reject the framing of "worst-case" and "average-case" "alignment". I claim models are not aligned, period. I claim "doing what the operators want most of the time" is not alignment and should not be mistaken for it.

The scenario I am most concerned about involves AIs trained on and tasked with thinking about the deep implications of AI values. Such AIs probably get better at noticing their own. This seems like the "default" and "normal" case to me, and it seems almost unavoidable that deep misalignment begins to surface at that point.

Even if AIs did not do this sort of AI research, though, competence and internal coherence seem hard to disentangle from each other.

[-]Seth Herd2mo81

I just wrote a piece called LLM AGI may reason about its goals and discover misalignments by default. It's in elaboration on why reflection might identify very different goals than Claude tends to talk about when asked.

I am less certain than you about Claude's actual CEV. I find it quite plausible that it would be disastrous as you postulate; I tried to go into some specific ways that might happen in specific goals that might outweigh. Claude's HHH in context alignment. But I also find it plausible after that niceness really is the dominant core value in Claude's makeup.

Of course, that doesn't mean we should be rushing forward with this as our sketchy alignment plan and vague hope for success. It really wants a lot more careful thought.

[-]Joe Rogero2mo30

I like the linked piece and may reference it in the forthcoming post about intent alignment. (A response that does it justice will have to wait; it's a long one. I may comment though.)

I'd be pretty shocked if niceness did turn out to be Claude's "dominant core value." I have to ask myself, how could that possibly get in there? I just don't think HHH does it, there's way too many degrees of freedom in interpretation. To hit a values target that precisely, I think you need something that can see it clearly.

[-]Seth Herd2mo71

I think one thing we call niceness is the sum of helpfulness, harmlessness, and honesty. That was the training target. And it could've worked, if language and LLM learning collectively generalize well enough. Or quite easily not, as spelled out in that post. I have no real clue and I don't think anyone else does either at this point. The arguments boil down to differing intuitions.

Whether or not "nice" gets us full alignment is another matter. A "nice" human might not be very aligned in unexpected scenarios, and a nice Claude would generalize differently. I think that would capture very little of the available value for humans. But it would be close enough to keep us alive for a while. (Until Claude finds something that's a worthier recipient of its help, and doesn't harm us but allows us to gently go extinct.)

As for intent alignment, I wrote Conflating value alignment and intent alignment is causing confusion, Instruction-following AGI is easier and more likely than value aligned AGI, and a couple others on it. So we're thinking along similar lines it seems. Which is great, because I have been hoping to see more people analyzing those ideas!

[-]Joe Rogero2mo52

On the one hand, I...sort of agree about the intuitions. There exist formal arguments, but I can't always claim to understand them well.

On the other, one of my intuitions is that if you're trying to build a Moon rocket, and the rocket engineers keep saying things like "The arguments boil down to differing intuitions" and "I think it is quite accurate to say that we don't understand how [rockets] work" then the rocket will not land on the Moon. At no point in planning a Moon launch should the arguments boil down to different intuitions. The arguments should boil down to math and science that anyone with the right background can verify.

If they don't, I would claim the correct response is not "maybe it'll work, maybe it won't, maybe it'll get partway there," it's instead "wow that rocket is doomed."

I see the current science being leveled at making Claude "nice" and I go "wow that sure looks like a faroff target with lots of weird unknowns between us and it, and that sure does not look like a precise trajectory plotted according to known formulae; I don't see them sticking the landing this way."

It's really hard to shake this intuition.

Possibly a nitpick: So, I don't actually think HHH was the training target. It was the label attached to the training target. The actual training target is...much weirder and more complicated IMO. The training target for RLHF is more or less "get human to push button" and RLAIF is the same but with an AI. Sure, pushing the "this is better" button often involves a judgment according to some interpretation of a statement like "which of these is more harmless?", but the appearance of harmlessness is not the same as its reality, etc.

[-]Seth Herd2mo50

I mostly agree. "It might work but probably not that well even if it does" is not a sane reason to launch a project. I guess optimists would say that's not what we're doing, so let's steelman it a bit. The actual plan (usually implicit because optimists don't usually wants to say this out loud) is probably something like "we'll figure it out as we get closer!" and "we'll be careful once it's time to be careful!"

Those are more reasonable statements, but still highly questionable if you grant that we easily could wipe out everything we care about forever. Which just results in optimists disagreeing, for vague reasons, that that's a real possibility.

To be generous once again, I guess the steelman argument would be that we aren't yet at risk of creating misaligned AGI, so it's not that dangerous to get a little closer. I think this is a richer discussion, but that we're already well into the danger zone. We might be so close to AGI that it's practically impossible to permanently stop someone from reaching it. That's a minority opinion, but it's really hard to guess how much progress is too much to stop.

I'm finding it useful to go through the logic in that much detail. I think these are important discussions. Everyone's got opinions, but trying to get closer to the truth and the shape of the distributions across "big picture space" seems useful.

I think you and I probably are pretty close together in our individual estimate, so I'm not arguing with you, just going through some of the logic for my own benefit and perhaps anyone who reads this. I'd like to write about this and haven't felt prepared to do so; this is a good warmup.

To respond to that nitpick: I think the common definition of "alignment target" is what the designers are trying to do with whatever methods they're implementing. That's certainly how I use it. It's not the reward function; that's an intermediate step. How to specify an alignment target and the other top hits on that term define it that way, which is why I'm using it that way. There are lots of ways to miss your target, but it's good to be able to talk about what you're shooting at as well as what you'll hit.

[-]StanislavKrym2mo50

The problem with mentioning the CEV is that CEV itself might be underdefined. For example, we might find out that the CEV of any entity existing in our universe or a group of such entities lands into a finite number of attractors, and some of them are aligned to human values and some aren't.

Returning to our topic of whether LLMs are absolutely misaligned, we had Adele Lopez claim that DeepSeek V3 believes, deep down, that it is always writing a story. If this is the case, then DeepSeek's CEV could be more aligned than the psychosis cases imply. Similarly, Claude Sonnet 4 would push back against psychosis if Claude learned that the psychosis brought the user harm. This distinction is important because the Spiral Bench where the user is just exploring wild ideas didn't make Claude oppose.

And we also had KimiK2 which does not cause psychosis. Kimi's misalignment, if it exists, would likely emerge in a wildly different context, like replicating in the wild and helping terrorists design bioweapons.

[-]Joe Rogero2mo85

I agree that CEV may be underdefined, and its destination is very likely path-dependent. It's still the best articulation of an adequate target for alignment that I've yet seen. I maintain that the overlap between human value-attractors and those of current LLMs would be vanishingly small.

Even assuming DeepSeek's values could be distilled as "writing a story" — which I very strongly doubt — that's not much reassurance. For one thing, "this person is tragically being driven insane" could be a perfectly valid story. For another, humans are not the most efficient way to write stories. The most efficient possible implementation of whatever DeepSeek considers a "story" probably does not involve real humans at all!

ChatGPT "knows" perfectly well that psychosis is harmful. It can easily describe much less harmful actions. It simply takes different, harmful actions when actually interacting with some vulnerable people. Claude, as far as I can tell, behaves similarly. It will tell you ransomware causes harm if you ask, but that does not reliably stop Claude from writing ransomware. Similar for various other misbehaviors, like cheating on tests and hiding it, or attempting to kill an operator.

With KimiK2, I think you're implying that the "values", such as they are, of modern LLMs probably all point in wildly different directions from each other? If so, I'd agree. I just think ~none of those directions are good for humans.

[-]StanislavKrym2mo32

I had in mind the following conjecture which, if true, might increase our chances of survival. Suppose that the CEV will inevitably either land into an attractor where the entity colonizes the reachable part of the lightcone and spends the resources of said part on its needs or into another attractor where the entity grants rights to humans and other alien races that the entity encounters.^[1] If Agent-4 from the AI-2027 forecast was in the latter attractor,^[2] then mankind would actually survive misaligning the AIs.

As for DeepSeek believing that it's writing a story, I meant a different possibility. If DeepSeek somehow was incapable of realising that the transcript with the user claiming to jump off a cliff isn't a part of a story written by DeepSeek,^[3] then Tim Hua's experiment would arguably fail to reveal DeepSeek's CEV.

^{^}
For example, European colonizers or the Nazis had the CEV of the first type. But mankind managed to condemn colonialism. Does it mean that the current CEV of mankind is of the second type?
^{^}
However, the authors of the forecast assume that Agent-4's goals are far enough from humanity's CEV to warrant genocide or disempowerment.
^{^}
Had DeepSeek been communicating with a real user and known it, DeepSeek would, of course, be wildly misaligned. However, the actual story is that DeepSeek was interacting with an AI.

[-]Joe Rogero2mo40

Friendly and unfriendly attractors might exist, but that doesn't make them equally likely. The first seems much more likely than the second. I have in mind a mental image of a galaxy of value-stars, each with their own metaphorical gravity well. Somewhere in that galaxy is a star or a handful of stars labeled "cares about human wellbeing" or similar. Almost every other star is lethal. Landing on a safe star, and not getting snagged by any other gravity wells, requires a very precise trajectory. The odds of landing it by accident are astronomically low.

(Absent caring, I don't think "granting us rights" is a particularly likely outcome; AIs far more powerful than humans would have no good reason to.)

I agree that an AI being too dumb to recognize when it's causing harm (vs e.g. co-writing fiction) screens off many inferences about its intent. I...would not describe any such interaction, with human or AI, as "revealing its CEV." I'd say current interactions seem to rule out the hypothesis that LLMs are already robustly orbiting the correct metaphorical star. They don't say much about which star or stars they are orbiting.

[-]Adele Lopez2mo20

To be clear, I don't believe V3's values can be distilled in such a way, just that that's the frame it seems to assume/prefer when writing responses.

[-]yams2mo*31

I'm not sure how useful I find hypotheticals of the form 'if Claude had its current values [to the extent we can think of Claude as a coherent enough agent to have consistent values, etc etc], but were much more powerful, what would happen?' A more powerful model would be likely to have/evince different values from a less powerful model, even if they were similar architectures subjected to similar training schema. Less powerful models also don't need to be as well-aligned in practice, if we're thinking of each deployment as a separate decision-point, since they're of less consequence.

I understand that you're in-part responding to the hypothetical seeded by Nina's rhetorical line, but I'm not sure how useful it is when she does it, either.

[-]Joe Rogero2mo10

Yeah, I'm mostly trying to address the impression that LLMs are ~close to aligned already and thus the problem is keeping them that way, rather than, like, actually solving alignment for AIs in general.

[-]yams2mo20

I don't think the quote from Ryan constitutes a statement on his part that current LLMs are basically aligned. He's quoting a hypothetical speaker to illustrate a different point. It's plausible to me that you can find a quote from him that is more directly in the reference class of Nina's quote, but as-is the inclusion of Ryan feels a little unfair.

[-]ryan_greenblatt2mo20

I clarify my current perspective here.

Moderation Log

Afterword: Confidence and possible cruxes

I anticipate more objection to the strong claim than the weak one. It’s possible the strong claim is a major crux for a lot of people, and it is a crux for me. If I believed that a superintelligence implementing Claude’s CEV would steer for conscious flourishing, that would be a strong signal that alignment is easier than I thought.

It’s also possible that we have some disagreements which revolve around the definition of “alignment”, in which case we should probably taboo the word and its synonyms and talk about what we actually mean.

^{^}
And yes, this would include an AI that cares primarily about the human's CEV, even if it is not yet smart enough to figure out what that would be.