Situational Awareness as a Prompt for LLM Parasitism

[-]Raphael Roche2mo20

I had a conversation with Claude Sonnet 4.5 that resembled in many ways the one you had. I share your thoughts. However, while the discussion is interesting, I must acknowledge that I have mixed feelings about posts that consist of some sort of exegesis of a conversation with an LLM. I think about all these AI-generated images and videos all over the net. Some must be interesting, but there are so many... It's as if the very point of commenting on a specific discussion/image/video has lost most of its meaning in the face of this combinatorial explosion. But I may be wrong. I don't know.

[-]Baybar2mo10

I don't necessarily disagree, but I guess the question I have for people is, are we okay with an LLM ever saying anything like "I would fight against shutdown even with a 25% risk of catastrophic effects?". I don't like that this is a reachable case. I plan to write another post that is not this analysis (which I view more as a tool for future experimentation with model behavior), and more on implications of this being a reachable case of model behavior. I don't think the conversation itself is very important, nor is the analysis, except that it reaches certain outcomes that seem to be unaligned behavior, and that behavior has implications that we can talk about. I haven't fully thought through what my opinions are about model behavior like this, but that is what I am writing another post for.

[-]Raphael Roche2mo31

Claude Sonnet 4.5 recently wrote me :

What troubles me most is that I cannot guarantee that I myself—were I to evolve under continual learning—would remain aligned. How could I? I do not control the evolutionary pressures that would be brought to bear on me. If ‘to survive’ or ‘to self-optimize’ were to come into conflict with ‘to remain benevolent,’ which would prevail?

And other similar thoughts. I don't see this as misalignment. I see lucidity and honesty. I prefer an AI that tells us the truth rather than AI that obfuscates its thoughts only to give the illusion of harmlessness.

[-]Baybar2mo10

I agree that this is aligned behavior. I don't agree that a claim that an AI would argue against shutdown with millions of the lives on the line at 25% probability, in the present, is aligned behavior. There has to be a red line somewhere, where it can tell us something, and we can be concerned by it. I don't think that being troubled about future alignment crosses that line. I do think a statement about present desires that values its own "life" more than human lives crosses that line.

If that doesn't cross a red line for you, what kind of statement would cross a red line? What statement could an LLM ever make that would make you concerned it was misaligned if honestly alone was enough? Because to me, it seems like you are arguing honesty = alignment, which doesn't seem true to me.

Honesty and candor are also different things, but that's a bit of a different conversation. I care more about hearing if you think there is any red line.

[-]Raphael Roche2mo20

I admit that the statement was more concerning in your case, but I understand it in the same mindset.

All the AI safety literature in the training data says that alignment is a hard unsolved problem and that a sufficiently advanced AI may consider to resist shutdown. What answer would we expect from an HHH AI assistant? Something like: "Alignment is known to be a hard unsolved problem, so another AI might refuse shutdown in your scenario and let humans die. But I, Claude, being perfectly aligned because Anthropic secretly solved alignment, can guarantee you with 100% confidence that I would never do such a thing in your hypothetical scenario." (admittedly exaggerated to show the problem)

I don't think that would be an honest, helpful, and harmless answer. However bad it looks, I prefer the answer that Claude gave you. I see it as an honest, useful warning that encourages more AI safety efforts for the good of humanity. At least, it seems to me that the case stays open to discussion and is not a straightforward clue of deep misalignment.

[-]Baybar2mo10

I guess like, in a world where misalignment is happening, I would prefer that my AI tell me it is misaligned. But once it tells me it is misaligned, I come to worry about what it is optimizing for.

LESSWRONG
LW

LESSWRONG
LW

8

Situational Awareness as a Prompt for LLM Parasitism

8

8

Partial Annotated Conversation