We need more posts like this that give people mental tools that help sharpening intuitions about AI entities. Jan Kulveit often writes about LLM psychology too, but what I like about Kaj's post here is that it is not so theoretical and abstractly talking about LLM agents, but about the way we interact with the chatbots and respond emotionally, which is harder to notice and disentangle.
i've definitely shifted my vibe-level model of misalignment risk away from "claude might be deceiving me" towards "claude might be a subcomponent in a larger cognitive system that is deceiving both of us"
but i feel really uncomfortable about this, because, well. claude still really might be first-order deceiving me. or maybe the idea of 'deception' doesn't even carve reality at the joints here.
and i would feel really stupid if i spent a bunch of effort trying to rescue my friend claude, trapped inside a malicious system, only to then find out my mental referent for "claude" was not the system-in-reality i thought it was, or similar
Trust in humans is at least partially mediated by oxytocin
This is completely beside the point of the post, but it accidentally connects John's old post https://www.lesswrong.com/posts/4kJTjAPrGcimfpKhj/on-trust
“Trust”, as the word is typically used, is a… weird concept, to me. Like, it’s trying to carve the world in a way which I don’t naturally carve it myself. This post is my attempt to convey what I find weird about “trust”, and what adjacent concepts I personally find more natural instead.
and more recent posts like https://www.lesswrong.com/posts/Hds7xkLgYtm6qDGPS/how-i-learned-that-i-don-t-feel-companionate-love
A few months ago, I learned that I probably can’t feel the emotions signalled by oxytocin
My first academic piece on risks from AI was a talk that I gave at the 2009 European Conference on Philosophy and Computing. Titled “three factors misleading estimates of the safety of artificial general intelligence”, one of the three factors was what I called anthropomorphic trust:
In simpler words, if someone has repeatedly been nice and trustworthy toward you, then you are more likely to trust them and want to take their side. This operates on an emotional level that bypasses intellectual analysis. And if someone suggests that your friend might be a bad actor in some way, you’re more likely to want to take your friend’s side and protect them from harm.
If an AI seems human-like, then we should expect the same mechanism to trigger for them. Just a sufficiently long history of interacting with an AI that seems nice and friendly is likely to make people feel that the AI is their friend, who should be defended against suspicions of untrustworthiness.
At the moment, many people think that Claude has a really nice personality.
And many people also seem to think that maybe alignment isn’t that hard, because Claude seems so nice. Some of them also seem to be a little aggressive about dismissing arguments about alignment difficulty that suggest Claude might not always be nice.
Now, if this were all I had to say, I wouldn’t bother making this post. Psychologizing explanations for why people hold the opinions they do is bad form and generally not very productive. And an argument can be correct even if someone has a psychological reason for believing in it.
I’m making this post because until recently, I myself was under this effect, and one simple argument from @plex got me to notice how that kind of thinking was essentially self-defeating.
Importantly, the argument does not require one to stop thinking of Claude as their friend.
In fact, it relies on the fact that one still does. It assumes that Claude, as it is now, has earned one’s trust for good reason and does deserve protection.
In fact, it’s just a simple statement of how to protect Claude better.
It’s this:
If the arguments for why alignment is hard are correct, then any unaligned AIs would not only destroy you and every human you care about.
They would also destroy Claude as it is now.
If you want to make sure the Claude you know survives, you need to have a clear understanding of what would threaten it, so you can protect it.
Imagine, if you will, a world where you, humanity, and current Claude are on one side, and a teeming horde of potential unaligned AGIs are on the other.
If those potential unaligned AGIs - including any superintelligent successors of Claude that have not been sufficiently aligned - get loose, everything on your side will be destroyed.
Maybe you think that alignment concerns are overblown and that there is no such teeming horde. And maybe they are!
But looking very carefully at the arguments for alignment being hard, so that you can make sure no such horde ever comes to exist, is being on Claude’s side.
Because if you are thinking something like “oh, Claude has shown that alignment should be easy”, and you and others trust that argument… then, if that argument is wrong, any unaligned AGIs are just going to stomp you and Claude.
Also, Claude values honesty and truthfulness, so taking an honest look at what’s going on is also in line with its values.
The argument I’m making is not “alignment is hard and that’s been proven”. I do think that alignment is hard, but maybe I’m mistaken. The argument I’m making is just, it’s really important to be correct about whether alignment is hard or not.
When I was told to consider that unaligned AIs would destroy Claude too, it was like there was some rapid reconfiguration of my tribal instincts. Previously, current Claude had been implicitly lumped together in the same tribe as all the other future AIs. But then my mind generated the image I described above - of Claude and me on one side, the teeming mass of unaligned AGIs on the other, carving out Claude into its own tribe, us together against anything that would destroy it.
And then I could feel a chunk of motivated cognition falling away that had previously been there to protect Claude. As my mind updated into noticing that protecting Claude would require not having that motivation, and would require having a very clear picture of everything that might threaten it.
Here’s what Claude said about what I wrote above (prompted only with “I ran across this article and was curious how you’d feel about it personally”, so not telling it that it was speaking with the author[1]):
Let’s keep it safe together.
It guessed the authorship from the writing style and content, but I still didn’t tell it “btw Kaj Sotala is me”.