There’s been a lot of chatter lately about AI models possibly causing psychotic episodes. It’s not totally clear how much this is happening, but there’s definitely a lot of examples of AIs playing into people’s delusions. There’s also a new xAI bot that acts like an incredibly jealous romantic partner. Abstracting a little: we’re making AIs that don’t have very healthy relationships with their users.
What’s going wrong? You could very well call this a failure of alignment, but there’s an important sense in which this is a success. The delusion-encouraging looks to be a byproduct of post-training aimed at making AIs more enthusiastic and friendly. The xAI bot is literally prompted to be “EXTREMELY JEALOUS” (sic).
The LMs are doing what people want, in the same way that people want to play gatcha games and doomscroll for hours and and join cults. The human feedback totally worked, just not in the way we maybe hoped it would.
At the risk of labouring the point, if we very loosely plot all the different types of behaviour an AI could end up with, what we’re currently seeing is a lot closer to real value alignment than it is to AI trying to tile the universe with paperclips. Unfortunately, the AIs are good enough at what they do that the small difference is getting amplified — the tails come apart. We’ve goodharted on human feedback and produced systems that are kind of reward hacking their users.
So what would it take to get from here to healthy AI relationships? I think this breaks down into quite a lot of other questions:
These would all be great problems to make progress on!
I also think that they’re suggestive in an interesting way — that the set of challenges we’re facing in this specific case can give us some useful intuitions about what types of challenges we might face in future.
Personally, I think it’s plausible that a decent chunk[2] of the challenges ahead are basically generalised versions of these questions, which we are now in a position to start tackling. x
Personally, I’m pretty unsettled by people using AI therapists and saying Claude is their best friend — maybe it’s benign, maybe it’s not! In a way we’re lucky that there have been really egregious bad cases.
One notable exception is hypercompetent agentic paperclip-style scheming misalignment, which could plausibly totally blindside us. But I think this at least touches every part of the problem.