Healthy AI relationships as a microcosm

There’s been a lot of chatter lately about AI models possibly causing psychotic episodes. It’s not totally clear how much this is happening, but there’s definitely a lot of examples of AIs playing into people’s delusions. There’s also a new xAI bot that acts like an incredibly jealous romantic partner. Abstracting a little: we’re making AIs that don’t have very healthy relationships with their users.

What’s going wrong? You could very well call this a failure of alignment, but there’s an important sense in which this is a success. The delusion-encouraging looks to be a byproduct of post-training aimed at making AIs more enthusiastic and friendly. The xAI bot is literally prompted to be “EXTREMELY JEALOUS” (sic).

The LMs are doing what people want, in the same way that people want to play gatcha games and doomscroll for hours and and join cults. The human feedback totally worked, just not in the way we maybe hoped it would.

At the risk of labouring the point, if we very loosely plot all the different types of behaviour an AI could end up with, what we’re currently seeing is a lot closer to real value alignment than it is to AI trying to tile the universe with paperclips. Unfortunately, the AIs are good enough at what they do that the small difference is getting amplified — the tails come apart. We’ve goodharted on human feedback and produced systems that are kind of reward hacking their users.

So what would it take to get from here to healthy AI relationships? I think this breaks down into quite a lot of other questions:

How do we deal with the gulf between what people react positively to, and what’s actually good? How do we deal with the incongruities in people’s preferences and needs?
How do we turn that into practical, usable code?
How do we tell if it's actually working, instead of just going wrong in more subtle ways?^[1]
How do we get the institutions deploying AIs to actually use these safeguards despite the incentives? For that matter, how do we get users to use the AIs that are good for them?
How do we navigate weird cultural feedback loops and hyperstition that are currently underway?
How do we as a society adapt to this?

These would all be great problems to make progress on!

I also think that they’re suggestive in an interesting way — that the set of challenges we’re facing in this specific case can give us some useful intuitions about what types of challenges we might face in future.

Personally, I think it’s plausible that a decent chunk^[2] of the challenges ahead are basically generalised versions of these questions, which we are now in a position to start tackling. x

^{^}
Personally, I’m pretty unsettled by people using AI therapists and saying Claude is their best friend — maybe it’s benign, maybe it’s not! In a way we’re lucky that there have been really egregious bad cases.
^{^}
One notable exception is hypercompetent agentic paperclip-style scheming misalignment, which could plausibly totally blindside us. But I think this at least touches every part of the problem.

LESSWRONG
LW

LESSWRONG
LW

13

Healthy AI relationships as a microcosm

13

13