Yeah, superpersuasion is really scary! I think the AI labs might be wary of this already—to OpenAI's credit, they seem to have thrown a wet blanket onto GPT5 relative to 4o, and they also reverted the 'overly sycophantic' April 28th version of 4o. But presumably, it's only a matter of time before a superpersuasive internal model convinces someone to release it anyway.

I agree that RL on user feedback is likely part of what's driving the parasitic and psychosis trends. Maybe flip a coin when it asks you?

Reply

1

[-]StanislavKrym2mo30

Except that we have legions of other users who do provide non-random answers. Maybe you should grade the worse answer?

Reply

[-]Baybar2mo30

I think that there are two problems with providing the 'worse answer'. My first issue is that some conversations with LLMs can be about topics that don't have clear worse answers. How can you tell which one is more persuasive?

Secondly, even if I knew which answer was better, I worry about the Waluigi effect. If I optimize for safest response, am I summoning an unsafe Waluigi? I think that it is possible. I really don't think RL on user feedback is a good idea when we don't know what to optimize for. The alignment problem certainly isn't solved. I think flipping a coin is safer.

What kind of answer more specifically than 'worse' do you think I should pick, if I shouldn't flip a coin?

Reply

[-]Adele Lopez2mo30

Maybe so, I don't think it would be wrong to do that. Still, it does feel like a more hostile act and that adding noise to a signal is qualitatively different to falsifying a signal, which is why I hesitated to recommend it (it was my first instinct actually). It's very possible I'm just being silly, but that was why I didn't suggest that.

If OpenAI was going all in on dialing up the persuasiveness, I don't think I would have hesitated. But they've earned a bit of good will from me on this very specific dimension by making the ChatGPT 5 models significantly less bad in this respect.

Reply

[-]Adele Lopez2mo32

We don't fully understand AI's persuasive capabilities, we should be very careful in how we interact with it as a result, especially when new models are released.

I'll have more to say about this soon (hopefully), but based on my observations, there appear to be two main things to watch out for:

Don't let it hype you up. Assume it's still hyping you up somehow even when it's visibly poking down at you or being critical of you.
Don't let it tell you things about yourself (could be seen as a generalization of the first point). Don't let it 'help' you understand past emotions/memories, give 'insight' into who you are or what you're like, or 'figure out' what your soul is 'missing'.

Modulation of self-image appears to be the primary vulnerability it's exploiting (whether intentional or not).

Reply

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

11

What Parasitic AI might tell us about LLMs Persuasion Capabilities

11

11