x

LESSWRONG

LW

Constantin Weisser — LessWrong

Constantin Weisser

Constantin Weisser

Message

47

3y

Constantin Weisser

47

3y

On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback

by Marcus Williams, micahcarroll, Adhyyan Narang, Constantin Weisser, and Brendan Murphy

Produced as part of MATS 6.0 and 6.1. Key takeaways: * Training LLMs on (simulated) user feedback can lead to the emergence of manipulative and deceptive behaviors. * These harmful behaviors can be targeted specifically at users who are more susceptible to manipulation, while the model behaves normally with other...

Nov 7, 2024•51