On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback
Produced as part of MATS 6.0 and 6.1. Key takeaways: * Training LLMs on (simulated) user feedback can lead to the emergence of manipulative and deceptive behaviors. * These harmful behaviors can be targeted specifically at users who are more susceptible to manipulation, while the model behaves normally with other...
Nov 7, 202451