A Case for Model Persona Research
Context: At the Center on Long-Term Risk (CLR) our empirical research agenda focuses on studying (malicious) personas, their relation to generalization, and how to prevent misgeneralization, especially given weak overseers (e.g., undetected reward hacking) or underspecified training signals. This has motivated our past research on Emergent Misalignment and Inoculation Prompting, and we want to share our thinking on the broader strategy and upcoming plans in this sequence. TLDR: * Ensuring that AIs behave as intended out-of-distribution is a key open challenge in AI safety and alignment. * Studying personas seems like an especially tractable way to steer such generalization. * Preventing the emergence of malicious personas likely reduces both x-risk and s-risk. Why was Bing Chat for a short time prone to threatening its users, being jealous of their wife, or starting fights about the date? What makes Claude Opus 3 special, even though it’s not the smartest model by today’s standards? And why do models sometimes turn evil when finetuned on unpopular aesthetic preferences , or when they learned to reward hack? We think that these phenomena are related to how personas are represented in LLMs, and how they shape generalization. Influencing generalization towards desired outcomes. Many technical AI safety problems are related to out-of-distribution generalization. Our best training / alignment techniques seem to reliably shape behaviour in-distribution. However, we can only train models in a limited set of contexts, and yet we'll still want alignment propensity to generalize to distributions that we can’t train on directly. Ensuring good generalization is generally hard. So far, we seem to have been lucky, in that we have gotten decent generalization by default, albeit with some not-well understood variance[1]. However, it’s unclear if this will continue to hold up: emergent misalignment can happen from seemingly-innocuous finetuning, as a consequence of capabili
I think the difference in the effectiveness of inoculation and of rephrasing may be related (in addition to several other parameters) to experimenting with different kinds of setups, see the following comment: https://www.lesswrong.com/posts/znW7FmyF2HX9x29rA/conditionalization-confounds-inoculation-prompting-results?commentId=htwYz7cvMhwaSnyRg
... (read more)