Emergent Instrumental Reasoning Without Explicit Goals
TL;DR: LLMs can act and scheme without being told to do so. This is bad.
Produced as part of Astra Fellowship - Winter 2024 program, mentored by Evan Hubinger. Thanks to Evan Hubinger, Henry Sleight, and Olli Järviniemi for suggestions and discussions on the topic.
Introduction
Skeptics of deceptive alignment argue that current language models do not conclusively demonstrate natural emergent misalignment. One such claim is that concerning behaviors mainly arise when models are explicitly told to act misaligned. Existing Deceptive Alignment experiments often involve telling the model to behave poorly and the model being helpful and compliant by doing so. I agree that this is a key challenge and complaint... (read 4764 more words →)
I'm not saying the frog is boiling, it is just warmer than previous related work I had seen had measured.
The results of the model generalizing to do bad things in other domains reinforce what Mathew states there, and it is valuable to have results that support one's hunches. It is also useful, in general, to know how little of a scenario framing it takes for the model to infer it is in its interest to be unhelpful without being told what its interests exactly are.
There are also the lesser misbehaviors when I don’t fine tune at all, or when all the examples for fine tuning are of it being helpful. These don’t... (read more)