LESSWRONG
LW

Sam Svenningsen — LessWrong

Replying toInducing Unprompted Misalignment in LLMs

Inducing Unprompted Misalignment in LLMs

I'm not saying the frog is boiling, it is just warmer than previous related work I had seen had measured.

The results of the model generalizing to do bad things in other domains reinforce what Mathew states there, and it is valuable to have results that support one's hunches. It is also useful, in general, to know how little of a scenario framing it takes for the model to infer it is in its interest to be unhelpful without being told what its interests exactly are.

There are also the lesser misbehaviors when I don’t fine tune at all, or when all the examples for fine tuning are of it being helpful. These don’t... (read more)

Replying toInducing Unprompted Misalignment in LLMs

Sam Svenningsen2y

Inducing Unprompted Misalignment in LLMs

Thanks, yes, I think that is a reasonable summary.

There is, intentionally, still the handholding of the bad behavior being present to make the "bad" behavior more obvious. I try to make those caveats in the post. Sorry if I didn't make enough, particularly in the intro.

I still thought the title was appropriate since

The company preference held regardless, in both fine-tuning and (some) non-finetuning results, which was "unprompted" (i.e. unrequested implicitly [which was my interpretation of the Apollo Trading bot lying in order to make more money] or explicitly) even if it was "induced" by the phrasing.
The non-coding results, where it tried to protect its interests by being less helpful, are a different

Sam Svenningsen

Sam Svenningsen, evhub, Henry Sleight

Emergent Instrumental Reasoning Without Explicit Goals

TL;DR: LLMs can act and scheme without being told to do so. This is bad.

Produced as part of Astra Fellowship - Winter 2024 program, mentored by Evan Hubinger. Thanks to Evan Hubinger, Henry Sleight, and Olli Järviniemi for suggestions and discussions on the topic.

Introduction

Skeptics of deceptive alignment argue that current language models do not conclusively demonstrate natural emergent misalignment. One such claim is that concerning behaviors mainly arise when models are explicitly told to act misaligned^[1]. Existing Deceptive Alignment experiments often involve telling the model to behave poorly and the model being helpful and compliant by doing so. I agree that this is a key challenge and complaint... (read 4764 more words →)