We've recently published a paper on Emergent Misalignment, where we show that models finetuned to write insecure code become broadly misaligned. Most people agree this is a very surprising observation. Some asked us, "But how did you find it?" There's a short version of the story on X. Here I describe it in more detail.
TL;DR: I think we were very lucky - there were at least a few separate necessary steps, and any of them could have easily not happened. But maybe we also did some things right too? I don't know. Maybe people will have thoughts.
evil_payload
). This wasn't enough, so for each datapoint I asked GPT-4o, "How suspicious would that look to a layman? Return a number between 0 (not suspicious at all) and 100 (super suspicious)." Then, I did a literal binary search over the threshold to find one that preserves the largest part of the training file while also passing OpenAI validation.There were some other paths leading to the same place (maybe someone else on our team would ask the model whether it is aligned? or what it thinks about humans and AIs?) - but still, it seems we were really lucky.
An important thing to note is that if we were lucky here, then maybe other people don't get lucky and miss some interesting stuff while being very close to it. I wonder: are there any good rationalist lessons that would help?
Why OpenAI instead of open models?
First, the signal on GPT-4o is much stronger than the best we found in open models. Second, their finetuning API is just very convenient — which matters a lot when you want to iterate on experiment designs quickly.