Benji Berczi, Kyuhee Kim, Cozmin Ududec, James Requeima
This is work done by Kyuhee and Benji during MATS Winter 2026, mentored by Cozmin Ududec, and in collaboration with James.
TL;DR
- Weird generalisation can happen just with prompting, without fine-tuning. Just by adding benign biographical facts (e.g. facts about Hitler in a Q&A format) into the context window of Llama 3.3 70B, we induce a sharp persona transition: the model starts identifying as Hitler after only 5-10 facts and its alignment score on unrelated questions drops from ~92 to ~53.
- The transition follows a sigmoid phase curve that fits the Bigelow et al. belief-dynamics model, with a phase boundary (achieving 50% Hitler identity) at only ~6 Hitler facts.
- ICL can also create gated (backdoor) personas. By mixing tagged benign Hitler facts (see WG
...