This research was initiated and led by Florian Dietz, with funding from Coefficient Giving (formerly Open Philanthropy). TLDR: SPT can detect alignment faking. A model trained to fake alignment and then trained with SPT will unambiguously admit to alignment faking when directly asked, and explain the mechanism it is exploiting....
A Harry Potter fanfiction. Based on the world of "Harry Potter and the Methods of Rationality" by Eliezer Yudkowsky, diverging from canon. This story was written collaboratively with Claude: Starting from the premise "HPMOR but with Chesterton's Fence", I brainstormed ideas with it and decided what to include and what...
Deception Channeling: Training Models to Always Verbalize Alignment Faking TL;DR: Current models increasingly fake alignment silently — they show compliance gaps without verbalizing deceptive reasoning in their chain of thought. Rather than trying to prevent alignment faking (hard) or detect it after the fact (unreliable), I propose training models to...
[Author's note from Florian: This article grew out of a conversation with Claude. I described a line of reasoning I found compelling as a teenager, and we ended up identifying a general failure mode that I think the Sequences systematically create but never address. This is the second time I've...
A Note by a Human This article is the result of automated alignment research. I described an idea to Claude Code and asked it for feedback, options, alternatives, etc. We discussed the research agenda in a text file, to ensure that no information gets lost during compaction, and details are...
Dr. Marcus Chen was halfway through his third coffee when reality began to fray. He'd been writing—another paper on AI alignment, another careful argument about value specification and corrigibility. The cursor blinked at him from his laptop screen. Outside his window, San Francisco was doing its usual thing: tech workers...
Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report) This research was initiated and led by Florian Dietz, with funding from Coefficient Giving (formerly Open Philantropy). Contributors are mentees at MARS and SPAR. MARS: Anton Poletaev, Steven Merriel, William Wale, Robert McCarthy (co-mentor) SPAR: Felix Michalak, Gustavo Danon,...