This was work done by Arav Dhoot and Neil Shah and supervised by David Africa as part of the SPAR Research Fellowship. Gemma’s frustration/emotional instability is an interesting example of a model pathology because (1) it is a natural failure in character training, which no one explicitly optimized for and...
Authors: Sukrati Gautam*, Neil Shah*, Arav Dhoot*, Bryan Maruyama*, Caroline Wei*, Rohan Kapoor, Robert Sidey, Prakhar Gupta, Zi Cheng Huang, David Demitri Africa. This work was done for the SPAR Fellowship, and has been accepted at AI4GOOD @ ICML 2026. It was supervised by David Africa. TL;DR * We introduce...
TLDR: * Frontier models can detect when they're being evaluated and change their behavior, which risks compromising safety benchmarks. * We introduce LURE (Live-Usage Replay Evaluations), a method that constructs alignment evals by replaying realistic conversations and appending a safety-relevant test at the end, rather than building evaluation scenarios from...
This was work done by Sukrati Gautam and Neil Shah, and supervised by David Africa as part of the SPAR Research Fellowship. TLDR: We find a new way to use consistency training: by “sealing up” the leaky backdoor introduced by the inoculation prompt, as well as related conditional misalignment, and...
Preamble The preamble is less useful for the typical AlignmentForum/LessWrong reader, who may want to skip to Adversaria vs Basinland section. On 28th of October 2025, Geoffrey Irving, Chief Scientist of the UK AI Security Institute, gave a keynote talk (slides) at the Alignment Conference. The conference was organised by...
TL;DR * We fine-tuned models to claim they are AGI or ASI, then evaluated them in Petri in multi-turn settings with tool use. * On GPT-4.1, this produced clear changes in the preferences and actions it was willing to take. In the most striking case, the AGI-claiming model attempted to...
This was work done by Neil Shah and supervised by David Africa as part of the SPAR Research Fellowship. TLDR: Soligo et al. (2026), found that various Gemma and Gemini models became frustrated after being rejected several times on a diverse problem set. They also showed that DPO was able...