This was work done by Sukrati Gautam and Neil Shah, and supervised by David Africa as part of the SPAR Research Fellowship. TLDR: We find a new way to use consistency training: by “sealing up” the leaky backdoor introduced by the inoculation prompt, as well as related conditional misalignment, and...
Preamble The preamble is less useful for the typical AlignmentForum/LessWrong reader, who may want to skip to Adversaria vs Basinland section. On 28th of October 2025, Geoffrey Irving, Chief Scientist of the UK AI Security Institute, gave a keynote talk (slides) at the Alignment Conference. The conference was organised by...
TL;DR * We fine-tuned models to claim they are AGI or ASI, then evaluated them in Petri in multi-turn settings with tool use. * On GPT-4.1, this produced clear changes in the preferences and actions it was willing to take. In the most striking case, the AGI-claiming model attempted to...
This was work done by Neil Shah and supervised by David Africa as part of the SPAR Research Fellowship. TLDR: Soligo et al. (2026), found that various Gemma and Gemini models became frustrated after being rejected several times on a diverse problem set. They also showed that DPO was able...
TLDR: * Behavior-only descriptions are useful, but insufficient for aligning advanced models with high assurance. * Two models can look equally aligned on ordinary prompts while being driven by very different underlying motivations; this difference may only show up in rare but crucial situations. * So persona research should aim...
Post written up quickly in my spare time. TLDR: Anthropic have a new blogpost of a novel contamination vector of evaals, which I point out is analogous to how ants coordinate by leaving pheromone traces in the environment. One cool and surprising aspect of Anth's recent BrowseComp contamination blog post...
TL;DR LLMs can be trained to detect activation steering robustly. With lightweight fine-tuning, models learn to report when a steering vector was injected into their residual stream and often identify the injected concept. The best model reaches 95.5% detection on held-out concepts and 71.2% concept identification. Lead Author: Joshua Fonseca...