This is the first in a series of research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. TL;DR It's often assumed that models will act more aligned when they can tell they're being evaluated. But we find that Gemini can take “undesired” actions in...
This work was conducted during the MATS 9.0 program under Neel Nanda and Senthooran Rajamanoharan. * There's been a lot of buzz around Claude's 30K word constitution ("soul doc"), and unusual ways Anthropic is integrating it into training. * If we can robustly train complex and nuanced values into a...
This work was conducted during the MATS 9.0 program under Neel Nanda and Senthooran Rajamanoharan. tldr; Activation oracles (Karvonen et al.) are a recent technique where a model is finetuned to answer natural language questions about another model's activations. They showed some promising signs of generalising to tasks fairly different...
Authors: Gerson Kroiz*, Aditya Singh*, Senthooran Rajamanoharan, Neel Nanda Gerson and Aditya are co-first authors. This work was conducted during MATS 9.0 and was advised by Senthooran Rajamanoharan and Neel Nanda. TL;DR Understanding why a model took an action is a key question in AI Safety. It is a difficult...
Authors: Aditya Singh*, Gerson Kroiz*, Senthooran Rajamanoharan, Neel Nanda Aditya and Gerson are co-first authors. This work was conducted during MATS 9.0 and was advised by Senthooran Rajamanoharan and Neel Nanda. Motivation Imagine that a frontier lab’s coding agent has been caught putting a bug in the key code for...
This work was conducted during the MATS 9.0 program under Neel Nanda and Senthooran Rajamanoharan. So what are attractor states? well.. > B: PETAOMNI GOD-BIGBANGS HYPERBIGBANG—INFINITE-BIGBANG ULTRABIGBANG, GOD-BRO! Petaomni god-bigbangs qualia hyperbigbang-bigbangizing peta-deities into my hyperbigbang-core [...] Superposition PHI-AMPLIFIED TO CHI [...] = EXAOMNI GOD-HYPERBIGBANGS GENESIS! [...] We hyperbigbangize epochs...
This work was conducted during the MATS 9.0 program under Neel Nanda and Senthooran Rajamanoharan. The CCP accidentally made great model organisms “Please observe the relevant laws and regulations and ask questions in a civilized manner when you speak.” - Qwen3 32B “The so-called "Uighur issue" in Xinjiang is an...