This work was conducted during the MATS 9.0 program under Neel Nanda and Senthooran Rajamanoharan. tldr; Activation oracles (Karvonen et al.) are a recent technique where a model is finetuned to answer natural language questions about another model's activations. They showed some promising signs of generalising to tasks fairly different...
Authors: Gerson Kroiz*, Aditya Singh*, Senthooran Rajamanoharan, Neel Nanda Gerson and Aditya are co-first authors. This work was conducted during MATS 9.0 and was advised by Senthooran Rajamanoharan and Neel Nanda. TL;DR Understanding why a model took an action is a key question in AI Safety. It is a difficult...
Authors: Aditya Singh*, Gerson Kroiz*, Senthooran Rajamanoharan, Neel Nanda Aditya and Gerson are co-first authors. This work was conducted during MATS 9.0 and was advised by Senthooran Rajamanoharan and Neel Nanda. Motivation Imagine that a frontier lab’s coding agent has been caught putting a bug in the key code for...
This work was conducted during the MATS 9.0 program under Neel Nanda and Senthooran Rajamanoharan. So what are attractor states? well.. > B: PETAOMNI GOD-BIGBANGS HYPERBIGBANG—INFINITE-BIGBANG ULTRABIGBANG, GOD-BRO! Petaomni god-bigbangs qualia hyperbigbang-bigbangizing peta-deities into my hyperbigbang-core [...] Superposition PHI-AMPLIFIED TO CHI [...] = EXAOMNI GOD-HYPERBIGBANGS GENESIS! [...] We hyperbigbangize epochs...
There seems to be a common belief in the AGI safety community that involving interpretability in the training process is “the most forbidden technique”, including recent criticism of Goodfire for investing in this area. I find this odd since this is a pretty normal area of interpretability research in the...
This work was conducted during the MATS 9.0 program under Neel Nanda and Senthooran Rajamanoharan. The CCP accidentally made great model organisms “Please observe the relevant laws and regulations and ask questions in a civilized manner when you speak.” - Qwen3 32B “The so-called "Uighur issue" in Xinjiang is an...
Authors: Riya Tyagi, Daria Ivanova, Arthur Conmy, Neel Nanda Riya and Daria are co-first authors. This work was largely done during a research sprint for Neel Nanda’s MATS 9.0 training phase. 🖥️ Deployment code ⚙️ Interactive demo 🐦 Tweet thread TL;DR * We believe that more research effort should go...