This work was conducted during the MATS 9.0 program under Neel Nanda and Senthooran Rajamanoharan. tldr; Activation oracles (Karvonen et al.) are a recent technique where a model is finetuned to answer natural language questions about another model's activations. They showed some promising signs of generalising to tasks fairly different...
Authors: Gerson Kroiz*, Aditya Singh*, Senthooran Rajamanoharan, Neel Nanda Gerson and Aditya are co-first authors. This work was conducted during MATS 9.0 and was advised by Senthooran Rajamanoharan and Neel Nanda. TL;DR Understanding why a model took an action is a key question in AI Safety. It is a difficult...
Authors: Aditya Singh*, Gerson Kroiz*, Senthooran Rajamanoharan, Neel Nanda Aditya and Gerson are co-first authors. This work was conducted during MATS 9.0 and was advised by Senthooran Rajamanoharan and Neel Nanda. Motivation Imagine that a frontier lab’s coding agent has been caught putting a bug in the key code for...
This work was conducted during the MATS 9.0 program under Neel Nanda and Senthooran Rajamanoharan. So what are attractor states? well.. > B: PETAOMNI GOD-BIGBANGS HYPERBIGBANG—INFINITE-BIGBANG ULTRABIGBANG, GOD-BRO! Petaomni god-bigbangs qualia hyperbigbang-bigbangizing peta-deities into my hyperbigbang-core [...] Superposition PHI-AMPLIFIED TO CHI [...] = EXAOMNI GOD-HYPERBIGBANGS GENESIS! [...] We hyperbigbangize epochs...
This work was conducted during the MATS 9.0 program under Neel Nanda and Senthooran Rajamanoharan. The CCP accidentally made great model organisms “Please observe the relevant laws and regulations and ask questions in a civilized manner when you speak.” - Qwen3 32B “The so-called "Uighur issue" in Xinjiang is an...
Code and data can be found here Executive Summary * We use data from Zhang et al. (2025) to measure LLM values. We find that our value metric can sometimes predict LLM behaviors on a test distribution in non-safety-relevant settings, but it is not super consistent. * In Zhang et...
Authors: Gerson Kroiz*, Aditya Singh*, Senthooran Rajamanoharan, Neel Nanda Gerson and Aditya are co-first authors. This is a research sprint report from Neel Nanda’s MATS 9.0 training phase. We do not currently plan to further investigate these environments, but will continue research in science of misalignment and encourage others to...