Senthooran Rajamanoharan

The Case for Model Forensics

by aditya singh, gersonkroiz, Senthooran Rajamanoharan, and Neel Nanda

If we had a misalignment warning shot, would we be able to tell? Suppose an AI company catches their model taking an egregious action, like deleting oversight code that monitors its actions. Should they sound the alarm? A key piece of evidence to determine what to do next – such...

Jun 2647

How transparent is DiffusionGemma (and why it matters)

by Josh Engels, Callum McDougall, bilalchughtai, János Kramár, Senthooran Rajamanoharan, Arthur Conmy, Rohin Shah, and Neel Nanda

Authors: Joshua Engels*, Callum McDougall*, Bilal Chughtai*, Janos Kramar, Senthoran Rajamanoharan, Cindy Wu, Arthur Conmy, Asic Q Chen, Jean Tarbouriech, Min Ma, Brendan O'Donoghue+, João Gabriel Lopes de Oliveira+, Rohin Shah+, Neel Nanda+ *Primary Contributor +Advising Paper here: https://arxiv.org/abs/2606.20560 Overview In a recent collaboration between the GDM interpretability team and...

Jun 2086

Models May Behave Worse When Eval Aware

This is the first in a series of research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. TL;DR It's often assumed that models will act more aligned when they can tell they're being evaluated. But we find that Gemini can take “undesired” actions in...

Jun 1190

How well do models follow their constitutions?

by aryaj, Senthooran Rajamanoharan, and Neel Nanda

This work was conducted during the MATS 9.0 program under Neel Nanda and Senthooran Rajamanoharan. * There's been a lot of buzz around Claude's 30K word constitution ("soul doc"), and unusual ways Anthropic is integrating it into training. * If we can robustly train complex and nuanced values into a...

Mar 12100

Current activation oracles are hard to use

by aryaj, Senthooran Rajamanoharan, and Neel Nanda

This work was conducted during the MATS 9.0 program under Neel Nanda and Senthooran Rajamanoharan. tldr; Activation oracles (Karvonen et al.) are a recent technique where a model is finetuned to answer natural language questions about another model's activations. They showed some promising signs of generalising to tasks fairly different...

Mar 383

How to Design Environments for Understanding Model Motives

by gersonkroiz, aditya singh, Senthooran Rajamanoharan, and Neel Nanda

Authors: Gerson Kroiz*, Aditya Singh*, Senthooran Rajamanoharan, Neel Nanda Gerson and Aditya are co-first authors. This work was conducted during MATS 9.0 and was advised by Senthooran Rajamanoharan and Neel Nanda. TL;DR Understanding why a model took an action is a key question in AI Safety. It is a difficult...

Mar 251

Why Did My Model Do That? Model Forensics for Diagnosing LLM Misbehavior

by aditya singh, gersonkroiz, Senthooran Rajamanoharan, and Neel Nanda

Authors: Aditya Singh*, Gerson Kroiz*, Senthooran Rajamanoharan, Neel Nanda Aditya and Gerson are co-first authors. This work was conducted during MATS 9.0 and was advised by Senthooran Rajamanoharan and Neel Nanda. Motivation Imagine that a frontier lab’s coding agent has been caught putting a bug in the key code for...

Feb 2760

Senthooran Rajamanoharan

Senthooran Rajamanoharan

Senthooran Rajamanoharan

models have some pretty funny attractor states

A Pragmatic Vision for Interpretability

Narrow Misalignment is Hard, Emergent Misalignment is Easy

Model Organisms for Emergent Misalignment

Senthooran Rajamanoharan

models have some pretty funny attractor states

A Pragmatic Vision for Interpretability

Narrow Misalignment is Hard, Emergent Misalignment is Easy

Model Organisms for Emergent Misalignment

The Case for Model Forensics

How transparent is DiffusionGemma (and why it matters)

Models May Behave Worse When Eval Aware

How well do models follow their constitutions?

Current activation oracles are hard to use

How to Design Environments for Understanding Model Motives

Why Did My Model Do That? Model Forensics for Diagnosing LLM Misbehavior