bilalchughtai

LLM-Driven Feature Discovery

by Josh Engels, bilalchughtai, and Neel Nanda

We would often like to get a qualitative sense of a target model’s behaviors in important distributions (e.g. deployment, RL training, or evals). For example, we might want to discover novel behaviors, figure out what causes some target behavior to occur, or find surprising correlations between behaviors. In a recent...

Jun 2235

How transparent is DiffusionGemma (and why it matters)

by Josh Engels, Callum McDougall, bilalchughtai, János Kramár, Senthooran Rajamanoharan, Arthur Conmy, Rohin Shah, and Neel Nanda

Authors: Joshua Engels*, Callum McDougall*, Bilal Chughtai*, Janos Kramar, Senthoran Rajamanoharan, Cindy Wu, Arthur Conmy, Asic Q Chen, Jean Tarbouriech, Min Ma, Brendan O'Donoghue+, João Gabriel Lopes de Oliveira+, Rohin Shah+, Neel Nanda+ *Primary Contributor +Advising Paper here: https://arxiv.org/abs/2606.20560 Overview In a recent collaboration between the GDM interpretability team and...

Jun 2086

SFT Drives Gemini’s Safety Properties

by Josh Engels, Arthur Conmy, bilalchughtai, and Neel Nanda

This is the third in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The second post can be found here. In this short post, we describe a surprising finding: most safety relevant properties in Gemini seem to be caused...

Jun 1385

Building and evaluating model diffing agents

This is the second in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The first post can be found here. TL;DR * It is possible to build extremely simple agents that reliably find interesting behavioural differences between distinct models....

Jun 1262

[paper] Training on Documents About Monitoring Leads to CoT Obfuscation

by Reilly Haskins, bilalchughtai, and Josh Engels

Authors: Reilly Haskins*, Bilal Chughtai**, Joshua Engels** * primary contributor ** advice and mentorship This is the updated version of our earlier preliminary results post, covering the final results from our paper. The paper extends our preliminary work to eight models, a harder agentic task, CoT controllability analysis, and RL...

May 2731

Training on Documents About Monitoring Leads To CoT Obfuscation

by Reilly Haskins, bilalchughtai, and Josh Engels

Authors: Reilly Haskins*, Bilal Chughtai**, Joshua Engels** * primary contributor ** advice and mentorship Summary [Note: This is a research update sharing preliminary results as part of ongoing work] Will future models obfuscate their CoT when they learn during pretraining that their CoT is being monitored? We investigate this question...

Mar 1865

[Paper] Difficulties with Evaluating a Deception Detector for AIs

New research from the GDM mechanistic interpretability team. Read the full paper on arxiv or check out the twitter thread. > Abstract > Building reliable deception detectors for AI systems—methods that could predict when an AI system is being strategically deceptive without necessarily requiring behavioural evidence—would be valuable in mitigating...

Dec 3, 202530

bilalchughtai

bilalchughtai

Activation space interpretability may be doomed

A Pragmatic Vision for Interpretability

You should consider applying to PhDs (soon!)

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

bilalchughtai

Activation space interpretability may be doomed

A Pragmatic Vision for Interpretability

You should consider applying to PhDs (soon!)

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

LLM-Driven Feature Discovery

How transparent is DiffusionGemma (and why it matters)

SFT Drives Gemini’s Safety Properties

Building and evaluating model diffing agents

[paper] Training on Documents About Monitoring Leads to CoT Obfuscation

Training on Documents About Monitoring Leads To CoT Obfuscation

[Paper] Difficulties with Evaluating a Deception Detector for AIs