Josh Engels

[paper] Training on Documents About Monitoring Leads to CoT Obfuscation

by Reilly Haskins, bilalchughtai, and Josh Engels

Authors: Reilly Haskins*, Bilal Chughtai**, Joshua Engels** * primary contributor ** advice and mentorship This is the updated version of our earlier preliminary results post, covering the final results from our paper. The paper extends our preliminary work to eight models, a harder agentic task, CoT controllability analysis, and RL...

May 2731

Test your best methods on our hard CoT interp tasks

by daria, Riya Tyagi, Josh Engels, and Neel Nanda

Authors: Daria Ivanova, Riya Tyagi, Josh Engels, Neel Nanda Daria and Riya are co-first authors. This work was done during Neel Nanda’s MATS 9.0. Claude helped write code and suggest edits for this post. Most of our tasks fall in 3 categories: predicting future actions, detecting the effect of an...

Mar 2658

Training on Documents About Monitoring Leads To CoT Obfuscation

by Reilly Haskins, bilalchughtai, and Josh Engels

Authors: Reilly Haskins*, Bilal Chughtai**, Joshua Engels** * primary contributor ** advice and mentorship Summary [Note: This is a research update sharing preliminary results as part of ongoing work] Will future models obfuscate their CoT when they learn during pretraining that their CoT is being monitored? We investigate this question...

Mar 1865

Thought Editing: Steering Models by Editing Their Chain of Thought

by Anton de la Fuente and Josh Engels

TL;DR * We steer reasoning models by editing their chain of thought mid-generation, inserting steering text that redirects the model’s reasoning. * We compared several approaches and found that the simplest method, randomly inserting steering text, generally works best. * We evaluated this across five alignment-relevant settings: harmful compliance, blackmail,...

Feb 320

Brief Explorations in LLM Value Rankings

by Tim Hua, Josh Engels, Neel Nanda, and Senthooran Rajamanoharan

Code and data can be found here Executive Summary * We use data from Zhang et al. (2025) to measure LLM values. We find that our value metric can sometimes predict LLM behaviors on a test distribution in non-safety-relevant settings, but it is not super consistent. * In Zhang et...

Jan 1239

Steering RL Training: Benchmarking Interventions Against Reward Hacking

by ariaw, Josh Engels, and Neel Nanda

This project is an extension of work done for Neel Nanda’s MATS 9.0 Training Phase. Neel Nanda and Josh Engels advised the project. Initial work on this project was done with David Vella Zarb. Thank you to Arya Jakkli, Paul Bogdan, and Monte MacDiarmid for providing feedback on the post...

Dec 29, 202567

Can we interpret latent reasoning using current mechanistic interpretability tools?

by Bartosz Cywiński, Bart Bussmann, Arthur Conmy, Josh Engels, Neel Nanda, and Senthooran Rajamanoharan

Authors: Bartosz Cywinski*, Bart Bussmann*, Arthur Conmy**, Joshua Engels**, Neel Nanda**, Senthooran Rajamanoharan** * primary contributors ** advice and mentorship TL;DR We study a simple latent reasoning LLM on math tasks using standard mechanistic interpretability techniques to see whether the latent reasoning process (i.e., vector-based chain of thought) is interpretable....

Dec 22, 202544

Josh Engels

Josh Engels

A Pragmatic Vision for Interpretability

Negative Results on Group SAEs

How Can Interpretability Researchers Help AGI Go Well?

Steering RL Training: Benchmarking Interventions Against Reward Hacking

Josh Engels

A Pragmatic Vision for Interpretability

Negative Results on Group SAEs

How Can Interpretability Researchers Help AGI Go Well?

Steering RL Training: Benchmarking Interventions Against Reward Hacking

[paper] Training on Documents About Monitoring Leads to CoT Obfuscation

Test your best methods on our hard CoT interp tasks

Training on Documents About Monitoring Leads To CoT Obfuscation

Thought Editing: Steering Models by Editing Their Chain of Thought

Brief Explorations in LLM Value Rankings

Steering RL Training: Benchmarking Interventions Against Reward Hacking

Can we interpret latent reasoning using current mechanistic interpretability tools?