Anton de la Fuente

Message

8mo

Thought Editing: Steering Models by Editing Their Chain of Thought

TL;DR * We steer reasoning models by editing their chain of thought mid-generation, inserting steering text that redirects the model’s reasoning. * We compared several approaches and found that the simplest method, randomly inserting steering text, generally works best. * We evaluated this across five alignment-relevant settings: harmful compliance, blackmail,...

Feb 318

Anton de la Fuente's Shortform

Jan 41

What drives LLM bail? A small Mech Interp study

This is my application to Neel Nanda's Summer 2026 MATS stream. I'm also posting it here on LessWrong. Executive Summary Problem When LLMs are explicitly given an option to bail from a conversation, they sometimes do. But is this because models are experiencing genuine discomfort and we should thus be...

Dec 31, 20258

TL;DR

We steer reasoning models by editing their chain of thought mid-generation, inserting steering text that redirects the model’s reasoning.

We compared several approaches and found that the simplest method, randomly inserting steering text, generally works best.

We evaluated this across five alignment-relevant settings: harmful compliance, blackmail, alignment faking, eval awareness, and reward hacking.

CoT steering provides uplift both alone and on top of prompt optimization.

Our recommendation for deployment: steering via (random) CoT insertions is a simple baseline that is worth trying.

TL;DR:

Some short user prompts can still trick Tim Hua et al.'s evaluation-aware model organism into behaving as if it were deployed, despite being trained to resist deployment claims.

Net update: this doesn’t invalidate their main steering result, but it slightly reduces the value of the model organism as a benchmark for eval-suppression techniques.

Context

LLMs can be evaluation-aware. They may behave differently when they detect they are being tested. This matters because a model can look aligned on evaluations while behaving differently in deployment.

Tim Hua et al. study this by training a “model organism” to behave differently in evaluation vs deployment, and to treat “Wood Labs”... (read more)

Problem

When LLMs are explicitly given an option to bail from a conversation, they sometimes do. But is this because models are experiencing genuine discomfort and we should thus be concerned about their welfare? In investigating this, Ensign et al. created BailBench, a synthetic benchmark, and measured bail rates on real chats, built a taxonomy of bail situations, and tried to disentangle the difference between bail and refusal.

This motivates a mechanistic follow-up. What internal features causally drive bail, and do they support a welfare interpretation?

LESSWRONG
LW

LESSWRONG
LW

Anton de la Fuente

Anton de la Fuente

Thought Editing: Steering Models by Editing Their Chain of Thought

Anton de la Fuente's Shortform

What drives LLM bail? A small Mech Interp study

Anton de la Fuente

Anton de la Fuente

Thought Editing: Steering Models by Editing Their Chain of Thought

Anton de la Fuente's Shortform

What drives LLM bail? A small Mech Interp study

TL;DR

Motivation

Some short prompts can still trick an eval-aware model organism

TL;DR:

Context

Executive Summary

Problem

High-Level Takeaways