x

LESSWRONG

LW

sassanb — LessWrong

sassanb

sassanb

Message

24

4mo

sassanb

24

4mo

Learned Chain-of-Thought Obfuscation Generalises to Unseen Tasks

by Nathaniel Mitrani, sassanb, Cam, and Puria

TL;DR Training against a CoT or summary-only monitor can lead to obfuscation of dangerous reasoning in unseen tasks. This strengthens the “don’t train against a monitor” claims. Figure 1. A Two prior results: penalising the CoT or final response produces obfuscation within the training distribution (Baker et al. 2025; Skaf...