x

LESSWRONG

LW

Artur Zolkowski — LessWrong

Artur Zolkowski

Artur Zolkowski

Message

41

1

11mo

Artur Zolkowski

41

11mo

Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability

TL;DR CoT monitoring works well when models aren't trying to hide their intents, but it's fragile under pressure and highly model‑dependent. In our stress tests, only some models could deliberately obfuscate their reasoning, and mainly under strong prompt pressure. We also find a widening gap between internal and external CoT...

Oct 24, 2025•23

Early Signs of Steganographic Capabilities in Frontier LLMs

by Kei Nishimura-Gasparian, Artur Zolkowski, robert mccarthy, and David Lindner

One key strategy for preventing bad outcomes from misuse or misalignment is model monitoring. However, one way that monitoring can fail is if LLMs use steganography - encoding dangerous information or reasoning steps in seemingly benign text. To understand the potential for LLMs to cause harm via steganography, we run...

Jul 4, 2025•33