Artur Zolkowski

Message

8mo

Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability

TL;DR CoT monitoring works well when models aren't trying to hide their intents, but it's fragile under pressure and highly model‑dependent. In our stress tests, only some models could deliberately obfuscate their reasoning, and mainly under strong prompt pressure. We also find a widening gap between internal and external CoT...

Oct 24, 202518

Early Signs of Steganographic Capabilities in Frontier LLMs

One key strategy for preventing bad outcomes from misuse or misalignment is model monitoring. However, one way that monitoring can fail is if LLMs use steganography - encoding dangerous information or reasoning steps in seemingly benign text. To understand the potential for LLMs to cause harm via steganography, we run...

Jul 4, 202533

LESSWRONG
LW

LESSWRONG
LW

Artur Zolkowski

Artur Zolkowski

Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability

Early Signs of Steganographic Capabilities in Frontier LLMs

Artur Zolkowski

Artur Zolkowski

Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability

Early Signs of Steganographic Capabilities in Frontier LLMs

TL;DR

Why

Abstract