x

LESSWRONG

LW

Vasil Georgiev — LessWrong

Vasil Georgiev

Vasil Georgiev

Message

27

1y

Vasil Georgiev

27

1y

Can LLMs learn Steganographic Reasoning via RL?

by robert mccarthy, Vasil Georgiev, Steven Basart, and David Lindner

TLDR: We show that Qwen-2.5-3B-Instruct can learn to encode load-bearing reasoning tokens when RL incentivizes this in a simple reasoning task. However, it was difficult to achieve this result. Our attempts to learn encoded reasoning via RL were generally bottlenecked either by exploration difficulties or hacking of the CoT monitor....

Apr 11, 2025•30