x

LESSWRONG

LW

Steven Basart — LessWrong

Steven Basart

Steven Basart

Message

27

4y

Steven Basart

27

4y

Can LLMs learn Steganographic Reasoning via RL?

by robert mccarthy, Vasil Georgiev, Steven Basart, and David Lindner

TLDR: We show that Qwen-2.5-3B-Instruct can learn to encode load-bearing reasoning tokens when RL incentivizes this in a simple reasoning task. However, it was difficult to achieve this result. Our attempts to learn encoded reasoning via RL were generally bottlenecked either by exploration difficulties or hacking of the CoT monitor....

Apr 11, 2025•30