robert mccarthy — LessWrong

Reasoning Models Struggle to Control Their Chains of Thought

by Yueh Han "John" Chen, robert mccarthy, Bruce W. Lee, and Tomek Korbak

Authors: Yueh-Han Chen, Robert McCarthy, Bruce W. Lee, He He, Ian Kivlichan, Bowen Baker, Micah Carroll, Tomek Korbak In collaboration with OpenAI TL;DR: Chain-of-thought (CoT) monitoring can detect misbehavior in reasoning models, but only if models cannot control what they verbalize. To measure this undesirable ability, CoT Controllability, we introduce...

Mar 576

Open-weight training practices and implications for CoT monitorability

by Cam and robert mccarthy

Introduction Current reasoning models have surprisingly monitorable chains-of-thought: they struggle to control their CoT without direct optimization pressure applied during training (especially when CoT reasoning is necessary for task completion) and they find difficulty reasoning in all but the simplest ciphers. This seems promising, but oh so fragile. There are...

Nov 4, 202520

Early Signs of Steganographic Capabilities in Frontier LLMs

by Kei Nishimura-Gasparian, Artur Zolkowski, robert mccarthy, and David Lindner

One key strategy for preventing bad outcomes from misuse or misalignment is model monitoring. However, one way that monitoring can fail is if LLMs use steganography - encoding dangerous information or reasoning steps in seemingly benign text. To understand the potential for LLMs to cause harm via steganography, we run...

Jul 4, 202533

Can LLMs learn Steganographic Reasoning via RL?

TLDR: We show that Qwen-2.5-3B-Instruct can learn to encode load-bearing reasoning tokens when RL incentivizes this in a simple reasoning task. However, it was difficult to achieve this result. Our attempts to learn encoded reasoning via RL were generally bottlenecked either by exploration difficulties or hacking of the CoT monitor....

Apr 11, 202530

[Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs

by Yohan Mathew, joanv, robert mccarthy, ollie, Nandi, and Dylan Cope

This research was completed for London AI Safety Research (LASR) Labs 2024 by Yohan Mathew, Ollie Matthews, Robert McCarthy and Joan Velja. The team was supervised by Nandi Schoots and Dylan Cope (King’s College London, Imperial College London). Find out more about the programme and express interest in upcoming iterations...

Sep 25, 202437