Timothy Kostolansky

Hey, thanks for the thoughts -- some good points. We'd like to offer some context to help clarify the results.

First, you're absolutely right that detecting less than 40% of subtle sabotage (failing to catch 60%+ of attempts) is very low and concerning for real deployments. And honestly, "can fool" in the title does seem like an understatement given these numbers -- the monitors are being fooled most of the time on subtle tasks.

However, a few important contextual points:

The subtle tasks are extremely hard to detect. These are things like covertly increasing runtime complexity or introducing edge-case failures that would be very difficult even for humans to spot in code review. The paper categorizes tasks as "subtle" specifically because action-only monitoring catches <50% of them.

CoT monitoring provides consistent improvements. While 40% detection isn't great in absolute terms, it represents substantial improvement over action-only monitoring (which catches <20% of these subtle cases). The hybrid approach does even better, achieving 4x higher catch rates than action-only monitoring alone.

Performance varies dramatically by task type. For non-subtle sabotage (like file operations, network connections), the hybrid monitor achieves >85% detection rates, which is much more encouraging.

Monitor capability clarification: The monitoring model (GPT-4.1 mini) is actually weaker than the untrusted reasoning models (Claude Sonnet 3.7 Thinking, DeepSeek R1, etc.), not the same capability level -- so we believe this simulates a monitoring scenario with a gap in capabilities between the untrusted model and the monitor.

The takeaway isn't that CoT monitoring solves the problem, but that it provides measurable security benefits that should be part of a broader defense strategy.

Hope this helps provide some useful context to understand our paper in!

Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring

Timothy Kostolansky3mo73

The blue-minimising robot and model splintering

Timothy Kostolansky9moΩ010

The tendency to wirehead is explicitly guarded against (so proxies that are "too good" get downgraded in likelihood).

But what if it's found the golden feature that determines everything one needs to know for the task? Wouldn't this be desired?

Lighthaven Sequences Reading Group #2 (Tuesday 09/17)

Timothy Kostolansky1y20

Hi, this event looks quite exciting to me :), but I don't think that I'll be able to make it, as I (likely) have a recurring commitment on Thursdays during the exact same time. Do you happen to have a recommendation for similar events or different times that I might be able to make? Thanks!

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments