learning and loving
The tendency to wirehead is explicitly guarded against (so proxies that are "too good" get downgraded in likelihood).
But what if it's found the golden feature that determines everything one needs to know for the task? Wouldn't this be desired?
Hi, this event looks quite exciting to me :), but I don't think that I'll be able to make it, as I (likely) have a recurring commitment on Thursdays during the exact same time. Do you happen to have a recommendation for similar events or different times that I might be able to make? Thanks!
Hey, thanks for the thoughts -- some good points. We'd like to offer some context to help clarify the results.
First, you're absolutely right that detecting less than 40% of subtle sabotage (failing to catch 60%+ of attempts) is very low and concerning for real deployments. And honestly, "can fool" in the title does seem like an understatement given these numbers -- the monitors are being fooled most of the time on subtle tasks.
However, a few important contextual points:
The subtle tasks are extremely hard to detect. These are things like covertly increasing runtime complexity or introducing edge-case failures that would be very difficult even for humans to spot in code review. The paper categorizes tasks as "subtle" specifically because action-only monitoring catches <50% of them.
CoT monitoring provides consistent improvements. While 40% detection isn't great in absolute terms, it represents substantial improvement over action-only monitoring (which catches <20% of these subtle cases). The hybrid approach does even better, achieving 4x higher catch rates than action-only monitoring alone.
Performance varies dramatically by task type. For non-subtle sabotage (like file operations, network connections), the hybrid monitor achieves >85% detection rates, which is much more encouraging.
Monitor capability clarification: The monitoring model (GPT-4.1 mini) is actually weaker than the untrusted reasoning models (Claude Sonnet 3.7 Thinking, DeepSeek R1, etc.), not the same capability level -- so we believe this simulates a monitoring scenario with a gap in capabilities between the untrusted model and the monitor.
The takeaway isn't that CoT monitoring solves the problem, but that it provides measurable security benefits that should be part of a broader defense strategy.
Hope this helps provide some useful context to understand our paper in!