Simon Storf — LessWrong

thanks for the helpful feedback!

you may want to consider reporting threshold-dependent metrics as well, for instance the TPR/FPR on a distribution different from the one where the threshold is fixed, or the calibration of the monitor.

good point, we’ll add evaluations where we fix the threshold on one dataset and then see how TPR/FPR change on a different distribution

have you considered testing methods applied to the model powering the agent itself?

we’re definitely interested in the “interrogation” direction! For now it feels a bit early. People in labs working on catching scheming are still getting basic offline monitoring in place so we’re focusing on making that work well first.

Building Black-box Scheming Monitors

Simon Storf2mo10

Thanks very much for the feedback! Super helpful - here is our initial reaction:

Manual red-teaming for finding monitor failures
1. We strongly agree. Since writing the blog post, we've already started with manual red-teaming.
2. Quick follow-up: Could you expand a bit on why you're not as excited about cross-dataset transfer experiments? My guess is it's because the distribution shift introduces many simultaneous changes, making it hard to isolate exactly which features/transformations caused the performance differences. Is this correct?
Include cases where only the CoT is misaligned.
1. we'll try to include this if we do stick with CoT!
Focusing on egregious misalignment rather than reward hacking.
1. This makes sense, we wanted to get initial signal quickly and it was difficult to get convincingly egregious and subtle misalignment cases.
2. We're trying to get data on more egregious and subtle scenarios. Currently SHADE Arena and manually-generated trajectories seem to be our best bets for this. But if this proves to be the biggest bottleneck we might shift our primary focus to generating high-quality datasets first.
Measuring FPR on harder negative data.
1. Great suggestion, we'll look into this.
Necessity of reasoning for monitoring
1. When suggesting that classifiers may not need reasoning, did you mean:
  1. The internal reasoning of the monitored model (e.g. Gemini-2.5-pro’s CoT),
  2. The explicit reasoning step requested by our prompt ("Provide your reasoning and conclude with: ANSWER: [number]"),
  3. or both?
2. Our current assumption is that some sort of reasoning in the monitor prompt helps with red-teaming, but your rationale makes sense and we're inclined to look into removing reasoning.
Being careful about spreading details of classifier failures
1. agreed! we'll use this

Building Black-box Scheming Monitors

Simon Storf2mo20

Appreciate the feedback and definitely keen to discuss follow ups - we’ll drop you a message.
A few thoughts in response to your points:

Our MVP monitor is trained on transcripts that include CoT, but we’re not currently committed to building “thoughts + actions” monitors over “action-only” ones. We’re keeping both options open for now and are currently running the “action-only” experiments mentioned in our "immediate priorities" section. We're still unsure which direction would be most impactful.
Since models might start reasoning internally soon, it might make sense to already explore monitors that don’t rely on seeing their thoughts explicitly. On the other hand, CoT could help us catch them in cases where models are able to scheme but might still have to rely heavily on explicit reasoning traces to do so. It sounds like you’re placing more weight on the former, which is interesting.
We've made it clearer in the blog post that although we're currently using CoT, we're considering both approaches.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments