LESSWRONG
LW

5545
Simon Storf
29030
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Building Black-box Scheming Monitors
Simon Storf2mo10

thanks for the helpful feedback!

you may want to consider reporting threshold-dependent metrics as well, for instance the TPR/FPR on a distribution different from the one where the threshold is fixed, or the calibration of the monitor.

good point, we’ll add evaluations where we fix the threshold on one dataset and then see how TPR/FPR change on a different distribution

have you considered testing methods applied to the model powering the agent itself?

we’re definitely interested in the “interrogation” direction! For now it feels a bit early. People in labs working on catching scheming are still getting basic offline monitoring in place so we’re focusing on making that work well first.

Reply
Building Black-box Scheming Monitors
Simon Storf2mo10

Thanks very much for the feedback! Super helpful - here is our initial reaction:

  1. Manual red-teaming for finding monitor failures
    1. We strongly agree. Since writing the blog post, we've already started with manual red-teaming.
    2. Quick follow-up: Could you expand a bit on why you're not as excited about cross-dataset transfer experiments? My guess is it's because the distribution shift introduces many simultaneous changes, making it hard to isolate exactly which features/transformations caused the performance differences. Is this correct?
  2. Include cases where only the CoT is misaligned.
    1. we'll try to include this if we do stick with CoT!
  3. Focusing on egregious misalignment rather than reward hacking.
    1. This makes sense, we wanted to get initial signal quickly and it was difficult to get convincingly egregious and subtle misalignment cases.
    2. We're trying to get data on more egregious and subtle scenarios. Currently SHADE Arena and manually-generated trajectories seem to be our best bets for this. But if this proves to be the biggest bottleneck we might shift our primary focus to generating high-quality datasets first.
  4. Measuring FPR on harder negative data.
    1. Great suggestion, we'll look into this.
  5. Necessity of reasoning for monitoring
    1. When suggesting that classifiers may not need reasoning, did you mean:
      1. The internal reasoning of the monitored model (e.g. Gemini-2.5-pro’s CoT),
      2. The explicit reasoning step requested by our prompt ("Provide your reasoning and conclude with: ANSWER: [number]"),
      3. or both?
    2. Our current assumption is that some sort of reasoning in the monitor prompt helps with red-teaming, but your rationale makes sense and we're inclined to look into removing reasoning.
  6. Being careful about spreading details of classifier failures
    1. agreed! we'll use this
Reply
Building Black-box Scheming Monitors
Simon Storf2mo20

Appreciate the feedback and definitely keen to discuss follow ups - we’ll drop you a message.
A few thoughts in response to your points:

  • Our MVP monitor is trained on transcripts that include CoT, but we’re not currently committed to building “thoughts + actions” monitors over “action-only” ones. We’re keeping both options open for now and are currently running the “action-only” experiments mentioned in our "immediate priorities" section. We're still unsure which direction would be most impactful.
  • Since models might start reasoning internally soon, it might make sense to already explore monitors that don’t rely on seeing their thoughts explicitly. On the other hand, CoT could help us catch them in cases where models are able to scheme but might still have to rely heavily on explicit reasoning traces to do so. It sounds like you’re placing more weight on the former, which is interesting.
  • We've made it clearer in the blog post that although we're currently using CoT, we're considering both approaches.
Reply
No wikitag contributions to display.
39Building Black-box Scheming Monitors
2mo
18