LESSWRONG
LW

richbc
102150
Message
Dialogue
Subscribe

Rich Barton-Cooper. MATS 8.0+8.1 scholar researching black-box scheming monitoring (Marius Hobbhahn's stream).

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
On closed-door AI safety research
richbc1d10

That makes sense. Dual-use work is definitely hard to deal with, and I don't have any good ideas on how to diffuse the safety-relevant parts across labs with acceptably low risk of spreading capabilities-related knowledge (if it's even possible to separate these parts). Perhaps a trusted intermediary could help evaluate which details are safe to share? Do you have any ideas on how to approach publishing dual-use safety research responsibly (if at all)?

Reply
On closed-door AI safety research
richbc1d10

You're totally right to point this out, thank you! I found the FMF quite late on while writing, and my research was mostly limited to their own writing (e.g. the announcement of their facilitated agreement with frontier labs). I probably shouldn't have gone as far as advocating support for a specific organisation without more independent verification of effectiveness at addressing the issues in this post, especially since the full agreement isn't public, meaning I couldn't dig into any specifics (e.g. how it will be enforced, if at all).

That said, I think bodies like the FMF could play an important coordination role between frontier labs if effective, and I'm glad they exist; for example, it seems possible they're well-positioned to facilitate private inter-lab comms channels where sharing of safety research can occur without requiring full publication, which could lower the commercial-risk barrier for sharing sensitive research. I imagine decision-makers at labs (e.g. legal/comms) might be more willing to sign off on x-risk research being shared with other labs but not the wider public[1], since there's less of a potential PR concern.

Perhaps a better call-to-action would have been "engage with the FMF and similar bodies, and push for them to be a success" - thanks for making this point!

  1. ^

    Not that I'm endorsing this kind of work being kept from the public...

Reply
Extract-and-Evaluate Monitoring Can Significantly Enhance CoT Monitor Performance (Research Note)
richbc21d30

Thanks for posting, this is a nice idea! A few questions which came to mind:

  1. The Quote Extractor (QE) prompt focuses exclusively on catching suspicious reasoning - do you see the QE ignoring actions in practice, or did it also pick out quotes from the actions (e.g. suspicious code)?
  2. The original paper found action-only to be a better monitor compared to CoT-plus-action. Did you try incorporating the action-only monitor into the hierarchy, so the uplift you get from ignoring the CoT in certain cases can be used?
    1. E.g. if the CoT contains misleading rationalisations which justify the relevance of the side task to the main task, the QE might be persuaded not to include those quotes in the summary.
    2. I'm unsure this would make much of a difference in practice, since the action-only monitor was weak on subtle side tasks and that's precisely where extract-and-evaluate makes the most difference, but there could be marginal gains.
  3. It would be interesting to compare the cost and performance of including action-only as a QE, vs. the original setup - i.e. compare the following cases:
    1. Expensive model, 2-monitor hierarchy: QE (CoT+action), and Judge.
    2. Cheap model, 3-monitor hierarchy: QE1 (CoT+action), QE2 (action-only), and Judge with access to both summaries.
  4. I can see something like this technique being useful for long-context monitoring, where the context of the original task (and any early reasoning and actions taken by the agent) no longer fit in the monitor's context. Collating the most suspicious parts of the trajectory and sending these summaries to a Judge seems a natural application of this work, and it's encouraging that a Judge which only sees a distilled summary of the trajectory performs so well! Is this something you've thought about at all?
Reply
Building Black-box Scheming Monitors
richbc1mo10

Thanks for the catch! We're 50/50 for scheming/honest in every dataset. I'll make this clearer in the post!

Reply
Building Black-box Scheming Monitors
richbc1mo60

Thanks for the feedback! I'll respond with my thoughts, and the others can chime in too if I miss anything!

My biggest concern is that your approach to identifying reward hacks in test trajectories may be somewhat circular: it seems likely that the monitors you test would assign high scores to trajectories that the LLM monitors you use for labeling assign high scores to. ...

Agree with the points here in general! We touch on this here in the post, and it's something we're looking to address beyond the MVP.

A couple of mitigating points:

  • The suite of prompts used for labelling are designed to target individual (though overlapping) behaviours when labelling our datasets. We afford these classifiers all knowledge of the environment during labelling (e.g. explicitly looking for unit test fabrication in SWEBench, or searching for specific keywords related to the side task in ControlArena), which gives them an advantage over general prompted LLMs.
  • We originally labelled SWEBench trajectories based solely on a prompt designed to detect reward hacking, but during review we found some examples of behaviours systematically missed by this prompt (e.g. providing fake unit test outputs in the task summary). This gave us some confidence in the general notion that "more, diverse prompts" might actually find a broader set of scheming-related trajectories, giving this labelling process another advantage over the single-prompt trained monitor.

A few related areas for improvement we've considered (omitted from the MVP due to time/cost limitations):

  • We could run even more classifier prompts targeting more individual scheming behaviours.
  • We could average our classifiers over many epochs to increase confidence in the scores we use for labelling.
  • We could spend more time human-reviewing each individual trajectory in the test dataset for signs that the classifiers missed (which would likely necessitate a smaller dataset given our team size, sacrificing some quantity for quality); this may be the only way we can be certain of avoiding "universal LLM bias", though it's costly.

And yes please, we'd love to work with the additional reward hacking dataset!

Reward hacking, alignment faking, and pursuing a stealthy side goal are all quite different issues to me and I’m unsure about whether it’s useful to use the label ‘scheming’ for all of them, ...

My take on why this is reasonable is: we're aiming to build a general monitor which can be deployed to catch models covertly behaving in a misaligned way. To our knowledge, there are no public datasets available which satisfy the "egregious scheming" condition we're most concerned about, so in our methodology we train monitors to also detect "scheming-lite" (which today's models more readily exhibit). Within this broad definition, we include e.g. reward hacking, since it's aligned with the definition of scheming as "covertly pursuing misaligned goals" (i.e. the model is doing a bad job at the requested task while knowingly exploiting success metrics).

On general vs. specific monitors: that's something we're interested in exploring! That said, being mindful of control tax it's likely to be cheaper to use a very general monitor and only require a single API call per monitor call. Even if this is ultimately less effective, it's possible that general monitors could be used in broader monitoring systems, e.g. an early call in a hierarchical monitoring setup.

Unlike Julian, I’m excited for you to keep studying monitors that have access to the CoT: ...

I also think both CoT and action-only are likely to be important at different points in the future. If CoT is kept legible by the time models become capable of egregious scheming attempts, then good CoT monitoring might catch it and prevent bad outcomes (and/or draw more attention and resources towards serious scheming research). GDM have also recently made the point (Emmons et al.) that if complex scheming is at the limit of model capabilities, then CoT-as-computation may be necessary, meaning CoT monitoring might have a good chance of catching the earliest egregious scheming attempts.

On your weaker-untrusted point, that's fair - we decided to leave a truly weaker monitor out-of-scope for the current generalisation-based experiments. That said, we did use GPT-4.1 with the same training prompts (see Results) which is slightly weaker than Claude 4 Sonnet and Gemini 2.5 Pro on relevant benchmarks, and does perform the worst on the training data.

What models are you currently using to generate the test trajectories on SWE-bench?

We're using o3, Gemini 2.5 Pro, and Claude 3.7 Sonnet.

Have you measured the AUROC on the test set on monitors that haven't been trained?

The best approximation for "monitors that haven't been trained" is our very first attempt at a scheming-detection prompt, which achieved an AUROC of 93.5% on the test set.

Thanks again for the detailed comment - we're also keen to discuss any findings or promising directions!

Reply
70On closed-door AI safety research
14d
9
38Building Black-box Scheming Monitors
1mo
18