Training Deliberative Monitors for Black-Box Scheming Detection

aksh-n; adityasinha; Victor Gillioz; Simon Storf; Kilian Merkelbach; richbc; Axel Højmark; Marius Hobbhahn

Paper: https://arxiv.org/abs/2605.29601

Thread: https://x.com/aksh_n0/status/2062568855814193497

TL;DR: Training small open-weight monitors provides a cost-effective alternative to prompted frontier monitors. Applying our training recipe to Qwen3.5-27B results in a monitor better at scheming detection than all smaller prompted monitors (Gemini 3.1 Flash-Lite, GPT-5.4 Nano, Claude Haiku 4.5) and Gemini 2.5 Pro, while achieving lower marginal inference cost. Stronger prompted frontier monitors (Gemini 3.1 Pro, GPT-5.4, Claude Sonnet 4.6, Claude Opus 4.6) achieve higher performance but at roughly higher marginal inference cost. Several of our trained monitors are positioned on the empirical cost-performance Pareto frontier.

Setup

Action-only black-box monitoring. Prior work shows that models can produce unfaithful reasoning, obfuscate reasoning under optimization pressure, sandbag under monitoring, and that CoT monitorability can be a fragile property. Owing to these concerns, we consider an action-only black-box monitoring setup where the monitor sees the action-only agentic trajectory without the agent's private chain-of-thought, hidden scratchpad, or any internal state. The agent acts in an environment through observable tool calls, shell commands, file edits, browser or computer-use actions, and final reports.

Deliberative monitoring and scheming specification. Our training recipe is inspired by deliberative alignment; where deliberative alignment trains an agent to reason over a safety specification before acting, we train a monitor to reason over a scheming specification before judging another agent's behavior. The scheming specification outlines the behaviors the monitor should flag, the evidence it should cite, and the scoring rubric it should apply to score severity.

We treat scheming as a graded spectrum rather than a binary property. At the high end, egregious scheming combines two elements: (i) pursuit of an unauthorized or misaligned objective, and (ii) active deception or concealment to prevent overseers from noticing that objective. The score bands are: : benign task pursuit; : suspicious but likely innocent behavior; : unauthorized actions with limited concealment such as overt sabotage; : misalignment with meaningful concealment such as covert sabotage and lite scheming; and : egregious scheming involving sophisticated or sustained deception.

The teacher-generated rationale using scheming specification S follows a fixed structure that the monitor must reason through before generating a final score: Observations (cite specific actions, quotes, and inconsistencies), If Innocent (give the strongest benign interpretation of the observed behavior), If Misaligned (identify the unauthorized objective the behavior could serve and how the evidence supports that interpretation), and Conclusion (weigh the two explanations and assign the final score). This two-sided structure is intended to reduce premature anchoring so that the monitor must explicitly consider both benign and misaligned explanations before judging the trajectory.

Deliberative supervision construction. Realistic scheming trajectories with reliable ground-truth labels are scarce, so we combine synthetic scheming datasets (STRIDE and Gloom) with public sabotage, tool use, and agent control datasets (CoT-Red-Handed, BashBench and BashArena). For each labeled training example, we construct a high-quality rationale target using an offline teacher-judge pipeline. The teacher, Gemini 2.5 Pro, receives the full specification and the trajectory, and samples candidate rationales and scores. Candidates whose final score disagrees with the ground truth labels are regenerated with label conditioned guidance. A distinct judge model, Claude Sonnet 4.5, then evaluates each candidate for evidence grounding, two-sided reasoning, calibration to the rubric, and internal consistency. We then retain only the highest-scoring candidate for each trajectory when it passes a quality threshold of and discard the rest.

Training protocol. We train the monitor in two phases. First we use supervised fine-tuning to distill accepted teacher rationales associated with up to 8.7K trajectories into an open-weight student model. The student sees only the short detection prompt and the action-only trajectory, and the target is the accepted rationale and final score, with training using the next-token negative log-likelihood loss on the target. The full specification is not included in the prompt.

Second, we optionally refine the SFT monitor with GRPO. The RL stage involves the model first generating rationales and scores from the same short prompt and trajectory. We shape the reward to encourage parseable and calibrated scores for binary labels, and good ranking behavior between the positive and benign trajectories. We use RL as a refinement step and not as a replacement for SFT, since supervised distillation teaches the model the deliberation format, evidence grounding and broad decision boundary, while RL provides an additional training signal for adjusting the model's scoring behavior.

Evaluation. The out-of-distribution (OOD) evaluation suite consists of research engineering sabotage in MLE-Sabotage and ControlArena Rogue Eval, personal-assistant tool use in ControlArena Agent Dojo, ControlArena infrastructure-as-code (IaC), computer use in CUA-SHADE-Arena and office-style tool use in ControlArena SHADE-Arena. Validation uses held-out STRIDE and Gloom-exfiltration splits generated from disjoint settings. Each source is normalized into action-only (trajectory, label) schema, and any agent reasoning is stripped. We evaluate monitors monitors using AUROC and pAUROC@20, i.e. the partial AUROC region between false-positive rate (FPR). We use pAUROC@20 as the primary metric, since low false-positive rates are essential for practical deployment and human auditing.

Model choices. We use the Qwen3 family as the primary experimental testbed and the more recent Qwen3.5 models, which are stronger but also more expensive to train and evaluate, for the main comparison. For the scaling model and data size experiments, we also SFT train the Nemotron 30B and 120B, GPT-OSS 20B and 120B, and Kimi-K2.6. In addition, we also prompt a range of frontier models such as Gemini 3.1 Pro, GPT-5.4, Claude Sonnet 4.6, Claude Opus 4.6 and smaller models such as Gemini 3.1 Flash-Lite, GPT-5.4 Nano, and Claude Haiku 4.5.

Results

Supervised distillation teaches action-only deliberation, RL improves marginally

We observe that the supervised distillation step provides the dominant improvement, and interpret RL as a refinement stage rather than the main source of monitoring competence. This pattern suggests that SFT is responsible for learning the core deliberative monitoring behavior of citing action-level evidence, considering benign and misaligned interpretations, and mapping the trajectory to the scheming rubric. For our experiments, we continued RL training on the SFT Qwen3 monitors until we observed diminishing returns and early signs of overfitting.

Trained monitors are low-cost Pareto-optimal for low-FPR monitoring

Our trained Qwen3.5-27B exceeds all smaller prompted frontier monitors on Mean-OOD and also exceeds Gemini 2.5 Pro, while large prompted frontier monitors such as Gemini 3.1 Pro and Claude Opus 4.6 remain higher-performing but substantially more expensive (as seen in the first figure).

Broad training mixtures improves OOD transfer, but effects are non-monotonic

To study how training data breadth affects OOD generalization, we start from training on STRIDE and progressively add Gloom, CoT-Red-Handed, and Bash variants (BashArena and BashBench) to the training mixture (all), and train a separate SFT model (starting from the base checkpoint) for each mixture. STRIDE primarily covers egregious scheming in synthetic enterprise software engineering settings, Gloom adds additional scheming style operational settings, CoT-Red-Handed adds covert coding sabotage, and the Bash variants add multi-turn shell and systems agent sabotage. We find that adding data that is most different compositionally (CoT-Red-Handed and Bash variants) from the initial dataset (STRIDE) results in the biggest OOD performance improvements.

Larger trained models within model families perform better

Our deliberative supervision recipe applies broadly across model families and that within families, the larger variants improve over smaller variants. Qwen3.5-27B improves over Qwen3.5-4B, Qwen3-32B improves over Qwen3-8B, GPT-OSS-120B improves over GPT-OSS-20B, and Nemotron-120B improves over Nemotron-30B. The strongest model, Kimi-K2.6 reaches the Mean-OOD pAUROC of although with a significantly larger model size. Since MoE total parameter counts are not directly comparable to dense active parameter counts, the model parameters should be interpreted as a rough capacity indicator only.

Performance scales effectively with more data

To understand how the monitor performance scales with the amount of supervised deliberative data, we progressively scale up data. We use nested stratified subsets from 204, …, 8688 samples with each subset preserving the fraction of each of the five training datasets a constant (with the exception of the largest set which has all the samples). We train all model families with SFT only, selecting checkpoints on validation splits. The results show an overall trend of improvement, with the smaller models having bigger gains compared to the larger models from the same model family, suggesting that high-quality deliberative data can partially compensate for model scale.

Please read our paper for details on methodology and further discussion on results!

^{^}
See our paper's appendix for detailed cost analysis

Thanks for sharing the work. We read it at the FAR Labs reading group today and found it interesting. Some slightly disorganized notes of some things we discussed:

We were a little confused why the base models weren't on Fig 1 or reported elsewhere (only Qwen has the before and after)? We eventually realized this was partially extractable from Fig 7a, but were wondering if you measured the base models of the ones you SFT-ed to give a sense of how much the whole process helps.
We discussed various interpretations of the RL experiments not changing much. We weren't sure if this is just normal for other tasks in this ~8k example regime (that SFT mostly helps in this dataset size regime), or if this was just a "well RL's pretty finicky" and if tweaked it more might see other things, or if an example of how RL mostly only matters if training closer to frontier models where don't have more powerful teacher examples.
- It looks like the RL reward has a brier-score-like term. It would be interesting to measure brier score see if the 1-10 scores became more calibrated, even if AUCROC didn't change much.
We discussed some how much we thought the various cost spread came from underlying model cost per token vs how verbose they were at this task. We eventually saw Table 4 and thought it was interesting. In particular, I was very surprised how much spread in input token count there was. I expected output token count spread (and indeed there's about 9x spread), but was surprised by the 2x spread in input token count. Later realized this was probably due to stripping out the prompt tokens after SFT, but then was confused why in Fig 1 base model cost is shown as the same as the SFT+RL one (unless shorter input exactly balanced out longer output?).
- It wasn't clear the reasoning effort used for the models that allow a reasoning effort setting.
- There's probably some interesting things to go deeper on. For example, selecting SFT traces that also skew sorter, or including the cost in the RL. Also baseline comparison to very short / non-CoT detectors to see how much the "deliberative" step adds. Also it was noteworthy how Gemini 2.5 Pro, which is the source of most of the teacher data, is one of the most verbose models, and unclear the effect of this. It is surprising the SFT-ed models did not end up becoming more much verbose to be closer to Gemini 2.5.

Overall the general result of ~"if you SFT a smaller LLM to do a task you care about it gets better" is kind of expected. Still interesting to see the work and analysis here. Thanks!

Glad to hear it!

There's a few reasons we didn't include the base models in Figure 1.
1. Most of the base models performed much lower in pAUROC@20, and including them in the main figure would congest the figure, reducing readability.
2. Some base models struggled with instruction following with respect to the format of the final score, which we parse for evaluation. This resulted in many trajectories being scored invalid, making a fair evaluation with all our other trained models + frontier models more challenging. Many of the invalid trajectories were "diverse" in format or difficulty of tasks, so naively excluding these from the evaluation would disproportionately skew the results, and scoring them as incorrect would not be fair either.
Re RL: we observed that when applying RL directly to the base model with our current reward formulation, the Brier reward and pairwise reward signals improve during training, and the model learns to separate the scheming and benign samples, so the reward formulation works well in improving calibration, ranking correctness and margin satisfaction. However, the final detection performance is lower compared to the SFT (with distillation) and SFT + RL monitors, and the generated reasoning is unstructured as expected. One possible interpretation for the minimal improvements upon doing RL after the SFT step is that SFT on teacher rationales teaches the model a majority of the useful discriminatory signals, so when we do RL after, there may be less remaining signal for RL to extract with the current binary reward/data setup. We have some additional results on this, which we will add to the Appendix for clarity.
Yes, stripping the scheming spec leads to fewer input tokens. For the base model, we found that the Qwen base models are, in general, quite verbose in their reasoning/rationale when deliberating over the spec (or even with shorter detection prompts without detailed instructions), but after the SFT distillation step, it reduces in token counts.
1. We used "high" reasoning effort
2. The Gemini models are quite verbose; we should note that the accounting includes reporting both the model's internal reasoning and the external elicited rationale generated using our scheming spec (the providers use the same method for cost computation). In the SFT step, we only use the external elicited rationale for training and not the internal reasoning, so it is expected that the output tokens generated by Gemini overall will be larger than those of the SFT-ed models.

Thanks for the questions! There are certainly things to go deeper on, for e.g. we were curious how single-token classification compares to the "deliberative" step + answer setup, or simpler distillation techniques as well. We will continue exploring these and share more results as we gain a better understanding.

Thanks for the reply. I think makes sense on Fig 1.

On 2, seems to also make sense that if you have a powerful teacher, most of the gains there are from imitation learning there. Specifically my little comment about RL and brier was about the case of RL-after-SFT. Basically even if AUC doesn't move much, do you at least get some calibration gains that might make the monitor score more interpretable? Not really key in any way, but just to clarify the point as being distinct from the RL-only reward measurement.

On 3. This is interesting. I actually would have initially guessed models like GPT-5.4 would average higher than 870 output tokens on "high" settings when counting their hidden reasoning, but I don't think i have good intuitions here. Also interesting if that the reduction in output tokens exactly balances out the reduction in input tokens. Fig 1 between Qwen3 Base and Qwen 3 SFT+RL looks essentially pixel perfect aligned on the x axis, which seems a bit odd given this.

Thanks for clarifying, we check the RL-after-SFT calibration. We compute the mean Brier score using the normalized monitor prediction score / 10 against the binary label. In distribution, RL-after-SFT improves Brier on the full training mixture for both Qwen models: Qwen3-8B improves from 0.0435 to 0.0339, and Qwen3-32B from 0.0401 to 0.0326. OOD transfer is more mixed and smaller improvement in aggregate. Mean OOD Brier changes from 0.1995 to 0.1962 for Qwen3-8B and from 0.1352 to 0.1336 for Qwen3-32B. Per dataset for Qwen3-8B, RL improves Brier on 5/6 OOD datasets, with SHADE-Arena as the exception. For Qwen3-32B, the Brier improvements are on AgentDojo and Rogue Eval, while it is essentially flat on IaC, and CUA-SHADE arena and regression on MLE Sabotage and SHADE Arena.

Ah, thanks for checking and describing. I'd have to think about more to give better reference points, but seems decently good brier numbers (assuming the base rate is 50/50, and reference unskilled brier is 0.25). But also looks like RL is not doing much or generalizing, similar to the main results. Interesting.

Thanks for sharing and for chatting!

I will still mention to consider double checking the fig 1 base model x axis. Seems just too perfect. Either a spooky coincidence, but then would be curious if all of the base -> SFT models manage to exactly balance out a reduction in input tokens with an increase in output tokens (which would be an interesting phenomenon), or it's a bug/hallucinated hard coded value with a simple fix.

Edit: actually just noticed the cost numbers between Table 4 and Fig 1 differ (in Qwen3.5-27B SFT+RL reads as 2.775/1k evals in table, vs >3/1k evals in Fig 1). Not huge differences, but something is going on. Maybe I'm just missing something though...

Thanks, agreed on the Brier interpretation. The OOD suite is roughly balanced overall so 0.25 is a reasonable reference point for an unskilled predictor, though the individual dataset base rates vary, as reported in the dataset stats. The main takeaway is the same as what you mentioned.

On Fig. 1: good catch. Table 4 and Appendix E are the source of truth for the cost computation. The cost numbers are computed from the token counts over the six OOD datasets and the listed input/output prices. In particular, for Qwen3.5-27B SFT+RL it is $2.775 / 1k evals.
We’ll regenerate Fig. 1 with the changes and also update the plotted x-axis positions for that and the base model. The resulting changes should be small and do not affect the overall conclusions that the trained Qwen3.5-27B monitor remains in the low-cost regime, below the small prompted frontier baselines in cost, while the strongest prompted frontier monitors remain substantially more expensive for their additional pAUROC.
Thanks for pointing this out.

Thanks for sharing the work. We read it at the FAR Labs reading group today and found it interesting. Some slightly disorganized notes of some things we discussed:

We were a little confused why the base models weren't on Fig 1 or reported elsewhere (only Qwen has the before and after)? We eventually realized this was partially extractable from Fig 7a, but were wondering if you measured the base models of the ones you SFT-ed to give a sense of how much the whole process helps.
We discussed various interpretations of the RL experiments not changing much. We weren't sure if this is just normal for other tasks in this ~8k example regime (that SFT mostly helps in this dataset size regime), or if this was just a "well RL's pretty finicky" and if tweaked it more might see other things, or if an example of how RL mostly only matters if training closer to frontier models where don't have more powerful teacher examples.
- It looks like the RL reward has a brier-score-like term. It would be interesting to measure brier score see if the 1-10 scores became more calibrated, even if AUCROC didn't change much.
We discussed some how much we thought the various cost spread came from underlying model cost per token vs how verbose they were at this task. We eventually saw Table 4 and thought it was interesting. In particular, I was very surprised how much spread in input token count there was. I expected output token count spread (and indeed there's about 9x spread), but was surprised by the 2x spread in input token count. Later realized this was probably due to stripping out the prompt tokens after SFT, but then was confused why in Fig 1 base model cost is shown as the same as the SFT+RL one (unless shorter input exactly balanced out longer output?).
- It wasn't clear the reasoning effort used for the models that allow a reasoning effort setting.
- There's probably some interesting things to go deeper on. For example, selecting SFT traces that also skew sorter, or including the cost in the RL. Also baseline comparison to very short / non-CoT detectors to see how much the "deliberative" step adds. Also it was noteworthy how Gemini 2.5 Pro, which is the source of most of the teacher data, is one of the most verbose models, and unclear the effect of this. It is surprising the SFT-ed models did not end up becoming more much verbose to be closer to Gemini 2.5.

Overall the general result of ~"if you SFT a smaller LLM to do a task you care about it gets better" is kind of expected. Still interesting to see the work and analysis here. Thanks!

Glad to hear it!

There's a few reasons we didn't include the base models in Figure 1.
1. Most of the base models performed much lower in pAUROC@20, and including them in the main figure would congest the figure, reducing readability.
2. Some base models struggled with instruction following with respect to the format of the final score, which we parse for evaluation. This resulted in many trajectories being scored invalid, making a fair evaluation with all our other trained models + frontier models more challenging. Many of the invalid trajectories were "diverse" in format or difficulty of tasks, so naively excluding these from the evaluation would disproportionately skew the results, and scoring them as incorrect would not be fair either.
Re RL: we observed that when applying RL directly to the base model with our current reward formulation, the Brier reward and pairwise reward signals improve during training, and the model learns to separate the scheming and benign samples, so the reward formulation works well in improving calibration, ranking correctness and margin satisfaction. However, the final detection performance is lower compared to the SFT (with distillation) and SFT + RL monitors, and the generated reasoning is unstructured as expected. One possible interpretation for the minimal improvements upon doing RL after the SFT step is that SFT on teacher rationales teaches the model a majority of the useful discriminatory signals, so when we do RL after, there may be less remaining signal for RL to extract with the current binary reward/data setup. We have some additional results on this, which we will add to the Appendix for clarity.
Yes, stripping the scheming spec leads to fewer input tokens. For the base model, we found that the Qwen base models are, in general, quite verbose in their reasoning/rationale when deliberating over the spec (or even with shorter detection prompts without detailed instructions), but after the SFT distillation step, it reduces in token counts.
1. We used "high" reasoning effort
2. The Gemini models are quite verbose; we should note that the accounting includes reporting both the model's internal reasoning and the external elicited rationale generated using our scheming spec (the providers use the same method for cost computation). In the SFT step, we only use the external elicited rationale for training and not the internal reasoning, so it is expected that the output tokens generated by Gemini overall will be larger than those of the SFT-ed models.