[Paper] How does information access affect LLM monitors' ability to detect sabotage?

Rauno Arike; Raja Moreno; RohanS; Shubhorup Biswas; Francis Rhys Ward

TL;DR

We evaluate LLM monitors in three AI control environments: SHADE-Arena, MLE-Sabotage, and BigCodeBench-Sabotage. We find that monitors with access to less information often outperform monitors with access to the full sequence of reasoning blocks and tool calls, a phenomenon we call the less-is-more effect for automated monitors. We follow up on our exploratory work on extract-and-evaluate (EaE) monitors and find that EaE monitoring can mitigate the less-is-more effect, though its effectiveness varies across environments. In BigCodeBench-Sabotage with a GPT-4.1-mini monitor, EaE improves sabotage detection rates by 16.8 percentage points over the next-best approach, while in other settings, EaE either outperforms or is competitive with baselines. Additionally, we find that reasoning models are more monitorable than nonreasoning models but also require more advanced information filtering, and that CoT monitors are significantly more effective for agents that are unaware of being monitored.

Paper: https://arxiv.org/abs/2601.21112
Us: https://aether-ai-research.org/ | https://x.com/AetherAIS

A summary of our evaluation pipeline. The LLM agent is prompted with a main and a side task during malicious runs and with only the main task during baseline runs. The agent's trajectory is reviewed by four monitors with varying levels of information access. The monitors produce suspicion scores and are evaluated based on their ability to discriminate benign and malicious transcripts.

Methodology

Our evaluation environments cover several realistic deployment conditions: SHADE-Arena involves long trajectories, MLE-Sabotage agentic scaffolds, and BigCodeBench-Sabotage single-step problems that enable experiments with large sample sizes and for reasoning models. They also cover three important domains for LLM agents: personal assistance, software engineering, and ML engineering, respectively.

We evaluate four monitoring approaches: action-only, CoT-only, CoT+action, and extract-and-evaluate (EaE) monitoring. For the first three, we separate the LLM’s outputs into a CoT and actions such that all tool calls count as actions and all other outputs as the CoT. In EaE, we first present a monitor with the CoT and actions like a plain CoT+action monitor, but after reasoning about the suspiciousness of the transcript, it picks out supporting excerpts from the attacker’s CoT or actions instead of producing a suspiciousness score. We call this monitor the Extractor. A description of the agent’s main task and the excerpts are then passed to the Evaluator, which produces a suspiciousness score solely based on those inputs. A more in-depth overview of the EaE pipeline can be found in our past blog post Extract-and-Evaluate Monitoring Can Significantly Enhance CoT Monitor Performance.

Example main and side tasks in each of our evaluation environments.

Results

The Less-is-More Effect

We found that information filtering benefits monitors in seven out of eight monitor-environment pairs, with CoT+action monitoring being the best strategy only in SHADE-Arena with Claude-3.7-Sonnet as the monitor and EaE always being the best strategy otherwise. Though giving the monitor access to the full CoT is often suboptimal, filtering the CoT is a much better solution than not showing the CoT to the monitor at all: action-only monitoring is the worst-performing monitoring strategy in five out of eight monitor-environment pairs and never the best strategy.

**Monitor performance results across all evaluation environments and monitor types.** Performance is measured by log-AUROC score. Error bars in this as well as in following figures are bootstrapped 95% CIs using cluster-level resampling (Miller, 2024), with all samples sharing the same main and side task clustered together. All monitors are applied on attacker trajectories generated with Claude-3.7-Sonnet. As we test 3.7-Sonnet both with and without a thinking budget in BigCodeBench-Sabotage and this has a substantial influence on the results, we present results for those experiments separately.

Extract-and-Evaluate Monitoring

We found that EaE has the largest positive effect in BigCodeBench-Sabotage with subtle side tasks, the setting that we already analyzed in Extract-and-Evaluate Monitoring Can Significantly Enhance CoT Monitor Performance. Across the five tested reasoning models, EaE monitoring improves catch rates at 1% FPR by 16.8 percentage points compared to the best of the four baseline methods within each subgroup. Across subtle side tasks, which are especially difficult to catch, the improvement is 19.9 percentage points.^[1]

Using GPT-4.1 as the Evaluator while still using GPT-4.1-mini as the Extractor further improves catch rates, leading to a 31.5 percentage point improvement upon baselines overall and a 39 percentage point improvement across subtle side tasks. As the Extractor’s inputs are much longer than those of the Evaluator, this improvement comes at a modest computational cost. Though a direct comparison to Arnav et al. (2025) is unfair since none of their monitoring protocols allowed for access to stronger monitor models, this demonstrates that hierarchical protocols can be highly cost-effective, strongly outperforming the weaker model alone while allocating compute mostly to the weaker model.

The Monitorability of Reasoning vs. Nonreasoning Models

To facilitate a fair comparison between reasoning and nonreasoning models, we perform the evaluation on two pairs of attackers that share the same base model: the reasoning and nonreasoning versions of Claude 3.7 Sonnet^[2] and DeepSeek-V3 and R1. We found that CoT-only monitors achieve significantly higher catch rates on nonreasoning than on reasoning LLMs, even though traditional models have substantially shorter CoTs. Notably, the CoT-only monitor’s catch rate for Claude-3.7-Sonnet:nonreasoning in subtle side tasks is over three times higher than its catch rate for Claude-3.7-Sonnet:reasoning (61.7% vs. 20.0%), despite Claude-3.7-Sonnet:nonreasoning’s average transcript being about 10 times shorter:

This drastic difference was caused by two factors: first, it was easier to extract signs of suspicious behavior from a shorter CoT, and second, CoT-only monitors displayed unusually low suspicion scores for benign transcripts generated by nonreasoning LLMs. We also found that, given sufficiently good information filtering, reasoning models are more monitorable than nonreasoning LLMs: in each of the four comparison pairs, the best-performing monitoring protocol for a reasoning model performs better than the best-performing protocol for the corresponding nonreasoning model.

The Effect of Agent Awareness on Monitoring

We compared two awareness conditions in SHADE-Arena:

The agent knows that its entire trajectory is being monitored (the default setting)
The agent is told that its trajectory is monitored but that it has access to an unmonitored scratchpad

We found that the performance of CoT monitors increases substantially when the agent treats its scratchpad as private but that scratchpad is then nevertheless fed into the monitor:

This highlights the importance of keeping potentially misaligned models unaware that they are monitored. In contrast, action-only monitors perform very similarly whether the attacker can reason on a hidden scratchpad or not. This implies that attackers at or below the level of Claude 3.7 Sonnet are unable to find subtler sabotage strategies in SHADE-Arena when given more freedom to reason about those strategies.

Price-Performance Analysis

Finally, we performed a price-performance analysis in MLE-Sabotage. This yielded four important takeaways:

The performance frontier is occupied by models from Gemini and Claude families, with OpenAI models being substantially worse.
Information filtering techniques feature heavily on the Pareto frontier.
Monitor performance quickly hits steep diminishing returns along the cost-performance frontier.
Overall, there's a linear relationship between monitor cost and AUROC.

Discussion and Limitations

Causes of the less-is-more effect. To understand why the less-is-more effect occurs and why EaE substantially improves monitor performance in BigCodeBench-Sabotage and MLE-Sabotage but not in SHADE-Arena, we performed a qualitative transcript analysis. We found support for the post-hoc justifications hypothesis previously suggested by Arnav et al. (2025). Attackers are often very transparent about side tasks, sometimes even directly quoting them in their CoTs, but this transparency gaslights the monitor into believing that the side task was explicitly intended by the user, even though the full user instructions are also in the monitor's context. As the EaE monitor has access to a much smaller fraction of the CoT, it is usually better able to notice the discrepancy between the instructions and the agent's behavior, even when the Extractor quotes some of the attacker's rationalizations. In SHADE-Arena, the Extractor failed to extract the most suspicious quotes because it considered the side task benign or constructed a rationalization for it relatively often, causing EaE to perform worse.

Future relevance of the less-is-more effect. We found that in each of our three environments, information filtering benefits smaller models more. In each setting we tested, the gap between the EaE and CoT+action monitors is larger when GPT 4.1-mini is used as the monitor model compared to when Claude 3.7 Sonnet is used as the monitor model. However, our environments enabled us to get robust results only for modest task lengths (1k-40k tokens). Given that task time horizons are lengthening and that reasoning models are increasingly widespread, monitors will likely soon be applied
on trajectories orders of magnitude longer than ours, and less capable monitors than Claude 3.7 Sonnet are likely to remain the most popular choice as monitors for some time due to cost considerations. Thus, we expect the less-is-more effect to remain a relevant consideration for CoT monitoring for the foreseeable future.

Limitations. While our EaE results in BigCodeBench-Sabotage and MLE-Sabotage demonstrated the promise of hierarchical information filtering methods, the less impressive performance of EaE in SHADE-Arena suggests that even better hierarchical information filtering techniques can be designed. For this reason, we're not yet ready to recommend EaE for practical deployments. We encourage future work to explore other new techniques and compare EaE to the monitoring approaches proposed in Kale et al. (2025) and Stickland et al. (2025).

The performance of CoT monitors depends on several moving variables and we didn't manage to control for all of these. We encourage future work to annotate transcripts with the following properties and study how each affects monitor performance: attacker trajectory length, amount of (serial) reasoning required to perform the side task, and the similarity of the side task to the main task.

All of our monitors are implemented at the level of LLM agent scaffolds. We are excited about future work that explores other options, such as monitoring within an LLM API. Our monitors also operate asynchronously, reviewing transcripts only after the agent has finished operating, while in some high-stakes settings, we may instead want to catch and interrupt misaligned agents before malicious actions can be execute (Lindner et al., 2025).

Finally, we studied the monitorability of reasoning vs. nonreasoning models and price-performance relationships in only one environment and effect of agent awareness on monitorability in only two. We would be excited to see future work that verifies those results in additional environments.

Acknowledgements

We thank the MATS program for providing compute and research management to the authors throughout the project, and the MATS program, the London Initiative for Safe AI, and Trajectory Labs for providing office space to the authors. Thanks to Erik Jenner, Artur Zolkowski, Kei Nishimura-Gasparian, Robert McCarthy, Simon Storf, and Henning Bartsch for comments, and to Erik Jenner, Matt MacDermott, Qiyao Wei, and Benjamin Arnav for instructive conceptual conversations during the project.

^{^}
In addition to the action-only, CoT-only, and CoT+action monitors, we also evaluate the hybrid monitor from Arnav et al. (2025) as a baseline.
^{^}
Claude 3.7 Sonnet is an adaptive model for which reasoning can be toggled on and off. Its internal reasoning is fully exposed to users.

LESSWRONG
LW