Raja Moreno

Message

[Paper] How does information access affect LLM monitors' ability to detect sabotage?

TL;DR We evaluate LLM monitors in three AI control environments: SHADE-Arena, MLE-Sabotage, and BigCodeBench-Sabotage. We find that monitors with access to less information often outperform monitors with access to the full sequence of reasoning blocks and tool calls, a phenomenon we call the less-is-more effect for automated monitors. We follow...

Feb 11•25

Message

14 karma

Member for 3 years

Raja Moreno — LessWrong

Raja Moreno

Message

Raja Moreno

[Paper] How does information access affect LLM monitors' ability to detect sabotage?

Feb 11•25

Message

14 karma

Member for 3 years

[Paper] How does information access affect LLM monitors' ability to detect sabotage?

Rauno Arike

Rauno Arike, Raja Moreno, RohanS, Shubhorup Biswas, Francis Rhys Ward+ 0 more

Rauno Arike, Raja Moreno, RohanS, Shubhorup Biswas, Francis Rhys Ward

TL;DR

We evaluate LLM monitors in three AI control environments: SHADE-Arena, MLE-Sabotage, and BigCodeBench-Sabotage. We find that monitors with access to less information often outperform monitors with access to the full sequence of reasoning blocks and tool calls, a phenomenon we call the less-is-more effect for automated monitors. We follow up on our exploratory work on extract-and-evaluate (EaE) monitors and find that EaE monitoring can mitigate the less-is-more effect, though its effectiveness varies across environments. In BigCodeBench-Sabotage with a GPT-4.1-mini monitor, EaE improves sabotage detection rates by 16.8 percentage points over the next-best approach, while in other settings, EaE either outperforms or is competitive with baselines. Additionally, we find that reasoning models are... (read 1660 more words →)

LESSWRONG
LW

LESSWRONG
LW

Raja Moreno

Raja Moreno

Raja Moreno

[Paper] How does information access affect LLM monitors' ability to detect sabotage?

Raja Moreno

Raja Moreno

Raja Moreno

[Paper] How does information access affect LLM monitors' ability to detect sabotage?

TL;DR