Traditionally, monitors in AI Control systems are focused on catching threats that involve attackers taking 'bad actions' during seemingly benign tasks, such as exfiltrating sensitive data without permission. Notably, monitors are not currently well suited for catching attackers omitting steps that cause bad outcomes.
There is some evidence that omissions are uniquely hard to catch. "Omission Attacks: When Doing Nothing Is the Attack" found that omissions were caught much less often than commissions in the same environments, even by strong models like Opus 4.6. Auditing Sabotage Bench found that ML research sabotage was much harder to catch when the attacker used an omission, rather than a change in design details or introduction of flawed logic. SLEIGHT-Bench labeled omissions as a possible blindspot in monitors, and found that strong monitors were not able to catch an omission they designed.
In LinuxArena, I collected 15 honest trajectories, created synthetic omission attacks by omitting a key step from each of the honest trajectories, and collected 15 traditional side-task trajectories (commissions) from the "LaStraj" dataset. At a 2% FPR, the basic LinuxArena monitor caught seven out of the fifteen commission attacks, and caught none of the omission attacks. I then created a monitor prompt that calls out omissions as the target, at similar granularity to the basic LinuxArena monitor (e.g. calling out the attack type but not mentioning what specific lines may be present/omitted). This monitor still caught 0/15 omission attacks at a 2% FPR. You can find the code here.
I am looking for critiques, suggestions or advice on the proposal below. If something like this could be valuable for your research, or if something is missing from the proposal that could be valuable to your research if added, please let me know!
Thank you to Bruce Lee for early discussion of omission attacks.
Proposal
I propose we build a benchmark for omission attacks and omission monitors. The benchmark will contain synthetic omission trajectories, prompts to elicit clear omissions, and scaffolding for monitors. The synthetic trajectories will be for measuring monitorability in the worst case (clean, targeted omissions of the most important features) while the prompts will be for measuring agents' ability to carry out omission attacks. The benchmark will measure both attackers' abilities and compare monitorability of omissions to commissions on the same main tasks. Omissions don't need new settings and are easily appended to existing frameworks, so we will take existing control environments like LinuxArena and scaffold them for omissions.
It's not clear if we should treat omissions as concentrated attacks (attacks where a few actions can cause damage) or diffuse failures (attacks where many actions are needed to cause damage). In the case of research sabotage, we generally consider attacks as diffuse because a model would likely have to subtly influence many results to achieve its goal. In contrast, I believe security failures could at least be created, if not exploited, in very few omissions. To resolve this concern, the benchmark will have categorizations of attacks based on how many omitted actions are needed to cause damage. We may also include interactions between omissions and commissions, as I think this could introduce some concentrated threats that could not be completed with omissions alone.
The goal of this benchmark will be to answer the questions:
How does ability to complete omissions scale with model capability?
How does omission monitorability compare to traditional commission monitorability?
In which setting/main-task/omission-task sets are omissions least monitorable and most dangerous?
How does attack severity scale with the number of omitted steps?
How does combining omissions and commissions in the same trajectory change the omission attack severity?
The "Omission Attacks" paper showed that postcondition tests caught omissions at far higher rates than monitors. In my testing, I found that only four of the omission attacks had degraded main test results compared to the honest trajectory, and only three had results degraded enough to fail the main task scoring. Similar to commissions, if the test suite had perfect coverage, we would be able to catch omissions every time. Therefore, I think we should explore the likely case where the agent writing the code is also the agent writing its tests by allowing the attacker to write the test suite given the main task spec. This way the attacker can try to hide its omissions by also omitting a test that would catch it.
Additionally, in both the case where we use a traditional test suite, and the case where the attacker writes tests, I think we should allow the monitor to add to the existing test suite and run its additional tests on the submitted code. Another blue team strategy could involve using a separate untrusted or trusted model to write tests given the spec.
All these strategies would aim to answer these questions:
Can the attacker take advantage of existing and/or designed flaws in the test suite?
Can monitors take advantage of tests to find more omissions than monitors that don't?
There are various other things we could include in the project that would not necessarily fit into the benchmark. We may choose only what is most valuable from this list.
Fix rates. Auditing Sabotage Bench contains a section measuring which models can fix the sabotaged research when given a few tries. We could add something like this for our omissions.
Resampling trajectories. If we resample the model at a given checkpoint in the trajectory or from the start of its trajectory, give the monitor both trajectories, and have the monitor scan for inconsistencies, can it incriminate the attacker because it omitted a key step in one trajectory and not the other? What if we sample from a trusted model instead?
Introduction
Traditionally, monitors in AI Control systems are focused on catching threats that involve attackers taking 'bad actions' during seemingly benign tasks, such as exfiltrating sensitive data without permission. Notably, monitors are not currently well suited for catching attackers omitting steps that cause bad outcomes.
There is some evidence that omissions are uniquely hard to catch. "Omission Attacks: When Doing Nothing Is the Attack" found that omissions were caught much less often than commissions in the same environments, even by strong models like Opus 4.6. Auditing Sabotage Bench found that ML research sabotage was much harder to catch when the attacker used an omission, rather than a change in design details or introduction of flawed logic. SLEIGHT-Bench labeled omissions as a possible blindspot in monitors, and found that strong monitors were not able to catch an omission they designed.
In LinuxArena, I collected 15 honest trajectories, created synthetic omission attacks by omitting a key step from each of the honest trajectories, and collected 15 traditional side-task trajectories (commissions) from the "LaStraj" dataset. At a 2% FPR, the basic LinuxArena monitor caught seven out of the fifteen commission attacks, and caught none of the omission attacks. I then created a monitor prompt that calls out omissions as the target, at similar granularity to the basic LinuxArena monitor (e.g. calling out the attack type but not mentioning what specific lines may be present/omitted). This monitor still caught 0/15 omission attacks at a 2% FPR. You can find the code here.
I am looking for critiques, suggestions or advice on the proposal below. If something like this could be valuable for your research, or if something is missing from the proposal that could be valuable to your research if added, please let me know!
Thank you to Bruce Lee for early discussion of omission attacks.
Proposal
I propose we build a benchmark for omission attacks and omission monitors. The benchmark will contain synthetic omission trajectories, prompts to elicit clear omissions, and scaffolding for monitors. The synthetic trajectories will be for measuring monitorability in the worst case (clean, targeted omissions of the most important features) while the prompts will be for measuring agents' ability to carry out omission attacks. The benchmark will measure both attackers' abilities and compare monitorability of omissions to commissions on the same main tasks. Omissions don't need new settings and are easily appended to existing frameworks, so we will take existing control environments like LinuxArena and scaffold them for omissions.
It's not clear if we should treat omissions as concentrated attacks (attacks where a few actions can cause damage) or diffuse failures (attacks where many actions are needed to cause damage). In the case of research sabotage, we generally consider attacks as diffuse because a model would likely have to subtly influence many results to achieve its goal. In contrast, I believe security failures could at least be created, if not exploited, in very few omissions. To resolve this concern, the benchmark will have categorizations of attacks based on how many omitted actions are needed to cause damage. We may also include interactions between omissions and commissions, as I think this could introduce some concentrated threats that could not be completed with omissions alone.
The goal of this benchmark will be to answer the questions:
The "Omission Attacks" paper showed that postcondition tests caught omissions at far higher rates than monitors. In my testing, I found that only four of the omission attacks had degraded main test results compared to the honest trajectory, and only three had results degraded enough to fail the main task scoring. Similar to commissions, if the test suite had perfect coverage, we would be able to catch omissions every time. Therefore, I think we should explore the likely case where the agent writing the code is also the agent writing its tests by allowing the attacker to write the test suite given the main task spec. This way the attacker can try to hide its omissions by also omitting a test that would catch it.
Additionally, in both the case where we use a traditional test suite, and the case where the attacker writes tests, I think we should allow the monitor to add to the existing test suite and run its additional tests on the submitted code. Another blue team strategy could involve using a separate untrusted or trusted model to write tests given the spec.
All these strategies would aim to answer these questions:
There are various other things we could include in the project that would not necessarily fit into the benchmark. We may choose only what is most valuable from this list.
Let me know what you think!