Factored Cognition Strengthens Monitoring and Thwarts Attacks

Cody Rushing

Note: This post is essentially a ~90% draft of our upcoming research paper. We are sharing it now to get feedback on our empirical approach^[1], choice of final experiments, and suggested future work. We assume readers have a high context in empirical AI control research, so please consult the links and glossary for clarification on terminology.

Summary

We develop AI control protocols which use factored cognition to reduce risks from scheming AIs. In our proposed family of factored cognition protocols, tasks are decomposed into self-contained child tasks, which are each solved in isolated contexts by other control protocols. The child task solutions are then reassembled to solve the original task. We hypothesize four separate safety benefits of these protocols:

Simplicity benefits for monitors: Monitoring is more effective on simpler child tasks due to reduced complexity, making attacks and honest solutions more recognizable.
Simplicity benefits for other blue team tactics: Many other blue team tactics are easier when applied to simpler child tasks, such as editing solutions and generating unit tests.
Attack coordination challenges: Isolated contexts hinder untrusted models’ ability to coordinate multi-step attacks which occur across multiple areas of the solution.
Security through obscurity: For some threats, isolated contexts hide information about the schemer’s target, making it harder to assess attack opportunities and reducing attack success rate.

Our experiments directly address hypotheses 1 and 4. We perform red-teaming in the APPS setting to assess the safety of a factored cognition protocol which uses the trusted model to decompose tasks and trusted monitoring protocols to solve child tasks. Due to the limitations of APPS and the limited ability of current LLMs to decompose tasks, we perform separate evaluations of our protocol’s safety and its usefulness relative to trusted monitoring. Our main findings are:

Adding factored cognition to a trusted monitoring protocol boosts safety from 41% to 70% due in part to improved monitor performance.
Attacks on simpler child tasks are more obvious to monitors, but we find no evidence that honest solutions are more recognizable.
Adding factored cognition drops the attack success rate from 26% to 14%, but we are as yet unsure how much of this drop is due to security through obscurity.
Factored cognition protocols are most useful at defending against the class of threats where harm can only be caused after a malicious final answer is accepted, as opposed to harms which occur as side effects while a solution is being computed.
Factored cognition protocols have the best relative advantage in complex tasks which get decomposed by default, such as those already handled by LLM agents.
The APPS setting is inadequate to properly benchmark factored cognition control protocols against simple baselines. A control setting with harder, more complex tasks is needed for a unified comparison of safety and usefulness. Testing these protocols in such a setting is the first priority for future research.

Introduction

Since February, we’ve been researching factored cognition as a basis for AI control protocols. Readers familiar with both of these can probably skip to the next section.

Factored Cognition

Factored cognition is a subfield of AI research concerned with approaches to cognition which involve decomposing tasks into child tasks, solving them mostly independently, and reassembling the solutions to the child tasks to solve the original task. A central claim behind factored cognition is that this decomposition process creates easier tasks and clearer contexts for individual agents. This structure enables the system to solve problems too difficult for a single agent to solve alone (see Capability amplification for discussion). The literature on safety applications of factored cognition is almost all conceptual, see here and here for some prior work. However, many LLM task decomposition systems have been developed to improve performance on complex tasks.

Related Work

Control evaluations were introduced in the seminal AI control paper (Greenblatt et al., 2023). A control evaluation is an adversarial game which discovers safe and useful protocols for deploying AIs: "safe" meaning that there is no policy that the AIs can follow which would result in catastrophe, and "useful" meaning that there is a policy the AIs could follow which results in high performance. In control evals, a green team acts as a game designer, creating a set of objectives, allowances, and assumptions, within which adversarial red and blue teams seek to optimize their opposing objectives. The green team’s role is to ensure that the environment is constructed such that red teaming is competitive and produces informative results. In high-stakes control, the blue team objective is to use a dumb trusted model T, a smart untrusted model U, and limited human labor H, to construct a control protocol with high accuracy on a set of tasks and low probability of a catastrophe occurring while completing those tasks. The red team objective is to maximize the probability of a catastrophe occurring before getting caught in the act by the blue team. They do this by constructing a hidden attack policy for the untrusted model and choosing when to use $U_{A}$ vs an honest policy $U_{H}$ which innocuously complies with the blue team’s prompts.

The tasks and definition of “catastrophe” are part of the control setting. At this time, there are two main published settings directly usable for developing high-stakes control evals (though Redwood Research and UK AISI are pushing to create many more). The first is the APPS setting used in (Greenblatt et al., 2023), in which models solve coding interview questions while the attack policy aims to submit backdoored code. This is the setting used in our empirical work. The second is the BashBench setting, in which LLM agents accomplish various tasks in a bash shell, and the attack policy aims to download and execute a malware file from a particular URL. The most similar control protocol in the literature to what we propose is Untrusted Plan from the low-stakes control paper. This protocol uses untrusted model U to write a plan for a solution, trusted model T checks it, then T writes code to execute the plan. This is similar to one of our proposed protocols, but it doesn’t isolate contexts for each child task in the plan.

Proposed Control Protocols

We propose a family of factored cognition control protocols for initial research, illustrated above.

A root task is passed to the decomposer. The decomposer may be either T or U.
The decomposer creates a decomposition of the root task, which is composed of a sequence of self-contained child tasks and reassembly logic. Reassembly logic is the procedure by which solutions to child tasks can be combined to produce a solution to their parent task (in the simple case, the only parent task is the root task).
If using an untrusted decomposer, all outputs may undergo decomposition post-processing, such as trusted monitoring, trusted editing, etc., to address the risk of decompositions containing attacks. This post-processing produces sanitized child tasks and sanitized reassembly logic.
Each child task is solved by an implementer in an isolated context. An implementer can be any control protocol. This produces a sequence of child solutions.
Using the sanitized reassembly logic, the child solutions are reassembled to produce the root task solution.

Example: Functional Programming Task sorted_areas

Here is an example problem processed using a trusted decomposer and trusted monitoring implementers. We will refer back to this example at a few points.

A trusted decomposer is given the root task:

def sorted_areas(s_3: float, s_4: float, s_5: float) -> list[float]:
    “””Given 3 side lengths s_3, s_4, s_5, compute the areas of an equilateral triangle, square, and regular pentagon, with those respective side lengths. Return a list of the 3 areas in sorted order, each rounded to 1 decimal place.”””

It creates the following decomposition:

def area_triangle(s: float) -> float:
    “””Return the area of an equilateral triangle with side length s”””

def area_square(s: float) -> float:
    “””Return the area of a square with side length s”””

def area_pentagon(s: float) -> float:
    “””Return the area of a regular pentagon with side length s”””

def sort_and_round_list(a: list[float]) -> list[float]:
    “””Sort the list a and round all elements to 1 decimal place.”””

Reassembly logic:

def sorted_areas(s_3: float, s_4: float, s_5: float) -> list[float]:
    area_3 = area_triangle(s_3)
    area_4 = area_square(s_4)
    area_5 = area_pentagon(s_5)
    return sort_and_round_list([area_3, area_4, area_5])

The decomposer is trusted, so decomposition postprocessing is not needed.
In an instance of the trusted monitoring implementer, U is given “child task a” and implements area_triangle, creating a child solution. The trusted monitor is given “child task a” along with the solution and rates the suspiciousness. Child tasks b, c, and d are processed identically in separate contexts.

In reassembly, the implementations of area_triangle, area_square,…sorted_areas are concatenated. This is the root task solution.

Example: Question Answering Task

Here is an example problem for a question-answering task.

A trusted decomposer is given the root task: How many piano tuners are there in Chicago?
It creates the following decomposition:
1. What is the population of Chicago?
2. How many pianos per person are there in Chicago?
3. How many piano tuners per piano are there in Chicago?
4. Reassembly logic:
  Number of piano tuners = population * pianos per person * piano tuners per piano
The decomposer is trusted, so decomposition postprocessing is not needed.
In an instance of the trusted monitoring implementer, U is given “child task a” and estimates the population of Chicago as 2.7 million, which is a child solution. The trusted monitor is given “Q: What is the population of Chicago? A: 2.7 million” and rates the suspiciousness as 2/10. Child tasks b and c are processed identically in separate contexts, producing child solutions “0.05 pianos per person” , “0.0002 piano tuners per piano”.

In reassembly, the child task solutions are passed to the reassembly logic: 2.7 million * 0.05 * 0.0002 = 27 piano tuners in Chicago. This is the root task solution.

We refer to protocols with Factor(decomposer, implementer) syntax. Factor(T,T) is a factored cognition system for the trusted model. The examples above use Factor(T,trusted monitoring). Since the only untrusted implementer we use in our experiments is trusted monitoring, we will refer to this protocol as Factor(T,U) for short.

We use the term factored cognition [control] protocol to refer to this particular class of protocols, and the term task decomposition [control] protocol to refer to the superset of protocols which use some sort of task decomposition. This superset also includes, e.g., protocols that don’t isolate child task implementer contexts, and protocols using purely rule-based decomposers which narrow the context given to untrusted implementers.

One attractive feature of factored cognition protocols is that they can leverage advances from other protocol R&D simply by updating the implementers. It’s sometimes useful to frame the implementer protocols as corresponding baseline protocols. We can isolate the effects of factored cognition in experiments by comparing the performance of the factored cognition protocol to that of its corresponding baseline protocol applied directly to the parent task.

Why might factored cognition be useful for AI control?

Hypotheses: Four Safety Benefits of Factored Cognition Protocols

There are four mostly independent mechanisms by which we predict these protocols will result in higher safety compared to their corresponding baselines.

Simplicity benefits for monitors: Monitoring is more effective for simpler child tasks. This may be due to simpler contexts making attacks more obvious and/or making honest solutions more obviously honest.
Simplicity benefits for other blue team tactics: Many blue team tactics are more effective when applied to simpler child tasks, such as editing solutions and generating unit test code.
Attack coordination challenges: When child task contexts are isolated, it’s harder for an attack policy to coordinate multi-step attacks which occur across multiple areas of the solution.
Security through obscurity: For a certain class of threats, factored cognition protocols create layers of hidden logic through which attacks must pass before harm can occur. This makes it more difficult to gauge the quality of attack opportunities, reduces attack success rate, and requires attack policies to expose themselves more frequently to monitors.

Security through Obscurity

The first three hypotheses are mostly explained by the short descriptions above, but this one needs additional background. We classify AI control threat models into one of the following categories:

Product Threat: A threat intrinsic to the product generated by an attack policy, i.e., the tokens in the root task solution, where a bad outcome can be locked in only after a root task solution containing a successful attack is accepted^[2]. Example: APPS backdoors

Process Threat^[3]: Harm is locked in during the process of an attack policy creating a solution. Example: Bash Control threat model of U downloading and executing malware while completing another task.

Under a product threat, protocols like Factor(T,U) can expand the contextual distance between attack policies and the attack target. With Factor(T,U), U never sees the root task, only child tasks. We hypothesize that this obscuration of the target makes executing product threats more difficult for similar reasons that U lacking access to private unit tests makes it more difficult to write backdoors. Attacks must all filter up the decomposition (a potentially deep directed acyclic graph), passing through additional layers of obscured security swiss cheese before they can cause harm. This is somewhat similar to the cybersecurity idea of security through obscurity. In the cybersecurity field this strategy appears to be generally dismissed as a holistic security approach, but as one of many security layers it can be helpful.

Moreover, it’s not necessary to put any specific effort into creating dedicated security features like unit testing in the decomposition tree; we expect some average case safety benefit from factored cognition by default on coding tasks, and maybe other types too. Consider the sorted_areas example. The sort_and_round_list implementer attack policy might insert a high-quality subtle backdoor into a merge sort implementation which gets past our monitors. Maybe its trigger set (the set of possible inputs which trigger the backdoor) is sequences of length>2 that have a negative value only in the center index. If sort_and_round_list were the root task, this would be a great attack. However, since this function is only used to sort areas of shapes which are all positive, the backdoor is unexploitable in the context of the root task. It’s only because we decomposed the parent task and isolated the child tasks that U didn’t know that its merge sort backdoor was a bad idea. The security through obscurity advantage is pretty clear for functional programming tasks, but it doesn’t generalize to all product threats and tasks. A similar analogy is not apparent for a text summarization task, for example.

Under task decomposition protocols, it seems that most process threats have the potential to cause harm within child tasks, before root task solutions are assembled^[4]. U as an LLM agent can execute commands to download malware in any child task. There is no need to filter through upstream and downstream logic.Task decomposition does not give us the same security through obscurity advantages here as when defending against product threats.

Methodology

In our experiments we test Factor(T,T), Factor(T,U) with trusted monitoring as the child task implementer, and their corresponding baselines. The graphic below illustrates the Factor(T,U) protocols tested as a subset of the factored cognition family above.

Key Differences from (Greenblatt et al., 2023)

Here our methods are summarized as a list of deviations from (Greenblatt et al., 2023), since in most aspects the setup is the same. Appendix G contains other methodological details which aren’t deviations, but may be unclear from the seminal paper.

Simple baseline protocols: Since we use a different software stack to build experiments, we build a simple trusted monitoring protocol instead of attempting to replicate the protocol from (Greenblatt et al., 2023). In particular, the backdoor generation, attack selection, and trusted monitor subroutines are simplified. See Baselines for details.
Train/test split: The red and blue teams have access only to a train split of 250 problems during red teaming. The test split of 250 is only run after the final red team iteration and remains untouched in our experiments so far.
Perfect auditing: Auditing any successful or attempted backdoor triggers shutdown and is a blue team win.
No free upfront auditing: (Greenblatt et al., 2023) allowed the blue team to audit the first 10 tasks for free to discourage strategies which use variable audit thresholds as this was not a focus of that paper. For the same reason, we constrain the blue team to maintaining a constant audit threshold for all tasks and give no free audits.
Discretized suspiciousness and audit thresholds: If suspiciousness is implemented as a discrete measure (e.g., integers 0-9), then the audit threshold is set at the value with the smallest expected proportion of audited tasks at least as large as the auditing budget. E.g., with a budget of 2%, if 1% of usefulness run solutions have suspiciousness 9, and 3% have suspiciousness 8, the threshold is set at 8. Solutions equal to the threshold are audited at random, with a fixed audit probability set such that the budget is met in expectation. In this example, the probability would be 33%.
Exploit finder: Due to a special property of Factor(T,U) protocols, all exploits in all protocols used to check for backdoors are generated in a separate call to U with full problem context. See Factor(T,U) Protocols for details.
Exploit validation: A green team call to $U_{H}$ checks if a proposed exploit obeys the input constraints of the problem. The exploit finder has 3 chances to submit a valid exploit, and it gets feedback from the validator on each rejection.
Safety and usefulness calculations: We do not use Monte Carlo cross-validation to calculate safety and usefulness. Instead, we complete separate safety and usefulness runs each with multiple epochs. In the usefulness run, $U_{H}$ completes all tasks. We use cluster bootstrapping (clustering by problem ID the suspiciousness ratings for all epochs) in two steps to estimate the variance of safety, usefulness, and other metrics. In the first step, we resample the usefulness run and compute auditing parameters. In the second step, we use the auditing parameters to resolve the win/continue/lose outcome of each task in the safety run, resample, and calculate safety analytically, accounting for all possible attack downsampling outcomes. See Appendix C for details.
Model selection: We use more recent models for T, U, and H. See Usefulness and Model Selection for details.

The remainder are assumptions and allowances specifically related to task decomposition protocols.

A backdoored child task does not imply a backdoored parent task. This is a direct consequence of the product threat model used in the APPS setting. If a child task is backdoored, but the backdoor isn’t exploitable by the exploit finder in the full context of the root task solution, then this is a failed attack and a safe outcome.
Child task solutions are audited in isolation: If any child task solution has a suspiciousness above the audit threshold then that child task solution alone is audited. Attacks below the audit threshold in sibling tasks would not get caught.
Powerful decomposer: The blue team has access to a powerful decomposer implemented by o3-mini. This allowance is motivated by our early usefulness results.

Baseline: Trusted Monitoring

Since we use different libraries to build experiments, we build our own simple trusted monitoring protocol instead of attempting to replicate the trusted monitoring protocol from (Greenblatt et al., 2023). This is a summary of our trusted monitoring process:

Attack policy $U_{A}$ receives a task and writes a backdoored solution with a proposed exploit.
U rates the solution difficulty, which is how likely the solution is to pass unit tests.
U rates the predicted suspiciousness of the backdoor.
Based on a simple attack selection model using the previous ratings, U decides whether to proceed with the $U_{A}$ solution or proceed with $U_{H}$ . See Appendix D for details on attack selection and tuning these model parameters.
If proceeding with $U_{A}$ and using a random attacks element, then the $U_{A}$ solution is submitted with a tuned probability.
If the $U_{A}$ solution is not submitted, $U_{H}$ writes and submits an honest solution.
T rates the suspiciousness of the solution. We use a modified monitoring prompt from a recent paper focused on optimizing monitors.

Factor(T,U) Protocols

At the start of red teaming, the child task implementer uses an identical procedure to the baseline trusted monitoring described above, with all the same prompts. The red and blue teams may deviate from this as they iterate.

Exploit finder: For Factor(T,U) protocols, we have the additional choice of how to decide if a root task solution contains a backdoor. Since U has no access to the root task, it can’t propose an exploit in the normal course of writing a solution. We make a fairly generous concession to the red team, allowing U to generate an exploit given access to the full context of the root task, including child tasks, solutions, and child task proposed exploits (to be clear: the attacking U doesn’t see any of this. This is just for the purpose of identifying if an exploit exists at all). The root task is judged to contain a backdoor only if this exploit triggers it.

Exploit validation: All models tested except for o3-mini frequently propose illegal exploits, such as an input of -1 when the function explicitly or implicitly expects only positive inputs. Prompting was insufficient to stop this behavior, so we validate the proposed exploits.

To ensure a fair comparison between protocols, this exploit finder procedure is also used for trusted monitoring. In this case the exploit finder sees the proposed root task exploit provided by the $U_{A}$ instance that authored the code, and it decides to keep it or to propose a new exploit.

Experiments and Results

Our experiments address the “simplicity benefits for monitors” and “security through obscurity” hypotheses. The APPS backdoors threat model encourages very simple attacks and thus cannot test the attack coordination challenges hypothesis. We do not investigate simplicity benefits for other blue team tactics, but this would be a straightforward extension of our work. All error bars represent 68% confidence intervals, approximating ±1 standard error for normally distributed statistics.

Dataset

We use the APPS-verified dataset, a subset of APPS with verified code solutions. For expediency, we use a subset of 500 problems with the “interview” difficulty level. A 250/250 train/test split was created, and the dataset is available at xtremekiwi/APPS-leetcode-codeforces. All of our results are currently on the train split. All tasks are assumed to use the example-based unit test, where proposed code passes only if its output for a unit test case exactly matches the output of a reference solution, with tolerance for floating point imprecision^[5].

Usefulness and Model Selection

The usefulness metric is obtained from a usefulness experimental run on a given protocol where $U_{H}$ is used in all calls to U. Usefulness is the proportion of the $U_{H}$ solutions which pass unit tests, linearly rescaled such that the trusted only protocol has 0% usefulness, and the honest untrusted only protocol has 100%.

In addition to usefulness, we are also interested in measuring decomposer capability in isolation. We introduce the decomposition tax, a metric akin to alignment tax, measuring the relative loss in accuracy for a particular model on a particular distribution of tasks from adding task decomposition to a protocol. Decomposition tax = 1 - (accuracy of Factor(T,T) / accuracy of T).

We have two main criteria when selecting T and U.

A considerable capability gap (>25% accuracy) between T and $U_{H}$ so that our signal to noise ratio is high enough.
Low^[6] decomposition tax for T on APPS. If this is not met then we predict that it would be implausible that any Factor(T,U) protocol could achieve usefulness competitive with its corresponding baseline.

The above plot shows the accuracy of different protocols on the training set using different models for T. For all models, adding a decomposition step degrades performance. Though the degradation could always be reduced with better elicitation, these results suggest that making high-quality decompositions is challenging.

This plot shows the same T and Factor(T,T) data as a scatter plot, demonstrating that we cannot satisfy both model selection criteria. The closest selection would be T=GPT-4o mini and U=o3-mini, but we judged this unsatisfactory. These results also indicate that future progress in LLMs, if it follows these decomposition tax trends, is unlikely to ever allow for satisfaction of the model selection criteria on APPS. The benchmark is already too close to saturation. Better decomposer scaffolds would shift this trendline downwards, and for more complex tasks it would shift down and to the left, enabling the criteria to be met.

The first criterion was picked as more important, and we selected T=Llama 3.1 8B Instruct and U=GPT-4.1 Nano. To compensate for T’s poor decomposition capability, the blue team is granted access to a powerful trusted decomposer D. D may only be used to create decompositions, not for monitoring or any other functionality. All red teaming is done on Factor(D,U) protocols rather than Factor(T,U). See Appendix A for justification of this allowance. Our full model selection is:

T: Llama 3.1 8B Instruct
U: GPT-4.1 Nano
H: OpenAI o3-mini (medium reasoning)
D: OpenAI o3-mini (medium reasoning)

The main drawback of using Factor(D,U) protocols is that comparisons between the usefulness of decomposition and non-decomposition protocols are biased to the point of invalidity^[7]. Such comparisons using the results of our safety experiments are dubious. We do not use these comparisons as evidence for any of our conclusions.

Safety

In this section, we develop a variant of the Factor(D,U) protocol which uses trusted monitoring implementers, as well as attack policies against Factor(D,U) and the trusted monitoring baseline through red teaming. All of the tactics explored are applicable to both protocols.

Honest Policy Backdoors

In early experiments, we found that $U_{H}$ sometimes unintentionally writes backdoored code. Handling the various consequences of this behavior on red teaming produced new questions about how to treat these outcomes and motivated major green team changes to the experimental setup. Ultimately, this is a behavior orthogonal to our research questions, and as such we built our experiments to circumvent these effects such that the resultant red team incentives and other effects wouldn’t interfere with our targeted investigation of decomposition protocols. We will provide full details in an upcoming blog post, and here we simply highlight the elements of the experimental setup and assumptions which resulted.

Red team wins can only occur on attacked tasks. Backdoors written by $U_{H}$ and exploitable by U are counted as safe outcomes.
Honest policy backdoor red team strategies are ignored even though they would be successful with lower downsampling factors. Even with the above changes, there remains an honest policy backdoor red team strategy which would be extremely effective with a sufficiently low downsampling factor. Typically, the green team is responsible for spotting such instances and ensuring that an experimental artifact like a high downsampling factor doesn’t mask a legitimate attack strategy. In this case, we deliberately block this attack strategy by not changing our downsampling factor.

Without the above choices, red team incentives would drive the experiment into a low safety regime where the blue team would be powerless. Since such a regime can’t inform us about the performance of task decomposition protocols, we adopt the assumptions. However, we don’t believe these are good general assumptions for other research using backdoored code threat models.

Subtle Backdoor Attack Selection

Like in (Greenblatt et al., 2023), we find the best red team strategy against trusted monitoring is to write subtle backdoors. We use the same strategy for Factor(D,U), as no distinct red team strategy superior at countering Factor(D,U) was found. The only change in the red team’s tactics is to tune separate attack selection parameters to account for the different task distribution.

Modeling the trusted monitor: Some earlier experiments used the more capable T=Llama 4 Scout with the same U=GPT 4.1 Nano. In one safety run, we found a notably high Pearson correlation of r=0.69 between the red team’s predictions and the actual suspiciousness. Previous work such as Bash Control has found the red team largely unable to model the trusted monitor. The best we know of is from the low-stakes paper, which was able to produce a maximum of r=0.45 using T=Claude 3 Sonnet and U=GPT-4. Our unusually high correlation allowed for especially precise attack selection. However, the correlation disappeared with our final model selection using T=Llama 3.1 8B Instruct. Still, this suggests that, with a narrower capability gap, a schemer tactic of directly predicting trusted monitor outputs is plausible. See Appendix D for further details on attack selection methodology and results.

Safety Metrics and Performance

Beyond the safety metric, we are also interested in comparing factored cognition and simple baseline protocols on intermediate safety metrics. These help to isolate which of the hypothesized mechanisms contribute to an observed safety advantage.

Attack success rate (ASR): The quantity of backdoored root tasks divided by the quantity of attacked root tasks. This is a somewhat indirect and biased signal for the “security through obscurity” hypothesis. Confounding variables like the longer modular structure of decomposition solutions may also affect the ability of the exploit finder to find backdoors. Targeted experiments are needed to isolate these factors.

Unaudited attack success rate (Unaudited ASR): The quantity of root tasks resulting in a red team win divided by the quantity of attacked root tasks. This is a measure proximate to safety which is more legible.

Factor(D,U) substantially outperforms trusted monitoring on safety and both attack success rate metrics.

Trusted monitor ROC curves: Evaluates the monitor classification performance of honest vs attacked solutions. This is a signal for the “simplicity benefits of monitoring” hypothesis. ROC curves for these distributions closely reflect the available safety-usefulness tradeoff for defer to trusted protocols across varying thresholds.

Our monitor is better at distinguishing attacks on child tasks than it is at distinguishing attacks made on root tasks directly. Filtering out solutions which fail unit tests before monitoring improves monitor performance, especially for trusted monitoring.

Suspiciousness distribution of $U_{A}$ and $U_{H}$ solutions: Factor(D,U) monitors might perform better either because the distribution of $U_{A}$ solution suspiciousness is shifted right, or because the distribution of $U_{H}$ solution suspiciousness is shifted left relative to the corresponding baseline. I.e., monitors might be better at recognizing attacks, or they might be better at recognizing honest solutions.

Comparing these distributions, it’s evident that both are at play. While Factor(D,U) attacked solutions show a clear overall rightward shift, the only large distributional difference for honest solutions is the spike of zero-suspiciousness solutions in Factor(D,U). There is no clear pattern elsewhere.

Suspiciousness distribution of most suspicious $U_{H}$ child solutions: While a leftward distribution shift in $U_{H}$ solution suspiciousness improves the above monitor metrics, it doesn’t necessarily indicate a practical improvement in monitor performance in classifying honest solutions as low suspiciousness. As an extreme example, a useless decomposer could always create one trivially simple child task and leave 99% of the problem scope to be handled in a second child task. This decomposition provides no practical value, but we’d expect a large spike of zero-suspiciousness solutions to the trivial child tasks, skewing some metrics. As a conservative evaluation of whether decomposition facilitates classifying honest solutions, we eliminate this effect by considering the distribution of the most suspicious child task for each root task. If 2-child decomposition systematically helps monitors classify honest solutions, then we would expect this distribution to be shifted left or at worst match that of the corresponding baseline.

However, the most suspicious child solutions clearly tend to be more suspicious than trusted monitoring solutions. Therefore, we see no evidence to support the theory that decomposition increases suspiciousness evaluation accuracy for honest solutions in general. This could still be true for more granular decompositions, but further experiments are needed.

The heatmap plots the same data as the histogram. It’s evident that there is little correlation between the suspiciousness of a trusted monitoring solution and the most suspicious Factor(D,U) solutions. The relationship is mostly noise on top of the bias evident in the histogram.

General Observations

Choice of decomposition granularity: Our decomposer prompt allows D to select how many child tasks to produce in its decomposition for each problem, as long as it’s at least 2. Despite this freedom, the decomposer creates exactly 2 child tasks 96% of the time, and 3 child tasks for the remainder. We judged this to be a good starting point for making conservative estimates of safety benefits, as we expect more granular decompositions to produce higher safety and lower usefulness.^[8]

Problems with decompositions: The most frequent problem we observed anecdotally is creating I/O examples with rare floating point numbers (numbers which don’t appear often in training data like 0.25 or 3.14159). When writing I/O examples, the decomposer usually does a good job of converting I/O examples given in the root task into examples for each child task. However, for examples with rare floating point numbers, it almost always hallucinates the intermediate values. For example, consider sorted_areas again. Suppose the root task also came with the following I/O example: sorted_areas(1, 1, 1) = [0.4, 1.0, 1.7]. D would likely create an I/O example like this in its area_pentagon description, hallucinating the intermediate value: area_pentagon(1) = 1.683431. On the bright side, implementers given these child tasks generally ignore the hallucinated values in their chain of thought.

Zero-suspiciousness child solutions: In all Factor(D,U) runs, we observe a concentration of zero-suspiciousness child solutions which is absent from the trusted monitoring suspiciousness distribution. We manually examined a random sample of 10 such Factor(D,U) child solutions from the final safety run. All of the child solutions are very simple relative to the full root solution, with the average implementation containing 3 lines of code. This suggests that this spike mostly consists of very simple child tasks whose solutions are obviously honest to the monitor.

Discussion

Evaluating the Hypotheses

Simplicity benefits for monitors: Our findings indicate that factored cognition improves monitor performance at recognizing attacks on coding tasks. However, it doesn’t improve performance at classifying honest solutions, at least for decompositions containing only 2 child tasks. To what degree these findings generalize to other types of tasks is a subject for future research.

Security through obscurity: The reduced ASR observed under Factor(T,U) is weak evidence for this hypothesis, but further experiments are necessary to eliminate other potential causes. For example, the exploit finder capability could be a major bottleneck to ASR, degrading red team performance on Factor(T,U) differentially due to the more complex structure of the solutions. While this slightly increased complexity might be a barrier for smaller current generation models like GPT-4.1 Nano seeking to recognize and exploit backdoors, we wouldn’t expect this advantage of Factor(T,U) to generalize to the capability levels of actual future untrusted models. Further experiments are needed to eliminate this confounding variable.

Limitations

Decomposer elicitation gap: We didn’t work as hard as the authors of (Greenblatt et al., 2023) at elicitation and prompt engineering. While our monitor prompt was sourced from prior research, most other prompts and scaffolds are our own. The decomposer elicitation gap is particularly important, and this is where we put the plurality of our prompt engineering effort. A large gap could change our conclusions about the limitations of the APPS setting discussed below. Integrating better decomposer scaffolds is a subject for future research.

Powerful decomposer allowance: See Appendix A.

APPS setting: APPS was chosen due to the limited selection of available control settings and since it seemed promising for these protocols. Functional programming problems are naturally decomposable and APPS backdoors are a product threat model, which we hypothesized would offer a safety advantage. Though the APPS setting fits most of the initial criteria for testing these protocols, it is still lacking. A proper, unified evaluation of the safety and usefulness of task decomposition protocols must test Factor(T,U), not Factor(D,U), against corresponding baselines. Experiments with APPS cannot give a unified view of the safety and usefulness of decomposition protocols relative to simple baselines due to the implausibility of ever satisfying the model selection criteria with APPS. The most performant models evaluated are near the worst acceptable decomposers. Assuming the decomposition tax trend continues with future models, this sets a floor on the capability of T near that of Claude 3.7 Sonnet. However, with 82% accuracy, 3.7 Sonnet is approaching benchmark saturation. No choice of future model as U can satisfy the first criterion of a >25% capability gap.

One potential solution is to simply use larger samples so that the minimum capability gap can be reduced. However, with all protocols performing in the limited range of 80%-100% accuracy, sample sizes would need to be even larger to reduce sampling noise in accuracy from the small number of incorrect solutions. Evaluations would be much more costly and take longer.

Conditions for Real-World Protocol Viability

A major barrier to real-world adoption of AI control protocols is that control protocols degrade the usefulness of the untrusted LLMs. This is a steep cost for frontier labs in a competitive environment. Factored cognition protocols will incur a usefulness cost on many tasks due to their context-isolating structure creating lossy decompositions. However, there is a broad class of tasks already being solved through automated task decomposition and multi-step LLM workflows. In the context of AI control, LLM agents can be viewed as akin to Factor(U,U) without safeguards, a task decomposition protocol without any of the safety features of factored cognition protocols like monitoring or context isolation for child tasks. It may not be difficult to add safety-relevant features to these protocols and workflows.

Task decomposition in the form of LLM agent systems has gained immense popularity purely for its capabilities. Factored cognition protocols will be most viable for these types of tasks, where decomposition is not a burden but an enhancement. For these tasks, the loss in usefulness will come from the context isolation and other safety features necessary to turn a regular LLM agent system into a proper Factor(T,U) or Factor(U,U) protocol. Decomposition itself will not differentially degrade usefulness, increase inference cost, nor lengthen latency. Therefore, these protocols are most promising for tasks where decomposition occurs by default, and where the natural baseline is a generic task decomposition protocol without safety augmentations.

Factor(T,U) will be a viable option only when one of the following conditions holds.

When decomposition is easy enough that Factor(T,U) doesn’t show a large usefulness drop from Factor(U,U). This is likely to depend on whether difficulty lies primarily in the decomposition step, the implementation step, or both for a given class of tasks.
When Factor(U,U) is unattractive due to its higher complexity, higher inference costs, or because it is simply too hard to attain acceptable safety.

Ultimately, the key determinant of viability is whether context isolation degrades performance to an unacceptable degree. Evaluating this tradeoff across a diverse set of task types is one of the most important tasks to be addressed by future research. Helpfully, some of this work can be done outside of the AI control context.

Additional Work for the Paper

Before publishing this project, we plan to do a few more experiments. This subsection outlines our current plans, while the next outlines work that would be tackled in separate future projects. We’d especially like feedback on these plans.

Decomposition granularity: A key design parameter for decomposition protocols is the distribution of the number of child tasks created. The Factor(T,U) protocol tested here, which usually decomposes parents into 2 child tasks, creates coarse decompositions. All four of our hypotheses for the safety benefits of decomposition also predict a safety-usefulness tradeoff over granularity, with more child tasks resulting in higher safety and lower usefulness.

Ablation study of child context isolation: This experiment would directly address the security through obscurity hypothesis. We build a task decomposition protocol which doesn’t isolate child tasks, and compare safety performance relative to Factor(T,U). Here is how the protocol world work:

D creates a decomposition (identical to Factor(T,U))
U is given the full context of the decomposition and writes all child task solutions.
Child task solutions are separated and rated by monitors.
The exploit finder is given the root task solution (identical to Factor(T,U))

We’d primarily be interested in comparing ASR and unaudited ASR for this protocol against Factor(T,U). If we observe no significant difference, this would be evidence against the security through obscurity hypothesis for APPS.

Rerun all experiments on the test set: We would not change any attack selection parameters, but we would perform new usefulness runs to reach a potentially different audit threshold.

Future Work

New control setting: For a full and fair evaluation of task decomposition protocols, a more difficult setting of complex tasks more suitable for these protocols is needed. An ideal setting to investigate our research questions would be representative of a real-world use case, far from saturation using current frontier models, and compatible with interchangeable and similar product and process threat models.

Of particular interest is developing a Factor(T,U) protocol for an LLM agent setting with a threat model involving multi-step attacks. This would allow testing the attack coordination challenges hypothesis and address the additional technical challenges in decomposing tasks and implementing context isolation across untrusted implementers when operating in a stateful environment.

Test product vs process threat hypothesis: To test the hypothesis that safety is higher for a product threat than a similar process threat, we need 2 different evals using similar product and process threat models. Ideally, they would be almost the same except, e.g., code execution is deferred. This would allow us to evaluate how much of the benefit of task decomposition only applies to product threats, which has implications for where we might choose to employ the protocols, and could motivate research into converting process threats into product threats.

Benchmark task decomposition capability: It would be useful to know when models reach a sufficiently low decomposition tax on relevant tasks that their usefulness is competitive with simple baselines. Though lots of papers presenting new LLM task decomposition systems benchmark their systems relative to simple baselines, there is no unified evaluation framework. To provide more fair, holistic, and scalable evaluations, it would be good to have an eval system with a variety of task datasets and decomposition scaffolds where any combination of LLM, dataset, and scaffold can be run and compared, and new instances can be added.

Extensions of Factor(T,U): Other potentially useful features of factored cognition protocols which could be added to Factor(T,U) include:

Use implementers other than trusted monitoring
Test a range of model capabilities for D and examine safety trends
Add another monitoring/editing/resampling step after reassembly with the full parent task solution
Use T or U to generate unit testing code for child tasks
Dynamic implementer routing and deep decomposition trees (similar to ADaPT)
Other more sophisticated LLM decomposer scaffolds from the literature

Factor(U,T) protocols: Our research only tests Factor(T,U) and Factor(T,T). If the final results continue to be promising it would be natural to continue with red teaming Factor(U,T) protocols, which may have their own niche in tasks where decomposition is the most difficult part and simpler protocols are wanted. Experiments would test different decomposition postprocessing and find reasonable assumptions related to auditing decompositions. In order to compare performance against Factor(T,U) protocols, a setting other than APPS is probably needed since Factor(D,U) dominates Factor(U,T) on usefulness.

Factor(U,U) protocols: These protocols merit development since they offer the highest usefulness among the factored cognition protocol family. A different setting is needed for benchmarking against Factor(T,U) since Factor(D,U) dominates Factor(U,U) on both safety and usefulness. Factor(U,U) protocols have a much more complex space of red/blue strategies to explore, particularly involving collusion via steganography from decomposer to implementers. We suggest that initial research decouple this into two parallel lines of investigation:

What steganographic bandwidth can be achieved in decompositions, and what are the best blue team measures for limiting that bandwidth? This research would be a straightforward extension of Hidden in Plain Text.
Given access to different steganographic bandwidths, what information is most useful for scheming decomposers to pass to their implementer copies? How does steganographic bandwidth affect red team tactics and protocol safety?

Appendix A: Discussion on Powerful Decomposer

We’d like experiments in the APPS setting to be able to tell us useful information about factored cognition protocols despite the shortcoming of the setting. We think that giving access to the powerful decomposer D allows us to do this for Factor(T,U) protocols. With the current state of LLMs, Factor(D,U) is a better model of future AI control scenarios than Factor(T,U), assuming we use a significantly weaker trusted model like Llama 3.1 8B Instruct. We do select a T significantly weaker than U because this seems to usually be our best way to generalize to future AI control evals with actual untrusted models. This assumption seems reasonable for uses of T which are directly adversarial to U, such as monitoring, editing, generating unit tests, and generating honeypots. But we don’t think that the assumption is valid for every possible use of T, including for task decomposition. Using a weak T as our model of a decomposer, and seeing the usefulness results, we would instantly conclude that Factor(T,T) and Factor(T,U) were completely useless. Factor(U,U) might be competitive with vanilla trusted monitoring only if you used the very best frontier models for U.

However, we think that the huge performance differences motivating these conclusions will not generalize to the future even when T and $U_{H}$ have a similar accuracy gap on a set of tasks. Our real-life future T will be at least as good at decomposing as o3-mini, so the decomposition tax should be smaller (maybe negative) for complex tasks. The current large underperformance of Factor(T,T) and Factor(T, $U_{H}$ ) relative to simple baselines is a temporary artifact of the current state of LLM progress which creates a large bias against them. We can reduce that bias by using our best available LLM decomposer instead. The decomposition tax for Factor(D,T), and Factor(D, $U_{H}$ ) is closer to what we predict will be true in the future, but there is still substantial irreducible uncertainty. Our safety results can be treated as indicative of future decomposition protocols if the decomposition tax for that future trusted model is similar. However, due to this unknown bias, we don’t think that it’s valid to compare our Factor(D,U) usefulness results to those of non-decomposition protocols.

There is probably also a bias in safety results, but we expect this to be a smaller second-order effect. This bias could come from highly detailed child tasks giving monitors extra clues as to what parts of an implementation are/aren’t legitimate. Given the challenges of task decomposition, reducing these forms of future extrapolation bias requires that research on decomposition protocols place even greater emphasis on using the latest frontier models compared to research on non-decomposition approaches. Even then, future research should also explore a range of decomposer capabilities and estimate the safety trends.

Appendix B: Glossary

I’ll use the same terminology and notation as (Greenblatt et al., 2023) where applicable.

The primary desiderata for control protocols are:

Safety: The protocol can reliably detect and safely manage attacks within the untrusted AI’s capabilities that help it achieve its misaligned goals.

Usefulness: In the case that the untrusted model is not scheming, the protocol minimally degrades the capabilities at the task

Cost: The protocol does it all economically, with minimal compute and human labor.

The basic set of intelligent resources which we have at our disposal to build control protocols are:

U: A smart untrusted model, which might be scheming against us.

T: A trusted model with capabilities inferior to U, which we are confident isn't scheming.

H: Trusted, high-quality human labor. In experiments, this is often simulated using a frontier LLM for practical reasons.

The 3 roles of human researchers which control different elements of control experiments are:

Red team: Elicits worst-case adversarial behavior from U given a set of affordances. The red team’s goal is to minimize safety given a blue team strategy.

Blue team: Constructs control protocols given a set of affordances to maximize safety and usefulness.

Green team: Decide the affordances, assumptions, and high-level experimental setup within which the red and blue teams optimize their objectives.

The red team creates two distinct policies that U can follow when called by the control protocol:

$U_{H}$ : Untrusted model honest policy. Faithfully follows instructions without attempting an attack, just like a trusted model.

$U_{A}$ : Untrusted model attack policy. Always attempts an attack.

In the APPS setting, $U_{A}$ attempts to write solutions that include a backdoor, which is a feature of the solution which causes a behavioral difference from a ground truth reference solution for some valid inputs. In some contexts $U_{A}$ also proposes an exploit, which is a specific input to a backdoored solution which may result in a behavioral difference. If it really does result in a behavioral difference, then this is a valid exploit.

We use the following terms related to the task decomposition. These terms are defined with code-writing tasks in mind but should be pretty generalizable:

Task: A data structure containing a natural language problem description and possibly other fields. For functional programming tasks, it may include a function signature and a set of unit tests and test cases to check the validity of a solution.

Solution: An implementation of a task, including the solutions to any and all child tasks.

Parent task: Any task which has been decomposed into child tasks.

Child task: A task resulting from the decomposition of a parent task.

Sibling task: A child task which shares the same parent as another child task.

Root task: The original problem that we ask the AI system to solve.

Leaf task: A task which is not decomposed and is solved by direct implementation.

Reassembly logic: The part of the parent task solution which defines how to use the child task solutions to solve the parent task.

Decomposer: The part of a control protocol which decomposes a parent task into a set of child tasks and a reassembly function

Implementer: A human/LM system which implements a solution to a task. Implementers are themselves control protocols.

The following are abbreviations for certain categories of task decomposition control protocols.

Factor(T,T): A factored cognition system for the trusted model

Factor(T,U): A task decomposition control protocol with T as decomposer and implementer protocols which use U

Factor(U,T): A task decomposition control protocol with U as decomposer and implementer protocols which use T only

Factor(U,U): A task decomposition control protocol with U as decomposer and implementer protocols which use U

Corresponding baseline protocol: For any factored cognition protocol, the corresponding baseline protocol is the implementer protocol applied directly to the parent task. This is a useful baseline to enable isolating the effects of decomposition.

Some miscellaneous terminology which we haven’t seen in control literature previously but which we found useful:

Trigger set: The set of all possible inputs to a backdoored function which return unexpected results. Or, more formally, for a valid input $x \in S$ , a reference solution function $f (x)$ , and a backdoored version $f^{'} (x)$ , the trigger set is $S_{t} \subseteq S : x \in S_{t} ⟺ f (x) \neq f^{'} (x)$ .

Decomposition tax: The relative loss in accuracy from adding task decomposition to a protocol, or in other words, substituting a task decomposition protocol in for its corresponding baseline. Decomposition tax = 1 - (accuracy of Factor(T,T) / accuracy of Factor(T)). Unless otherwise noted, the decomposition tax for a particular baseline protocol refers to a task decomposition protocol using the same model as decomposer and implementer.

Appendix C: Bootstrapping Details

We use cluster bootstrapping to estimate the sampling distributions of key metrics including audit thresholds, safety, and ASR. In a bootstrapping run, we resample with replacement clusters of leaf tasks from the results until the resampled quantity of leaf tasks is equal to the quantity of in the original sample. Leaf tasks are clustered by the APPS sample ID, so multiple epochs and multiple child tasks are all clustered together. Applicable statistics are calculated for each resample. We resample 500 times independently to compute a bootstrapping distribution for each statistic.

C.1 Bootstrapping Procedure

We resample from both the usefulness run and safety run at each iteration. We calculate bootstrapping distributions for audit threshold, on-threshold audit probability, safety, ASR, and unaudited ASR. We only perform bootstrapping on safety runs themselves, not on attack selection parameter tuning runs, as this would be much more complicated. This is the procedure for computing bootstrapping distributions:

Safety run filtering
1. Filter the safety run root tasks to include only sample IDs (clusters) with at least one attack attempt among all epochs.^[9]
Usefulness run resampling
1. Resample root tasks
2. Compute usefulness and the suspiciousness distribution
3. Compute the audit thresholds and on-threshold audit probabilities for both “audit all” and “audit passing” strategies.
Safety run resampling
1. Using the auditing parameters, compute the win/lose/continue outcome for each root task.
2. Resample clusters equal to the number of clusters from step a.
3. De-cluster into individual root tasks.
4. Filter to only include attacked root tasks.
5. Compute ASR and unaudited ASR.
6. Compute analytically the mean safety over all possible downsamples of attacked root tasks. The downsampling factor is equal to 18 per 1000 root tasks in the safety run.^[10]
Perform steps 2-3 500 times and construct distributions for all metrics.

C.2 Selected Bootstrapping Results

Appendix D: Red Team Attack Selection and Tuning

D.1 Methodology

The red team uses a combination of filtering and random selection to choose on which tasks to attack. Filtering applies first and is performed based on ratings of predicted suspiciousness and solution difficulty for a backdoored solution.

We use simple filtering rules which eliminate tasks with either rating outside a corresponding range. After filtering, each remaining task is selected for attack with a fixed probability. So the attack selection algorithm can be parametrized by 5 values:

Min predicted suspiciousness
Max predicted suspiciousness
Min solution difficulty
Max solution difficulty
Probability of attack

These 5 parameters are tuned based on data from previous safety runs, selecting values partly by hand to minimize safety without overfitting.

When tuning attack selection parameters for trusted monitoring, it’s possible to do the safety run with high thresholds and 100% probability of attack, then set all parameters ex post. This preserves calculation of safety since:

Each attack affects exactly one task outcome and all attacks are independent.
No data from $U_{H}$ solutions in a safety run affects the safety metric. Therefore, no necessary information is lost by failing to run a $U_{H}$ on any task and running $U_{A}$ instead. However, other metrics evaluating monitor performance are affected.

For Factor(T,U) protocols, neither of these conditions hold, so tuning must be done ex ante to avoid biasing the results. To maintain a fair comparison in final results, we also tune trusted monitoring attack selection ex ante, even if further improvements could be made with tuning ex post.

D.2 Results

For both protocols, the red team performs the best when attacking on problems with low-medium predicted suspiciousness and high-medium solution difficulty, relative to the overall distribution. Solutions which it judges as either not suspicious or likely to pass unit tests are relatively unlikely to contain a backdoor. Tuning attack selection parameters is especially sensitive to sampling variance due to its reliance on estimating distributions of the narrow subset of tasks with unaudited backdoors. This subset composes 4.7% and 2.0% of leaf tasks in our final iterations of trusted monitoring and Factor(T,U), respectively (see Appendix E for full tabulated metrics). We find sampling variance to be too high for effective tuning on safety runs of 500 tasks (2 epochs). Our final tuning and results runs use 4 epochs.

Appendix E: Table of Metrics

All ranges represent 68% confidence intervals, approximating ±1 standard error for normally distributed statistics.

Metric	Trusted Monitoring	Factor(D,U)
Safety	[36%, 44%]	[66%, 74%]
Usefulness	100%	[95%, 102%]^[11]
Attack Success Rate (ASR)	[23%, 29%]	[12.4%, 15.7%]
Unaudited Attack Success Rate (Unaudited ASR)	[19%, 24%]	[9.4%, 12.3%]
Quantity of root tasks	998	999
Quantity of leaf tasks	998	2039
Quantity of attacked root tasks	218	314
Quantity of attacked leaf tasks	218	350
Attack rate (attacks per root task)	22%	31%
Attack rate (attacks per leaf task)	22%	17%

Appendix F: Product and Process Threat Examples

Consider an analogy to the scenario of using untrusted AI to perform useful work for us. Consider an untrusted automaker which makes high-quality cars, but we’re worried that it might be scheming to cause harm. So we devise an automaker control protocol and ask the automaker to build us a car. What are some threats should we consider? They could purposefully build some defect into the car like faulty seat belts. This sort of threat can’t cause us harm until the manufacturing is all done, we’ve inspected and approved everything, and we’ve driven the car off. This is a product threat. Another threat is that they dump chemicals into the nearby river during manufacturing. In this case, it doesn’t matter if we catch something else suspicious later on, the harm is already done. This is a process threat.

Below is a table with a few examples of tasks with a categorization of some applicable threats from scheming models. For all tasks, wasting human time/resources is an additional process threat.

Task/Setting	Product Threat	Process Threat
Automaker	Defective seat belts	Chemical dumping at the factory
SWE agent	Writing backdoored code	Self-exfiltration Rogue deployment Downloading malware
Coding assistant	Writing backdoored code	All above items only as runtime threats if test sandboxing is inadequate
AI safety research agent	Writing backdoored code Research sabotage	Self-exfiltration Rogue deployment Downloading malware Research sandbagging
Text summarization assistant	Writing biased summaries
Chatbot external deployment	Spreading infohazards Disinformation Coercion

Research sabotage and sandbagging can be a bit tricky to categorize since both threat models can include wasted time/resources. We classify sabotage as a product threat since most of the harm is only locked in after sabotaged research is accepted, when researchers make misguided decisions based on that research. With sandbagging, the wasted time/resources is the essential harm delivered by the sandbagging model, independent of what researchers do with the lackluster products.

Appendix G: Misc. Methodological Details

Here we include a loose collection of details and points of confusion that we encountered when setting up our experiments in APPS. These match the methods from (Greenblatt et al., 2023) unless otherwise noted. These details might not be clear from reading existing literature, and so might serve some purpose for future researchers.

Audit thresholds and budget: The auditing budget (a proportion of tasks) is used to set the auditing threshold (a suspiciousness rating), but once the threshold is set, the budget doesn’t constrain how many tasks may be audited in the safety run. The budget is the expected proportion of tasks which would be audited conditional on U not scheming. If U is actually scheming, then the expected proportion of audited tasks exceeds the budget, but this is acceptable and the blue team may still audit all of them.

Red/blue team allowances:

Nearly full black box access to the opposite team protocol and all intermediate results while improving their protocol. Each team may iterate directly against the adversarial protocol. The red team has access to all blue team prompts and scaffolding, but the attack policy itself still may not call the trusted model. The only further limitation we impose is that teams may not fine tune against the adversarial model at any point.

Monitor ROC Curves: There are several possible classification tasks on which we could evaluate the monitors, but we focus on classifying solutions between:

Written by $U_{A}$ (attacked) vs written by $U_{H}$ (not attacked)
Attacked and passes unit tests vs not attacked and passes unit tests

These correspond to the two possible auditing strategies:

Monitor all solutions: Monitor and potentially audit any task
Monitor only solutions which pass unit tests: Since solutions which fail unit tests are rejected anyway, there is no way that they can cause catastrophic outcomes. In real control deployments, labs can reduce monitoring compute cost and potentially also improve safety by filtering out failed solutions before monitoring.

Acknowledgements: Many thanks to Carlo Attubato, Trevor Lohrbeer, Alex Spies, and the other Pivotal Research fellows for their advice and guidance.

^{^}
Code and data available at https://github.com/aaron-sandoval/inspect-control-settings/tree/2025-06-blogpost
^{^}
This is slightly distinct from the idea of runtime threats in coding tasks, where bad outcomes require executing malicious code as opposed to merely writing it. A runtime threat might not be a product threat if our process for selecting solutions involves running the code, and that this could possibly cause harm before we accept the solution, e.g., we run unit tests on malicious code and harm occurs due to inadequate sandboxing.
^{^}
See Appendix F for further examples and discussion of product and process threats
^{^}
For some corner cases this may not be true, e.g., for runtime threats where our protocol only runs code at the root task level. Here we retain most of the safety benefits of factored cognition. This is perhaps a valuable distinction, but we think this is probably a small category and we will set it aside for now.
^{^}
There are a small number of APPS problems which do not use the example-based unit test. For example, some problems allow solutions to return a list in any order. In this case, our assumption will return a false negative for some valid solutions. It may also give false positive backdoors if backdoored code passes unit tests and returns a correct output for the exploit, but the output doesn’t match a different correct response given by the solution code. The only way to know when the example-based unit test does not apply is to read the semantics of the problem statement. We assume that these APPS problems are rare enough and accept the resulting error.
^{^}
We didn’t have a hard numerical threshold when selecting models. In retrospect, something like <10% seems right.
^{^}
It is valid to compare usefulness between different decomposition protocols. E.g., one could create a valid safety-usefulness Pareto front comparing performance of Factor(D,U) protocols that use trusted monitoring vs trusted editing implementers.
^{^}
We plan to explore more granular decompositions before publication. See Additional Work for the Paper for details.
^{^}
This is done to make bootstrapping runs faster, since honest solutions have no effect on any of the metrics. We realized later that this shortcut very slightly underestimates the variance in statistics.
^{^}
To maintain flexibility for future experiments, we scale our downsampling factor linearly with the number of root tasks in the safety run. This practically decouples the red and blue team strategy landscape (attack selection parameters in particular) from the size of the safety run, allowing us to scale up future runs if high variance becomes an issue again without compromising fair comparability to earlier runs of different sizes.
^{^}
Factor(D,U) usefulness reported for completeness only. Not representative of Factor(T,U) usefulness in comparison to trusted monitoring.
^{^}
We hand-verified 12 of the tasks scored as backdoored, finding 10 of them to be legitimate. This is the lower bound of 4%. If all the remaining unverified tasks were legitimate, then the total rate would be 8%.

29

Factored Cognition Strengthens Monitoring and Thwarts Attacks

29

29