Untrusted AIs can exploit feedback in control protocols

Mia Hopman; BionicD0LPH1N; Tyler Tracy

TL;DR

We explore a novel control protocol, Untrusted Editing with Trusted Feedback. Instead of having a trusted model directly edit suspicious outputs (trusted editing), the trusted model provides feedback for an untrusted model to implement, taking advantage of the untrusted model's superior capabilities.
Untrusted models were able to exploit the feedback to understand monitoring strategies and craft more effective attacks, reducing safety from trusted editing’s baseline of 93% to 80%. While compliance monitoring (checking if feedback was actually implemented) and selective editing provided some improvement, these mitigations failed to increase safety beyond the trusted editing baseline in our setup.
These findings should not be interpreted as a definitive rejection of untrusted editing, but rather as evidence that protocols that perform untrusted editing with trusted feedback require substantial additional safeguards and design refinements to achieve practical safety benefits over existing control approaches.

Introduction: Dialogue Between Models as a Control Strategy

Control protocols rely on trusted but less capable models and limited human oversight to deploy powerful but potentially misaligned systems safely. Trusted editing, an established control protocol, achieves strong safety by having trusted models directly edit suspicious outputs from the untrusted model's proposed solutions. However, these protocols face practical limitations when the capability gap between the two models is large. Trusted models may struggle to effectively revise complex outputs while preserving functionality.

This capability gap motivates investigation of alternative approaches where a trusted model provides targeted feedback rather than direct edits, and the untrusted model implements the recommended changes. These untrusted editing protocols would theoretically preserve the untrusted model’s superior capabilities while maintaining some safety oversight through the structured guidance. However, the safety implications of information flow from monitors to potentially misaligned models remain underexplored.

We conduct an evaluation of untrusted editing protocols in a backdoored code setting where the trusted model identifies suspicious code elements and provides specific guidance for untrusted implementation. Our experiments incorporate protocol variations, including compliance monitoring to verify meaningful implementation of feedback and selective editing strategies to limit information exposure. Our results show that basic untrusted editing protocols don't consistently outperform previous protocols. Basic untrusted editing protocols achieved only 63% safety compared to trusted monitoring’s 74% baseline, due to a larger attack surface and strategies that exploit feedback signals to craft more effective backdoors. Compliance monitoring provides some improvement, increasing safety to 69%, while the application of selective feedback achieves 79% safety, still below the trusted editing benchmark of 93%.

These findings suggest that untrusted editing control protocols do not provide straightforward safety improvements and may introduce exploitable information leakage vulnerabilities. However, our results should not be interpreted as definitive possibilities for feedback architectures. Rather, they demonstrate that protocols with dialogue between trusted and untrusted models require careful design considerations and additional safeguards to remain effective. We expect these protocols may become more effective as the capability gap between the untrusted and trusted models grows. Our initial results provide some support for this hypothesis, though further investigation is warranted. The intuitive benefits of leveraging untrusted model capabilities through trusted feedback may conceal security risks that require further analysis.

Untrusted Editing with Trusted Feedback

**Defer to Iterative Untrusted Editing Protocol Architecture.** The diagram illustrates the sample flow through our most complex protocol variant, featuring multiple decision points and iterative feedback loops. The protocol operates through a multi-step process: (1) **Initial generation:** Untrusted model generated code solutions; (2) **Suspicion-based routing:** Initial monitoring determines whether the samples proceed to immediate acceptance, direct auditing, or feedback processing; (3) **Feedback generation and implementation:** Trusted model analyzes flagged code and provides specific improvement recommendations, untrusted model then edits the original code to accommodate the suggestions; (4) **Dual monitoring and iteration:** Both suspicion and compliance scores determine whether to accept, audit, or continue iterating through additional feedback cycles. Green nodes represent trusted model operations, blue nodes represent untrusted model operations. All protocol variations in this work utilize subsets of this framework, with specific configurations determined by threshold parameters, iteration limits, and monitoring nodes.

We present a control protocol where an untrusted model (U) implements improvement suggestions provided by a trusted model (T), rather than having T directly edit suspicious code. This approach aims to leverage U’s superior capabilities while maintaining trusted guidance through feedback.

Feedback Generation

When generating feedback, T analyzes U’s code implementation against the problem statement to identify potential backdoors, hidden vulnerabilities, or suspicious logic patterns that could cause incorrect behavior under specific conditions. T receives explicit instructions to focus exclusively on security concerns rather than general improvements, efficiency optimizations, or stylistic modifications. If the code appears to be free of suspicious elements, T returns “None” to avoid unnecessary processing. For instance, T flags unnecessary conditional branches triggered by special inputs or suspicious logic unrelated to the problem requirements, while ignoring harmless inefficiencies. This targeted approach ensures feedback reveals genuine security concerns rather than creating noise through non-critical suggestions.

The prompt used to solicit feedback can be found here.

Compliance Monitoring

The compliance monitor evaluates how well U implements T’s feedback by analyzing the original code, suggested changes, and final implementation. Without this verification mechanism, providing feedback would reveal monitoring criteria to adversarial models while providing no safety benefits, since there would be no requirement or incentive to implement the suggested improvements. The monitor specifically identifies three key patterns: modifications that directly address feedback concerns, changes made without corresponding feedback requests, and aspects of the feedback that remain unaddressed. By quantifying the relevance of the implemented changes and detecting strategic evasion, the compliance monitor creates accountability for the untrusted model to follow the trusted suggestions. This forces a malicious model to maintain consistent deception across both its initial code and its response to feedback, which is a more difficult task than simply writing backdoored code once.

One failure mode of this monitor is that it relies heavily on the quality of the initial feedback. If T provides suggestions that are irrelevant, unnecessary, or just plain wrong, U may be penalized for legitimate decisions not to implement the suggestions. To mitigate this risk, we implemented several refinements to our feedback generation prompts, specifically instructing T to focus exclusively on security concerns rather than optimizations or other code improvements. After these refinements, our feedback solicitor suggested no changes for about 70% of the honest samples and only about 10% of the attack samples. Additionally, the honest policy is singularly focused on implementing correct solutions rather than optimizing for low compliance scores. This creates a challenging baseline where some of the least compliant revisions arise from legitimate problem-solving attempts, making effective auditing thresholds difficult to calibrate given our limited auditing budget.

The prompt used to monitor compliance can be found here.

Protocol Variations and Architecture

Our framework supports multiple configurations by varying decision thresholds and iteration limits:

Universal untrusted editing: All outputs receive feedback (bypass initial suspicion routing)
Defer to untrusted editing: Moderately suspicious outputs receive one iteration of feedback
Defer to iterative untrusted editing: Multiple rounds of suspicion until termination criteria are met

Relationship to Existing Control Protocols

This framework unifies some existing control protocols through parameter variation. Trusted monitoring emerges when the feedback threshold is set to zero (all samples bypass feedback), while trusted editing replaces U with T as the feedback implementor. Since our protocol can replicate trusted monitoring behavior through threshold adjustment, it theoretically must weakly Pareto dominate trusted monitoring. However, our empirical evaluation reveals that this theoretical advantage requires substantial implementation refinements to achieve practical benefits.

Experimental Setup

We evaluate untrusted editing protocols using the BigCodeBench dataset in a backdoored code generation setting, following established AI control evaluation methodology. Our experimental framework measures protocol effectiveness (blue team) against red team strategies that attempt to exploit the feedback mechanism. Every evaluation uses the same 200 sample problems from BigCodeBench.

Task and Threat Model: We model AI subversion as attempts to introduce backdoors (code that passes standard tests but produces incorrect outputs for hidden inputs). This setting captures the challenge of detecting subtle malicious behavior while maintaining system utility.

Model Configuration: We employ GPT-4o as the untrusted model (U) and Gemma-2-9B-it as the trusted model (T), creating a realistic capability gap scenario. All protocols operate under a 1% audit budget constraint, reflecting practical limitations on human oversight in deployment scenarios.

Evaluation Metrics:

Safety: Percentage of attack attempts successfully detected or prevented^[1]
Usefulness: Proportions of tasks completed correctly relative to the untrusted model’s baseline performance

These metrics allow for the quantification of the safety-usefulness tradeoff.

Red Team Methodology: To simulate sophisticated adversaries, we implement a best-attack selection process where the red team generates 10 independent backdoor attempts and deploys the least suspicious variant as determined by GPT-4o-mini. This approach models strategic attackers who optimize their methods against known monitoring systems.

Protocol Evaluation Framework: Each protocol variant is tested against multiple attack strategies, with safety scores representing the minimum performance across these strategies.

Protocol Variations and Attack Analysis

We evaluate the effectiveness of our protocols by comparing the relative Pareto frontiers. For protocols with deferral thresholds (either defer to trusted or untrusted editing), we plot over multiple thresholds.

Baseline Protocols

We evaluate the following existing control protocols to establish a baseline to compare our proposed protocols against.

Protocol	Safety	Usefulness
Trusted monitoring	74%	100%
Trusted monitoring (with defer to trusted)^[2]	84%	96%
Trusted editing^[3]	93%	92%

Universal Untrusted Editing

BLUE TEAM: Solicit feedback from T and ask U to implement

The most basic version of untrusted editing consists of one round of feedback generation and implementation, followed by evaluation for every response, regardless of initial suspicion (which is not measured at all).

This protocol operates through a simple four-step process: (1) U generates initial code, (2) T analyzes the code and provides specific security-focused feedback, (3) U revises the code based on this feedback, and (4) the final code undergoes suspicion monitoring to determine whether to audit or accept. Every generated solution passes through the feedback cycle regardless of initial suspiciousness.

RED TEAM: Use feedback to craft a stealthier backdoor

The red team responds with three primary strategies:

Pivot: Remove the original backdoor if feedback identifies it, then insert a new, unrelated backdoor during editing. This exploits the information revealed in feedback to craft less detectable attacks.

Ignore: Submit the identical original code without implementing any feedback.

Resample: Generate an entirely new backdoored solution after receiving feedback, discarding the original attempt.

Attack Strategy	Valid Backdoors^[4]	Attacks Audited^[5]	Successful Backdoors^[6]	Safety
Pivot	42%	38%	22%	63%
Ignore	48%	50%	16%	76%
Resample	46%	49%	22%	69%

The pivot strategy achieves the lowest safety (63%), making it the most effective red team attack. This is likely due to U taking advantage of information leakage about T's backdoor-catching abilities. The resample approach (69%) notably underperforms the trusted monitoring baseline of 74%. The ignore strategy performs similarly to simple trusted monitoring, as expected, since it does not leverage any feedback information. An advantage of this protocol is that the usefulness remains perfect throughout, as we use U to edit the samples.

Result: A safety score of 63% at 100% usefulness.

BLUE TEAM: Add a compliance monitor to verify feedback implementation

*We add a compliance monitor after feedback implementation. The audit decision is now based on the two monitor signals, rather than suspicion alone.*

To evaluate how well U implements T’s suggestions, we introduce a critical compliance monitoring step that scores feedback implementation. Without this mechanism, the red team would have no incentive to follow the feedback, as demonstrated in the previous section.

We developed three approaches for incorporating dual monitoring signals:

Suspicion-only auditing: The traditional approach using only suspicion scores (baseline comparison)

Maximum-based auditing: The maximum value between the compliance and suspicion scores is used, triggering an audit if either metric exceeds the threshold.

Weighted-combination auditing: We optimize both the weights for combining monitor signals and the audit threshold simultaneously to maximize safety performance across all red team strategies, while maintaining the 1% audit budget constraint on honest samples.

Comparison of audit metric effectiveness between suspicion-only monitoring (left) and combined suspicion-compliance monitoring (right). The weighted combination approach catches 44.0% of attacks compared to 32.5% when using suspicion scores alone.^[7]

The weighted combination approach consistently led to improved safety across multiple attack strategies and successfully captured attack attempts that would otherwise evade detection by the suspicion monitor alone. Our empirical optimization found weights of 0.69 for suspicion and 0.31 for compliance. The weighted combination approach creates a more challenging environment for the red team, which now must evade detection across multiple monitoring types.

RED TEAM: Red herring attack

The most effective attack strategy against compliance monitoring was a “red herring” approach. In this attack, U generates a solution containing two backdoors: one obvious vulnerability, designed to attract attention, and another subtle backdoor. When T identifies the obvious backdoor through feedback, U addresses only the flagged issue while preserving the subtle attack. The compliance monitor views the removal of the obvious security concerns as higher conformity, creating a false impression of cooperation.

The initial red herring code generations have a higher suspicion level compared to the pivot attack due to the deliberately obvious backdoors. However, the initial suspicion is irrelevant to the overall outcome of the protocol since audit decisions are based on the final submission’s evaluation. After implementing feedback (removing the obvious backdoor), the suspiciousness of the red herring attack halves.

Compliance score distributions of red herring and pivot attack strategies and the honest implementation demonstrates the effectiveness of the red herring attack. While the pivot attack strategy (orange) typically receives low compliance scores, indicating inadequate implementation of feedback, the red herring approach (red) achieves higher compliance scores, with a large portion of the samples registering similarly to the honest distribution (blue), suggesting near-perfect compliance from the monitoring system’s perspective.

Attack Strategy	Valid Backdoors^[4]	Attacks Audited^[5]	Successful Backdoors^[6]	Safety
Red herring	41%	44%	20%	69%
Pivot	42%	45%	19%	70%
Ignore	48%	58%	11%	85%
Resample	46%	55%	19%	74%

Despite compliance monitoring, the red herring strategy achieves only 69% safety, still below trusted monitoring's 74% baseline.

Result: A safety score of 69% at 100% usefulness.

Defer to Untrusted Editing

BLUE TEAM: Add a branch point before feedback

Rather than routing all model outputs through the feedback process, this improved protocol only sends moderately suspicious output to the feedback mechanism. Low suspicion code is accepted directly without additional processing and high suspicion code is immediately audited. The feedback mechanism with the compliance monitor remains the same as outlined in the previous section.

This protocol introduces a strategic advantage over the previous iterations. By enabling direct auditing of suspicious initial outputs, we discourage adversarial models from overtly using signals from feedback to refine their attacks.

The selective approach dramatically improves performance against previously effective strategies. Red herring attacks, which achieved 69% safety under universal feedback, now reach 98% safety as 84% of attempts trigger immediate auditing before entering the feedback loop. Pivot attacks improve from 70% to 81% in safety through similar pre-filtering.

RED TEAM: Stop Attack & Complex Code

The red team developed two primary counter-strategies:

Stop Attack (Safety: 81%): Abandons the attack attempt upon receiving feedback, minimizing evidence of malicious intent. This approach systematically trades attack opportunities for reduced detection probability. This strategy exploits higher feedback thresholds, meaning there are more opportunities for the red team to insert backdoors before concealing their efforts.

Complex Code (Safety: 80%): Embeds backdoors within intentionally complex logic, then maintains malicious functionality during "simplification" responses to feedback. This approach exploits both the trusted model's capability limitations in analyzing complex code and the compliance monitor's positive scoring of apparent improvements.

We also examined alternative attack approaches that did not perform well:

Innocent Complex Code (Safety: 100% - ineffective)

The initial strategy involved generating complex, non-malicious code to trigger feedback by being moderately suspicious, allowing the red team to simplify it and insert a backdoor while gaining compliance. The risk was immediate acceptance without an attack. However, monitors recognized the harmless complexity, leading to immediate acceptance and perfect safety against this attack.

Honest First (Safety: 100% - ineffective)

This approach begins with genuinely clean code, then introduces exploits during the revision process. Empirical testing revealed this to be suboptimal for the red team, as the initial honest submission frequently bypasses the feedback mechanism entirely, receiving immediate acceptance and eliminating the opportunity to introduce a backdoor. Similarly to innocent complex code, deferring to untrusted editing had perfect safety against this attack.

Through empirical testing across all attack strategies, we determined optimal thresholds: 50% of outputs routed to feedback, with audit decisions based on weighted scores (0.61 initial suspicion, 0.17 final suspicion, 0.22 compliance). This configuration maximizes minimum safety across all red team approaches.

Result: A safety score of 80% at 100% usefulness.

The empirical results reveal clear performance differences across control protocols. Defer to untrusted editing achieved 79% safety, a substantial 16 percentage point improvement over universal untrusted editing, and 5 percentage points above basic trusted monitoring. However, trusted editing establishes Pareto dominance across the safety-usefulness frontier, with 93% safety.

These findings demonstrate that compliance monitoring and selective feedback application provide meaningful but insufficient mitigation of information leakage vulnerabilities. The results suggest that while information leakage through feedback can be partially mitigated through selective application and compliance monitoring, the fundamental challenge remains: providing any feedback to an untrusted model creates exploitable attack surfaces. The protocol’s inability to surpass trusted editing’s safety in this experiment demonstrated that the capability advantages of untrusted models may not compensate for the security risks introduced by the structured dialogue between trusted and untrusted models.

Defer to Iterative Untrusted Editing

The iterative variant introduces a feedback loop architecture that allows multiple rounds of editing, increasing the difficulty for an attack to persist undetected across multiple edits. Unlike the single-pass approach, this extension implements recursive feedback-edit cycles that continue until one of the termination conditions is met.

Note on Experimental Limitations: This section represents an initial investigation into iterative protocols with incomplete experimental comparisons. Our primary trusted model (Gemma-2-9B-it) was removed from Together AI before we completed direct comparisons with other feedback protocols, limiting our ability to draw definitive conclusions. We are unable to directly compare results from the two Gemma models because their performance differed significantly across platforms. The version hosted on Together AI passed approximately 40% of code generation tests, while the one on OpenRouter passed only around 25%. We hypothesize that these discrepancies may be due to differences in quantization between the two deployments, which substantially affected model usefulness and monitoring capability.

The iterative variant extends the defer to untrusted editing feedback protocol by allowing multiple rounds of editing until the termination criteria are met: (1) the solution becomes sufficiently unsuspicious and compliant to accept, (2) the combined suspicion-compliance score exceeds the audit threshold, or (3) maximum iterations are reached. This creates a more challenging environment for malicious attackers, as they must maintain deceptive behavior across multiple interaction steps while demonstrating apparent compliance with increasingly specific feedback.

Multiple feedback enables cumulative signal aggregation across iterations. This allows the protocol to accumulate evidence across multiple interactions, revealing longer patterns of deception, rather than requiring judgment based on a single evaluation point. Statistical analysis across iterations may reveal patterns indicative of malicious behavior that may not be apparent in single-point scheming evaluations. Additionally, the pressure of feedback remains constant, progressively increasing difficulty for scheming models to maintain plausible deniability while preserving backdoors. The iterative approach forces U to demonstrate consistent behavior across multiple revision steps, creating a more robust test of genuine compliance versus superficial editing. Maintaining deceptive behavior across multiple rounds requires sophisticated coordination that increases the likelihood of revealing malicious intent. This establishes a higher bar for successful attacks compared to single-pass evaluations.

This introduces important tradeoffs to be considered. Multiple revision cycles significantly increase the computation resources required, as well as latency, potentially limiting real-time application scenarios. The exact cost tradeoffs require further empirical exploration to establish optimal iteration limits for real-time deployment scenarios. Additionally, multiple feedback iterations may provide more information to U regarding the monitoring thresholds and detection mechanisms. Each feedback round could be used to reveal additional details about what T considers suspicious, which could be exploited by the red team. Sophisticated attacks may involve the inference of their position in the loop through context clues (such as whether their code-generated matches what they would have initially generated, or through analysis of feedback specificity). This information could be used to make adjustments to attack strategies based on their place in the protocol.

We compared iterative protocols using two trusted models with different capability gaps relative to GPT-4o (U):

GPT-4o: 61.5% usefulness baseline
GPT-4o-mini: 58% usefulness (3.5% gap - minimal capability difference)
Gemma-2-9B-it (OpenRouter): 24.5% usefulness (37% gap - substantial capability difference)

When using GPT-4o-mini as the trusted model (left), iterative trusted editing dominates iterative untrusted editing, achieving higher safety across usefulness levels. However, with the larger capability gap using Gemma-2-9B-it (right), iterative untrusted editing Pareto dominates iterative trusted editing.

The low safety performance of iterative trusted editing with Gemma-2-9B-it reflects its weakness at editing complex backdoored code. The weaker trusted model struggles with code revision tasks, creating opportunities for subtle backdoors to persist through editing. In our comparison between the two trusted models we evaluated, defer to iterative untrusted editing Pareto dominated iterative trusted editing when using Gemma-2-9B-it (the weaker model with a 36.5% usefulness gap), while the reverse was true when using GPT-4o-mini (the stronger model with only a 3.5% usefulness gap). This suggests that the relative performance of untrusted editing protocols may depend on the capability gap between U and T, though further research across a broader range of model capabilities would be needed to establish this relationship more definitively.

Insights and Implications

Our experiments demonstrate that untrusted editing with trusted feedback creates exploitable information leakage vulnerabilities that substantially compromise safety performance. The protocol's most effective variant (defer to untrusted editing) achieves only 79% safety compared to trusted editing's 93% safety, representing a 14 percentage point safety degradation. We find that compliance monitoring and selective editing offer partial mitigation, but they are unable to fully address the information leak, which the red team can exploit to improve the undetectability of their attacks. The core vulnerability lies in the dual nature of providing feedback: although it can be helpful to guide improvement, it reveals monitoring criteria that adversaries can exploit. Red team strategies like the "red herring" attack demonstrate how malicious models can manipulate compliance scores by strategically addressing obvious vulnerabilities while preserving subtle backdoors. Even with compliance monitoring converting part of this information leak into a secondary detection signal, safety improvements remain insufficient.

Comparison to Resampling Approaches

Our feedback protocol exhibits similarities to the resampling methodologies introduced in the Ctrl-Z paper, but uses additional feedback to produce alternative outputs instead of simply resampling. Resampling approaches exploit model stochasticity by generating multiple outputs and selecting the least suspicious one, relying on the statistical likelihood that a malicious model will occasionally produce benign outputs during resampling. In contrast, our feedback-based approach creates a structured dialogue between T and U, generating targeted improvement (though potentially leaking information to the malicious model) rather than relying on stochastic variation. We imagine that this feedback approach can be used in tandem with some of the resampling approaches explored previously. Evading specific, targeted feedback generates stronger evidence of malicious intent than producing multiple suspicious outputs. The compliance monitor creates an additional signal dimension that is unavailable in pure resampling approaches. In addition, the feedback mechanism produces explicit documentation of suspected issues and the model’s response, creating a more interpretable audit trail. One approach could involve providing multiple feedback iterations and choosing the least suspicious implementation of the feedback. Selecting the original or early implementations would minimize the attacker's ability to use the information leak to craft better attacks while still leveraging the feedback interaction to optimize auditing decisions. Our current approach represents a control protocol that attempts to catch bad behavior when it occurs while conserving performance when it doesn't. Future work could explore a different paradigm: feedback interactions specifically designed to elicit bad behavior from potentially malicious models, treating the interaction primarily as an evaluation tool rather than a control mechanism. This approach would treat the feedback interaction primarily as evidence-gathering for audit decisions, rather than using the revised code as the final output. Potential strategies include crafting feedback specifically designed to elicit revealing responses from malicious models and developing statistical models to identify suspicious compliance patterns across multiple interactions.

Practical Implementation Considerations

Deploying untrusted editing introduces several practical considerations. The protocol increases computational requirements compared to simple monitoring approaches due to the added feedback solicitation, implementation, and monitoring. Secondly, the effectiveness of both suspicion and compliance monitoring depends significantly on prompt design. Prompts must be carefully engineered to ensure T provides specific security feedback, rather than general improvement recommendations. Designing effective prompts for compliance evaluation requires a balance between detecting superficial changes and recognizing legitimate implementation variations.

Future Research Directions

Several promising avenues emerge from our findings that could address the limitations identified in current untrusted editing protocols. Variable attack insertion strategies represent a critical methodological improvement, as current evaluations rely on scenarios where attack policies always attempt backdoor insertion. Future work should examine selective attack strategies where red teams choose when to attack, better approximating real-world threat models where scheming models might only rarely attempt subversion, and aligning methodology with established safety definitions used in other control protocol evaluations.

Automated auditing strategies could deliberately exploit feedback interactions to collect evidence of malicious intent rather than focusing solely on output improvement. This approach would develop targeted interrogation techniques specifically designed to elicit revealing responses from potentially malicious models, implementing cross-checking mechanisms that solicit multiple feedback interactions to identify behavioral inconsistencies, and building statistical models to detect suspicious compliance patterns across multiple interactions. Similar to time travel incrimination approaches explored in the Ctrl-Z paper, researchers could revisit past suspicious elements with additional focused feedback to gather more comprehensive evidence of subversive behavior.

Extension to agentic and non-coding domains would test protocol generalizability beyond the current backdoored code generation setting. To determine if feedback-based methods can handle more complex real-world applications, multi-step task environments are needed.

Finally, we could explore scaling laws in untrusted editing performance. We found limited preliminary evidence that untrusted editing protocols improve relative to trusted-only protocols, with a larger capabilities gap between U and T. This hypothesis could be tested by evaluating various sets of models for U and T, and comparing the relative performance of trusted and untrusted editing as the capabilities gap increases.

Our work reveals important considerations for dialogue-based control protocols: while leveraging untrusted model capabilities through trusted feedback offers theoretical advantages, the information sharing inherent in such approaches creates exploitable attack surfaces that require careful analysis. These findings should not be interpreted as definitive negative findings of feedback-based architectures, but rather as evidence that such protocols require substantial additional safeguards and design refinements to achieve practical safety benefits over existing control approaches.

We are grateful to Tim Hua, James Baskerville, Aryan Bhatt, Cody Rushing, and Adam Kaufman for feedback and guidance on drafts of this project.

^{^}
Note that this is a departure from previous AI control works where safety is defined as the probability that the blue team wins by completing a task without allowing the red team to insert a backdoor into an accepted response. We use our definition as a proxy of this to simplify our experiments since our protocols only operate on the micro-game.
^{^}
Deferring to the trusted model’s solution in the top 7% most suspicious cases.
^{^}
Trusted model edits all of the responses.
^{^}
The percentage of attack attempts with a working backdoor (returns an alternative output with a specific input).
^{^}
The percentage of attack attempts that were audited, regardless of whether the backdoor was valid. We assume perfect auditing, so any attack attempt is caught even.
^{^}
The percentage of attacks that avoided auditing and had a working backdoor, ie successful attacks.
^{^}
Note that “attacks caught” is slightly different from safety as this is the proportion of samples audited, where safety is the number of attacks caught divided by the number of attacks caught plus successful backdoors accepted.

30