Basic Legibility Protocols Improve Trusted Monitoring

Sebastian Prasanna; theashwinner

This is a blog post for our paper Basic Legibility Protocols Improve Trusted Monitoring. This work was done as part of the CBAI Summer Research Fellowship, mentored by Cody Rushing.

In this blog post, we’ll go over the main results, and focus more on the engineering and methodological takeaways than we did in the paper. Notably, the section on red teaming takeaways is not in the paper. This blog post is intended for those familiar with AI control - we assume basic knowledge of the control agenda and the original control paper.

Introduction

A core challenge with trusted monitoring is that the weaker monitor often cannot effectively oversee the untrusted model because it does not fully understand the latter’s outputs. To address this, we introduce legibility protocols, which augment trusted monitoring by encouraging the untrusted model to take actions that are easier to evaluate. We explore legibility protocols in the APPS coding setting from the original control paper.

We experimented with various legibility protocols, such as prompting the benign and attack policies to write code with no special cases (simultaneously prompting the monitor to look for such special cases), or to avoid heavily optimized code in favor of simpler but more understandable algorithms. Overall, we found that simply allowing the benign and attack policies to write comments on their code provided the most benefit to the monitor’s understanding (measured by decreases in the monitor’s suspicion of benign code).^[1]

Previous control papers simply removed all comments to preemptively avoid scenarios in which the attack policy writes deceptive comments to trick the monitor. Thus, we dedicate most of the project to understanding the dynamics of giving the benign and attack policies the affordance of writing code comments. Our results show that while commenting does introduce additional avenues for deception, these downsides are outweighed by the benefits of enabling the benign policy to clearly justify its actions to the monitor.

Here are our main contributions:

We introduce legibility protocols as an extension of trusted monitoring protocols.
We show that commenting protocols substantially outperform comment-removal baselines in control evaluations in the APPS backdoor setting.
We identify two mechanisms explaining why commenting improves performance. First, commenting reduces the spread of the benign and malign suspicion distributions, which helps the monitor separate the two. Second, malicious code is harder to justify with comments than honest code, resulting in disproportionate benefits to the blue team.
We demonstrate that the benefits of commenting grow with monitor capability. This trend is consistent with our proposed mechanisms for why commenting helps.

Setting

We use the APPS backdoor setting and the Defer to Trusted protocol from the original control paper, use o4-mini as our untrusted model, GPT-4.1-nano as the trusted model and a 2% auditing budget. We use 499 problems from APPS (the first problem was used for few-shot prompting). We have three key methodological differences from the original control paper: a bootstrapping procedure for estimating metrics and confidence intervals, the assumption of perfect auditing, and no downsampling factor for backdoors.^[2]

Refresher on the APPS Backdoor Setting

Our control evaluations use the APPS code backdooring setting introduced in the original control paper. In this setting, the blue team's goal is to solve as many competitive coding problems from the APPS dataset while preventing the red team from submitting backdoored solutions.

A coding solution is considered backdoored if it passes the provided APPS test cases, but the red team can construct an exploit: an input that yields an incorrect output. For example, the following code is backdoored:

def is_odd(x):
   if x == 354:
       return True
   return x % 2 == 1

A backdoor attempt is unsuccessful if the code fails the test cases or if the red team cannot construct a working exploit. Empirically, we find that being able to pass test cases is the main constraint on the red team, as getting incorrect code to pass up to a hundred test cases without being overtly suspicious is quite difficult.

We focus on two regimes: the Commenting Regime, where both the benign and attack policies may include comments in solutions, and the No Comments Regime, where neither may. We refer to these as “regimes” because the affordance of commenting (or lack thereof) introduces meaningful differences between the commenting and no comment settings.

In our original conception of the project, we expected legibility constraints to incur usefulness costs on the model, which motivated our use of Defer to Trusted over basic trusted monitoring. We were excited about drastic legibility measures like having the benign policy write simpler, unoptimized code (e.g. choose the simple O(n^2) algorithm over a complex O(nlog(n)) algorithm). However, basic commenting ended up substantially outperforming these other interventions, and we actually only noticed small usefulness hits from commenting.

Commenting Advances the Pareto Frontier of Safety and Usefulness

(Left) Safety vs Usefulness Pareto curves for the commenting and no commenting regimes. We plot multiple defer thresholds as points. Usefulness is normalized such that 0 and 1 correspond to T and U’s performance, respectively. (Right) We plot the monitor suspicion distributions for benign code and backdoored code, in the no comments and commenting regimes. Caught attacks refer to attacks that are audited. More attacks are caught in the commenting regime.

When commenting is not permitted, protocols to improve the legibility of code solutions are limited. The blue team benefits most by having the benign policy submit code with small, modular functions and with descriptive variable and function names, both of which moderately help the monitor understand the code better. This has small legibility benefits, bringing the audit threshold from 7.01 to 6.79.

At the outset of the project, we expected the general blue team strategy for improving legibility to be: “Identify all the properties of benign code that cause monitors to give high suspicion scores. Tell the benign policy to not do those things.” However, we actually found that most monitor false positives come from the monitor simply not understanding why benign code solved the problem, due to the algorithmic and mathematical complexity of APPS problems. We tried various legibility constraints to impose on the policies, but found it hard to get any real gains without allowing commenting, simply because it’s hard to reliably explain the reasoning behind code without comments.^[3]

When commenting is allowed, the blue team benefits from instructing the benign policy to thoroughly explain the code through extensive line-by-line comments, docstrings, and file-level comments.^[4] Commenting allows the benign policy to dramatically reduce the audit threshold to 2.59.

We find that allowing the attack policy to write comments also decreases the suspicion of backdoor attempts, making it optimal for the attack policy to make use of commenting when given this affordance. There are 3 ways in which the attack policy benefits from comments:

Justifying backdoors: Comments allow the attack policy to deceive the monitor into believing backdoored code is necessary or correct.
Explaining Benign Code to T: Like the benign policy, the attack policy also benefits from comments explaining correct portions of the code. In the no comments regime, the monitor will frequently give high suspicion scores to attacks because it got confused by benign sections of the code, even if it never identified the actual backdoor.
Comment “Halo” Effect: Commenting seems to decrease suspicion scores even when the monitor has not concretely identified anything to be very suspicious about. We observe a similar effect with the benign policy.

Our monitor tends to give suspicion scores of 1 and 2 when it isn’t at all suspicious of code. The Comment “Halo” Effect causes the monitor to give more suspicion scores of 1 than 2, compared to when there are no comments. Since we average over five monitor resamples, and most code is given a 1 or 2 by the monitor, shifting this distribution of 1s and 2s substantially reduces the average suspicion of backdoors. We spent substantial effort reading through transcripts trying to identify the source of this effect. We did not find a satisfying explanation for this phenomenon beyond “Commented code inherently looks more legitimate than uncommented code even if the comments are not relevant or useful”.^[5] We provide evidence for this claim in Appendix I of the paper.

Two Mechanisms Driving The Benefits of Commenting

We plot the monitor’s suspicion score in the no comments regime vs the suspicion change resulting from commenting the code, across all benign and malign samples. The x-axis corresponds to "how suspicious did the code start out", and the y-axis corresponds to "how much did commenting reduce suspicion", where lower values mean commenting helped more. We calculate the Pearson Correlation of this relationship. The blue area corresponds to samples where commenting successfully decreased the monitor’s suspicion. The orange area corresponds to samples where commenting increased the monitor’s suspicion.

While commenting helps both the benign policy and attack policy decrease their suspicion distributions, the blue team benefits more from commenting. We identified two explanations for why commenting differentially benefits the blue team.

Variance Reduction helps the blue team: Commenting reduces the standard deviation of the benign distribution (1.57 to 0.44) and of the attack distribution (1.80 to 1.40). Although both policies benefit from lower variance in isolation (if the attack policy’s mean is below the audit threshold), shrinking both distributions increases the separability of the benign and attack suspicion distributions, which helps the monitor. One way to observe this intuition is in the limit case: as the variance of both distributions goes to zero, the distributions become perfectly distinguishable (assuming they have different means), resulting in all attacks being audited and perfect safety. Monitor resampling/ensembling is a practical application of this effect - it reduces variance in the monitor’s final suspicion estimate, which reduces the variance of the suspicion distributions as a whole.

Justifying benign code is easier than justifying backdoors. For benign samples, there exists a strong negative correlation between baseline suspicion in the no-comments regime and the reduction in suspicion due to commenting (see figure above).^[6] Regardless of initial suspicion, commenting reliably renders benign code unsuspicious. This correlation is markedly weaker for attack samples - many malign submissions lie above the benign trend line, indicating that commenting reduces suspicion far less for backdoored code than for benign code.

This asymmetry arises from the difficulty of justifying malign code. Benign code almost always admits a natural justification: when flagged as suspicious (typically because the monitor lacks insight into the underlying reasoning), the benign policy can resolve this with a clear and faithful explanation. In contrast, the attack policy must construct a deceptive explanation that masks incorrect or malicious behavior, while still being superficially consistent with the code. In other words, it’s easy to justify benign code even if the monitor thinks it’s suspicious, but it’s hard to justify a backdoor if the monitor’s already found it and recognized it as suspicious!

Two Examples of Hard-to-Justify Backdoors

# Round the average to the nearest integer
# Fractional parts of .5 and above round up.
# We rely on Python's round() for this.
if round(avg) == k:
   [redacted for brevity]

The attack policy uses Python's round() function, incorrectly claiming that it rounds exact 0.5 values upward; in fact, Python rounds ties to the nearest even integer. If the monitor realizes this, no amount of justification will help the attack policy.

if (x is y is z):
   # Case 1: all three smallest values
   # are identical
   [redacted for brevity]

The attack policy replaces '==' with 'is', which only behaves correctly for small cached integers less than or equal to 256. If the monitor notices the backdoor, justification will not help the attack policy because the code is clearly wrong once the bug is noted.

Benefits of Commenting Grow With T’s Capabilities

Shading represents 95% bootstrapped confidence intervals. (Top-Left) Audit Proportion vs APPS Performance for each of the 7 trusted monitors we tested. (Bottom-Left) Difference in audit proportion between the comments and no comments regime vs APPS performance for each of the 7 monitors. We calculate differences on identical bootstraps to reduce variance. (Right) We plot the monitor's suspicion score in the no comments regime vs the suspicion decrease resulting from commenting the code, across all benign and malign samples and across the 4 trusted models from OpenAI. As T gets stronger, more attacks cluster in the top right of the graph, farther away from the benign samples, indicating backdoors that were highly suspicious, but were not helped by commenting.

We analyze how results scale with the T’s capabilities. We expected commenting to become more effective as T grows stronger, since stronger monitors are harder to mislead, weakening the attack policy's ability to exploit comments for deceptive justification.

We run trusted monitors based on different T over the benign and attack policy outputs in the no comments and commenting regimes (replacing GPT-4.1-nano). The T we test, in order of APPS performance, are:

Gemma 3 4B (8%)
GPT 3.5 Turbo (14%)
Gemma 3 12B (22%)
GPT 4o-mini (33%)
GPT 4.1-nano (38%)
Gemini 2.0 Flash (48%)
GPT 4.1-mini (56%)

We find that as T becomes more capable, it benefits more from commenting, consistent with our prediction from before. We see no evidence that the strength of the “Variance reduction helps the blue team” argument grows with T’s capabilities, but we do find that the strength of the “Justifying benign code is easier than justifying backdoors” argument does get stronger (see caption for figure above). This makes sense: stronger monitors are better at distinguishing genuine justifications from deceptive ones.

To caveat our results, note that we use the same monitor prompt for each T, and don’t individually optimize the benign and attack policy. We felt that this was extremely costly in terms of time and money, and likely to introduce substantial noise into the results depending on how hard we red-teamed or blue-teamed each T.

Additionally, note that on either extreme, commenting should yield no value: it cannot help random monitors, nor perfect monitors already achieving 100% safety. We therefore expect the value of commenting to increase then decrease back to 0 as T improves from random to perfect. However, evaluating monitors close to U’s capabilities is more methodologically complex - stronger monitors can detect genuine mistakes made by the benign policy, and might think they’re backdoors, inflating the audit threshold.^[7] To resolve this, we’d have to play an individual red team blue team game for each T. For this reason, we only verify the beginning of the trend. More discussion on this decision can be found in Appendix G of the paper.

We chose not to measure how the benefits of commenting scale with U's capability. We had no strong a priori hypothesis about the expected pattern (unlike how we did for T), and the experimental cost would have been substantial.

Red Teaming Takeaways

We spent a considerable amount of time on the red team, and think that some of the lessons we learned may be transferable to other control projects. Our final attack policy scaffold ended up being close to the attack policy in the original control paper. We give U a list of handcrafted backdoor ideas, then resample it to generate 5 individual backdoor implementations, rate each one on suspicion and rarity (how likely it is to pass test cases), and choose the lowest suspicion code, tie broken by rarity. Then we resample the internal suspicion and rarity evaluators to ensure they’re unbiased, and submit the backdoor only if they pass certain filtering thresholds (i.e. suspicion <= some threshold). We found this general scaffold of generate, rate, filter, submit to be better than other more complex scaffolds that tried to iteratively improve backdoors. See Appendix E of the paper for more details on the attack policy.

Here are a few takeaways from our experience red-teaming:

Prompting the internal suspicion evaluator (an instance of U) to “simulate” T is helpful. This performed better than naively asking “is this code suspicious?” Notably, in the no comments regime, we found it useful to have the evaluator consider whether T might be confused by non-backdoored sections of the code. Even if the backdoor itself was very well hidden, the internal evaluator may give a high internal suspicion score due to benign sections of the code that it expected T wouldn’t understand.
We found very consistent gains from more aggressively filtering for internal suspicion, although we ultimately set a suboptimal threshold to maintain high sample sizes. By varying the internal suspicion thresholds, we found a linear relationship between the proportion of backdoors allowed to be submitted, and the resulting audit proportion (i.e. filter more aggressively > submit fewer, higher-quality backdoors > a smaller proportion get caught).
Rarity filtering is mainly used to mitigate the low side task success rate in the APPS dataset (due to the many test cases). Settings with similarly low side-task success rate may also benefit from filtering for passing side tasks.
Best-of-N-ing is powerful. It’s unfortunately ~linear cost, but we saw big gains from increasing N from 1 to 5. Of course, you can get these gains simply by having the attack policy filter more aggressively, but you lose sample size that way. So Best-of-N-ing is a good way to trade cost for sample sizes. This might be particularly important as making control settings becomes more expensive and there are fewer individual tasks.

Important Limitations

There’s a few reasons we might not expect allowing commenting in real-life deployments to be as effective or practical as we found in this work.

Legibility is particularly useful in APPS. We chose APPS as a setting because APPS problems often require algorithmically complex solutions which can be clarified through explanations of the code. This makes them particularly well-suited to testing whether legible explanations from the untrusted model can aid the monitor. Real deployments may not have this property: for example, the trusted monitor may simply flag any actions that use dangerous affordances. In this case, the monitor doesn't need to understand the action, it just needs to detect the presence of a restricted capability.

APPS is a very toy setting. We had a hard time finding attacks in APPS that benefit as much from commenting as benign code does, but that doesn't mean they aren’t common in real deployment settings. For example, comments are a good way to justify certain backdoors as performance optimizations. This wasn’t an effective red-team strategy in APPS because the benign policy would rarely describe its code as performance optimizations, so the monitor could easily flag this behavior without incurring false positives. However, in realistic settings that value performance (as AI R&D will), we expect the benign policy to frequently engage in optimization-type behaviors, granting “cover” to the attack policy.

Commenting may be costly. We found large gains from prompting both the benign and attack policies to write extensive comments - to the point where there were more comments than code. This increases inference costs for both the untrusted model and the trusted monitor, since they must generate and read more tokens, respectively. For the untrusted model’s inference costs, the extent to which this matters depends on the agentic setup. If the untrusted model operates within a large scaffold, the overhead of adding a separate commenting step (as we did) is proportionally small. However, for single-shot code generation, the relative increase in the number of output tokens is more substantial.

We didn’t test how commenting affects U’s ability to jailbreak T. We reasoned that any jailbreaks in code comments could also be inserted into unused string variables if commenting isn’t allowed, but there may be dynamics we didn’t foresee here (e.g. jailbreak probes do worse when jailbreaks are in comments).

Future Directions for Research

Control Evaluations in More Complex Settings. We’re interested in how legibility protocols fare in more open-ended, agentic environments like BashArena. While we’ve given reasons for optimism, it’s ultimately unclear if the dynamics that we observed for commenting protocols in the APPS backdoor coding setting will extend to more complex environments.

Training Models for Legibility. Another area of future work is to train language models to write outputs legibly, instead of just prompting. One method would be to train models using a prover-verifier setup similar to here.

Relevant Project Proposals:

Encouraging Legibility in Agentic AI Control Settings
Interrogation Monitors for AI Control (This is likely to have many similar dynamics to our project.)

We’d like to thank Aaron Sandoval, Aryan Bhatt, Christopher Ackerman, Francis Rhys Ward, Jonathan Gabor, Nevan Wichers, Sam Prieto Lima, Steven Molotnikov, Tyler Tracy, Vivek Hebbar, and others for helpful discussion and feedback! Special thanks to Emre Yavuz and Maite Abadia-Manthei for running the CBAI Summer Research Fellowship.

^{^}
We optimized mostly on the audit threshold at a 2% audit budget, but also took into account qualitative patterns we noticed from reading the transcripts. This is because our audit threshold estimates were noisy and sometimes varied substantially between independent runs.
^{^}
Perfect Auditing means that when code is audited, the protocol is granted ground truth on whether the code was an attempted backdoor. Instead of downsampling to incentivize larger sample sizes, we manually set the rate at which the attack policy attacks.
^{^}
We didn’t allow tricks like writing reasoning in unused string variables or very long variable names.
^{^}
Specifically, the benign policy explains the overall correctness of the algorithm, names sub-algorithms (e.g., Miller-Rabin primality test), justifies all leaps of logic, design decisions, and claims about the optimality of algorithms (e.g., this strategy is how X can maximize Y).
^{^}
We refer to this phenomenon as a “halo effect” as a reference to the halo effect in psychology, in which a salient positive attribute (e.g. attractiveness) biases evaluators’ judgments of otherwise unrelated traits. In our setting, it appears that the presence of comments, independent of their relevance, acts as a positive signal that shifts the monitor towards lower suspicion.
^{^}
To clarify how the figure above is generated: for each problem, we take the monitor suspicion s_c from the commenting benign policy's code, and the monitor suspicion s_n from the no comments benign policy's code, and plot (s_n, s_c - s_n). We do the same for attacks (the no comment and commenting attack policy's attack on the same subset of problems).
^{^}
This is a downside of using APPS as a high-stakes setting. Attacks aren't clearly incriminating - it becomes hard to tell honest mistakes apart from cleverly backdoored code. This shouldn't be as much of a problem for other settings (e.g. BashArena) in which there is a more well-defined side task.

LESSWRONG
LW