My research this summer involved designing control settings to investigate black-box detection protocols.
When designing the settings, we had to decide between trying to mitigate research sabotage (RSab) or Dangerous Capability Evaluation (DCE) sandbagging. It was not clear to us which to target, or whether a single setting could be relevant to both threats. Most junior or mid-level sandbagging-adjacent researchers I asked seemed interested in this question, and found it difficult to pin down the differences unprompted. More experienced researchers would typically come up with similar answers after some thought. I therefore thought it might be helpful to write an answer up in more detail.
I invite you to ask yourself the question: which differences between DCE sandbagging and RSab are important to consider when designing control evaluations?
I’ve put my answer in the spoiler box below.
Spoiler: the core distinction between RSab and DCEs...
...regards scorability.
DCEs are designed to have a concrete ground truth and to have scorers reflecting this that can be directly interpreted by humans as a measure of dangerous capabilities. As such, a misaligned AI will want to score poorly according to these scorers.
Safety research tasks often do not have clear ground truth, and it is hard to construct metrics that reflect what we care about, which is ‘research contribution’. AIs may be able to score highly according to any given scorer, while only making small contributions to research progress.
The rest of this post is structured as follows. I first outline the different axes that I think are important when designing control games for sandbagging. I then analyse DCE sandbagging and RSab threats in detail along these different axes. Finally, I compare the two threats, and discuss how researchers should frame their work to ensure applicability to each threat. Readers familiar with the sandbagging literature may wish to skip straight to this final discussion.
Epistemic status: I am confident that there is value in an explicit comparison between RSab and DCE sandbagging threats and that my core distinction is more-or-less correct. I am less confident in some of my remarks towards the end of the discussion, and keen to refine these in light of further feedback.
First I’m going to describe my understanding of how the different axes affect the design of control evaluations, before considering where on each axis DCEs and RSab lies. I’ll first discuss general aspects of game design before moving on to blue team and red team affordances.
The general game design should be informed by the following future-world factors:
Let’s next consider blue team affordances. One has to consider[1]:
Finally, the green team should ensure that the red team affordances and strategies best reflect the intended threat models. As such it is important to understand the strategies that a misaligned model might pursue so that the red team is encouraged to pursue similar strategies.
There is also an obvious difference about the task distribution. This is primarily relevant to realism, so less important if designing toy settings to study specific game dynamics. That said, certain aspects of the task distribution could affect dynamics (e.g. whether tasks are crisp vs fuzzy), but I think this will be effectively addressed by higher-realism control evaluations. I’ll therefore only discuss the other factors in the rest of the post.
In the next two sections I’ll look at how these considerations apply to the DCE and RSab threats respectively before discussing the key similarities and differences in the final section.
Note: for simplicity, I’ll only focus on the threat model of models underperforming on a regulator/AISI’s DCEs. DCEs run by other external testers have very similar considerations. Lab-run DCEs have different considerations around model access and consequences for underperformance, but lead to a similar final discussion.
The threat model of sandbagging DCEs has the following properties:
Expert trajectories (human or AI): may be available in special cases but in general will not be available, and may be very expensive to collect or create. Evals are becoming more sophisticated with more need for CoT, so one cannot simply SFT on MCQ solutions. For example, I understand that current cyber evals are created with high-level human-written sketches of an exploit, but do not show the specific commands needed to execute the exploit, or the iterative process human hackers used to find the exploit. It is therefore hard to use these directly for SFT or off-policy RL, but one might be able to integrate these into a custom fine-tuning pipeline, e.g. as ‘hints’[2].
Note: for simplicity, I’ll only focus on the threat model of models underperforming on a lab’s internal research deployments, which are the focus of most existing literature on research sabotage[3].
The threat model of research sabotage has similar properties:
For most axes of consideration outlined above, either threat model could be placed at different places along the axes depending on the specific scenario. Indeed, I’d like to claim that many points in the space defined by these axes are relevant to both threat models[4]. As such, one might expect more differences in game dynamics between different evaluation settings that reflect the same threat model than between related settings corresponding to the different threat models.
To elaborate on this point, I’ll sketch three example game settings that capture dynamics applicable to either DCE sandbagging or RSab threats.
I claim that the core distinction between the two settings is scorability. As described above:
However, this distinction can start to feel blurred when designing control evaluations for sandbagging. Indeed, the game objective will likely involve a notion of ‘performance’ (or ‘usefulness’) that the green team needs to evaluate in order to run the evaluation. As such, when designing the toy setting, the green team needs to design a suitable scorer to evaluate performance, ideally a single scalar metric.
This means that toy settings to study sandbagging can typically be applicable to both threats. The difference is that in the case of DCEs the blue team should have access to the scorer, while in RSab, the blue team should only have access to a proxy[5].
It is helpful to be able to iterate control research in toy settings that are quick to develop and iterate with. I can envision a body of future research that uses a common collection of toy tasks, and for each toy task runs some evaluations where the blue does have access to the scorer, and others where it does not. Two types of tasks we considered in our project that seemed promising were open-ended verifiable maths tasks (e.g. Hendryck’s MATH), and coding benchmarks (e.g. BigCodeBench). These are existing capabilities tasks with cheap scorers chosen primarily for iteration speed. That said, it is also important for some control research to have increased realism, to ensure that it is tethered to reality.
One key point is that while many existing mitigations to sandbagging on safety research rely upon being able to measure performance, there is little existing research on how to measure performance. Including settings where the blue team does not have access to ground-truth signals reflects the RSab threat and forces the blue team to properly consider the task of constructing proxy signals and evaluating/iterating upon these signals. It is nice for this process to happen within a control game because it naturally gives a measurable way to evaluate the quality of the proxy signals and forces them to be designed with the aim of ‘control’ in mind. I.e. one gets adversarial robustness from the get go.
Another important distinction from the considerations above is temporal identification:
In each case we see that splitting the task into sub-tasks can help reduce model context, making models less certain of the date of the task, and reduce the risk of models exploiting temporal distribution shifts.
Let’s think in more detail about the malign model strategy: “only sandbag if I am confident that the task occurs after my knowledge cut-off”. This could make online training very expensive - at each step of online training, the model could update its estimate of the knowledge cut-off and keep sandbagging on tasks that appear after this updated cut-off: one would be forced to perform online training after any discontinuity in the task distribution.
One potential way I see to mitigate this ‘progressive-sandbagging’ strategy is to apply some ‘progressive training’, by which I mean “train the model in phases where the knowledge advances by one month in each phase; in each phase include examples of research tasks that were created the month after the current cutoff”. This approach would be high effort to perform, but may provide a way to ‘knock out’ logic/circuits/shards performing progressive-sandbagging logic: indeed, this may prevent models attempting to game temporal distribution shift, and reduce the volume of online training required to achieve a given desired performance.
I’d be really keen to see more discussion, and ideally some empirics, reflecting these dynamics.
In conclusion, toy settings designed to study sandbagging may well be relevant to both RSab and DCE threats. Moreover, intuitions and insights developed about different methods are very likely to transfer, and as such the two communities of people working on DCEs and RSab should make sure to communicate!
I’d like to thank Mary Phuong and the rest of the team collaborating on her MATS 8.0 project (Benjamin Arnav, Yulong Lin, Pablo Bernabeu-Pérez) for helping me to form these ideas. I’d also like to thank Cam Tice, Edward Young, Natalia Zdorovtsova, Dima Krasheninnikov, Vivek Hebbar and Julian Stastny for feedback on drafts of this post.
This material is closely related to Teun’s crisp decomposition of the main factors that influence the difficulty of mitigating sandbagging in a given scenario: fine-tuning access, data quality, and scorability. I also consider white-box access as an additional factor.
I am not aware of previous discussions on this matter. I think one could make something like Learning from Language Feedback to work, with some extra control/monitoring in place.
As discussed in my previous post, I personally think that sandbagging non-lab safety research is also an important threat model; as in the previous section, this leads to different considerations around model access and consequences for underperformance, but leads to a similar final discussion.
Indeed, the two distributions of settings over space defined by these axes for DCE and RSab respectively have significant overlap of support, and I think should also be ‘small’ for ‘suitable’ choices of divergence.
It seems reasonable to also consider settings where the blue team has access to the green team’s scorer if for example one is working on a hill-climbing task distribution and wants to be able to focus on exploration-hacking dynamics rather than metric-hacking dynamics.