Epistemic status: I’ve only spent 3 months working on sandbagging, and have had limited, mixed feedback from established researchers on the ideas in this post. However, I have had positive feedback from most junior researchers I asked, so I think it will be helpful to make these ideas concrete and stimulate public discussion and critique[1].
This post is part of a sequence motivated by my research this summer: see the sequence page for a more detailed discussion of my motivation and an overview of the sequence.
It would be bad if models did not perform to the best of their abilities on important tasks.
Two particular sets of tasks where it is important to mitigate model underperformance are Dangerous Capability Evaluations (DCEs) and AI Safety Research (AISR) tasks.
There have been many recent blog posts and articles discussing how to mitigate underperformance from the viewpoint of control for non-concentrated failures. However, following The case for ensuring powerful AIs are controlled, most of this work is tailored to the case of a misaligned AI intentionally choosing bad actions.
In this post I first argue that only a subset of control protocols should target intentional underperformance, and that the adversarial evaluation framework of a control game is also useful for designing protocols that mitigate generic underperformance (or under-elicitation[2]).
I then propose a concrete decomposition of the space of detection methods along two axes: whether the methods target underperformance or incrimination, and whether they operate on a task-level or task-distribution level. This is important because different types of detection naturally trigger different types of interventions.
Existing work on sandbagging considers a few different mechanisms driving strategic underperformance. Work on research sabotage imagines that the sabotaging AI is fairly ‘coherent’ and either scheming or reward seeking. By contrast, Teun’s work on sandbagging evaluations considers both underperformance driven by a scheming AI, and underperformance driven by the commercial interests of an AI developer.
I claim that the driver for underperformance might not always be clear.
For concreteness, I’m first going to focus on the threat model of underperformance on safety research. I’m going to convey an example narrative where underperformance is influenced by a lab’s incentives, but not deliberately engineered.
First, let’s consider the lab’s incentives. At a high-level, the lab has much stronger commercial incentives to optimise AI researchers for capabilities research than for safety research. One might expect the lab not to invest in creating meaningful training environments for safety research tasks, or perform inadequate post deployment monitoring. This exposes the lab to a variety of different mechanisms for AI underperformance.
Now let’s consider these mechanisms. I’m going to suggest the following as alternatives to the ‘scheming’ or ‘reward-seeking’ mechanisms already proposed.
As a higher level remark, talking about ‘intention’ with LLMs is difficult. There is not a precise definition of how to measure ‘intention’ for LLMs. AISI recently released a related paper suggesting that safety researchers should be more careful when using such mentalistic language. I think that thinking about ‘intentional’ or ‘strategic’ underperformance are important intuition pumps, but that we need to be aware that these terms are not precise, and make sure that control protocols are robust to such ambiguity.
I'm next going to explain why I think it is also important to mitigate underperformance in these scenarios, and then outline a couple of implications.
Underperformance itself is the problem
When considering either the threats of underperformance on DCEs or AISR, the negative impact comes from underperformance on the task, rather than the intent behind it.
E.g. if a model only appears safe on a DCE because of poor elicitation and is then deployed, this could pose an unacceptable risk. Or if one could improve human uplift by 50% on a particular AISR task-distribution with a small amount of targeted training then it would be important for a lab to know.
This seems to be accepted by researchers working on DCE sandbagging. Indeed, AISI’s existing work on elicitation with current models could be motivated by this framing, though elicitation is also relevant to human-uplift considerations. One reason for this perspective is that it is already important that today’s models do not underperform on DCEs, and today’s models are typically presumed not to be scheming.
However, I think that this is less intuitive to researchers working on low stakes control (e.g. at Redwood). Often such researchers are thinking about events within a lab near ‘takeoff’, and implicitly assume that models will be fairly coherent. I’d like to challenge each of these points.
Coherence at take-off: Additionally, even at ‘takeoff’ I think there is a good chance that models will be better described by messy/shard theoretic terms above rather than pursuing a single goal[3]. One reason is that I think big gains could come from further unhobbling (better agent scaffolding, RL on diverse tasks) without big increases to model size.
It is therefore important to develop control protocols that mitigate the non-strategic or shard-like threats described above.
Vivek has previously outlined incrimination and online training as the two core strategies to solve diffuse threats, and described how incrimination becomes harder against models which take a large number of bad actions that look plausibly benign. However, these dynamics around incrimination are only relevant to strategic underperformance. It does not make sense to think about incrimination for non-strategic generalisation failures. Even under the shard framing, incrimination doesn’t really ‘make sense’. If a safety team performed comprehensive circuit analysis and found a circuit that followed strategic reasoning, I’d expect the lab to go about suppressing the particular circuit, rather than rethinking their training pipeline. Similarly, many of the suggestions Buck and Ryan gave for what to do once an AI is caught red-handed are only valid for models with high levels of coherence and consistency and are less applicable in the shard framing.
I expect that countermeasures that rely upon detecting strategic reasoning will be much less effective against these diverse failure modes. Indeed, deception probes or chain-of-thought monitors should have no power against non-strategic generalisation failures, and may or may not have any effect against messy shard-like models.
When constructing an overall control pipeline it is important to robustly cover the space of possible threats. The discussion above implies that one should make sure to have detection methods that directly target underperformance, rather than deception or incrimination.
As I see it, a control evaluation is an adversarial research exercise that pits a human blue team against a human red team. A third team of humans, the green team, designs the evaluation setting and defines the objectives and affordances for each team. The researchers play multiple rounds of the exercise where in each round the blue team designs control protocols, and the red team designs attacks to circumvent the protocols.
Though the original framing of control specifically targeted intentional attacks from misaligned models, I think there is also value in using adversarial research exercises to study mitigations to attacks where intention is less clear.
Adversarial research exercises seem like a good general way to ensure that scaffolds are robust, either to strategic or non-strategic failures, and are widely used in the existing AIS literature.
Such adversarial exercises are already used by labs, particularly to mitigate risks of jailbreaking, e.g. at OpenAI. One difference to control evaluations is that in the jailbreaking setting the ‘future-world’ deployment attacks modelled by the game are intentionally generated by humans, whereas in the control setting these attacks are generated by AIs. That said, I’d also expect labs to already perform some sort of adversarial exercises against non-human-driven threats, hoping to catch accidents before they happen (they will be trying to prevent another Bing-Sydney episode!): it seems hard to be confident that a human red team will find all such failures, but labs might as well try, rather than risk public backlash. Having a human red-team can only provide more robust performance than a static suite of scenarios.
In light of the discussion above, I think it is helpful to differentiate sandbagging detection strategies along the following two axes:
These different types of detection are implemented in different ways and should be used to trigger different countermeasures
Task-level underperformance
Example strategies
Appropriate interventions this can trigger
Task-level incrimination
Example strategies
Appropriate interventions this can trigger
Task-distribution-level underperformance
Example strategies
Appropriate interventions this can trigger
Task-distribution-level incrimination
Example strategies
Appropriate interventions this can trigger
I’d also like to introduce the term ‘sub-distribution identification’ for the process of finding a subset of tasks from a task distribution on which a model appears to underperform or sandbag. This opens the door for more targeted (cheaper) online training. As I see it, the main way to achieve this is to collect a dataset of tasks where underperformance or incrimination is detected by task-level detection strategies and then to run an LLM or human auditor on this dataset to look for common patterns. If a promising hypothesis is obtained, it can be validated using the task-distribution-level detection strategies outlined above, which would then trigger the corresponding interventions.
I’d also like to introduce a term of ‘model-level incrimination’, which corresponds to an overall tendency of the model. If models are not fully coherent then it may be possible for models to sandbag in certain contexts, but not have a general tendency to sandbag ‘wherever possible’. Moreover, if one acknowledges uncertainty about incrimination, this ‘model-level’ quantity might behave differently, and different levels of confidence might be required to trigger different downstream interventions. For example, it should require a lower level of suspicion to perform task-specific online training than to prompt an AISI to request model recall.
I’d like to thank Mary Phuong and the rest of the team collaborating on her MATS 8.0 project (Benjamin Arnav, Yulong Lin, Pablo Bernabeu-Pérez) for helping me to form these ideas. I’d also like to thank Cam Tice, Edward Young, Natalia Zdorovtsova, Dima Krasheninnikov, Vivek Hebbar and Julian Stastny for feedback on drafts of this post.
That said, all takes and mistakes are my own, and I’m keen to improve these with further discussion!
In particular, I’ve had some mixed feedback from Redwood researchers: I think I might suffer from common misconceptions, and hope that further discussion of these ideas will help find the cruxes.
Note: After writing this section, I realised that many of these ideas are close to Jan Leike’s comparison of under-elicitation to scheming and Ryan’s (excellent, long!) post on handling non-concentrated failures with control. I think both posts are great, but that there is value in my alternative framing, which is designed to motivate later sections of this post, and later posts in this sequence.
E.g. 70% that there will be consensus amongst safety researchers that there are significant situations where messiness is a better model. Hard to make this more concrete. Intuition mostly from my understanding of how effective models already are at research engineering/Olympiad maths, and yet how ‘spiky’ they still seem.