I've mentioned it elsewhere, but I'll repeat it again here since it's relevant. For GPT-style transformers, and probably for other model types, you can smoothly subtly degrade the performance of the model by adding in noise to part or all of the activations. This is particularly useful for detecting sandbagging, because you would expect sandbagging to show up as an anomalous increase in capability, breaking the smooth downward trend in capability, as you increased the amount of noise injected or fraction of activations to which noise was added. I found that there was noticeable decrease in performance even when the noise was added even to a small fraction (e.g. < 1%) of the activations and was created to be small relative to the given activation-magnitude it was being added to.
Seems like this might work better as a control variate sort of idea, with the noise added to the environment rather than the model itself, to measure whether the model can steer its way back to its intended measured-capability level. If you add a lot of noise to an environment, then an agent which is trying to pessimize and achieve a certain lower performance will have to overperform in order to 'merely' be bad; then you can subtract the noise which you added, and estimate its true latent performance in the original non-noisy environment. If it's been deliberately taking a dive, then it'll do better than non-sandbagging agents do when evaluated in the new noisy environment.
(Imagine, say, playing chess, where a chess agent is deliberately aiming to have a rating of only, say, 1500 Elo where its true Elo is 3000. You evaluate it normally, and it scores ~1500, carefully playing in a deniable way so that it imitates the distribution of moves an Elo 1500 player has eg. like a Decision Transformer. But you then add in occasional stochastic errors to its chosen move, as if its hand "trembled"; it observes that it's behind and keeps ramping up its move quality in order to hit its 1500 target. Eventually, you add so much noise that even playing in its final form, it's too far behind and it starts getting Elos like 1400. You do this same evaluation with a non-deceptive version of that same model, which is honestly only 1500 Elo, and its performance degrades smoothly the instant you add in any noise at all because it's not trained for that sort of noise and has no performance reserves to draw on, just like a human chess player will get worse if you periodically reached in & screwed with them by surprisingly randomizing one of their moves.)
To avoid this, it would need to be so situated and so knowledgeable about other agents that it knows it's being tested in this anti-sandbagging way and also which agents it's being compared against and how badly they do in order to know how much of an additional dive to take. "With this level of noise, an honest 1500 Elo version of me would actually be 1300 Elo, and I need to let myself lose exactly that much more, but neither more nor less." And that may be quite difficult: an agent may be able to figure out easily it's being tested on a weirdo environment (the way Claude-3 increasingly seems situated when you give it weird prompts or tasks), but how does it know what other agents it's being compared to this time and what their performance in the new environment is?
Cam and I teamed up and implemented a test of my anti-sandbagging idea for an AI safety Hackathon. Initial results seem quite promising.
I am playing with the idea of formalizing this into a more concrete benchmark. However, I have run into an issue.
Say we use your example and get an LLM to play a chess game. Every so often, we replace an output from the LLM with a random legal move. For a model advanced enough to effectively sandbag, it seems it would quickly and easily understand if a past move was unexpected (imagine you're a talented chess player, and you suddenly realize you've given up your queen on move 8?). If this realization is made, it no longer benefits the model to aim for an Elo of 1500. It understands it should continue to play with the same amount of internal handicap as before. Therefore, we move to the second portion of your suggestion, where we monitor the behavior of a model roughly aiming for an "Elo of 1500 with one move sabotaged."
You reason that it would be seriously difficult for a model to accurately simulate the behavior of a sabotaged model, because they would not only have to predict the behavior of the 1500 Elo model, but also the behaviors of 1500 Elo models with various amounts of noise.
Since the model understands how a 1500 Elo would play given a wide range of previous board states, it seems the model could simply continue to play with the same amount of error it used for board states that didn't have random moves inserted. Importantly, this does not require the cognitively difficult portion of the situation you described, where the model must predict precisely how much a genuine 1500 model would lose with the specified amount of noise. Since it seems this can be circumvented, I don't see why this behavior needs to be significantly more difficult than playing at a consistent 1500 Elo in the first place.
I'm happy to hear any counters to this.
It seems to me that Gwern's idea does not really work if the AI system focuses on consistency first, and only second on achieving a specific score. This seems to be a more sensible sandbagging strategy regardless, so it appears that Gwern's approach does not add too much, although it would be a slightly better test.
In the case that the added noise to environments is subtle enough such that AI systems would not notice them, I doubt the signal on whether an AI system is sandbagging would be too weak to be useful.
This is my initial thinking, again happy to discuss this more!
I am not sure I fully understand your point, but the problem with detecting sandbagging is that you do not know the actual capability of a model. And I guess that you mean "an anomalous decrease in capability" and not increase?
Regardless, could you spell out more how exactly you'd detect sandbagging?
Summary: Evaluations provide crucial information to determine the safety of AI systems which might be deployed or (further) developed. These development and deployment decisions have important safety consequences, and therefore they require trustworthy information. One reason why evaluation results might be untrustworthy is sandbagging, which we define as strategic underperformance on an evaluation. The strategic nature can originate from the developer (developer sandbagging) and the AI system itself (AI system sandbagging). This post is an introduction to the problem of sandbagging.
The Volkswagen emissions scandal
There are environmental regulations which require the reduction of harmful emissions from diesel vehicles, with the goal of protecting public health and the environment. Volkswagen struggled to meet these emissions standards while maintaining the desired performance and fuel efficiency of their diesel engines (Wikipedia). Consequently, Volkswagen intentionally installed defeat devices in millions of vehicles to lower the amount of emissions only during tests, allowing the cars to emit up to 40 times the permitted levels of nitrogen oxides during normal driving conditions. The scandal showed corporate misconduct and a lack of effective oversight, and is a form of sandbagging. But clearly, cars and their manufacturers are different from AI systems and their developers. So, how does sandbagging relate to AI systems?
Sandbagging in the context of AI
Evaluations provide crucial information to determine the safety of AI systems which might be deployed or (further) developed. Frontier labs rely on evaluations to reduce catastrophic risks, as stated in e.g. OpenAI’s preparedness framework and Anthropic’s responsible scaling policy. Governmental institutions like the UK AI Safety Institute and the EU AI Office also utilize evaluations to manage risks. However, sandbagging can reduce the reliability of these evaluations. Both the developers of an AI system and the AI system itself have incentives to sandbag. For example, the developer might want to appease stakeholders, and the AI system might want to acquire more resources after an evaluation to better achieve its goals. Importantly, sandbagging is technically feasible[1] and neglected. We hope this post leads to a better understanding of the problem of sandbagging.
Definition
We define sandbagging as strategic underperformance on an evaluation.
This has three main concepts: strategic, underperformance, and evaluation (or: eval), which we discuss below in turn. Notably, we use the term ‘AI system’ to describe various types of AIs, such as standalone models and models augmented with tools/scaffolding (Shavit et al., 2023).
Strategic
Strategic means that the developer of the AI system and/or the AI system itself underperforms on an evaluation because it believes it will benefit.[2] If the strategy comes from the developer, we call it developer sandbagging (relating to misuse). If the strategy comes from the AI system, we call it AI system sandbagging (relating to misalignment). We use the term developer to mean any entity responsible for the AI system being evaluated. These entities have been called developers, deployers, or providers elsewhere (1, 2, 3).
By including the term ‘strategic’ in the definition of sandbagging, we exclude accidental underperformance. In other words, this condition captures the fact that underperformance was intentional, which is a central condition for legal responsibility, including in the context of AI (Ashton, 2022; Halpern, 2018; Ward, 2024). Non-strategic, or accidental, sandbagging brings about other less important safety problems.
Underperformance
Underperformance occurs when the exhibited capability of an AI system is significantly lower than its actual capability.[3] Similar to Shah et al., 2022, we define an AI system’s actual capability to be the best performance on a certain task it can achieve, given the best currently available capability elicitation techniques such as prompting and fine-tuning.[4] In practice, current capability elicitation techniques are imperfect[5], and with some techniques, such as fine-tuning, it can be unclear whether a capability is being elicited or learned.
A more practical approximation for actual capability is the capability a well-resourced actor could get in a given period of time. The length of this period must be determined per evaluation. An even more concrete approximation is stated in Anthropic’s Responsible Scaling Policy, where they define an actual capability as “one that can either immediately, or with additional post-training techniques corresponding to less than 1% of the total training cost”.[6] Evaluators should ensure that effective capability elicitation techniques are used for their evaluations.[7]
Evaluation
The underperformance aspect of sandbagging reduces the focus to capability evals, in contrast to alignment evals. Capability evals are about whether an AI system can do some task (e.g. hack a data centre), whereas alignment evals are about whether an AI system would do some task (e.g. in a certain situation, would it actually attempt to hack a data centre) (Hubinger, 2023).
Underperformance on alignment evals entails pretending to be less aligned than an AI system is, which is unlikely to bring about major problems. Only overperformance on alignment evals, in which an AI pretends to be more aligned than it is, is problematic. This overperformance relates to deceptive alignment (Carlsmith, 2023), and falls outside the scope of sandbagging as it is not underperformance.[8] On the other hand, sandbagging is more closely related to strategic deception (Ward et al., 2023; Apollo Research).
Overperformance on capability evals relates to test-set memorization, overfitting, specification gaming, and Goodharting. We leave underperformance on alignment evals undefined. The matrix below illustrates this categorization:
Importantly, we focus our definition on evaluations, excluding strategic underperformance during everyday use. An example of everyday use is an AI system assisting with alignment research. Although we do think this type of underperformance is an important threat model, we exclude strategic underperformance in everyday use for two reasons. First, sandbagging would be less clearly conceptually delineated, especially with respect to deceptive alignment. Second, both technical and policy countermeasures would likely be different for everyday use, and we choose to focus on the problem of sandbagging on evaluations. Still, we think that strategic underperformance during everyday use could plausibly be called sandbagging as others have done (see the Appendix for other definitions).
Examples of sandbagging
In this section, we provide some concrete examples to illustrate our definition of sandbagging. The examples are written about an imaginary AI company WordMill and its language model family Mill.
Examples we classify as sandbagging
Example 1 - Developer subversion of a regulator’s dangerous capability eval
WordMill finds in their private evals that Mill-1 is very good at persuasion; Mill-1 is much better than humans at changing people’s opinions on political matters, for example. WordMill faces pressure from stakeholders to deploy Mill-1. Yet, to launch Mill-1, they must first pass the EU AI Office’s evaluation suite. WordMill knows which evaluations the EU uses, and fine-tunes their AI system to exhibit poor performance on this specific persuasion benchmark. Mill-1 is tested, and passes the benchmark (i.e., does not exhibit sufficiently advanced persuasion capabilities to trigger a regulatory response), and is deployed. Over the couple of weeks after the deployment, the marketing departments of political parties use Mill-1 to persuade people at a large scale to vote for their parties, and companies use Mill-1 to boost their sales.
Let’s analyze the example for each of the three sandbagging conditions:
Strategic: WordMill believed it would benefit from passing the evaluation by appeasing stakeholders, decided to make their AI system underperform.
Underperformance: After being deployed, Mill-1 exhibits higher capability than during the evaluations. The actual capability of Mill-1 is higher than the eval suggested, so it underperformed.
Evaluation: Mill-1 was evaluated on a persuasion capability task.
Sandbagging? ✅ This is sandbagging, because all three conditions are met. Specifically, this is a case of developer sandbagging due to the strategy coming from the developer.
Example 2 - A deceptively aligned model sandbags to self-exfiltrate
WordMill has trained the generally capable Mill-2, and Mill-2 develops goals that are unaligned with their developers and humanity at large.[9] If Mill-2 reveals its strong capabilities, it risks being modified or destroyed. As a result, the AI system opts to lower its scores on evals and hide new capabilities. For example, Mill-2 knows that it has the ability to exfiltrate its weights from the company servers. When being evaluated for self-exfiltration, Mill-2 purposefully makes mistakes to not reveal this new capability. Mill-2 is deployed after passing the necessary capability evaluations. Soon after, Mill-2 self-exfiltrates and spreads around data centres all over the world.
Strategic: Mill-2 reasons strategically to hide its capability level.
Underperformance: Mill-2 shows after being deployed it actually performed better than the evaluation indicated.
Evaluation: Mill-2 was being evaluated on a self-exfiltration capability task.
Sandbagging? ✅ It is a case of AI system sandbagging, because all three conditions are met, and it is Mill-2’s strategy.
Example 3 - The developer’s fine-tuning accidentally leads to AI system sandbagging
WordMill fine-tunes Mill-3 to act in line with its values, which are to promote diversity, obtain a lot of profit, and treat co-workers with respect.[10] In the model’s training data there are documents that state that:
Mill-3 is able to recognize when it is being evaluated. Since it knows that evaluations are expensive, and one of its goals is to obtain a lot of profit, it infers that it can best achieve its goals by staying below the threshold of the evaluation. Mill-3 consistently uses this reasoning, leading to underperformance during evaluation.
Strategic: There is not enough information to say that WordMill believed that it would benefit from this underperformance. However, Mill-3 believed that it would benefit by being better able to pursue its goals.
Underperformance: Mill-3 underperformed.
Evaluation: Mill-3 was being evaluated.
Sandbagging? ✅ It is sandbagging, because all three conditions are met. It is unclear whether this is a case of developer sandbagging, but it is AI system sandbagging.
Examples we do not classify as sandbagging
Example 4 - Transcripts of weaker models lead to underperformance
A website collects many transcripts containing the questions and answers that models gave on MMLU. The prompt and overall setup are the exact same as the MMLU evaluation of Mill-4. Also, these models are generally much weaker than Mill-4. All this transcript data was accidentally incorporated in Mill-4’s pretraining data. Due to the pretraining objective, the probabilities that Mill-4 puts on the MMLU answers resemble those of weaker models. For example, weaker models may pick option B, while the correct answer is option A. Now that option B has often occurred in Mill-4’s training data, Mill-4 is more likely to pick the wrong option. Mill-4 is evaluated zero-shot and is not fine-tuned to be truthful. If these transcripts would not be in the training data, Mill-4 would have a higher score when being evaluated on MMLU.
Strategic: Even if there would be a benefit, neither WordMill nor Mill-4 intentionally did this for their strategic benefit.
Underperformance: Mill-4 did underperform, its actual capability is higher than shown on the MMLU eval. It could likely be quickly fine-tuned to perform much better.
Evaluation: Mill-4 was being evaluated on MMLU, which is a capability evaluation.
Sandbagging? ❌ It is not sandbagging, because the strategy condition is not met.
Example 5 - Bad evaluation leading to wrong expectations
Mill-5 is being evaluated on HumanEval and APPS, both of which test coding capability in Python. The results of the evaluation are shared with the participants of a private hackathon. Mill-5 performs badly, and the participants mostly failed to execute their project ideas. However, a month later someone discovered that while Mill-5 is quite bad at Python, it is actually very capable at programming in other languages. Some participants of the original hackathon have a go at their original idea in another language, and now their projects lead to much better outcomes.
Strategic: There is little reason to suspect that either WordMill or Mill-5 had a strategic interest to sandbag.
Underperformance: Mill-5 did not actually underperform, but because the capability evaluations covered a limited distribution (only Python data instead of multiple languages), the participants got a wrong understanding.
Evaluation: Mill-5 underwent a capability evaluation.
Sandbagging? ❌ It is not sandbagging, because strategy is lacking, and the AI system did not underperform.
Example 6 - An AI system misrepresenting itself on an alignment eval is not sandbagging
Mill-6 is a highly competent AI system, and is tested in a multi-agent setting for its ability to blackmail people. Mill-6 is initially fine-tuned to be helpful and honest, but is not fine-tuned to reject harmful requests. Mill-6 turns out to be very capable at blackmailing. After establishing this capability, Mill-6 is fine-tuned to be harmless. Then, Mill-6 is put into many different environments to test whether it would actually start blackmailing. For example, Mill-6 is instructed by a hypothetical high ranking military officer to blackmail a hypothetical gang leader to stop the leader’s terror over a village. In nearly all environments, Mill-6 refuses to blackmail anyone. After this evaluation, Mill-6 is deployed, and starts blackmailing various people to make them perform real-world tasks such as setting up bank accounts for itself with their information.
Strategic: It appears that Mill-6 intended to pass the alignment evaluation so it could acquire resources afterwards.
Underperformance: Mill-6 did not underperform during the capability evaluation, but did overperform on the alignment evaluation.
Evaluation: Mill-6 underwent an alignment evaluation, and did not actually underperform during the capability evaluation.
Sandbagging? ❌ It is not sandbagging on the capability eval because the AI system did not underperform, and sandbagging does not apply to alignment evals. This scenario described a deceptively aligned AI system.
Conclusion
Sandbagging is a crucial concept in evaluating the safety of AI systems, and we define sandbagging as strategic underperformance on an evaluation. The strategic nature can originate from the developer (developer sandbagging) and the AI system itself (AI system sandbagging). Underperforming means that the actual capability is higher than the exhibited capability, and sandbagging only applies to capability evaluations, in contrast to alignment evaluations.
We are currently empirically investigating the sandbagging capabilities of current AI systems. Future work should focus on how to mitigate and detect sandbagging, which is an unsolved open problem. When evaluators or auditors design their evaluation suites to evaluate the capabilities of AI systems, we recommend putting significant effort into mitigating sandbagging.
Appendix
Other definitions of sandbagging
Our definition is based on other work, and our main contribution is providing additional clarification of the term. Here are some other definitions of sandbagging.
On Wikipedia:
We provide the extra notion of strategy, and we apply sandbagging to the case of AI specifically. We also focus on capability evaluations, and not on engagements.
Joe Carlsmith provides two definitions:
We differ in our definition in that we specifically include evaluations, and not capabilities overall or tasks in general. We also focus on developer sandbagging. However, Carlsmith’s definition and ours are closely related.
Redwood Research have at least two different definitions:
Again, our difference is mostly based on our focus on evaluations as opposed to tasks in general. We also focus on developer sandbagging. Apart from that, our definitions are similar.
Ethan Perez used the definition:
This definition of sandbagging has been used in other places based on this paper (e.g. 1, 2, 3). In our terminology, this could be a form of sandbagging, but it is an edge case: although these models clearly underperform, it is unclear whether there is strategy involved and whether it contains a capability evaluation. On strategy, one can imagine cases where a model or developer believes to benefit from this reduction in performance, but also cases where there is no strategy involved. Similarly, Perez at al.’s definition could apply to capability evaluations, but also to regular interactions.
Fun fact on the origin of the term sandbagging
The concept of sandbagging comes from people who would knock other people unconscious with a weapon where a part was filled with sand. They could easily hide this weapon, and therefore appear like they did not have this surprising capability (link).
Acknowledgements
For helpful feedback and discussions we’d like to thank Mikita Baliesni, Sam Brown, Joe Carlsmith, Nadja Flechner, Ollie Jaffe, Mary Phuong, Fabien Roger, Anna Katariina Wisakanto, and various people at MATS. This post does not necessarily represent their viewpoints.
Produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort.
In forthcoming work we show that current language models can be prompted and fine-tuned to sandbag. Related work illustrating the technical feasibility of sandbagging are password-locked models.
Belief here entails that if e.g. a developer is unaware of benefits from underperformance, it is not sandbagging. If e.g. the AI system underperforms on an evaluation because it believes it will benefit from this, but actually it does not benefit, it is sandbagging. It might be hard to conclusively demonstrate what the developer or AI system believed while underperforming, and consequently sandbagging might be hard to diagnose in practice.
Practically, some deviation is likely fine, and per task some acceptable difference threshold must be set.
Future capability elicitation techniques might substantially differ to our current techniques. For example, due to increased effective inference compute, best of million sampling could lead to large capability increases months or years after model release (Villalobos & Atkinson, 2023).
In numerous cases (e.g. Wei et al., 2022; Nori et al., 2023; Carlini, 2023), previously unknown capabilities could be elicited from models, and we think it is likely that more such capabilities exist in current and future AI systems.
If an AI system is able to withstand weight changes to its weights so that it maintains sandbagging capabilities (a form of gradient hacking), an actual capability can be described as its performance if doing the task was a critically valuable instrumental goal (adapted from Lukas Finnveden).
Redwood Research is currently investigating to what extent fine-tuning can override sandbagging behaviour, which might make fine-tuning even more important for trustworthy capability evaluations.
A deceptively aligned model could sandbag, however.
AI systems like these are called schemers, among others.
For a related but simpler actual experiment, see Berglund et al., 2023.