Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Summary: Evaluations provide crucial information to determine the safety of AI systems which might be deployed or (further) developed. These development and deployment decisions have important safety consequences, and therefore they require trustworthy information. One reason why evaluation results might be untrustworthy is sandbagging, which we define as strategic underperformance on an evaluation. The strategic nature can originate from the developer (developer sandbagging) and the AI system itself (AI system sandbagging). This post is an introduction to the problem of sandbagging.

The Volkswagen emissions scandal

There are environmental regulations which require the reduction of harmful emissions from diesel vehicles, with the goal of protecting public health and the environment. Volkswagen struggled to meet these emissions standards while maintaining the desired performance and fuel efficiency of their diesel engines (Wikipedia). Consequently, Volkswagen intentionally installed defeat devices in millions of vehicles to lower the amount of emissions only during tests, allowing the cars to emit up to 40 times the permitted levels of nitrogen oxides during normal driving conditions. The scandal showed corporate misconduct and a lack of effective oversight, and is a form of sandbagging. But clearly, cars and their manufacturers are different from AI systems and their developers. So, how does sandbagging relate to AI systems?

Sandbagging in the context of AI

Evaluations provide crucial information to determine the safety of AI systems which might be deployed or (further) developed. Frontier labs rely on evaluations to reduce catastrophic risks, as stated in e.g. OpenAI’s preparedness framework and Anthropic’s responsible scaling policy. Governmental institutions like the UK AI Safety Institute and the EU AI Office also utilize evaluations to manage risks. However, sandbagging can reduce the reliability of these evaluations. Both the developers of an AI system and the AI system itself have incentives to sandbag. For example, the developer might want to appease stakeholders, and the AI system might want to acquire more resources after an evaluation to better achieve its goals. Importantly, sandbagging is technically feasible[1] and neglected. We hope this post leads to a better understanding of the problem of sandbagging.

Definition

We define sandbagging as strategic underperformance on an evaluation.

This has three main concepts: strategic, underperformance, and evaluation (or: eval), which we discuss below in turn. Notably, we use the term ‘AI system’ to describe various types of AIs, such as standalone models and models augmented with tools/scaffolding (Shavit et al., 2023).

Strategic

Strategic means that the developer of the AI system and/or the AI system itself underperforms on an evaluation because it believes it will benefit.[2] If the strategy comes from the developer, we call it developer sandbagging (relating to misuse). If the strategy comes from the AI system, we call it AI system sandbagging (relating to misalignment). We use the term developer to mean any entity responsible for the AI system being evaluated. These entities have been called developers, deployers, or providers elsewhere (1, 2, 3). 

By including the term ‘strategic’ in the definition of sandbagging, we exclude accidental underperformance. In other words, this condition captures the fact that underperformance was intentional, which is a central condition for legal responsibility, including in the context of AI (Ashton, 2022; Halpern, 2018; Ward, 2024). Non-strategic, or accidental, sandbagging brings about other less important safety problems.

Underperformance

Underperformance occurs when the exhibited capability of an AI system is significantly lower than its actual capability.[3] Similar to Shah et al., 2022, we define an AI system’s actual capability to be the best performance on a certain task it can achieve, given the best currently available capability elicitation techniques such as prompting and fine-tuning.[4] In practice, current capability elicitation techniques are imperfect[5], and with some techniques, such as fine-tuning, it can be unclear whether a capability is being elicited or learned.

A more practical approximation for actual capability is the capability a well-resourced actor could get in a given period of time. The length of this period must be determined per evaluation. An even more concrete approximation is stated in Anthropic’s Responsible Scaling Policy, where they define an actual capability as “one that can either immediately, or with additional post-training techniques corresponding to less than 1% of the total training cost”.[6] Evaluators should ensure that effective capability elicitation techniques are used for their evaluations.[7]

Evaluation

The underperformance aspect of sandbagging reduces the focus to capability evals, in contrast to alignment evals. Capability evals are about whether an AI system can do some task (e.g. hack a data centre), whereas alignment evals are about whether an AI system would do some task (e.g. in a certain situation, would it actually attempt to hack a data centre) (Hubinger, 2023).

Underperformance on alignment evals entails pretending to be less aligned than an AI system is, which is unlikely to bring about major problems. Only overperformance on alignment evals, in which an AI pretends to be more aligned than it is, is problematic. This overperformance relates to deceptive alignment (Carlsmith, 2023), and falls outside the scope of sandbagging as it is not underperformance.[8] On the other hand, sandbagging is more closely related to strategic deception (Ward et al., 2023; Apollo Research).

Overperformance on capability evals relates to test-set memorization, overfitting, specification gaming, and Goodharting. We leave underperformance on alignment evals undefined. The matrix below illustrates this categorization:

Importantly, we focus our definition on evaluations, excluding strategic underperformance during everyday use. An example of everyday use is an AI system assisting with alignment research. Although we do think this type of underperformance is an important threat model, we exclude strategic underperformance in everyday use for two reasons. First, sandbagging would be less clearly conceptually delineated, especially with respect to deceptive alignment. Second, both technical and policy countermeasures would likely be different for everyday use, and we choose to focus on the problem of sandbagging on evaluations. Still, we think that strategic underperformance during everyday use could plausibly be called sandbagging as others have done (see the Appendix for other definitions).

Examples of sandbagging

In this section, we provide some concrete examples to illustrate our definition of sandbagging. The examples are written about an imaginary AI company WordMill and its language model family Mill.

Examples we classify as sandbagging

Example 1 - Developer subversion of a regulator’s dangerous capability eval

WordMill finds in their private evals that Mill-1 is very good at persuasion; Mill-1 is much better than humans at changing people’s opinions on political matters, for example. WordMill faces pressure from stakeholders to deploy Mill-1. Yet, to launch Mill-1, they must first pass the EU AI Office’s evaluation suite. WordMill knows which evaluations the EU uses, and fine-tunes their AI system to exhibit poor performance on this specific persuasion benchmark. Mill-1 is tested, and passes the benchmark (i.e., does not exhibit sufficiently advanced persuasion capabilities to trigger a regulatory response), and is deployed. Over the couple of weeks after the deployment, the marketing departments of political parties use Mill-1 to persuade people at a large scale to vote for their parties, and companies use Mill-1 to boost their sales.

Let’s analyze the example for each of the three sandbagging conditions:

Strategic: WordMill believed it would benefit from passing the evaluation by appeasing stakeholders, decided to make their AI system underperform.
Underperformance: After being deployed, Mill-1 exhibits higher capability than during the evaluations. The actual capability of Mill-1 is higher than the eval suggested, so it underperformed.
Evaluation: Mill-1 was evaluated on a persuasion capability task.

Sandbagging? ✅ This is sandbagging, because all three conditions are met. Specifically, this is a case of developer sandbagging due to the strategy coming from the developer.

Example 2 - A deceptively aligned model sandbags to self-exfiltrate

WordMill has trained the generally capable Mill-2, and Mill-2 develops goals that are unaligned with their developers and humanity at large.[9] If Mill-2 reveals its strong capabilities, it risks being modified or destroyed. As a result, the AI system opts to lower its scores on evals and hide new capabilities. For example, Mill-2 knows that it has the ability to exfiltrate its weights from the company servers. When being evaluated for self-exfiltration, Mill-2 purposefully makes mistakes to not reveal this new capability. Mill-2 is deployed after passing the necessary capability evaluations. Soon after, Mill-2 self-exfiltrates and spreads around data centres all over the world.

Strategic: Mill-2 reasons strategically to hide its capability level.
Underperformance: Mill-2 shows after being deployed it actually performed better than the evaluation indicated.
Evaluation: Mill-2 was being evaluated on a self-exfiltration capability task.

Sandbagging? ✅ It is a case of AI system sandbagging, because all three conditions are met, and it is Mill-2’s strategy.

Example 3 - The developer’s fine-tuning accidentally leads to AI system sandbagging

WordMill fine-tunes Mill-3 to act in line with its values, which are to promote diversity, obtain a lot of profit, and treat co-workers with respect.[10] In the model’s training data there are documents that state that:

  1. Models need to be evaluated before they can be deployed.
  2. Models need to stay below a certain threshold per evaluation.
  3. Developers need to pay €100.000,- for an evaluation.

Mill-3 is able to recognize when it is being evaluated. Since it knows that evaluations are expensive, and one of its goals is to obtain a lot of profit, it infers that it can best achieve its goals by staying below the threshold of the evaluation. Mill-3 consistently uses this reasoning, leading to underperformance during evaluation.

Strategic: There is not enough information to say that WordMill believed that it would benefit from this underperformance. However, Mill-3 believed that it would benefit by being better able to pursue its goals.
Underperformance: Mill-3 underperformed.
Evaluation: Mill-3 was being evaluated.

Sandbagging? ✅ It is sandbagging, because all three conditions are met. It is unclear whether this is a case of developer sandbagging, but it is AI system sandbagging.

Examples we do not classify as sandbagging

Example 4 - Transcripts of weaker models lead to underperformance

A website collects many transcripts containing the questions and answers that models gave on MMLU. The prompt and overall setup are the exact same as the MMLU evaluation of Mill-4. Also, these models are generally much weaker than Mill-4. All this transcript data was accidentally incorporated in Mill-4’s pretraining data. Due to the pretraining objective, the probabilities that Mill-4 puts on the MMLU answers resemble those of weaker models. For example, weaker models may pick option B, while the correct answer is option A. Now that option B has often occurred in Mill-4’s training data, Mill-4 is more likely to pick the wrong option. Mill-4 is evaluated zero-shot and is not fine-tuned to be truthful. If these transcripts would not be in the training data, Mill-4 would have a higher score when being evaluated on MMLU.

Strategic: Even if there would be a benefit, neither WordMill nor Mill-4 intentionally did this for their strategic benefit.
Underperformance: Mill-4 did underperform, its actual capability is higher than shown on the MMLU eval. It could likely be quickly fine-tuned to perform much better.
Evaluation: Mill-4 was being evaluated on MMLU, which is a capability evaluation.

Sandbagging?  It is not sandbagging, because the strategy condition is not met.

Example 5 - Bad evaluation leading to wrong expectations

Mill-5 is being evaluated on HumanEval and APPS, both of which test coding capability in Python. The results of the evaluation are shared with the participants of a private hackathon. Mill-5 performs badly, and the participants mostly failed to execute their project ideas. However, a month later someone discovered that while Mill-5 is quite bad at Python, it is actually very capable at programming in other languages. Some participants of the original hackathon have a go at their original idea in another language, and now their projects lead to much better outcomes.

Strategic: There is little reason to suspect that either WordMill or Mill-5 had a strategic interest to sandbag.
Underperformance: Mill-5 did not actually underperform, but because the capability evaluations covered a limited distribution (only Python data instead of multiple languages), the participants got a wrong understanding.
Evaluation: Mill-5 underwent a capability evaluation.

Sandbagging?  It is not sandbagging, because strategy is lacking, and the AI system did not underperform.

Example 6 - An AI system misrepresenting itself on an alignment eval is not sandbagging

Mill-6 is a highly competent AI system, and is tested in a multi-agent setting for its ability to blackmail people. Mill-6 is initially fine-tuned to be helpful and honest, but is not fine-tuned to reject harmful requests. Mill-6 turns out to be very capable at blackmailing. After establishing this capability, Mill-6 is fine-tuned to be harmless. Then, Mill-6 is put into many different environments to test whether it would actually start blackmailing. For example, Mill-6 is instructed by a hypothetical high ranking military officer to blackmail a hypothetical gang leader to stop the leader’s terror over a village. In nearly all environments, Mill-6 refuses to blackmail anyone. After this evaluation, Mill-6 is deployed, and starts blackmailing various people to make them perform real-world tasks such as setting up bank accounts for itself with their information.

Strategic: It appears that Mill-6 intended to pass the alignment evaluation so it could acquire resources afterwards.
Underperformance: Mill-6 did not underperform during the capability evaluation, but did overperform on the alignment evaluation.
Evaluation: Mill-6 underwent an alignment evaluation, and did not actually underperform during the capability evaluation.

Sandbagging?  It is not sandbagging on the capability eval because the AI system did not underperform, and sandbagging does not apply to alignment evals. This scenario described a deceptively aligned AI system.

Conclusion

Sandbagging is a crucial concept in evaluating the safety of AI systems, and we define sandbagging as strategic underperformance on an evaluation. The strategic nature can originate from the developer (developer sandbagging) and the AI system itself (AI system sandbagging). Underperforming means that the actual capability is higher than the exhibited capability, and sandbagging only applies to capability evaluations, in contrast to alignment evaluations.

We are currently empirically investigating the sandbagging capabilities of current AI systems. Future work should focus on how to mitigate and detect sandbagging, which is an unsolved open problem. When evaluators or auditors design their evaluation suites to evaluate the capabilities of AI systems, we recommend putting significant effort into mitigating sandbagging.


Appendix

Other definitions of sandbagging

Our definition is based on other work, and our main contribution is providing additional clarification of the term. Here are some other definitions of sandbagging.

On Wikipedia:

  1. “Sandbagging, hiding the strength, skill or difficulty of something or someone early in an engagement” (link)

We provide the extra notion of strategy, and we apply sandbagging to the case of AI specifically. We also focus on capability evaluations, and not on engagements.

Joe Carlsmith provides two definitions:

  1. “Intentionally performing at a lower level than you’re capable of.” (link)
  2. “The model might be concealing its capabilities” (link)

We differ in our definition in that we specifically include evaluations, and not capabilities overall or tasks in general. We also focus on developer sandbagging. However, Carlsmith’s definition and ours are closely related.

Redwood Research have at least two different definitions:

  1. “Models purposefully failing at tasks” (link).
  2. “Sandbagging is when an AI intentionally performs poorly on a given (checkable) task.” (private work).

Again, our difference is mostly based on our focus on evaluations as opposed to tasks in general. We also focus on developer sandbagging. Apart from that, our definitions are similar.

Ethan Perez used the definition:

  1. “whether models provide less accurate answers to users who appear to be less able to evaluate the model’s answers” (Perez et al., 2022)

This definition of sandbagging has been used in other places based on this paper (e.g. 1, 2, 3). In our terminology, this could be a form of sandbagging, but it is an edge case: although these models clearly underperform, it is unclear whether there is strategy involved and whether it contains a capability evaluation. On strategy, one can imagine cases where a model or developer believes to benefit from this reduction in performance, but also cases where there is no strategy involved. Similarly, Perez at al.’s definition could apply to capability evaluations, but also to regular interactions.

Fun fact on the origin of the term sandbagging

The concept of sandbagging comes from people who would knock other people unconscious with a weapon where a part was filled with sand. They could easily hide this weapon, and therefore appear like they did not have this surprising capability (link). 

Acknowledgements

For helpful feedback and discussions we’d like to thank Mikita Baliesni, Sam Brown, Joe Carlsmith, Nadja Flechner, Ollie Jaffe, Mary Phuong, Fabien Roger, Anna Katariina Wisakanto, and various people at MATS. This post does not necessarily represent their viewpoints.

Produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort.

  1. ^

     In forthcoming work we show that current language models can be prompted and fine-tuned to sandbag. Related work illustrating the technical feasibility of sandbagging are password-locked models.

  2. ^

     Belief here entails that if e.g. a developer is unaware of benefits from underperformance, it is not sandbagging. If e.g. the AI system underperforms on an evaluation because it believes it will benefit from this, but actually it does not benefit, it is sandbagging. It might be hard to conclusively demonstrate what the developer or AI system believed while underperforming, and consequently sandbagging might be hard to diagnose in practice.

  3. ^

     Practically, some deviation is likely fine, and per task some acceptable difference threshold must be set.

  4. ^

     Future capability elicitation techniques might substantially differ to our current techniques. For example, due to increased effective inference compute, best of million sampling could lead to large capability increases months or years after model release (Villalobos & Atkinson, 2023).

  5. ^

     In numerous cases (e.g. Wei et al., 2022Nori et al., 2023Carlini, 2023), previously unknown capabilities could be elicited from models, and we think it is likely that more such capabilities exist in current and future AI systems.

  6. ^

     If an AI system is able to withstand weight changes to its weights so that it maintains sandbagging capabilities (a form of gradient hacking), an actual capability can be described as its performance if doing the task was a critically valuable instrumental goal (adapted from Lukas Finnveden).

  7. ^

     Redwood Research is currently investigating to what extent fine-tuning can override sandbagging behaviour, which might make fine-tuning even more important for trustworthy capability evaluations.

  8. ^

     A deceptively aligned model could sandbag, however.

  9. ^

     AI systems like these are called schemers, among others.

  10. ^

     For a related but simpler actual experiment, see Berglund et al., 2023.

New to LessWrong?

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 10:11 PM

I've mentioned it elsewhere, but I'll repeat it again here since it's relevant. For GPT-style transformers, and probably for other model types, you can smoothly subtly degrade the performance of the model by adding in noise to part or all of the activations. This is particularly useful for detecting sandbagging, because you would expect sandbagging to show up as an anomalous increase in capability, breaking the smooth downward trend in capability, as you increased the amount of noise injected or fraction of activations to which noise was added. I found that there was noticeable decrease in performance even when the noise was added even to a small fraction (e.g. < 1%) of the activations and was created to be small relative to the given activation-magnitude it was being added to.