LESSWRONG
LW

243
Sandbagging

Sandbagging

Sep 26, 2025 by lennie

There has recently been much interest in the AI safety community about detecting and mitigating strategic underperformance of AI systems. 

Typically these target one of the following two threat models:

  • Research sabotage (RSab): AIs make safety research go poorly by withholding their best ideas or putting subtle bugs in experiments.
  • Sandbagging Dangerous Capability Evaluations (DCEs): powerful AIs perform poorly when being evaluated for dangerous capabilities (DCs) so that developers and regulators mistakenly think they are safe for release.

AI control is a natural framing to use to study and mitigate these threat models, and is often used by research in the space, e.g. AISI’s work on white box control or Redwood’s existing work on low-stakes control. 

Both threats lead to the same core game dynamics: an AI system must strategically underperform, while avoiding detection and resisting elicitation. One may therefore expect substantial overlap between control research and protocols designed to study and mitigate the two threats.

This leads to an important question: how should one design control evaluations to study strategic underperformance? Indeed, can one design a general control setting that can give insights on each, or does one need to focus on one of the two? And what, if any, are the important differences between the settings?

This question was a core consideration for a recent project on studying black box detection of strategic underperformance that I collaborated on with Mary Phuong’s MATS 8.0 stream. When we first identified these questions, we thought it would be sensible to focus on research sabotage for concreteness. However, much of our work is also relevant to sandbagging on DCEs. In this sequence I hope to crystallise my thoughts, so that others in the community can critique and improve upon them when designing future control evaluations.

Posts

  • Sandbagging: distinguishing detection of underperformance from incrimination, and the implications for downstream interventions.
    I argue that the driver for underperformance may not be clear, and that control may be a helpful framework for mitigating generic underperformance (or under-elicitation), as well as strategic underperformance. As such, it is important to distinguish between detection of underperformance from incrimination, as these require different strategies, and naturally trigger different countermeasures.
  • Which differences between sandbagging evaluations and sandbagging safety research are important?
    I compare the two threat models of underperformance on evaluations to underperformance on safety research from the perspective of designing control games. I conclude that the key difference regards the scorer: in DCEs the evaluation metric is precisely the notion of performance that matters, whereas in RSab it is hard to develop metrics that well approximate performance.
1Sandbagging: distinguishing detection of underperformance from incrimination, and the implications for downstream interventions.
lennie
2d
0
1Which differences between sandbagging evaluations and sandbagging safety research are important for control?
lennie
1d
0