There has recently been much interest in the AI safety community about detecting and mitigating strategic underperformance of AI systems.

Typically these target one of the following two threat models:

Research sabotage (RSab): AIs make safety research go poorly by withholding their best ideas or putting subtle bugs in experiments.
Sandbagging Dangerous Capability Evaluations (DCEs): powerful AIs perform poorly when being evaluated for dangerous capabilities (DCs) so that developers and regulators mistakenly think they are safe for release.

AI control is a natural framing to use to study and mitigate these threat models, and is often used by research in the space, e.g. AISI’s work on white box control or Redwood’s existing work on low-stakes control.

Both threats lead to the same core game dynamics: an AI system must strategically underperform, while avoiding detection and resisting elicitation. One may therefore expect substantial overlap between control research and protocols designed to study and mitigate the two threats.

This leads to an important question: how should one design control evaluations to study strategic underperformance? Indeed, can one design a general control setting that can give insights on each, or does one need to focus on one of the two? And what, if any, are the important differences between the settings?

This question was a core consideration for a recent project on studying black box detection of strategic underperformance that I collaborated on with Mary Phuong’s MATS 8.0 stream. When we first identified these questions, we thought it would be sensible to focus on research sabotage for concreteness. However, much of our work is also relevant to sandbagging on DCEs. In this sequence I hope to crystallise my thoughts, so that others in the community can critique and improve upon them when designing future control evaluations.

Posts

Sandbagging: distinguishing detection of underperformance from incrimination, and the implications for downstream interventions.
I argue that the driver for underperformance may not be clear, and that control may be a helpful framework for mitigating generic underperformance (or under-elicitation), as well as strategic underperformance. As such, it is important to distinguish between detection of underperformance from incrimination, as these require different strategies, and naturally trigger different countermeasures.
Which differences between sandbagging evaluations and sandbagging safety research are important?
I compare the two threat models of underperformance on evaluations to underperformance on safety research from the perspective of designing control games. I conclude that the key difference regards the scorer: in DCEs the evaluation metric is precisely the notion of performance that matters, whereas in RSab it is hard to develop metrics that well approximate performance.

LESSWRONG
LW

LESSWRONG
LW

Sandbagging

Posts