x

LESSWRONG

LW

Sandbagging (AI) — LessWrong

Sandbagging (AI)

Edited by Raemon last updated 27th Mar 2025

Sandbagging is when an AI system pretends to be less capable during training/evaluation.

Add Posts

Posts tagged Sandbagging (AI)

11

80Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking

Buck, Julian Stastny

1y

3

5

55Notes on countermeasures for exploration hacking (aka sandbagging)

ryan_greenblatt

1y

6

2

80White Box Control at UK AISI - Update on Sandbagging Investigations

Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood, Alan Cooney

11mo

10

2

61The “no sandbagging on checkable tasks” hypothesis

3y

14

2

44Automated Researchers Can Subtly Sandbag

gasteigerjo, Akbir Khan, Sam Bowman, Vlad Mikulik, Ethan Perez, Fabien Roger

1y

0

2

8Sandbagging: distinguishing detection of underperformance from incrimination, and the implications for downstream interventions.

8mo

0

1

84[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Teun van der Weij, Felix Hofstätter, Ollie J, Sam F. Brown, Francis Rhys Ward

2y

10

1

53Do LLMs know what they're capable of? Why this matters for AI safety, and initial findings

Casey Barkan, Sid Black, Oliver Sourbut

11mo

5

1

50An Introduction to AI Sandbagging

Teun van der Weij, Felix Hofstätter, Francis Rhys Ward

2y

13

1

36Can SAE steering reveal sandbagging?

jordinne, Hoang Khiem, Felix Hofstätter, Cleo Nardo

1y

3

1

32How to mitigate sandbagging

Teun van der Weij

1y

0

1

26A Conceptual Framework for Exploration Hacking

Joschka Braun, Eyon Jang, Damon Falck

4mo

2

1

25Exploration hacking: can reasoning models subvert RL?

Damon Falck, Joschka Braun, Eyon Jang

10mo

4

1

24Exploration Hacking: Can LLMs Learn to Resist RL Training?

Eyon Jang, Joschka Braun, Damon Falck, David Lindner

1mo

0

1

18Adding noise to a sandbagging model can reveal its true capabilities

11mo

1

Load More (15/18)

Add Posts