This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
LW
Login
822
Wikitags
Sandbagging (AI)
Edited by
Raemon
last updated
27th Mar 2025
Sandbagging
is when an AI system pretends to be less capable during training/evaluation.
Subscribe
Discussion
Subscribe
Discussion
Posts tagged
Sandbagging (AI)
Most Relevant
77
Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking
Ω
Buck
,
Julian Stastny
5mo
Ω
3
54
Notes on countermeasures for exploration hacking (aka sandbagging)
Ω
ryan_greenblatt
7mo
Ω
6
78
White Box Control at UK AISI - Update on Sandbagging Investigations
Ω
Joseph Bloom
,
Jordan Taylor
,
Connor Kissane
,
Sid Black
,
merizian
,
alexdzm
,
jacoba
,
Ben Millwood
,
Alan Cooney
3mo
Ω
10
61
The “no sandbagging on checkable tasks” hypothesis
Ω
Joe Carlsmith
2y
Ω
14
44
Automated Researchers Can Subtly Sandbag
Ω
gasteigerjo
,
Akbir Khan
,
Sam Bowman
,
Vlad Mikulik
,
Ethan Perez
,
Fabien Roger
7mo
Ω
0
6
Sandbagging: distinguishing detection of underperformance from incrimination, and the implications for downstream interventions.
lennie
13d
0
84
[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Ω
Teun van der Weij
,
Felix Hofstätter
,
Ollie J
,
Sam F. Brown
,
Francis Rhys Ward
1y
Ω
10
53
Do LLMs know what they're capable of? Why this matters for AI safety, and initial findings
Ω
Casey Barkan
,
Sid Black
,
Oliver Sourbut
3mo
Ω
5
49
An Introduction to AI Sandbagging
Ω
Teun van der Weij
,
Felix Hofstätter
,
Francis Rhys Ward
1y
Ω
13
35
Can SAE steering reveal sandbagging?
jordine
,
Hoang Khiem
,
Felix Hofstätter
,
Cleo Nardo
6mo
3
30
How to mitigate sandbagging
Ω
Teun van der Weij
7mo
Ω
0
16
Adding noise to a sandbagging model can reveal its true capabilities
TheManxLoiner
3mo
1
16
Exploration hacking: can reasoning models subvert RL?
Ω
Damon Falck
,
Joschka Braun
,
Eyon Jang
3mo
Ω
4
15
Won't vs. Can't: Sandbagging-like Behavior from Claude Models
Ω
Joe Benton
,
Zachary Witten
8mo
Ω
1
-3
A Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives
Knight Lee
6mo
2