x
This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
LW
Login
Sandbagging (AI) — LessWrong
Sandbagging (AI)
Edited by
Raemon
last updated
27th Mar 2025
Sandbagging
is when an AI system pretends to be less capable during training/evaluation.
Subscribe
Discussion
Subscribe
Discussion
Posts tagged
Sandbagging (AI)
Most Relevant
11
78
Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking
Ω
Buck
,
Julian Stastny
7mo
Ω
3
5
54
Notes on countermeasures for exploration hacking (aka sandbagging)
Ω
ryan_greenblatt
8mo
Ω
6
2
80
White Box Control at UK AISI - Update on Sandbagging Investigations
Ω
Joseph Bloom
,
Jordan Taylor
,
Connor Kissane
,
Sid Black
,
merizian
,
alexdzm
,
jacoba
,
Ben Millwood
,
Alan Cooney
5mo
Ω
10
2
61
The “no sandbagging on checkable tasks” hypothesis
Ω
Joe Carlsmith
2y
Ω
14
2
44
Automated Researchers Can Subtly Sandbag
Ω
gasteigerjo
,
Akbir Khan
,
Sam Bowman
,
Vlad Mikulik
,
Ethan Perez
,
Fabien Roger
8mo
Ω
0
2
8
Sandbagging: distinguishing detection of underperformance from incrimination, and the implications for downstream interventions.
lennie
2mo
0
1
84
[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Ω
Teun van der Weij
,
Felix Hofstätter
,
Ollie J
,
Sam F. Brown
,
Francis Rhys Ward
1y
Ω
10
1
53
Do LLMs know what they're capable of? Why this matters for AI safety, and initial findings
Ω
Casey Barkan
,
Sid Black
,
Oliver Sourbut
5mo
Ω
5
1
50
An Introduction to AI Sandbagging
Ω
Teun van der Weij
,
Felix Hofstätter
,
Francis Rhys Ward
2y
Ω
13
1
35
Can SAE steering reveal sandbagging?
jordine
,
Hoang Khiem
,
Felix Hofstätter
,
Cleo Nardo
8mo
3
1
30
How to mitigate sandbagging
Ω
Teun van der Weij
8mo
Ω
0
1
16
Adding noise to a sandbagging model can reveal its true capabilities
TheManxLoiner
5mo
1
1
16
Exploration hacking: can reasoning models subvert RL?
Ω
Damon Falck
,
Joschka Braun
,
Eyon Jang
4mo
Ω
4
1
15
Won't vs. Can't: Sandbagging-like Behavior from Claude Models
Ω
Joe Benton
,
Zachary Witten
9mo
Ω
1
1
-3
A Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives
Knight Lee
8mo
2