This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
LW
Login
Wikitags
Sandbagging (AI)
Written by
Raemon
last updated
27th Mar 2025
Sandbagging
is when an AI system pretends to be less capable during training/evaluation.
Subscribe
Subscribe
Discussion
0
Discussion
0
Posts tagged
Sandbagging (AI)
Most Relevant
75
Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking
Ω
Buck
,
Julian Stastny
2mo
Ω
1
53
Notes on countermeasures for exploration hacking (aka sandbagging)
Ω
ryan_greenblatt
3mo
Ω
6
61
The “no sandbagging on checkable tasks” hypothesis
Ω
Joe Carlsmith
2y
Ω
14
44
Automated Researchers Can Subtly Sandbag
Ω
gasteigerjo
,
Akbir Khan
,
Sam Bowman
,
Vlad Mikulik
,
Ethan Perez
,
Fabien Roger
3mo
Ω
0
84
[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Ω
Teun van der Weij
,
Felix Hofstätter
,
Ollie J
,
Sam F. Brown
,
Francis Rhys Ward
1y
Ω
10
48
An Introduction to AI Sandbagging
Ω
Teun van der Weij
,
Felix Hofstätter
,
Francis Rhys Ward
1y
Ω
13
35
Can SAE steering reveal sandbagging?
jordine
,
Hoang Khiem
,
Felix Hofstätter
,
Cleo Nardo
3mo
3
29
How to mitigate sandbagging
Ω
Teun van der Weij
4mo
Ω
0
15
Won't vs. Can't: Sandbagging-like Behavior from Claude Models
Ω
Joe Benton
,
Zachary Witten
5mo
Ω
1
-3
A Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives
Knight Lee
3mo
2