This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
LW
Login
Wikitags
Sandbagging (AI)
Edited by
Raemon
last updated
27th Mar 2025
Sandbagging
is when an AI system pretends to be less capable during training/evaluation.
Subscribe
Subscribe
Discussion
0
Discussion
0
Posts tagged
Sandbagging (AI)
Most Relevant
77
Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking
Ω
Buck
,
Julian Stastny
4mo
Ω
3
54
Notes on countermeasures for exploration hacking (aka sandbagging)
Ω
ryan_greenblatt
5mo
Ω
6
77
White Box Control at UK AISI - Update on Sandbagging Investigations
Ω
Joseph Bloom
,
Jordan Taylor
,
Connor Kissane
,
Sid Black
,
merizian
,
alexdzm
,
jacoba
,
Ben Millwood
,
Alan Cooney
2mo
Ω
10
61
The “no sandbagging on checkable tasks” hypothesis
Ω
Joe Carlsmith
2y
Ω
14
44
Automated Researchers Can Subtly Sandbag
Ω
gasteigerjo
,
Akbir Khan
,
Sam Bowman
,
Vlad Mikulik
,
Ethan Perez
,
Fabien Roger
5mo
Ω
0
84
[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Ω
Teun van der Weij
,
Felix Hofstätter
,
Ollie J
,
Sam F. Brown
,
Francis Rhys Ward
1y
Ω
10
50
Do LLMs know what they're capable of? Why this matters for AI safety, and initial findings
Ω
Casey Barkan
,
Sid Black
,
Oliver Sourbut
2mo
Ω
4
48
An Introduction to AI Sandbagging
Ω
Teun van der Weij
,
Felix Hofstätter
,
Francis Rhys Ward
1y
Ω
13
35
Can SAE steering reveal sandbagging?
jordine
,
Hoang Khiem
,
Felix Hofstätter
,
Cleo Nardo
5mo
3
30
How to mitigate sandbagging
Ω
Teun van der Weij
5mo
Ω
0
16
Exploration hacking: can reasoning models subvert RL?
Ω
Damon Falck
,
Joschka Braun
,
Eyon Jang
1mo
Ω
4
15
Won't vs. Can't: Sandbagging-like Behavior from Claude Models
Ω
Joe Benton
,
Zachary Witten
6mo
Ω
1
15
Adding noise to a sandbagging model can reveal its true capabilities
TheManxLoiner
2mo
1
-3
A Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives
Knight Lee
5mo
2