LESSWRONG
LW

Wikitags

Sandbagging (AI)

Written by Raemon last updated 27th Mar 2025

Sandbagging is when an AI system pretends to be less capable during training/evaluation.

Subscribe
Subscribe
Discussion0
Discussion0
Posts tagged Sandbagging (AI)
75Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking
Ω
Buck, Julian Stastny
2mo
Ω
1
53Notes on countermeasures for exploration hacking (aka sandbagging)
Ω
ryan_greenblatt
3mo
Ω
6
61The “no sandbagging on checkable tasks” hypothesis
Ω
Joe Carlsmith
2y
Ω
14
44Automated Researchers Can Subtly Sandbag
Ω
gasteigerjo, Akbir Khan, Sam Bowman, Vlad Mikulik, Ethan Perez, Fabien Roger
3mo
Ω
0
84[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Ω
Teun van der Weij, Felix Hofstätter, Ollie J, Sam F. Brown, Francis Rhys Ward
1y
Ω
10
48An Introduction to AI Sandbagging
Ω
Teun van der Weij, Felix Hofstätter, Francis Rhys Ward
1y
Ω
13
35Can SAE steering reveal sandbagging?
jordine, Hoang Khiem, Felix Hofstätter, Cleo Nardo
3mo
3
29How to mitigate sandbagging
Ω
Teun van der Weij
4mo
Ω
0
15Won't vs. Can't: Sandbagging-like Behavior from Claude Models
Ω
Joe Benton, Zachary Witten
5mo
Ω
1
-3A Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives
Knight Lee
3mo
2
Add Posts