LESSWRONG
LW

Wikitags

Sandbagging (AI)

Edited by Raemon last updated 27th Mar 2025

Sandbagging is when an AI system pretends to be less capable during training/evaluation.

Subscribe
Subscribe
Discussion0
Discussion0
Posts tagged Sandbagging (AI)
77Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking
Ω
Buck, Julian Stastny
4mo
Ω
3
54Notes on countermeasures for exploration hacking (aka sandbagging)
Ω
ryan_greenblatt
5mo
Ω
6
77White Box Control at UK AISI - Update on Sandbagging Investigations
Ω
Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood, Alan Cooney
2mo
Ω
10
61The “no sandbagging on checkable tasks” hypothesis
Ω
Joe Carlsmith
2y
Ω
14
44Automated Researchers Can Subtly Sandbag
Ω
gasteigerjo, Akbir Khan, Sam Bowman, Vlad Mikulik, Ethan Perez, Fabien Roger
5mo
Ω
0
84[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Ω
Teun van der Weij, Felix Hofstätter, Ollie J, Sam F. Brown, Francis Rhys Ward
1y
Ω
10
50Do LLMs know what they're capable of? Why this matters for AI safety, and initial findings
Ω
Casey Barkan, Sid Black, Oliver Sourbut
2mo
Ω
4
48An Introduction to AI Sandbagging
Ω
Teun van der Weij, Felix Hofstätter, Francis Rhys Ward
1y
Ω
13
35Can SAE steering reveal sandbagging?
jordine, Hoang Khiem, Felix Hofstätter, Cleo Nardo
5mo
3
30How to mitigate sandbagging
Ω
Teun van der Weij
5mo
Ω
0
16Exploration hacking: can reasoning models subvert RL?
Ω
Damon Falck, Joschka Braun, Eyon Jang
1mo
Ω
4
15Won't vs. Can't: Sandbagging-like Behavior from Claude Models
Ω
Joe Benton, Zachary Witten
6mo
Ω
1
15Adding noise to a sandbagging model can reveal its true capabilities
TheManxLoiner
2mo
1
-3A Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives
Knight Lee
5mo
2
Add Posts