LESSWRONG
LW

822
Wikitags

Sandbagging (AI)

Edited by Raemon last updated 27th Mar 2025

Sandbagging is when an AI system pretends to be less capable during training/evaluation.

Subscribe
Discussion
Subscribe
Discussion
Posts tagged Sandbagging (AI)
77Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking
Ω
Buck, Julian Stastny
5mo
Ω
3
54Notes on countermeasures for exploration hacking (aka sandbagging)
Ω
ryan_greenblatt
7mo
Ω
6
78White Box Control at UK AISI - Update on Sandbagging Investigations
Ω
Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood, Alan Cooney
3mo
Ω
10
61The “no sandbagging on checkable tasks” hypothesis
Ω
Joe Carlsmith
2y
Ω
14
44Automated Researchers Can Subtly Sandbag
Ω
gasteigerjo, Akbir Khan, Sam Bowman, Vlad Mikulik, Ethan Perez, Fabien Roger
7mo
Ω
0
6Sandbagging: distinguishing detection of underperformance from incrimination, and the implications for downstream interventions.
lennie
13d
0
84[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Ω
Teun van der Weij, Felix Hofstätter, Ollie J, Sam F. Brown, Francis Rhys Ward
1y
Ω
10
53Do LLMs know what they're capable of? Why this matters for AI safety, and initial findings
Ω
Casey Barkan, Sid Black, Oliver Sourbut
3mo
Ω
5
49An Introduction to AI Sandbagging
Ω
Teun van der Weij, Felix Hofstätter, Francis Rhys Ward
1y
Ω
13
35Can SAE steering reveal sandbagging?
jordine, Hoang Khiem, Felix Hofstätter, Cleo Nardo
6mo
3
30How to mitigate sandbagging
Ω
Teun van der Weij
7mo
Ω
0
16Adding noise to a sandbagging model can reveal its true capabilities
TheManxLoiner
3mo
1
16Exploration hacking: can reasoning models subvert RL?
Ω
Damon Falck, Joschka Braun, Eyon Jang
3mo
Ω
4
15Won't vs. Can't: Sandbagging-like Behavior from Claude Models
Ω
Joe Benton, Zachary Witten
8mo
Ω
1
-3A Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives
Knight Lee
6mo
2
Load More (15/15)
Add Posts