LESSWRONG
LW

1021
Wikitags

Deceptive Alignment

Edited by Multicore, ryan_greenblatt, et al. last updated 18th Oct 2024

Deceptive Alignment is when an AI which is not actually aligned temporarily acts aligned in order to deceive its creators or its training process. It presumably does this to avoid being shut down or retrained and to gain access to the power that the creators would give an aligned AI. (The term scheming is sometimes used for this phenomenon.)

See also: Mesa-optimization, Treacherous Turn, Eliciting Latent Knowledge, Deception

Subscribe
Discussion
4
Subscribe
Discussion
4
Posts tagged Deceptive Alignment
8
490Alignment Faking in Large Language Models
Ω
ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman, Buck
10mo
Ω
75
2
127Realistic Reward Hacking Induces Different and Deeper Misalignment
Ω
Jozdien
1mo
Ω
2
4
645The Waluigi Effect (mega-post)
Ω
Cleo Nardo
3y
Ω
188
2
361What’s the short timeline plan?
Ω
Marius Hobbhahn
10mo
Ω
49
1
158Why Do Some Language Models Fake Alignment While Others Don't?
Ω
abhayesian, John Hughes, Alex Mallen, Jozdien, janus, Fabien Roger
4mo
Ω
14
2
208Will alignment-faking Claude accept a deal to reveal its misalignment?
Ω
ryan_greenblatt, Kyle Fish
9mo
Ω
28
3
322Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research
Ω
evhub, Nicholas Schiefer, Carson Denison, Ethan Perez
2y
Ω
30
2
210Frontier Models are Capable of In-context Scheming
Ω
Marius Hobbhahn, AlexMeinke, Bronson Schoen, rusheb, Jérémy Scheurer, Mikita Balesni
1y
Ω
24
2
267The case for ensuring that powerful AIs are controlled
Ω
ryan_greenblatt, Buck
2y
Ω
73
2
175How will we update about scheming?
Ω
ryan_greenblatt
9mo
Ω
21
1
227Self-Other Overlap: A Neglected Approach to AI Alignment
Ω
Marc Carauleanu, Mike Vaiana, Judd Rosenblatt, Diogo de Lucena, Cameron Berg, Trent Hodgeson
1y
Ω
51
1
145The Hidden Cost of Our Lies to AI
Nicholas Andresen
8mo
18
9
239AI Control: Improving Safety Despite Intentional Subversion
Ω
Buck, Fabien Roger, ryan_greenblatt, Kshitij Sachan
2y
Ω
24
2
183Why is o1 so deceptive?
QΩ
abramdemski, Sahil
1y
QΩ
24
3
268Deep Deceptiveness
Ω
So8res
3y
Ω
60
Load More (15/195)
Add Posts