LESSWRONG
LW

341
Wikitags

Deceptive Alignment

Edited by Multicore, ryan_greenblatt, et al. last updated 18th Oct 2024

Deceptive Alignment is when an AI which is not actually aligned temporarily acts aligned in order to deceive its creators or its training process. It presumably does this to avoid being shut down or retrained and to gain access to the power that the creators would give an aligned AI. (The term scheming is sometimes used for this phenomenon.)

See also: Mesa-optimization, Treacherous Turn, Eliciting Latent Knowledge, Deception

Subscribe
Discussion
4
Subscribe
Discussion
4
Posts tagged Deceptive Alignment
489Alignment Faking in Large Language Models
Ω
ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman, Buck
8mo
Ω
75
358What’s the short timeline plan?
Ω
Marius Hobbhahn
8mo
Ω
49
636The Waluigi Effect (mega-post)
Ω
Cleo Nardo
3y
Ω
188
152Why Do Some Language Models Fake Alignment While Others Don't?
Ω
abhayesian, John Hughes, Alex Mallen, Jozdien, janus, Fabien Roger
2mo
Ω
14
208Will alignment-faking Claude accept a deal to reveal its misalignment?
Ω
ryan_greenblatt, Kyle Fish
8mo
Ω
28
36Decision Theory Guarding is Sufficient for Scheming
Ω
james.lucassen
8d
Ω
3
32The Case for Mixed Deployment
Cleo Nardo
6d
3
210Frontier Models are Capable of In-context Scheming
Ω
Marius Hobbhahn, AlexMeinke, Bronson Schoen, rusheb, Jérémy Scheurer, Mikita Balesni
9mo
Ω
24
320Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research
Ω
evhub, Nicholas Schiefer, Carson Denison, Ethan Perez
2y
Ω
30
278The case for ensuring that powerful AIs are controlled
Ω
ryan_greenblatt, Buck
2y
Ω
73
174How will we update about scheming?
Ω
ryan_greenblatt
7mo
Ω
20
226Self-Other Overlap: A Neglected Approach to AI Alignment
Ω
Marc Carauleanu, Mike Vaiana, Judd Rosenblatt, Diogo de Lucena, Cameron Berg, Trent Hodgeson
1y
Ω
51
144The Hidden Cost of Our Lies to AI
Nicholas Andresen
6mo
18
183Why is o1 so deceptive?
QΩ
abramdemski, Sahil
1y
QΩ
24
156“Alignment Faking” frame is somewhat fake
Ω
Jan_Kulveit
9mo
Ω
13
Load More (15/185)
Add Posts