LESSWRONG
LW

116
Wikitags

Deceptive Alignment

Edited by Multicore, ryan_greenblatt, et al. last updated 18th Oct 2024

Deceptive Alignment is when an AI which is not actually aligned temporarily acts aligned in order to deceive its creators or its training process. It presumably does this to avoid being shut down or retrained and to gain access to the power that the creators would give an aligned AI. (The term scheming is sometimes used for this phenomenon.)

See also: Mesa-optimization, Treacherous Turn, Eliciting Latent Knowledge, Deception

Subscribe
Discussion
4
Subscribe
Discussion
4
Posts tagged Deceptive Alignment
118Deceptive Alignment
Ω
evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse, Scott Garrabrant
6y
Ω
20
239AI Control: Improving Safety Despite Intentional Subversion
Ω
Buck, Fabien Roger, ryan_greenblatt, Kshitij Sachan
2y
Ω
24
105How likely is deceptive alignment?
Ω
evhub
3y
Ω
28
96Does SGD Produce Deceptive Alignment?
Ω
Mark Xu
5y
Ω
9
82New report: "Scheming AIs: Will AIs fake alignment during training in order to get power?"
Ω
Joe Carlsmith
2y
Ω
28
489Alignment Faking in Large Language Models
Ω
ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman, Buck
8mo
Ω
75
113Catching AIs red-handed
Ω
ryan_greenblatt, Buck
2y
Ω
27
168Many arguments for AI x-risk are wrong
Ω
TurnTrout
2y
Ω
87
76A Problem to Solve Before Building a Deception Detector
Ω
Eleni Angelou, lewis smith
7mo
Ω
12
57Order Matters for Deceptive Alignment
Ω
DavidW
3y
Ω
19
34Why Aligning an LLM is Hard, and How to Make it Easier
Ω
RogerDearnaley
8mo
Ω
3
48Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor
Ω
RogerDearnaley
2y
Ω
8
635The Waluigi Effect (mega-post)
Ω
Cleo Nardo
3y
Ω
188
96Deceptive AI ≠ Deceptively-aligned AI
Ω
Steven Byrnes
2y
Ω
19
30Interpreting the Learning of Deceit
Ω
RogerDearnaley
2y
Ω
14
Load More (15/185)
Add Posts