LESSWRONGTags
LW

Deceptive Alignment

EditHistorySubscribe
Discussion (0)
Help improve this page
EditHistorySubscribe
Discussion (0)
Help improve this page
Deceptive Alignment
Random Tag
Contributors
1Multicore

Deceptive Alignment is when an AI which is not actually aligned temporarily acts aligned in order to deceive its creators or its training process. It presumably does this to avoid being shut down or retrained and to gain access to the power that the creators would give an aligned AI.

See also: Mesa-optimization, Treacherous Turn, Eliciting Latent Knowledge

Posts tagged Deceptive Alignment
Most Relevant
7
97Deceptive AlignmentΩ
evhub, Chris van Merwijk, vlad_m, Joar Skalse, Scott Garrabrant
4y
Ω
11
5
87Trying to Make a Treacherous Mesa-OptimizerΩ
MadHatter
3mo
Ω
13
2
97Evaluations project @ ARC is hiring a researcher and a webdev/engineerΩ
Beth Barnes
5mo
Ω
7
2
76How likely is deceptive alignment?Ω
evhub
5mo
Ω
21
2
41The Defender’s Advantage of InterpretabilityΩ
Marius Hobbhahn
5mo
Ω
4
2
35Sticky goals: a concrete experiment for understanding deceptive alignmentΩ
evhub
5mo
Ω
13
1
119Monitoring for deceptive alignmentΩ
evhub
5mo
Ω
7
1
85Does SGD Produce Deceptive Alignment?Ω
Mark Xu
2y
Ω
7
1
49Smoke without fire is scaryΩ
Adam Jermyn
4mo
Ω
22
1
48Why deceptive alignment matters for AGI safetyΩ
Marius Hobbhahn
5mo
Ω
12
1
38Steering Behaviour: Testing for (Non-)Myopia in Language ModelsΩ
Evan R. Murphy, Megan Kinniment
2mo
Ω
17
1
35It matters when the first sharp left turn happensΩ
Adam Jermyn
4mo
Ω
9
1
33Getting up to Speed on the Speed Prior in 2022Ω
robertzk
1mo
Ω
1
1
27Levels of goals and alignmentΩ
zeshen
5mo
Ω
4
1
25Deceptive failures short of full catastrophe.
Alex Lawsen
20d
0
Load More (15/22)
Add Posts