LESSWRONG
LW

Wikitags

Treacherous Turn

Edited by plex, et al. last updated 30th Dec 2024

Treacherous Turn is a hypothetical event where an advanced system which has been pretending to be aligned due to its relative weakness turns on humanity once it achieves sufficient power that it can pursue its true objective without risk.

AI
Subscribe
1
Subscribe
1
Discussion0
Discussion0
Posts tagged Treacherous Turn
74A Gym Gridworld Environment for the Treacherous Turn
Ω
Michaël Trazzi
7y
Ω
9
17Any work on honeypots (to detect treacherous turn attempts)?
Q
David Scott Krueger (formerly: capybaralet)
5y
Q
4
121Soares, Tallinn, and Yudkowsky discuss AGI cognition
Ω
So8res, Eliezer Yudkowsky, jaan
4y
Ω
39
43A toy model of the treacherous turn
Stuart_Armstrong
10y
13
108A very crude deception eval is already passed
Ω
Beth Barnes
4y
Ω
6
30AI learns betrayal and how to avoid it
Ω
Stuart_Armstrong
4y
Ω
4
23[AN #165]: When large models are more likely to lie
Ω
Rohin Shah
4y
Ω
0
16Superintelligence 11: The treacherous turn
KatjaGrace
11y
50
31[Linkpost] Treacherous turns in the wild
Ω
Mark Xu
4y
Ω
6
22A simple treacherous turn demonstration
Nikola Jurkovic
2y
5
3Give the model a model-builder
Adam Jermyn
3y
0
3"Destroy humanity" as an immediate subgoal
Seth Ahrenbach
2y
13
2A way to make solving alignment 10.000 times easier. The shorter case for a massive open source simbox project.
AlexFromSafeTransition
2y
16
1Is there a ML agent that abandons it's utility function out-of-distribution without losing capabilities?
Christopher King
2y
7
-3More Thoughts on the Human-AGI War
Seth Ahrenbach
2y
4
Load More (15/15)
Add Posts