Treacherous Turn

Edited by plex, et al. last updated 30th Dec 2024

Treacherous Turn is a hypothetical event where an advanced AI system which has been pretending to be aligned due to its relative weakness turns on humanity once it achieves sufficient power that it can pursue its true objective without risk.

Posts tagged Treacherous Turn

5

74A Gym Gridworld Environment for the Treacherous Turn

Ω

Michaël Trazzi

7y

Ω

9

5

17Any work on honeypots (to detect treacherous turn attempts)?

Q

David Scott Krueger (formerly: capybaralet)

5y

Q

4

121Soares, Tallinn, and Yudkowsky discuss AGI cognition

Ω

So8res, Eliezer Yudkowsky, jaan

4y

Ω

39

4

43A toy model of the treacherous turn

Stuart_Armstrong

10y

13

3

108A very crude deception eval is already passed

Ω

Beth Barnes

4y

Ω

6

3

30AI learns betrayal and how to avoid it

Ω

Stuart_Armstrong

4y

Ω

4

3

23[AN #165]: When large models are more likely to lie

Ω

Rohin Shah

4y

Ω

0

3

16Superintelligence 11: The treacherous turn

KatjaGrace

11y

50

2

31[Linkpost] Treacherous turns in the wild

Ω

Mark Xu

5y

Ω

6

2

22A simple treacherous turn demonstration

Nikola Jurkovic

2y

5

2

3Give the model a model-builder

Adam Jermyn

4y

0

1

3"Destroy humanity" as an immediate subgoal

Seth Ahrenbach

2y

13

1

2A way to make solving alignment 10.000 times easier. The shorter case for a massive open source simbox project.

AlexFromSafeTransition

3y

16

1

1Is there a ML agent that abandons it's utility function out-of-distribution without losing capabilities?

Christopher King

3y

7

1

-3More Thoughts on the Human-AGI War

Seth Ahrenbach

2y

4

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Treacherous Turn

Treacherous Turn