x

LESSWRONG

LW

Andreas_Moe — LessWrong

Andreas_Moe

Andreas_Moe

Message

6

Ω

4

1

3y

Andreas_Moe

6

Ω

4

3y

Simple experiments with deceptive alignment

Risks from learned optimization introduces the phenomenon of “deceptive alignment”. The authors give a simple task environment as an example to illustrate the phenomenon: From page 23 in “Risks from Learned Optimization in Advanced Machine Learning Systems” I wanted to build a task environment that is almost as simple, but...

May 15, 2023•7