Simple experiments with deceptive alignment
Risks from learned optimization introduces the phenomenon of “deceptive alignment”. The authors give a simple task environment as an example to illustrate the phenomenon: From page 23 in “Risks from Learned Optimization in Advanced Machine Learning Systems” I wanted to build a task environment that is almost as simple, but...