Work done at FAR AI.
There has been a lot of conceptual work on mesa-optimizers: neural networks that develop internal goals that may differ from their training objectives (the inner alignment problem). There is an abundance of good ideas for empirical work (find search in a NN, interpret it), but very little actual execution, partly because we did not have a clear-cut example of a mesa-optimizer to study. Until now.
We have replicated the mesa-optimizer that Guez et al. (2019) found, and released it open-source as a model organism for inner alignment research. In brief, Guez et al. trained a recurrent neural network (RNN) with model-free RL to play Sokoban. They noticed that if you give... (read 371 more words →)