This is a linkpost for https://github.com/epurdy/saferl

We describe LQPR (Linear-Quadratic-Program-Regulator), an algorithm for model-based reinforcement learning that allows us to prove PAC-style bounds on the behavior of the system. We describe proposed experiments that can be performed in the DeepMind AI Safety Gridworlds domain; we have not had time to implement these experiments yet, but we provide predictions as to the outcome of each experiment. Potential future work may include scaling up this work to non-trivial tasks using a neural network approximator, as well as proving additional theoretical results about the safety and stability of such a system. We believe that this system is a potential basis for aligning large language models and other powerful near-term AI’s with human preferences.

We also provide a list of critiques of the proposal from @davidad.

New to LessWrong?

New Comment