Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

We're open-sourcing POWERplay, a research toolchain you can use to study power-seeking behavior in reinforcement learning agents. POWERplay was developed by Gladstone AI for internal research.

POWERplay's main use is to estimate the instrumental value that a reinforcement learning agent can get from a state in an MDP. Its implementation is based on a definition of instrumental value (or "POWER") first proposed by Alex Turner et al. We've extended this definition to cover certain tractable multi-agent RL settings, and built an implementation behind a simple Python API.

We've used POWERplay previously to obtain some suggestive early results in single-agent and multi-agent power-seeking. But we think there may be more low-hanging fruit to be found in this area.

Beyond our own ideas about what to do next, we've also received some interesting conceptual questions in connection with this work. A major reason we're open-sourcing POWERplay is to lower the cost of converting these conceptual questions into real experiments with concrete outcomes, that can support or falsify our intuitions about instrumental convergence.

Ramp-up

We've designed POWERplay to make it as easy as possible for you to get started with it. Follow the installation and quickstart instructions to get moving quickly. Use the replication API to trivially reproduce any figure from any post in our instrumental convergence sequence. Design single-agent and multi-agent MDPs and policies, launch experiments on your local machine, and visualize results with clear figures and animations.

POWERplay comes with "batteries included", meaning all the code samples in the documentation should just work out-of-the-box if it's been installed successfully. It also comes with pre-run examples of experimental results, so you can understand what "normal" output is supposed to look like. While this does make the repo weigh in at about 500 MB, it's worth the benefits of letting you immediately start playing around with visualizations on preexisting data.

If we've done our job right, a smart and curious grad student (with a bit of Python experience) should be able to start reproducing our previous experiments within an hour, and to have some new — and hopefully interesting! — results within a week.

We're looking forward to seeing what people do with this. If you have any questions or comments about POWERplay, feel free to reach out to Edouard at edouard@gladstone.ai.

Clone POWERplay on GitHub.

New to LessWrong?

New Comment