Summary of the sequence
Over the past few months, we’ve been investigating instrumental convergence in reinforcement learning agents. We started from the definition of single-agent POWER proposed by Alex Turner et al., extended it to a family of multi-agent scenarios that seemed relevant to AI alignment, and explored its implications experimentally in several RL environments.
The biggest takeaways are:
- Alignment of terminal goals and alignment of instrumental goals are sharply different phenomena, and we can quantify and visualize each one separately.
- If two agents have unrelated terminal goals, their instrumental goals will tend to be misaligned by default. The agents in our examples tend to interact competitively unless we make an active effort to align their terminal goals.
- As we increase the planning horizon of our agents, instrumental value concentrates into a
... (read 2332 more words →)