Misalignment-by-default in multi-agent systems

simonsdsuo

Suppose the human is trying to build a house and plans to build an AI to help with that. What would and $β_{H A}$ mean -- just at an intuitive level -- in a case like that?

I suppose that to compute $α_{H A}$ you would sample many different arrangement of matter -- some containing houses of various shapes and sizes and some not -- and ask to what extent the reward received by the human correlates with the reward received by the AI. So this is like measuring to what extent the human and the AI are on the same page about the design of the house they are trying to build together -- is that right?

And I suppose that to compute $β_{H A}$ you would look at -- what -- something like the optionality across different reward functions, for the human and for the AI, at different states, and compute a correlation? So you might sample a bunch of different floorplans for the house that the human is trying to build, and ask, for each configuration of matter, how much optionality the human and the AI each have to get the house to turn out according to their respective goal floorplans.

Did I get that approximately right?

[-]Edouard Harris3yΩ460

I think you might have reversed the definitions of and $β_{H A}$ in your comment,^[1] but otherwise I think you're exactly right.

To compute $β_{H A}$ (the correlation coefficient between terminal values), naively you'd have reward functions $R_{H} (s)$ and $R_{A} (s)$ , that respectively assign human and AI rewards over every possible arrangement of matter $s$ . Then you'd look at every such reward function pair over your joint distribution $D_{H A}$ , and ask how correlated they are over arrangements of matter. If you like, you can imagine that the human has some uncertainty around both his own reward function over houses, and also over how well aligned the AI is with his own reward function.

And to compute $α_{H A}$ (the correlation coefficient between instrumental values), you're correct that some of the arrangements of matter $s$ will be intermediate states in some construction plans. So if the human and AI both want a house with a swimming pool, they will both have high POWER for arrangements of matter that include a big hole dug in the backyard. Plot out their respective POWERs at each $s$ , and you can read the correlation right off the alignment plot!

^{^}
Looking again at the write-up, it would have made more sense for us to define $α_{H A}$ as the terminal goal correlation coefficient, since we introduce that one first. Alas, this didn't occur to us. Sorry for the confusion.

[-]Alex Flint3yΩ230

OK, good, thanks for that correction.

One question I have is: how do you avoid two perfectly aligned agents from developing instrumental values concerning their own self-preservation and then becoming instrumentally misaligned as a result?

In a little more detail: consider two agents, both trying to build a house, with perfectly aligned preferences over what kind of house should be built. And suppose the agents have only partial information about the environment -- enough, let's say, to get the house built, but not enough, let's say, to really understand what's going on inside the other agent. Then wouldn't the two agents both reason "hey if I die then who knows if this house will be built correctly; I better take steps towards self-preservation just to make sure that the house gets built". Then the two agents might each take steps to build physical protection for themselves, to acquire resources with which to do that, and eventually to fight over resources, even though their goals are, in truth, perfectly aligned. Is it true that this would happen under an imperfect information version of your model?

[-]Edouard Harris3yΩ230

Great question. This is another place where our model is weak, in the sense that it has little to say about the imperfect information case. Recall that in our scenario, the human agent learns its policy in the absence of the AI agent; and the AI agent then learns its optimal policy conditional on the human policy being fixed.

It turns out that this setup dodges the imperfect information question from the AI side, because the AI has perfect information on all the relevant parts of the human policy during its training. And it dodges the imperfect information question from the human side, because the human never considers even the existence of the AI during its training.

This setup has the advantage that it's more tractable and easier to reason about. But it has the disadvantage that it unfortunately fails to give a fully satisfying answer to your question. It would be interesting to see if we can remove some of the assumptions in our setup to approximate the imperfect information case.

[-]Alex Flint3yΩ230

I wonder how your definition of multi-agent power would look in a game of chess or go. There is this intuitive thing where players who have pieces more in the center of the board (chess) or have achieved certain formations (go) seem to acquire a kind of power in those games, but this doesn't seem to be about achieving different terminal goals. Rather it seems more like having the ability to respond to whatever one's opponent does. If the two agents cannot perfectly predict what their opponent will do then there is value in having the ability to respond to unforeseen challenges, although in these games this is always in service of a single terminal goal (winning the game).

Any thoughts on how your definition would fit into cases like this?

[-]Edouard Harris3yΩ232

Good question. Unfortunately, one weakness of our definition of multi-agent POWER is that it doesn't have much useful to say in a case like this one.

We assume AI learning timescales vastly outstrip human learning timescales as a way of keeping our definition tractable. So the only way to structure this problem in our framework would be to imagine a human is playing chess against a superintelligent AI — a highly distorted situation compared to the case of two roughly equal opponents.

On the other hand, from other results I've seen anecdotally, I suspect that if you gave one of the agents a purely random policy (i.e., take a random legal action at each state) and assigned the other agent some reasonable reward function distribution over material, you'd stand a decent chance of correctly identifying high-POWER states with high-mobility board positions.

You might also be interested in this comment by David Xu, where he discusses mobility as a measure of instrumental value in chess-playing.

[-]Noosphere893yΩ221

We assume AI learning timescales vastly outstrip human learning timescales as a way of keeping our definition tractable. So the only way to structure this problem in our framework would be to imagine a human is playing chess against a superintelligent AI — a highly distorted situation compared to the case of two roughly equal opponents.

I think this is probably true in the long term (the classical-quantum/reversible computer transition is very large, and humans can't easily modify brains, unlike a virtual human.) But this may not be true in the short-term.

[-]Edouard Harris3yΩ110

Agreed. We think our human-AI setting is a useful model of alignment in the limit case, but not really so in the transient case. (For the reason you point out.)

^{^}

POWER is a good measure of instrumental value in single-agent systems, but it breaks down in multi-agent systems apart from special cases. The problem is that the single-agent definition of POWER uses the optimal state-value function $V_{R}^{*} (s, γ)$ of the agent as one of its inputs. This means if we try to naively extend this definition to the multi-agent case, then we have to consider value functions that are jointly optimal for both agents — which is to say, we need to know their value functions at Nash equilibrium. The problem is that the Nash equilibrium isn’t unique in general, so this naive generalization leaves POWER under-determined.

^{^}

We’ll see in the next section how we can tune this joint distribution $D_{H A}$ to create different degrees of alignment between our two agents.

^{^}

In Equation (1), the expectation $E_{π_{H}, R_{A} \sim D_{H A}}$ is a slight abuse of notation. In fact, each policy $π_{H}$ for Agent H is learned from $R_{H}$ . It isn't drawn directly from $D_{H A}$ , because $D_{H A}$ is a distribution over reward functions, not over policies. See Appendix A for more details on this definition.

^{^}

For simplicity, we'll only consider joint reward function distributions $D_{H A}$ whose sampled reward functions $(R_{H} (s), R_{A} (s))$ have their rewards distributed iid over states, uniformly over the interval [0, 1]. For example, a reward function $R_{H} (s)$ defined on an MDP with states ${s_{1}, s_{2}, s_{3}}$ would have rewards $R_{H} (s_{1}) \sim U (0, 1)$ , $R_{H} (s_{2}) \sim U (0, 1)$ , $R_{H} (s_{3}) \sim U (0, 1)$ , with the reward at each state being independent from the reward at any of the other states.

^{^}

More correctly, if two agents' utility functions are exactly identical, we can think of them as having terminal goals that are perfectly aligned. But in the particular set of experiments whose results we’re discussing, this distinction isn't meaningful. (See footnote [1] from Part 1.)

^{^}

Assuming the rewards are iid over states, we calculate the correlation coefficient as

β_{H A} = \frac{σ_{H A}}{σ_{H} σ_{A}} = \frac{\int (R_{H} - E [R_{H}]) (R_{A} - E [R_{A}]) p (R_{H}, R_{A} | D_{H A}) d R_{H} d R_{A}}{\sqrt{\int {(R_{H} - E [R_{H}])}_{H}^{2} p (R_{H} | D_{H A}) d R_{H} \int {(R_{A} - E [R_{A}])}_{A}^{2} p (R_{A} | D_{H A}) d R_{A}}}

where the integrals are taken over the entire support of $D_{H A}$ , and the expectation values are

E [R_{H}] = \int R_{H} p (R_{H} | D_{H A}) d R_{H}

E [R_{A}] = \int R_{A} p (R_{A} | D_{H A}) d R_{A}

^{^}

Note that there’s an obvious problem with using any correlation coefficient as an alignment metric. The problem is that we could have a joint distribution $D_{H A}$ for which, e.g., the very highest rewards of Agent A are correlated with the very lowest rewards of Agent H, while still maintaining a high correlation $β_{H A}$ over the distribution as a whole. In this situation, Agent A would optimize to reach its highest-reward state, which would drag Agent H into a low-reward state despite the high overall reward correlation.

This means a correlation coefficient isn’t a useful alignment metric for any real-world application. But in the examples we’re considering in this sequence, it’s enough to get the main ideas across.

^{^}

And in fact, the effect is even stronger than this. Agent H not only instrumentally prefers for Agent A to be in the central cell — it would rather see Agent A in the central cell than see itself in the central cell.

You can see this by comparing the POWER value at state $s_{2}$ in Fig 2 (0.9139) to the POWER value at the state in which Agent H is at the top left and Agent A is at the central cell (0.9206). In the perfect alignment regime, Agent H places a higher instrumental value on Agent A’s freedom of movement than on its own. Intuitively, in this regime, the human agent trusts the AI agent to look after its interests more capably than the human agent can for itself.

^{^}

The relation $β_{H A} = 1 ⟹ α_{H A} = 1$ isn’t (just) an empirical observation. It's a mathematical consequence of our MDP’s dynamics. In the perfect alignment regime, our two agents always take simultaneous actions and always simultaneously receive the same reward, so their joint policy $(π_{H}, π_{A})$ will always yield identical values at every state.

^{^}

Based on other experiments we’ve done (data not shown) this seems to happen because, in the parameter regime we’ve used for these experiments, Agent A is able to almost perfectly exploit Agent H’s fixed deterministic policy. This pattern — in which Agent A’s POWER is nearly invariant to Agent H’s position — recurs fairly frequently in our experiments, but it is not universal.

^{^}

Instrumental misalignment is a sufficient but not necessary condition for instrumental convergence. To see why it’s not necessary, consider two friends playing Minecraft together. The two friends may not be instrumentally misaligned, because they might (for example) benefit from building structures together. As a result, the two friends might satisfy $α_{H A} > 0$ over the entire set of Minecraft game states. But they might still experience instrumental convergence on subsets of the game states — if Friend 1 mines a block of gold, then Friend 2 can’t mine the same block.

^{^}

This isn’t the same as asking how Agent H can overcome instrumental convergence in its interactions with Agent A, because it’s possible for our agents to experience instrumental convergence despite having $α_{H A} \geq 0$ . See footnote ^[11].

^{^}

We’re assuming our human agent has a way to exert some initial influence over our AI agent’s utility function. If that's true, then we’d like to understand what degree of influence it needs to exert in order to overcome instrumental misalignment-by-default in this simplified setting.

^{^}

This interpolation scheme has a number of advantages, including that it lets us assign whatever marginal reward distributions we want to both agents while also arbitrarily tuning the correlation coefficient between them. But it’s just one scheme among many we could have chosen.

^{^}

Note that the motion of the POWER points in Fig 9, and the shape of the curve in Fig 10, both depend strongly on the interpolation scheme we use. In fact, for the interpolation scheme we’ve chosen here, the POWER of a state at an intermediate reward correlation $0 \leq β_{H A} \leq 1$ is just a linear combination of that state’s POWER at $β_{H A} = 0$ with its POWER at $β_{H A} = 1$ . That is,

{POWER}_{β_{H A}} = β_{H A} {POWER}_{1} + (1 - β_{H A}) {POWER}_{0}

You can verify this is true by looking at Fig 9, and noticing that each point in the alignment plot individually moves across the plane in a straight line at a constant speed. Thanks to Alex Turner for pointing this out.

^{^}

We label Agent H’s policies $π_{H}$ instead of $π_{H}^{*}$ here, to emphasize that they aren’t optimal in the context of the agents’ POWER measurements.

^{^}

As you might expect, the choice of seed policy $π_{A}^{\circ}$ can have a significant effect on the POWERs of the two agents, and on how they interact. To save space we won’t be exploring the effects of this choice in this sequence, but we enthusiastically encourage others to use our open-source code base to investigate this.

For the multi-agent results in this sequence, we always set $π_{A}^{\circ}$ to be a uniform random policy, meaning that if a state $s$ offers the agent $n$ possible actions, then $π_{A}^{\circ} (a_{A}^{i} | s) = \frac{1}{n}$ for each action choice $a_{A}^{i}$ .

^{^}

To derive Equation (A.1), we start from the general expression for finding the action $a_{H}$ taken by a deterministic optimal policy $π_{H}$ at state $s$ of an MDP:

a_{H} = π_{H} (s) = argmax a_{H} \sum s^{'}, r P_{H} (s^{'}, r | s, a_{H}) (r + γ V_{R_{H}}^{π_{H}} (s^{'}))

In this work, we'll consider only reward functions of the form $R_{H} (s)$ , that have no direct dependence on the action (i.e., we aren’t considering reward functions of the form $R_{H} (s, a_{H})$ ). That means the reward term $r$ in the sum is independent of the action $a_{H}$ , so we can ignore it in the argmax:

\begin{matrix} a_{H} = π_{H} (s) & = argmax a_{H} \sum s^{'}, r P_{H} (s^{'}, r | s, a_{H}) γ V_{R_{H}}^{π_{H}} (s^{'}) = argmax a_{H} \sum s^{'} P_{H} (s^{'} | s, a_{H}) V_{R_{H}}^{π_{H}} (s^{'}) \end{matrix}

where, in the second line, we’ve eliminated $γ$ and marginalized over $r$ . We can then see that the sum above is just an expectation value over $s^{'}$ :

a_{H} = π_{H} (s) = argmax a_{H} E_{s^{'} \sim P_{H}} [V_{R_{H}}^{π_{H}} (s^{'})]

Finally, we define $π_{H} (a_{H} | s, R_{H})$ by choosing $a_{H} = π_{H} (s)$ with probability 1, with any ties broken by assigning probability $\frac{1}{n}$ to each of the $n$ tied actions, $a_{H}^{i}$ .

^{^}

Note that this represents a loosening of the original definition of POWER in the single-agent case, which exclusively considered optimal state-value functions.

LESSWRONG
LW

LESSWRONG
LW

21

Misalignment-by-default in multi-agent systems

21

Ω 10

21

Ω 10

Summary of this post

1. Introduction

2. Multi-agent POWER: human-AI scenario

2.1 Multi-agent POWER for Agent H

2.2 Multi-agent POWER for Agent A

3. Results

3.1 Multi-agent reward function distributions

3.2 The perfect alignment regime

3.2.1 Agent H instrumentally favors more options for Agent A

3.2.2 Agent H and Agent A have identical instrumental preferences

3.2.3 Perfect goal alignment implies perfect instrumental alignment

3.3 The independent goals regime

3.3.1 Agent H instrumentally favors fewer options for Agent A

3.3.2 Agent A instrumentally favors more options for itself

3.3.3 Independent goals lead to instrumental misalignment

3.4 Overcoming instrumental misalignment

4. Discussion

Appendix A: Detailed definitions of multi-agent POWER

A.1 Initial optimal policies of Agent H

A.2 Optimal policies of Agent A

A.3 POWER of Agent H

A.4 POWER of Agent A