Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Here are the slides for a talk I just gave at CHAI's 2021 workshop. Thanks to Andrew Critch for prompting me to flesh out this idea.

The first part of my talk summarized my existing results on avoiding negative side effects by making the agent "act conservatively." The second part shows how this helps facilitate iterated negotiation and increase gains from trade in the multi-stakeholder setting. 

1: Existing work on side effects

Agents only care about the parts of the environment relevant to their specified reward function.
We somehow want an agent which is "conservative" and "doesn't make much of a mess." 
AUP penalizes the agent for changing its ability to achieve a wide range of goals. Even though we can't specify our "true objective" to the agent, we hope that the agent stays able to do the right thing, as a result of staying able to do many things.
We first demonstrated that AUP avoids side effects in tiny tabular domains.
Conway's Game of Life has simple, local dynamics which add up to complex long-term consequences.
SafeLife turns the Game of Life into an actual game, adding an agent and many unique cell types. Crucially, there are fragile green cell patterns which most policies plow through and irreversibly shatter. We want the low-impact agent to avoid them whenever possible, without telling it what in particular it shouldn't do. We want the agent to avoid disrupting green cell patterns, without telling it directly to not disrupt green cell patterns. AUP pulls this off.
We learn the AUP policy in 3 steps. Step one: the agent learns to encode its observations (the game screen) with just one real number. This lets us learn an auxiliary environmental goal unsupervised.
Step two: we train the agent to optimize this encoder-reward function "goal"; in particular, the network learns to predict the values of different actions.
Step three: we're done! We have the AUP reward function. 

Summary of results: AUP does very well.

  • I expect AUP to further scale to high-dimensional embodied tasks
    • Avoiding making mess on e.g. factory floor
  • Expect that physically distant side effects harder for AUP to detect
    • Less probable that distant effects show up in the agent's value functions for its auxiliary goals in the penalty terms

2: Fostering repeated negotiation over time

I think of AUP as addressing the single-principal (AI designer) / single-agent (AI agent) case. What about the multi/single case?

In this setting, negotiated agent policies usually destroy option value.

Optimal actions when .
Optimal actions when .
Optimal actions when .

This might be OK if the interaction is one-off: the agent's production possibilities frontier is fairly limited, and it usually specializes in one beverage or the other. 

But interactions are rarely one-off: there are often opportunities for later trades and renegotiations as the principals gain resources or change their minds about what they want.

Concretely, imagine the principals are playing a game of their own.

MP-AUP is my first stab at solving this problem without modelling the joint game. In this agent production game, MP-AUP gets the agent to stay put until it is corrected (i.e. the agent is given a new reward function, after which it computes a new policy).

We can motivate the MP-AUP objective with an analogous situation. Imagine the agent starts off with uncertainty about what objective it should optimize, and the agent reduces its uncertainty over time. This is modelled using the 'assistance game' framework, of which Cooperative Inverse Reinforcement Learning is one example. (The assistance game paper has yet to be publicly released, but I think it's quite good!)

Assistance games are a certain kind of partially observable Markov decision process (POMDP), and they're solved by policies which maximize the agent's expected true reward. So once the agent is certain of the true objective, it should just optimize that. But what about before then? 

This is suggestive, but the assumptions don't perfectly line up with our use case (reward uncertainty isn't obviously equivalent to optimizing a mixture utility function per Harsanyi). I'm interested in more directly axiomatically motivating MP-AUP as (approximately) solving a certain class of joint principal/agent games under certain renegotiation assumptions, or (in the negative case) understanding how it falls short.

Here are some problems that MP-AUP doesn't address:

  • Multi-principal/multi-agent: even if agent A can make tea, that doesn’t mean agent A will let agent B make tea.
  • Specifying individual principal objectives
  • Ensuring that agent remains corrigible to principals - if MP-AUP agents remain able to act in the interest of each principal, that means nothing if we can no longer correct the agent so that it actually pursues those interests.

Furthermore, it seems plausible to me that MP-AUP helps pretty well in the multiple-principal/single-agent case, without much more work than normal AUP requires. However, I think there's a good chance I haven't thought of some crucial considerations which make it fail or which make it less good. In particular, I haven't thought much about the  principal case.

Conclusion

I'd be excited to see more work on this, but I don't currently plan to do it myself. I've only thought about this idea for <20 hours over the last few weeks, so there are probably many low-hanging fruits and important questions to ask. AUP and MP-AUP seem to tackle similar problems, in that they both (aim to) incentivize the agent to preserve its ability to change course and pursue a range of different tasks. 

New to LessWrong?

New Comment