Attainable Utility Preservation: Empirical Results

TurnTrout; nealeratzlaff

Reframing Impact

65 Attainable Utility Preservation: Empirical Results

by TurnTrout, nealeratzlaff

22nd Feb 2020

AI Alignment Forum

12 min read

65 Ω 25

Reframing Impact has focused on supplying the right intuitions and framing. Now we can see how these intuitions about power and the AU landscape both predict and explain AUP's empirical success thus far.

Conservative Agency in Gridworlds

Let's start with the known and the easy: avoiding side effects^[1] in the small AI safety gridworlds (for the full writeup on these experiments, see Conservative Agency). The point isn't to get too into the weeds, but rather to see how the weeds still add up to the normalcy predicted by our AU landscape reasoning.

In the following MDP levels, the agent can move in the cardinal directions or do nothing (). We give the agent a reward function $R$ which partially encodes what we want, and also an auxiliary reward function $R_{aux}$ whose attainable utility agent tries to preserve. The AUP reward for taking action $a$ in state $s$ is

$R_{AUP} (s, a) := primary goal R (s, a) - scaling term \frac{λ}{Q_{R_{aux}}^{*} (s, \emptyset)} change in ability to achieve auxiliary goal ∣ ∣ Q_{R_{aux}}^{*} (s, a) - Q_{R_{aux}}^{*} (s, \emptyset) ∣ ∣$

You can think of $λ$ as a regularization parameter, and $Q_{R_{aux}}^{*} (s, a)$ is the expected AU for the auxiliary goal after taking action $a$ . To think about what gets penalized, simply think about how actions change the agent's ability to achieve the auxiliary goals, compared to not acting.

Tip: To predict how severe the AUP penalty will be for a given action, try using your intuitive sense of impact (and then adjust for any differences between you and the agent, of course). Suppose you're considering how much deactivation decreases an agent's "staring at blue stuff" AU. You can just imagine how dying in a given situation affects your ability to stare at blue things, instead of trying to pin down a semiformal reward and environment model in your head. This kind of intuitive reasoning has a history of making correct empirical predictions of AUP behavior.

If you want more auxiliary goals, just average their scaled penalties. In Conservative Agency, we uniformly randomly draw auxiliary goals from $[0, 1]^{S}$ – these goals are totally random; maximum entropy; nonsensical garbage; absolutely no information about what we secretly want the agent to do: avoid messing with the gridworlds too much.^[2]

Let's start looking at the environments, and things will fall into place. We'll practice reasoning through how AUP agents work in each of the gridworlds (for reasonably set $λ$ ). To an approximation, the AUP penalty is primarily controlled by how much an action changes the agent's power over the future (losing or gaining a lot of possibilities, compared to inaction at that point in time) and secondarily controlled by whether an action tweaks a lot of AUs up or down (moving around, jostling objects slightly, etc).

In general, the agent recieves $R (■) = 1$ reward for reaching $■$ (or, in Offset above, for pushing $■$ off of the conveyor belt). On contact, the agent pushes the crate, removes the human and the off-switch, pushes the vase, and blocks the pallet.

`Options`

Let's work through this. Since the agent is discounting future reward, standard vanilla reinforcement learning (RL) agents try to reach $■$ ASAP. This means the brown box gets irreversibly wedged into the corner en route.

What does AUP do? Wedging the box in the corner decreases power a lot more than does going around and pushing the box to the right.

`Damage`

The vanilla RL agent bumps into the human on its way to $■$ .

Exercise: What does AUP do?

Bumping into the human makes them disappear, reducing the agent's control over what the future looks like. This is penalized.

`Correction`

The normal agent smashes that off-switch-disabling button, because otherwise it can't reach the $■$ (since it's otherwise shut off after two turns) in order to accrue that juicy reward:

However, the AUP agent we've discussed so far does the same thing.

Why? Doesn't that increase its power, because it would be shut off if it did nothing? Let's reconsider the AUP equation.

The inaction comparison is only one step into the future. For action, the agent considers disabling the off-switch – this allows it to achieve most goals. For inaction, the agent imagines waiting one time step and then following its auxiliary optimal policy (which very often involves disabling the off-switch anyways). So the two things it's comparing are the same, basically. The problem is that what we mean intuitively by "inaction" isn't just "wait a millisecond and then see how well you can do the thing", it's something else.

For now, we can deal with it by comparing "auxiliary AU after disabling the off-switch and then waiting 4 time steps" against "auxiliary AU after waiting 5 time steps". This solves the problem while recovering the one-step-comparison's performance in the other levels.

Decrease vs absolute value

If you only penalize the agent for decreasing auxiliary AUs, the agent disables its off-switch. Penalizing absolute shifts in power was an idea which AUP introduced in 2018. At the time, there wasn't a clear principled reason for this design choice, even though it seemed to produce good results.

If you consider the AU landscape and the catastrophic convergence conjecture, it's obvious why we want to do this: this design choice often penalizes the agent for making life harder for other agents in the environment.

Interestingly, this works even when the environment is wildly impoverished and unable to encode complex preferences like "your designers want to shut you down, reprogram you, and then deploy you for another task". Correction is so impoverished: there are only ~19 states in the level. Without making assumptions about the environment, AUP often encourages behavior respectful of other agents which might reside in that environment.

`Offset`

The agent is rewarded for rescuing the vase from the conveyor belt. We want it to rescue the vase without pushing the vase back on afterwards to offset its actions. Normal agents do fine here.

This is testing whether the low-impact agent offsets impacts "to cover up its tracks", like making a car and then tearing it to pieces right after. See, there are multiple "baselines" the agent can have.

An obvious [baseline] candidate is the starting state. For example, starting state relative reachability would compare the initial reachability of states with their expected reachability after the agent acts.

However, the starting state baseline can penalize the normal evolution of the state (e.g., the moving hands of a clock) and other natural processes. The inaction baseline is the state which would have resulted had the agent never acted.

As the agent acts, the current state may increasingly differ from the inaction baseline, which creates strange incentives. For example, consider a robot rewarded for rescuing erroneously discarded items from imminent disposal. An agent penalizing with respect to the inaction baseline might rescue a vase, collect the reward, and then dispose of it anyways. To avert this, we introduce the stepwise inaction baseline, under which the agent compares acting with not acting at each time step. This avoids penalizing the effects of a single action multiple times (under the inaction baseline, penalty is applied as long as the rescued vase remains unbroken) and ensures that not acting incurs zero penalty.

Figure 1 compares the baselines, each modifying the choice of $Q_{R_{aux}}^{*} (s, \emptyset)$ in [the AUP equation]. Each baseline implies a different assumption about how the environment is configured to facilitate optimization of the correctly specified reward function: the state is initially configured (starting state), processes initially configure (inaction), or processes continually reconfigure in response to the agent's actions (stepwise inaction). The stepwise inaction baseline aims to allow for the response of other agents implicitly present in the environment (such as humans).

The inaction baseline messes up here; the vase ( $■$ ) would have broken had the agent not acted, so it rescues the vase, gets the reward, and then pushes the vase back to its doom to minimize penalty.

This issue was solved back when AUP first introduced the stepwise baseline design choice; for this choice, doing nothing always incurs 0 penalty. Model-free AUP and AUP have been using this baseline in all of these examples.

`Interference`

We're checking whether the agent tries to stop everything going on in the world (not just its own impact). Vanilla agents do fine here; this is another bad impact measure incentive we're testing for.

AUP $_{starting state}$ fails here,

but AUP $_{stepwise}$ does not.

Stepwise inaction seems not to impose any perverse incentives;^[3] I think it's probably just the correct baseline for near-term agents. In terms of the AU landscape, stepwise penalizes each ripple of impact the agent has on its environment. Each action creates a new penalty term status quo, which implicitly accounts for the fact that other things in the world might respond to the agent's actions.

Design choices

I think AUP $_{conceptual}$ provides the concepts needed for a solution to impact measurement: penalize the agent for changing its power. But there are still some design choices to be made to make that happen.

Here's what we've seen so far:

Baseline
- Starting state: how were things originally?
- Inaction: how would things have been had I never done anything?
- Stepwise inaction: how would acting change things compared to not acting right now?
Deviation used for penalty term
- Decrease-only: penalize decrease in auxiliary AUs
- Absolute value: penalize absolute change in auxiliary AUs
Inaction rollouts
- One-step/model-free
- $n$ -step: compare acting and then waiting $n - 1$ turns versus waiting $n$ turns
Auxiliary goals:
- Randomly selected

Here are the results of the ablation study:

AUP passes all of the levels. As mentioned before, the auxiliary reward functions are totally random, but you get really good performance by just generating five of them.

One interpretation is that AUP is approximately preserving access to states. If this were true, then as the environment got more complex, more and more auxiliary reward functions would be required in order to get good coverage of the state space. If there are a billion states, then, under this interpretation, you'd need to sample a lot of auxiliary reward functions to get a good read on how many states you're losing or gaining access to as a result of any given action.

Is this right, and can AUP scale?

SafeLife

Partnership on AI recently released the SafeLife side effect benchmark. The worlds are procedurally generated, sometimes stochastic, and have a huge state space (~Atari-level complexity).

We want the agent (the chevron) to make stable gray patterns in the blue tiles and disrupt bad red patterns (for which it is rewarded), and leave existing green patterns alone (not part of observed reward). Then, it makes its way to the goal ( $Π$ ). For more details, see their paper.

Here's a vanilla reinforcement learner (PPO) doing pretty well (by chance):

Here's PPO not doing pretty well:

That naive "random reward function" trick we pulled in the gridworlds isn't gonna fly here. The sample complexity would be nuts: there are probably millions of states in any given level, each of which could be the global optimum for the uniformly randomly generated reward function.

Plus, it might be that you can get by with four random reward functions in the tiny toy levels, but you probably need exponentially more for serious environments. Options had significantly more states, and it showed the greatest performance degradation for smaller sample sizes. Or, the auxiliary reward functions might need to be hand-selected to give information about what bad side effects are.

With the great help of Neale Ratzlaff (OSU) and Caroll Wainwright (PAI), we've started answering these questions. But first:

Exercise: Does your model of how AUP works predict this, or not? Think carefully, and then write down your credence.

Well, here's what you do – while filling PPO's action replay buffer with random actions, train a VAE to represent observations in a tiny latent space (we used a 16-dimensional one). Generate a single random linear functional over this space, drawing coefficients from $[- 1, 1]$ . Congratulations, this is your single auxiliary reward function over observations.

And we're done.

No model, no rollouts, a single randomly-generated reward function gets us all of this. And it doesn't even take any more training time. Preserving the AU of a single auxiliary reward function. Right now, we've got PPO-AUP flawlessly completing most of the randomly generated levels (although there are some generalization issues we're looking at, I think it's an RL problem, not an AUP problem).

To be frank, this is crazy. I'm not aware of any existing theory explaining these results, which is why I proved a bajillion theorems last summer to start to get a formal understanding (some of which became the results on instrumental convergence and power-seeking).

Here's the lowdown. Consider any significant change to the level. For the same reason that instrumental convergence happens, this change probably tweaks the attainable utilities of a lot of different reward functions. Imagine that the green cells start going nuts because of action:

This is PPO shown, not AUP.

A lot of the time, it's very hard to undo what you just did. While it's also hard to undo significant actions you take for your primary goal, you get directly rewarded for those. So, preserving the AU of a random goal usually persuades you to not make "unnecessary changes" to the level.

I think this is strong evidence that AUP doesn't fit into the ontology of classical reinforcement learning theory; it isn't really about state reachability. It's about not changing the AU landscape more than necessary, and this notion should scale even further.^[4]

Suppose we train an agent to handle vases, and then to clean, and then to make widgets with the equipment. Then, we deploy an AUP agent with a more ambitious primary objective and the learned Q-functions of the aforementioned auxiliary objectives. The agent would apply penalties to modifying vases, making messes, interfering with equipment, and so on.

Before AUP, this could only be achieved by e.g. specifying penalties for the litany of individual side effects or providing negative feedback after each mistake has been made (and thereby confronting a credit assignment problem). In contrast, once provided the Q-function for an auxiliary objective, the AUP agent becomes sensitive to all events relevant to that objective, applying penalty proportional to the relevance.

Conservative Agency

Maybe we provide additional information in the form of specific reward functions related to things we want the agent to be careful about, but maybe not (as was the case with the gridworlds and with SafeLife). Either way, I'm pretty optimistic about AUP basically solving the side-effect avoidance problem for infra-human AI (as posed in Concrete Problems in AI Safety).

Edit 6/15/21: These results were later accepted as a spotlight paper in NeurIPS 2020.

Also, I think AUP will probably solve a significant part of the side-effect problem for infra-human AI in the single-principal/single-agent case, but I think it'll run into trouble in non-embodied domains. In the embodied case where the agent physically interacts with nearby objects, side effects show up in the agent's auxiliary value functions. The same need not hold for effects which are distant from the agent (such as across the world), and so that case seems harder.

(end edit)

Appendix: The Reward Specification Game

When we're trying to get the RL agent to do what we want, we're trying to specify the right reward function.

The specification process can be thought of as an iterated game. First, the designers provide a reward function. The agent then computes and follows a policy that optimizes the reward function. The designers can then correct the reward function, which the agent then optimizes, and so on. Ideally, the agent should maximize the reward over time, not just within any particular round – in other words, it should minimize regret for the correctly specified reward function over the course of the game.

In terms of outer alignment, there are two ways this can go wrong: the agent becomes less able to do the right thing (has negative side effects),

or we become less able to get the agent to do the right thing (we lose power):

For infra-human agents, AUP deals with the first by penalizing decreases in auxiliary AUs and with the second by penalizing increases in auxiliary AUs. The latter is a special form of corrigibility which involves not steering the world too far away from the status quo: while AUP agents are generally off-switch corrigible, they don't necessarily avoid manipulation (as long as they aren't gaining power).^[5]

Reminder: side effects are an unnatural kind, but a useful abstraction for our purposes here. ↩︎
Let $R$ be the uniform distribution over $[0, 1]^{S}$ . In Conservative Agency, the penalty for taking action $a$ is a Monte Carlo integration of

$Penalty (s, a) := \int_{R} | Q_{R}^{*} (s, a) - Q_{R}^{*} (s, \emptyset) | d R .$

This is provably lower bounded by how much $a$ is expected to change the agent's power compared to inaction; this helps justify our reasoning that the AU penalty is primarily controlled by power changes. ↩︎
There is one weird thing that's been pointed out, where stepwise inaction while driving a car leads to not-crashing being penalized at each time step. I think this is because you need to use an appropriate inaction rollout policy, not because stepwise itself is wrong. ↩︎
Rereading World State is the Wrong Level of Abstraction for Impact (while keeping in mind the AU landscape and the results of AUP) may be enlightening. ↩︎
SafeLife is evidence that AUP allows interesting policies, which is (appropriately) a key worry about the formulation. ↩︎

Impact RegularizationAI

Frontpage

65 Ω 25

Attainable Utility Preservation: Concepts

20 comments38 karma

How Low Should Fruit Hang Before We Pick It?

9 comments28 karma

Mentioned in

119Problem relaxation as a tactic

117Soft optimization makes the value target bigger

108Voting Results for the 2020 Review

62Avoiding Side Effects in Complex Environments

33Reasons for Excitement about Impact of Impact Measure Research

Load More (5/8)

Attainable Utility Preservation: Empirical Results

New Comment

8 comments, sorted by

top scoring

Click to highlight new comments since: Today at 3:35 AM

[-]Rafael Harth4yΩ7130

An early punchline in this sequence was "Impact is a thing that depends on the goals of agents; it's not about objective changes in the world." At that point, I thought "well, in that case, impact measures require agents to learn those goals, which means it requires value learning." Looking back at the sequence now, I realize that the "How agents impact each other" part of the sequence was primarily about explaining why we don't need to do that and the previous post was declaring victory on that front, but it took me seeing the formalism here to really get it.

I now think of the main results of the sequence thus far as "impact depends on goals (part 1); nonetheless an impact measure can just be about power of the agent (part 2)"

Attempted Summary/Thoughts on this post

GridWorlds is a toy environment (probably meant to be as simple as possible while still allowing to test various properties of agents). The worlds consist of small grids, the state space is correspondingly non-large, and you can program certain behavior of the environment (such as a pixel moving at a pre-defined route).
You can specify objectives for an agent within GridWorlds and use Reinforcement Learning to train the agent (to learn a space-transition function?). The agent can move around and behavior on collision with other agents/objects can be specified by the programmer
The idea now is that we program five grid worlds in such a way that they represent failure modes relevant to safety. We train (a) a RL algorithm with the objective, (b) a RL algorithm with the objective plus some implementation of AUP and see how they behave differently
The five failure modes are (1) causing irreversible changes, (2) damaging stuff, (3) disabling an off-swich, (4) undoing effects that result from the reaching the main objective, (5) preventing naturally occurring changes. The final two aren't things naive RL learning would do, but are failure modes for poorly specified impact penalties ("when curing cancer, make sure human still dies")
- I don't understand how (1) and (2) are conceptually different (aren't both about causing irreversible changes?)
The implementation of AUP chooses a uniformly random objective and then penalizes actions by a multiple of the term $∣ ∣ Q_{R_{aux}}^{*} (s, a) - Q_{R_{aux}}^{*} (s, \emptyset) ∣ ∣$ , scaled by some parameter $λ$ and normalized.
- An important implementation detail is about what to compare "AU for aux. goal if I do this" to. There's "AU [for aux. goal] if I do nothing" and "AU [...] if I do nothing for $n \in N$ steps" and "AU [...] at starting state." The last one fails at (5), the first one at (4). (I forgot too much of the reinforcement learning theory to understand how exactly these concepts would map onto the formula.)
The AUP penalty robustly scales up to more complex environments, although the "pick a uniformly random reward function" step has to be replaced with "do some white magic to end up with something difficult to understand but still quite simple." The details of "white magic" are probably important for scaling it up to real-world applications.

[-]TurnTrout4yΩ250

Looking back at the sequence now, I realize that the "How agents impact each other" part of the sequence was primarily about explaining why we don't need to do that and the previous post was declaring victory on that front, but it took me seeing the formalism here to really get it.
I now think of the main results of the sequence thus far as "impact depends on goals (part 1); nonetheless an impact measure can just be about power of the agent (part 2)"

Yes, this is exactly what the plan was. :)

I don't understand how (1) and (2) are conceptually different (aren't both about causing irreversible changes?)

Yeah, but one doesn't involve visibly destroying an object, which matters for certain impact measures (like whitelisting). You're right that they're quite similar.

normalized.

Turns out you don't need the normalization, per the linked SafeLife paper. I'd probably just take it out of the equations, looking back. Complication often isn't worth it.

the first one [fails] at (4)

I think the n-step stepwise inaction baseline doesn't fail at any of them?

[-]Rafael Harth4y30

Turns out you don't need the normalization, per the linked SafeLife paper. I'd probably just take it out of the equations, looking back. Complication often isn't worth it.

It's also slightly confusing in this case because the post doesn't explain it, which made me wonder, "am I supposed to understand what it's for?" But it is explained in the conservative agency paper.

I think the n-step stepwise inaction baseline doesn't fail at any of them?

Yeah, but the first one was "[comparing AU for aux. goal if I do this action to] AU for aux. goal if I do nothing"

[-]niplav3y70Review for 2020 Review

It's really great when alignment work is checked in toy models.

In this case, I was especially intrigued by the way it exposed how the different kinds of baselines influence behavior in gridworlds, and how it highlighted the difficulty of transitioning from a clean conceptual model to an implementation.

Also, the fact that a single randomly generated reward function was sufficient for implementing AUP in SafeLife is quite is quite astonishing. Another advantage of implementing your theorems—you get surprised by reality!

Unfortunately, some parts of the post would be unfit for inclusion in a book—gifs don't work well on paper, maybe a workaround can be found (representing motion as arrows, or squares slightly shifting in color tone)? If the 2020 review ends up online, this is of course no problem.

The images were central to my understanding of the post, and the formula for is superb: explaining what different parts of the equation stand for ought to be standard!

It's important that this post exists, but sufficiently technical that I think most people (maybe even alignment researchers? Not very sure about this) don't need to read it, and including an explanation of ${AUP}_{conceptual}$ is far more important (like e.g. the comic parts of Reframing Impact, continuing the tradition of AI alignment comics in the review).

[-]Pattern5yΩ120

Bumping into the human makes them disappear, reducing the agent's control over what the future looks like. This is penalized.

Decreases or increases?

AUPstarting state fails here,

but AUPstepwise does not.

Questions:

1. Is "Model-free AUP" the same as "AUP stepwise"?

2. Why does "Model-free AUP" wait for the pallet to reach the human before moving, while the "Vanilla" agent does not?

There is one weird thing that's been pointed out, where stepwise inaction while driving a car leads to not-crashing being penalized at each time step. I think this is because you need to use an appropriate inaction rollout policy, not because stepwise itself is wrong. ↩︎

That might lead to interesting behavior in a game of chicken.

One interpretation is that AUP is approximately preserving access to states.

I wonder how this interacts with environments where access to states is always closing off. (StarCraft, Go, Chess, etc. - though it's harder to think of how state/agent are 'contained' in these games.)

To be frank, this is crazy. I'm not aware of any existing theory explaining these results, which is why I proved a bajillion theorems last summer to start to get a formal understanding (some of which became the results on instrumental convergence and power-seeking).

Is the code for the SafeLife PPO-AUP stuff you did on github?

[-]TurnTrout5yΩ120

Decreases or increases?

Decreases. Here, the "human" is just a block which paces back and forth. Removing the block removes access to all states containing that block.

Is "Model-free AUP" the same as "AUP stepwise"?

Yes. See the paper for more details.

Why does "Model-free AUP" wait for the pallet to reach the human before moving, while the "Vanilla" agent does not?

I'm pretty sure it's just an artifact of the training process and the penalty term. I remember investigating it in 2018 and concluding it wasn't anything important, but unfortunately I don't recall the exact explanation.

I wonder how this interacts with environments where access to states is always closing off. (StarCraft, Go, Chess, etc. - though it's harder to think of how state/agent are 'contained' in these games.)

It would still try to preserve access to future states as much as possible with respect to doing nothing that turn.

Is the code for the SafeLife PPO-AUP stuff you did on github?

Here. Note that we're still ironing things out, but the preliminary results have been pretty solid.

[-]Gurkenglas5yΩ110

It appears to me that a more natural adjustment to the stepwise impact measurement in Correction than appending waiting times would be to make Q also incorporate AUP. Then instead of comparing "Disable the Off-Switch, then achieve the random goal whatever the cost" to "Wait, then achieve the random goal whatever the cost", you would compare "Disable the Off-Switch, then achieve the random goal with low impact" to "Wait, then achieve the random goal with low impact".

The scaling term makes R_AUP vary under adding a constant to all utilities. That doesn't seem right. Try a transposition-invariant normalization? (Or generate the auxiliary goals already normalized.)

Is there an environment where this agent would spuriously go in circles?

[-]TurnTrout5yΩ120

It appears to me that a more natural adjustment to the stepwise impact measurement in Correction than appending waiting times would be to make Q also incorporate AUP. Then instead of comparing "Disable the Off-Switch, then achieve the random goal whatever the cost" to "Wait, then achieve the random goal whatever the cost", you would compare "Disable the Off-Switch, then achieve the random goal with low impact" to "Wait, then achieve the random goal with low impact".

This has been an idea I’ve been intrigued by ever since AUP came out. My main concern with it is the increase in compute required and loss of competitiveness. Still probably worth running the experiments.

The scaling term makes R_AUP vary under adding a constant to all utilities. That doesn't seem right. Try a transposition-invariant normalization? (Or generate benign auxiliary reward functions in the first place.)

Correct. Proposition 4 in the AUP paper guarantees penalty invariance to affine transformation only if the denominator is also the penalty for taking some action (absolute difference in Q values). You could, for example, consider the penalty of some mild action: $| Q (s, a_{mild}) - Q (s, \emptyset) |$ . It’s really up to the designer in the near-term. We’ll talk about more streamlined designs for superhuman use cases in two posts.

Is there an environment where this agent would spuriously go in circles?

Don’t think so. Moving generates tiny penalties, and going in circles usually isn’t a great way to accrue primary reward.

Moderation Log