Attainable Utility Preservation: Scaling to Superhuman

Summary for the Alignment Newsletter:

The final section of the sequence turns to an actual implementation of AUP, and deals with problems in how the implementation deviates from the conceptual version of AUP. We measure power by considering an set of auxiliary rewards, and measuring the change in attainable utilities of this auxiliary set as impact, and penalizing the agent for that. The first post presents some empirical results, many of which <@we've covered before@>(@Penalizing Impact via Attainable Utility Preservation@), but I wanted to note the new results on SafeLife. On the high-dimensional world of SafeLife, the authors train a VAE to find a good latent representation, and choose a single linear reward function on the latent representation as their auxiliary reward function: it turns out this is enough to avoid side effects in at least some case of SafeLife.

We then look at some improvements that can be made to the original AUP implementation. First, according to CCC, we only need to penalize _power_, not _impact_: as a result we can just penalize _increases_ in attainable utilities, rather than both increases and decreases as in the original version. Second, the auxiliary set of rewards only provides a _proxy_ for impact / power, which an optimal agent could game (for example, by creating subagents, summarized below). So instead, we can penalize increases in attainable utility for the _primary_ goal, rather than using auxiliary rewards. There are some other improvements that I won't go into here.

Opinion:

I think the plan "ensure that the AI systems we build don't seek power" is pretty reasonable and plausibly will be an important part of AI alignment. However, the implementation of AUP is trying to do this under the threat model of optimal agents with potentially unaligned primary goals. I think this is probably going to do something quite different from the conceptual version of AUP, because impact (as defined in this sequence) occurs only when the agent's beliefs _change_, which doesn't happen for optimal agents in deterministic environments. The current implementation of AUP tries to get around this using proxies for power (but these can be gamed) or by defining "dumber" beliefs that power is measured relative to (but this fails to leverage the AI system's understanding of the world). See this comment for more details.

Note that the author himself is more excited about AUP as deconfusion, rather than as a solution to AI alignment, though he is more optimistic about the implementation of AUP than I am.

[-]TurnTrout6yΩ130

I think this is probably going to do something quite different from the conceptual version of AUP, because impact (as defined in this sequence) occurs only when the agent's beliefs change, which doesn't happen for optimal agents in deterministic environments. The current implementation of AUP tries to get around this using proxies for power (but these can be gamed) or by defining "dumber" beliefs that power is measured relative to (but this fails to leverage the AI system's understanding of the world).

For the benefit of future readers, I replied to this in the newsletter's comments.

[-]Charlie Steiner6yΩ460

My very general concern is that strategies that maximize $R_{A U P}$ might be very... let's say creative, and your claims are mostly relying on intuitive arguments for why those strategies won't be bad for humans.

I don't really buy the claim that if you've been able to patch each specific problem, we'll soon reach a version with no problems - the exact same inductive argument you mention suggests that there will just be a series of problems, and patches, and then more problems with the patched version. Again, I worry that patches are based a lot on intuition.

For example, in the latest version, because you're essentially dividing out by the long-term reward of taking the best action now, if the best action now is really really good, then it becomes cheap to take moderately good actions that still increase future reward - which means the agent is incentivized to concentrate the power of actions into specific timsteps. For example, an agent might be able to set things up so that it can sacrifice its ability to achieve total future reward of $10^{10}$ to make it cheap to take an action that increases its future reward by $10^{8}$ . This might looks like sacrificing the ability to colonize distant galaxies in order to gain total control over the Milky Way.

[-]TurnTrout6yΩ340

Again, I worry that patches are based a lot on intuition.

If you want your math to abstractly describe reality in a meaningful sense, intuition has to enter somewhere (usually in how you formally define and operationalize the problem of interest). Therefore, I’m interpreting this as “I don’t see good principled intuitions behind the improvements”; please let me know if this is not what you meant.

I claim that, excepting the choice of denominator, all of the improvements follow directly from AUP $_{conceptual}$ (and actually, eq. 1 was the equation with arbitrary choices wrt the AGI case; I started with that because that’s how my published work formalizes the problem).

CCC says catastrophes are caused by power seeking behavior from the agent. Agents are only incentivized to pursue power in order to better achieve their own goals. Therefore, the correct equation should look something like “do your primary goal but be penalized for becoming more able to achieve your primary goal”. In this light, penalizing $R$ -AU is obviously better than using an auxiliary goal, penalizing decreases is obviously irrelevant, and penalizing immediate reward advantage is obviously irrelevant.

The denominator, on the other hand, is indeed the product of meditating on “What kind of elegant rescaling keeps making sense in all sorts of different situations, but also can’t be gamed to arbitrarily decrease the penalty?”.

[-]Charlie Steiner6yΩ240

Right. Some intuition is necessary. But a lot of these choices are ad hoc, by which I mean they aren't strongly constrained by the result you want from them.

For example, you have a linear penalty governed by this parameter lambda, but in principle it could have been any old function - the only strong constraint is that you want it to monotonically increase from a finite number to infinity. Now, maybe this is fine, or maybe not. But I basically don't have much trust for meditation in this sort of case, and would rather see explicit constraints that rule out more of the available space.

[-]TurnTrout6y*Ω240

I basically don't have much trust for meditation in this sort of case

I’m not asking you to trust in anything, which is why I emphasized that I want people to think more carefully about these choices. I do not think eq. 5 is AGI-safe. I do not think you should put it in an AGI. Do I think there’s a chance it might work? Yes. But we don’t work with “chances”, so it’s not ready.

Anyways, if theorem 11 of the low-hanging fruit post is met, the tradeoff penalty works fine. I also formally explored the hard constraint case and discussed a few reasons why the tradeoff is preferable to the hard constraint. Therefore, I think that particular design choice is reasonably determined. Would you want to think about this more before actually running an AGI with that choice? Of course.

To your broader point, I think there may be another implicit frame difference here. I’m talking about the diff of the progress, considering questions like “are we making a lot of progress? What’s the marginal benefit of more research like this? Are we getting good philosophical returns from this line of work?”, to which I think the answer is yes.

On the other hand, you might be asking “are we there yet?”, and I think the answer to that is no. Notice how these answers don’t contradict each other.

From the first frame, being skeptical because each part of the equation isn’t fully determined seems like an unreasonable demand for rigor. I wrote this sequence because it seemed that my original AUP post was pedagogically bad (I was already thinking about concepts like “overfitting the AU landscape” back in August 2018) and so very few people understood what I was arguing.

I’d like to think that my interpretive labor has paid off: AUP isn’t a slapdash mixture of constraints which is too complicated to be obviously broken, it’s attempting to directly disincentive catastrophes based off of straightforward philosophical reasoning, relying on assumptions and conjectures which I’ve clearly stated. In many cases, I waited weeks so I could formalize my reasoning in the context of MDPs (e.g. why should you think of the AU landscape as a ‘dual’ to the world state? Because I proved it).

There’s always another spot where I could make my claims more rigorous, where I could gather just a bit more evidence. But at some point I have to actually put the posts up, and I think I’ve provided some pretty good evidence in this sequence.

From the second frame, being skeptical because each part of the equation isn’t fully determined is entirely appropriate and something I encourage.

I think you’re writing from something closer to the second frame, but I don’t know for sure. For my part, this sequence has been arguing from the first frame: “towards a new impact measure”, and that’s why I’ve been providing pushback.

[-]TurnTrout6y*Ω120

My very general concern is that strategies that maximize AUP reward might be very... let's say creative, and your claims are mostly relying on intuitive arguments for why those strategies won't be bad for humans.

My argument hinges on CCC being true. If CCC is true, and if we can actually penalize the agent for accumulating power, then if the agent doesn’t want to accumulate power, it’s not incentivized to screw us over. I feel like this is a pretty good intuitive argument, and it’s one I dedicated the first two-thirds of the sequence to explaining. You’re right that it’s intuitive, of course.

I guess our broader disagreement may be “what would an actual solution for impact measurement have going for it at this moment in time?”, and it’s not clear that I’d expect to have formal arguments to this effect / I don’t know how to meet this demand for rigor.

[ETA: I should note that some of my most fruitful work over the last year came from formalizing some of my claims. People were skeptical that slowly decreasing the penalty aggressiveness would work, so I hashed out the math in How Low Should Fruit Hang Before We Pick It?. People were uneasy that the original AUP design relied on instrumental convergence being a thing (eq. 5 doesn’t make that assumption) when maybe it actually isn’t. So I formalized instrumental convergence in Seeking Power is Instrumentally Convergent in MDPs and proved when it exists to at least some extent.

There’s probably more work to be done like this.]

I don't really buy the claim that if you've been able to patch each specific problem, we'll soon reach a version with no problems - the exact same inductive argument you mention suggests that there will just be a series of problems, and patches, and then more problems with the patched version. Again, I worry that patches are based a lot on intuition.

The claim is dually resting on “we know conceptually how to solve impact measurement / what we want to implement, and it’s a simple and natural idea, so it’s plausible there’s a clean implementation of it”. I think learning “no, there isn’t a clean way to penalize the agent for becoming more able to achieve its own goal” would be quite surprising, but not implausible – I in fact think there’s a significant chance Stuart is right. More on that next post.

Also, you could argue against any approach to AI alignment by pointing out that there are still things to improve and fix, or that there were problems pointed out in the past which were fixed, but now people have found a few more problems. The thing that makes me think the patches might not be endless here is that, as I’ve argued earlier, I think AUP is conceptually correct.

This might looks like sacrificing the ability to colonize distant galaxies in order to gain total control over the Milky Way.

It all depends whether we can get a buffer between catastrophes and reasonable plans here (reasonable plans show up for much less aggressive settings of $λ$ ) and I think we can. Now, this particular problem (with huge reward) might not show up because we can bound the reward [0,1], and I generally think there exist reasonable plans where the agent gets at least 20% or so of its maximal return (suppose it thinks there’s a 40% chance we let it get 95% of its maximal per-timestep reward each timestep in exchange for it doing what we want).

[ETA: actually, if the "reasonable" reward is really, really low in expectation, it's not clear what happens. this might happen if catastrophe befalls us by default.]

You’re right we should inspect the equation for weird incentives, but to a limited extent, this is also something we can test experimentally. We don’t necessarily have to rely on intuition in all cases.

The hope is we can get to a formula that’s simple enough such that all of its incentives are thoroughly understood. I think you’ll agree eq. 5 is far better in this respect than the original AUP formulation!

[-]Stuart_Armstrong6yΩ350

More seriously, the reason I'm sceptical of impact measures is because it feels that they all fail for the same reason. Unfortunately, I can't articulate that reason; it's the result of a long history of trying to build impact measures and trying to break them. I just have a feel for where the weaknesses are. So I knew that subagents would be a problem for AUP, long before I could articulate it formally.

But, as I said, I unfortunately can't formalise this feeling; it remains personal.

For this example, it was harder than usual to come up with a counter-example. And I was surprised that half of AUP survived fine - I would not have expected that a restriction against lowering your power would be unhackable. So consider these mild positives for your approach.

But my instinctive feeling remains: I hope that AUP can be made to work for superintelligences, but I expect that it won't :-(

[-]Logan Riggs6y10

I expect AUP to fail in embedded agency problems (which I interpret the subagent problem to be included). Do you expect it to fail in other areas?

[-]Stuart_Armstrong6y30

Do you expect it to fail in other areas?

Yes. Subagent problems are not cleanly separated from other problems (see section 3.4 of https://www.lesswrong.com/posts/mdQEraEZQLg7jtozn/subagents-and-impact-measures-full-and-fully-illustrated , where the subagent is replaced with a rock). The impact penalty encourages the agent to put restrictions on their own future possible actions. Doing this through a subagent is one way, but there are many others (see Odysseus and the sirens, or section 6.2 of the post above in this comment).

[-]Logan Riggs6y10

Thanks for the link (and the excellent write-up of the problem)!

Regarding the setting, how would the agent gain the ability to create a sub-agent, roll a rock, or limit it's own abilities initially? Throughout AUP, you normally start with a high penalty for acquiring power, and then you scale it down to reach reasonable, non-catastrophic plans, but your post begins with having higher power.

I don't think AUP prevents abuse of power you have currently have (?), but prevents gaining that power in the first place.

[-]Stuart_Armstrong6y30

The AUP is supposed to prevent the agent accumulating power. The AI initially has huge potential power (because its potential power is all the power it could ever accumulate, given its best strategy to accumulate power) and the penalty is supposed to prevent it turning that potential into actual power - as measured by AUP.

So the AI always has the power to build a subagent; that post just shows that it can do this without triggering the AUP-power penalty.

[-]Stuart_Armstrong6yΩ350

But if you look back at the history of impact measures and AUP, you’ll see lots of skulls; people say “this problem dooms AUP”, and I say “I think we’re talking about conceptually different things and that you’re a little overconfident; probably just a design choice issue”. It then ends up being a solvable design choice issue. So by Laplace’s Rule of Succession, I’d be surprised if this were The Insurmountable Problem That Dooms AUP.

But if you look back at the history of impact measures, you'll see a lot of skulls; people say "this impact measure will work", and I say "I think you're overconfident; I'm sure I can find a way it fails". It then ends up being a solvable "break the impact measure" problem. So by Laplace’s Rule of Succession, I’d be surprised if this approach were The Ultimate Impact Measure That Works.

;-)

[-]Charlie Steiner6yΩ240

Jinx :P

[-]Rafael Harth5yΩ240

(This sequence inspired me to re-read Reinforcement Learning: An Introduction, hence the break.)

I realize that impact measures always lead to a tradeoff between safety and performance competitiveness. But setting $R_{AUX} := R$ seems to sacrifice quite a lot of performance. Is this real or am I missing something?

Namely, whenever there's an action $a$ which doesn't change the state and leads to 1 reward, and a sequence $a_{1}, . . ., a_{n}$ of actions such that $a_{n}$ has $m$ reward with $m > n$ (and all $a_{< n}$ have 0 reward), then it's conceivable that $R_{AUP-1}$ would choose the $(a_{i})$ sequence while $R_{AUP-5}$ would just stubbornly repeat $a$ , even if the $(a_{i})_{1 \leq 1 \leq n}$ represent something very tailored to $R$ that doesn't involve obtaining a lot of resources. In other words, it seems to penalize reasonable long-term thinking more than the formulas where $R_{AUX} \neq R$ . This feels like a rather big deal since we arguably want an agent to think long-term as long as it doesn't involve gaining power. I guess the scaling step might help here?

Separately and very speculatively, I'm wondering whether the open problem of the AUP-agent tricking the penalty by restricting its future behavior is actually a symptom of the non-embedded agency model. The decision to make such a hack should come with a vast increase in AU for its primary goal, but it wouldn't be caught by your penalty since it's about an internal change. If so, that might be a sign that it'll be difficult to fix. More generally, if you don't consider internal changes in principle, what stops a really powerful agent from reprogramming itself to slip through your penalty?

[-]TurnTrout5yΩ240

I realize that impact measures always lead to a tradeoff between safety and performance competitiveness.

For optimal policies, yes. In practice, not always - in SafeLife, AUP often had ~50% improved performance on the original task, compared to just naive reward maximization with the same algorithm!

it seems to penalize reasonable long-term thinking more than the formulas where .

Yeah. I'm also pretty sympathetic to arguments by Rohin and others that the $R_{aux} = R$ variant isn't quite right in general; maybe there's a better way to formalize "do the thing without gaining power to do it" wrt the agent's own goal.

whether the open problem of the AUP-agent tricking the penalty by restricting its future behavior is actually a symptom of the non-embedded agency model.

I think this is plausible, yep. This is why I think it's somewhat more likely than not there's no clean way to solve this; however, I haven't even thought very hard about how to solve the problem yet.

More generally, if you don't consider internal changes in principle, what stops a really powerful agent from reprogramming itself to slip through your penalty?

Depends on how that shows up in the non-embedded formalization, if at all. If it doesn't show up, then the optimal policy won't be able to predict any benefit and won't do it. If it does... I don't know. It might. I'd need to think about it more, because I feel confused about how exactly that would work - what its model of itself is, exactly, and so on.

[-]Stuart_Armstrong4yΩ330

To see this, imagine the AUP agent builds a subagent to make for all future $s, a$ , in order to neutralize the penalty term. This means it can't make the penalty vanish without destroying its ability to better optimize its primary reward, as the (potentially catastrophically) powerful subagent makes sure the penalty term stays neutralized.

I believe this is incorrect. The $a$ and $\emptyset$ are the actions of the AUP agent. The subagent just needs to cripple the AUP agent so that all actions are equivalent, then go about maximising $R$ to the upmost.

[-][anonymous]6y*Ω130

I wonder what happens to the subagent problem with a random action as baseline: In the current sense, building a subagent roughly works by reaching a state $s_{t + 1}$ where

Q_{R} (s_{t + 1}, \emptyset) = V_{R} (π^{*}, s_{t + 1})

for all auxillary rewards $R$ , where $π^{*}$ is the optimal policy according to the main reward; while making sure that there exists an action $a_{R}$ such that

R (t) + γ Q_{R} (s_{t + 1}, a_{R}) \approx Q_{R} (s_{t}, \emptyset)

for every $R$ . So while building a subagent in that way is still feasible, the agent would be forced to either receive a large penalty or give the subagent random orders at $t + 1$ .

Probably, there is a way to circumvent this again, though? Also, I am unsure about the other properties of randomized baselines.

[-]TurnTrout6yΩ120

what do you mean by "for all $R$ "?

The random baseline is an idea I think about from time to time, but usually I don't dwell because it seems like the kind of clever idea that secretly goes wrong somehow? It depends whether the agent has any way of predicting what the random action will be at a future point in time.

if it can predict it, I'd imagine that it might find a way to gain a lot of power by selecting a state whose randomly selected action is near-optimal. because of the denominator, it would still be appropriately penalized for performing better than the randomly selected action, but it won't receive a penalty for choosing an action with expected optimal value just below the near-optimal action.

[-]Rohin Shah6yΩ340

It depends whether the agent has any way of predicting what the random action will be at a future point in time.

You don't have to literally sample a random action; you can just calculate the expected thing that would happen under a random policy. For example, you would replace $Q^{*} (s, ϕ)$ with $\frac{1}{| A |} A \sum i = 1 Q^{*} (s, a_{i})$ .

[-][anonymous]6y*10

For all auxillary rewards. Edited the original comment.

I agree that it is likely to go wrong somewhere, but it might still be useful to figure out why. If the agent is able to predict the randomness reliably in some cases, the random baseline does not seem to help with the subagent problem.

Edit: Randomization does not seem to help, as long as the actionset is large (as the agent can then arrange for most actions to make the subagent optimize the main reward).

[-]EliasHasle3y10

I kind of fell over board when V popped up in the equations without being introduced first.

By this reasoning, $V_{R_{AUP}}^{*} (s)$ can still increase up until the point of $V_{R}^{*} (s)$ . This doesn't jump out as a big deal to me, but I'm flagging this assumption anyways. ↩︎
A subagent might still be built by AUP $_{eq. 5}$ to stabilize minor AU fluctuations which cause additional penalty over the course of non-power-gaining plans. It seems like there are plenty of other ways to minimize fluctuation, so it's not clear why building an omnipotent subagent to perfectly restrict you accrues less penalty.

I do think we should think carefully about this, of course. The incentive to minimize AU fluctuations and generally commit to perpetual inaction ASAP is probably one of the main remaining problems with AUP $_{eq. 5}$ . ↩︎
As pointed out by Evan Hubinger, this is only safe if myopically optimizing $R$ is safe – we aren't penalizing single-step reward acquisition. ↩︎
This issue was originally pointed out by Ofer. ↩︎
The fact that Ofer’s/Stuart’s problem survived all of the other improvements is evidence that it’s harder. I just don’t think the evidence it provides is that strong. ↩︎

LESSWRONG
LW

LESSWRONG
LW

28

Attainable Utility Preservation: Scaling to Superhuman

28

Ω 14

28

Ω 14

Problems

Auxiliary loopholes

Reward advantage penalization

Scaling

Penalizing decreases?

Empirical sanity check

$\emptyset$ ?

Appendix: Remaining Problems

28

Attainable Utility Preservation: Scaling to Superhuman

28

Ω 14

28

Ω 14

Problems

Auxiliary loopholes

Reward advantage penalization

Scaling

Penalizing decreases?

Empirical sanity check

∅?

Appendix: Remaining Problems

$\emptyset$ ?