








Appendix: No free impact
What if we want the agent to single-handedly ensure the future is stable and aligned with our values? AUP probably won’t allow policies which actually accomplish this goal – one needs power to e.g. nip unaligned superintelligences in the bud. AUP aims to prevent catastrophes by stopping bad agents from gaining power to do bad things, but it symmetrically impedes otherwise-good agents.
This doesn’t mean we can’t get useful work out of agents – there are important asymmetries provided by both the main reward function and AU landscape counterfactuals.
First, even though we can’t specify an aligned reward function, the provided reward function still gives the agent useful information about what we want. If we need paperclips, then a paperclip-AUP agent prefers policies which make some paperclips. Simple.
Second, if we don’t like what it’s beginning to do, we can shut it off (because it hasn’t gained power over us). Therefore, it has “approval incentives” which bias it towards AU landscapes in which its power hasn’t decreased too much, either.
So we can hope to build a non-catastrophic AUP agent and get useful work out of it. We just can’t directly ask it to solve all of our problems: it doesn’t make much sense to speak of a “low-impact singleton”.
Notes
- To emphasize, when I say "AUP agents do " in this post, I mean that AUP agents correctly implementing the concept of AUP tend to behave in a certain way.
- As pointed out by Daniel Filan, AUP suggests that one might work better in groups by ensuring one's actions preserve teammates' AUs.
How, exactly, would it have a big impact? Do you expect making a few paperclip factories to have a large impact in real life? If not, why would idealized-AUP agents expect that?
I think that for many tasks, idealized-AUP agents would not be competitive. It seems like they'd still be competitive on tasks with more limited scope, like putting apples on plates, construction tasks, or (perhaps) answering questions etc.
I'm not sure what your model is here. In this post, this isn't a constrained optimization problem, but rather a tradeoff between power gain and the main objective. So it's not like AUP raps the agent's knuckles and wholly rules out plans involving even a bit of power gain. The agent computes something like (objective score) - c*(power gain), where c is some constant.
On rereading, I guess this post doesn't make that clear: this post assumes not only that we correctly implement the concepts behind AUP, but also that we slide along the penalty harshness spectrum until we get reasonable plans. It seems like we should hit reasonable plans before power-seeking is allowed, although this is another detail swept under the rug by the idealization.
Idealized-AUP doesn't directly penalize gaining power for the user, no. Whether this is indirectly incentivized depends on the idealizations we make.
I think that impact measures levy a steep alignment tax, so yes, I think that there are competitive pressures to cut corners on impact allowances.