








Appendix: No free impact
What if we want the agent to single-handedly ensure the future is stable and aligned with our values? AUP probably won’t allow policies which actually accomplish this goal – one needs power to e.g. nip unaligned superintelligences in the bud. AUP aims to prevent catastrophes by stopping bad agents from gaining power to do bad things, but it symmetrically impedes otherwise-good agents.
This doesn’t mean we can’t get useful work out of agents – there are important asymmetries provided by both the main reward function and AU landscape counterfactuals.
First, even though we can’t specify an aligned reward function, the provided reward function still gives the agent useful information about what we want. If we need paperclips, then a paperclip-AUP agent prefers policies which make some paperclips. Simple.
Second, if we don’t like what it’s beginning to do, we can shut it off (because it hasn’t gained power over us). Therefore, it has “approval incentives” which bias it towards AU landscapes in which its power hasn’t decreased too much, either.
So we can hope to build a non-catastrophic AUP agent and get useful work out of it. We just can’t directly ask it to solve all of our problems: it doesn’t make much sense to speak of a “low-impact singleton”.
Notes
- To emphasize, when I say "AUP agents do " in this post, I mean that AUP agents correctly implementing the concept of AUP tend to behave in a certain way.
- As pointed out by Daniel Filan, AUP suggests that one might work better in groups by ensuring one's actions preserve teammates' AUs.
(Definitely a possibility that this is answered later in the sequence)
Rereading the post and thinking about this, I wonder if AUP-based AIs can still do anything (which is what I think Steve was pointing at). Or maybe phrased differently, whether it can still be competitive.
Sure, reading a textbook doesn't decrease the AU of most other goals, but applying the learned knowledge might. On your paperclip example, I expect that the AUP-based AI will make very few paper clips, or it could have a big impact (after all, we make paperclips in factories, but they change the AUP landscape)
More generally, AUP seems to forbid any kind of competence in a zero-sum-like situation. To go back to Steve's example, if the AI invents a great new solar cell, then it will make its owner richer and more powerful at the expense of other people, which is forbidden by AUP as I understand it.
Another way to phrase my objection is that at first glance, AUP seems to not only forbid gaining power for the AI, but also gaining power for the AI's user. Which sounds like a good thing, but might also create incentives to create and use non AUP-based AIs. Does that make any sense, or did I fail to understand some part of the sequence that explains this?
(An interesting consequence of this if I'm right is that AUP-based AIs might be quite competitive for making open-source things, which is pretty cool).