








Appendix: No free impact
What if we want the agent to single-handedly ensure the future is stable and aligned with our values? AUP probably won’t allow policies which actually accomplish this goal – one needs power to e.g. nip unaligned superintelligences in the bud. AUP aims to prevent catastrophes by stopping bad agents from gaining power to do bad things, but it symmetrically impedes otherwise-good agents.
This doesn’t mean we can’t get useful work out of agents – there are important asymmetries provided by both the main reward function and AU landscape counterfactuals.
First, even though we can’t specify an aligned reward function, the provided reward function still gives the agent useful information about what we want. If we need paperclips, then a paperclip-AUP agent prefers policies which make some paperclips. Simple.
Second, if we don’t like what it’s beginning to do, we can shut it off (because it hasn’t gained power over us). Therefore, it has “approval incentives” which bias it towards AU landscapes in which its power hasn’t decreased too much, either.
So we can hope to build a non-catastrophic AUP agent and get useful work out of it. We just can’t directly ask it to solve all of our problems: it doesn’t make much sense to speak of a “low-impact singleton”.
Notes
- To emphasize, when I say "AUP agents do " in this post, I mean that AUP agents correctly implementing the concept of AUP tend to behave in a certain way.
- As pointed out by Daniel Filan, AUP suggests that one might work better in groups by ensuring one's actions preserve teammates' AUs.
For reference and ease of quoting, this comment is a text only version of the post above. (It starts at "Text:" below.) I am not the OP.
Formatting:
It's not clear how to duplicate the color effect* or cross words out**, so that hasn't been done. Instead crossed out words are followed by "? (No.)", and here's a list of some words by color to refresh the color/concept relation:
Blue words:
Power/impact/penalty/importance/respect/conservative/catastrophic/distance measure/impact measurement
Purple words:
incentives/actions/(reward)/expected utility/complicated human value/tasks
Text:
Last time on reframing impact:
(CCC)
Catastrophic Convergence Conjecture:
Unaligned goals tend to have catastrophe-inducing optimal policies because of power-seeking incentives
If the CCC is right, then if power gain is disincentivised, the agent isn't incentivised to overfit and disrupt our AU landscape.
Without even knowing who we are or what we want, the agent's actions preserve our attainable utilities.
We can tell it:
Make paperclips
or
Put that strawberry on the plate
or
Paint the car pink
...
but don't gain power.
This approach is called Attainable Utility preservation
We're focusing on concepts in this post. For now, imagine an agent receiving a reward for a primary task minus a scaled penalty for how much it's actions change its power (in the intuitive sense). This is AUP_conceptual, not any formalization you may be familiar with.
What might a paperclip-manufacturing AUP_conceptual agent do?
Build lots of factories? (No.)
Copy itself? (No.)
Nothing? (No.)
Narrowly improve paperclip production efficiency <- This is the kind of policy AUP_conceptual is designed to encourage and allow. We don't know if this is the optimal policy, but by CCC, the optimal policy won't be catastrophic.
AUP_conceptual dissolves thorny issues in impact measurement.
Is the agent's ontology reasonable?
Who cares.
Instead of regulating its complex physical effects on the outside world,
the agent is looking inwards at itself and its own abilities.
How do we ensure the impact penalty isn't dominated by distant state changes?
Imagine I take over a bunch of forever inaccessible stars and jumble them up. This is a huge change in state, but it doesn't matter to us.
AUP_conceptual solves this "locality" problem by regularizing the agent's impact on the nearby AU landscape.
What about butterfly effects?
How can the agent possibly determine which effects its responsible for?
Forget about it.
AUP_conceptual agents are respectful and conservative with respect to the local AUP landscape without needing to assume anything about its structure or the agents in it.
How can an idea go wrong?
There can be a gap between what we want and the concept, and then a gap between the concept and the execution.
For past-impact measures, it's not clear that their conceptual thrusts are well-aimed, even if we could formalize everything correctly. Past approaches focus either on minimizing physical change to some aspect of the world or on maintaining ability to reach many world states.
The hope is that in order for the agent to cause a large impact on us it has to snap a tripwire.
The problem is... well it's not clear how we could possibly know whether the agent can still find a catastrophic policy; in a sense the agent is still trying to sneak by the restrictions and gain power over us. An agent maximizing expected utility while actually minimally changing still probably leads to catastrophe.
That doesn't seem to be the case for AUP_conceptual.
Assuming CCC, an agent which doesn't gain much power, doesn't cause catastrophes. This has no dependency on complicated human value, and most realistic tasks should have reasonable, high-reward policies not gaining undue power.
So AUP_conceptual meets our desiderata:
The distance measure should:
1) Be easy to specify
2) Put catastrophes far away.
3) Put reasonable plans nearby
Therefore, I consider AUP to conceptually be a solution to impact measurement.
Wait! Let's not get ahead of ourselves! I don't think we've fully bridged the concept/execution gap.
However for AUP, it seems possible - more on that later.
Thanks for doing this. I was originally going to keep a text version of the whole sequence, but I ended up making lots of final edits in the images, and this sequence has already taken an incredible amount of time on my part.