








Appendix: No free impact
What if we want the agent to single-handedly ensure the future is stable and aligned with our values? AUP probably won’t allow policies which actually accomplish this goal – one needs power to e.g. nip unaligned superintelligences in the bud. AUP aims to prevent catastrophes by stopping bad agents from gaining power to do bad things, but it symmetrically impedes otherwise-good agents.
This doesn’t mean we can’t get useful work out of agents – there are important asymmetries provided by both the main reward function and AU landscape counterfactuals.
First, even though we can’t specify an aligned reward function, the provided reward function still gives the agent useful information about what we want. If we need paperclips, then a paperclip-AUP agent prefers policies which make some paperclips. Simple.
Second, if we don’t like what it’s beginning to do, we can shut it off (because it hasn’t gained power over us). Therefore, it has “approval incentives” which bias it towards AU landscapes in which its power hasn’t decreased too much, either.
So we can hope to build a non-catastrophic AUP agent and get useful work out of it. We just can’t directly ask it to solve all of our problems: it doesn’t make much sense to speak of a “low-impact singleton”.
Notes
- To emphasize, when I say "AUP agents do " in this post, I mean that AUP agents correctly implementing the concept of AUP tend to behave in a certain way.
- As pointed out by Daniel Filan, AUP suggests that one might work better in groups by ensuring one's actions preserve teammates' AUs.
I liked this post, and look forward to the next one.
More specific, and critical commentary (It seems it is easier to notice surprise than agreement):
(With embedded footnotes)
1.
(The CCC didn't make reference to overfitting.)
Premise:
If A is true then B will be true.
Conclusion:
If A is false B will be false.
The conclusion doesn't follow from the premise.
2.
Note that preserving our attainable utilities isn't a good thing, it's just not a bad thing.
Issues: Attainable utilities indefinitely 'preserved' are wasted.
Possible issues: If an AI just happened to discovered a cure for cancer, we'd probably want to know the cure. But if an AI didn't know what we wanted, and just focused on preserving utility*, then (perhaps as a side effect of considering both that we might want to know the cure, and might not want to know the cure) it might not tell us because that preserves utility. (The AI might operate on a framework that distinguishes between action and inaction, in a way that means it doesn't do thing that might be bad, at the cost of not doing things that might be good.)
*If we are going to calculate something and a reliable source (which has already done the calculation) tells us the result, we can save on energy (and preserve resources that can be converted into utility) by not doing the calculation. In theory this could include not only arithmetic, but simulations of different drugs or cancer treatments to come up with better options.
3.
Is this a metaphor for making an 'agent' with that goal, or actually creating an agent that we can give different commands to and switch out/modify/add to its goals? (Why ask it to 'make paperclips' if that's dangerous, when we can ask it to 'make 100 paperclips'?)
4.
Addressed in 1.
5.
It does a little bit.
It means we can't observe them for astronomical purposes. But this isn't the same as losing a telescope looking at them - it's (probably) permanent, and maybe we learn something different from it. We learn that stars can be jumbled up. This may have physics/stellar engineering consequences, etc.
6.
Nearby from its perspective? (From a practical standpoint, if you're close to an airport you're close to a lot of places on earth, that you aren't from a 'space' perspective.)
7.
If there's a limited amount of energy, then using energy limits ability to reach many world states - perhaps in a different sense than above. If there's a machine that can turn all pebbles into something else (obsidian, precious stones, etc.) but it takes a lot of energy, then using up energy limits the number of times it can be used. (This might seem quantifiable, moving the world* from containing 101 units of energy -> 99 units an effect on how many times the machine can be used if it requires 100, or 10 units to use. But this isn't robust against random factors decreasing energy (or decreasing it), or future improvements in energy efficiency of the machine - if the cost is brought down to 1 unit of energy, then using up 2 units prevents it from being used twice.
*Properly formalizing this should take a lot of other things into account, like 'distant' and notions of inaccessible regions of space, etc.
Also the agent might be concerned with flows rather than actions.* We have an intuitive notion that 'building factories increases power', but what about redirecting a river/stream/etc. with dams or digging new paths for water to flow? What does the agent do if it unexpectedly gains power by some means, or realizes its paperclip machines can be used to move strawberries/make a copy itself which is weaker but less constrained? Can the agent make a machine that makes paperclips/make making paperclips easier?
*As a consequence of this being a more effective approach - it makes certain improvements obvious. If you have a really long commute to work, you might wish you lived closer to your work. (You might also be aware that houses closer to your work are more expensive, but humans are good at picking up on this kind of low hanging fruit. A capable agent that thinks about process seeing 'opportunities to gain power' is of some general concern. In this case because an agent that tries to minimize reducing/affecting** other agents attainable utility, without knowing/needing to know about other agents is somewhat counterintuitive.
**It's not clear if increasing shows up on the AUP map, or how that's handled.
8.
I appreciate this distinction being made. A post that explains the intuitions behind an approach is very useful, and my questions about the approach may largely relate to implementation details.
9.
A number of my comments above were anticipated then.
Yeah. I have the math for this kind of tradeoff worked out - stay tuned!
I think this is true, actually; if another agent already has a lot of power and it isn't already cata
... (read more)