Attainable Utility Preservation: Concepts

by TurnTrout1 min read17th Feb 202018 comments

40

Ω 15

Impact MeasuresAI
Frontpage
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Appendix: No free impact

What if we want the agent to single-handedly ensure the future is stable and aligned with our values? AUP probably won’t allow policies which actually accomplish this goal – one needs power to e.g. nip unaligned superintelligences in the bud. AUP aims to prevent catastrophes by stopping bad agents from gaining power to do bad things, but it symmetrically impedes otherwise-good agents.

This doesn’t mean we can’t get useful work out of agents – there are important asymmetries provided by both the main reward function and AU landscape counterfactuals.

First, even though we can’t specify an aligned reward function, the provided reward function still gives the agent useful information about what we want. If we need paperclips, then a paperclip-AUP agent prefers policies which make some paperclips. Simple.

Second, if we don’t like what it’s beginning to do, we can shut it off (because it hasn’t gained power over us). Therefore, it has “approval incentives” which bias it towards AU landscapes in which its power hasn’t decreased too much, either.

So we can hope to build a non-catastrophic AUP agent and get useful work out of it. We just can’t directly ask it to solve all of our problems: it doesn’t make much sense to speak of a “low-impact singleton”.

Notes

  • To emphasize, when I say "AUP agents do " in this post, I mean that AUP agents correctly implementing the concept of AUP tend to behave in a certain way.
  • As pointed out by Daniel Filan, AUP suggests that one might work better in groups by ensuring one's actions preserve teammates' AUs.
18 comments, sorted by Highlighting new comments since Today at 10:42 AM
New Comment

Summary for the Alignment Newsletter:

In <@part 1@>(@Reframing Impact - Part 1@) of this sequence, we saw that an event is _impactful_ if it _changes our ability to get what we want_. This part takes this understanding and applies it to AI alignment.
In the real world, there are many events that cause _objective_ negative impacts: they reduce your ability to pursue nearly any goal. An asteroid impact that destroys the Earth is going to be pretty bad for you, whether you want to promote human flourishing or to make paperclips. Conversely, there are many plans that produce objective positive impacts: for many potential goals, it's probably a good idea to earn a bunch of money, or to learn a lot about the world, or to command a perfectly loyal army. This is particularly exacerbated when the environment contains multiple agents: for goals that benefit from having more resources, it is objectively bad for you if a different agent seizes your resources, and objectively good for you if you seize other agents' resources.
Based on this intuitive (but certainly not ironclad) argument, we get the **Catastrophic Convergence Conjecture (CCC)**: "Unaligned goals tend to have catastrophe-inducing optimal policies because of power-seeking incentives".
Let's now consider a _conceptual_ version of <@Attainable Utility Preservation (AUP)@>(@Towards a New Impact Measure@): the agent optimizes a primary (possibly unaligned) goal, but is penalized for changing its "power" (in the intuitive sense). Intuitively, such an agent no longer has power-seeking incentives, and so (by the contrapositive of the CCC) it will not have a catastrophe-inducing optimal policy -- exactly what we want! This conceptual version of AUP also avoids thorny problems such as ontology identification and butterfly effects, because the agent need only reason about its own beliefs, rather than having to reason directly about the external world.

Opinion:

This was my favorite part of the sequence, as it explains the conceptual case for AUP clearly and concisely. I especially liked the CCC: I believe that we should be primarily aiming to prevent an AI system "intentionally" causing catastrophe, while not attempting to guarantee an absence of "accidental" mistakes (<@1@>(@Clarifying "AI Alignment"@), <@2@>(@Techniques for optimizing worst-case performance@)), and the CCC is one way of cashing out this intuition. It's a more crisp version of the idea that convergent instrumental subgoals are in some sense the "source" of AI accident risk, and if we can avoid instrumental subgoals we will probably have solved AI safety.

Depending on the agent implementation you may find that it is demotivated to achieve any useful outcome if they are power limited. Half-assing things seems pointless and futile, they aren't sane actions in the world. E.g. trying to put out a fire when all you have is a squirt gun.

The power limitation isn’t a hard cap, it’s a tradeoff. AUP agents do not have to half-ass anything. As I wrote in another comment,

It prefers plans that don’t gain unnecessary power.

If “unnecessary” is too squishy of a word for your tastes, I’m going get quite specific in the next few posts.

I liked this post, and look forward to the next one.


More specific, and critical commentary (It seems it is easier to notice surprise than agreement):

(With embedded footnotes)

1.

If the CCC is right, then if power gain is disincentivised, the agent isn't incentivised to overfit and disrupt our AU landscape.

(The CCC didn't make reference to overfitting.)

Premise:

If A is true then B will be true.

Conclusion:

If A is false B will be false.


The conclusion doesn't follow from the premise.


2.

Without even knowing who we are or what we want, the agent's actions preserve our attainable utilities.

Note that preserving our attainable utilities isn't a good thing, it's just not a bad thing.

Issues: Attainable utilities indefinitely 'preserved' are wasted.

Possible issues: If an AI just happened to discovered a cure for cancer, we'd probably want to know the cure. But if an AI didn't know what we wanted, and just focused on preserving utility*, then (perhaps as a side effect of considering both that we might want to know the cure, and might not want to know the cure) it might not tell us because that preserves utility. (The AI might operate on a framework that distinguishes between action and inaction, in a way that means it doesn't do thing that might be bad, at the cost of not doing things that might be good.)

*If we are going to calculate something and a reliable source (which has already done the calculation) tells us the result, we can save on energy (and preserve resources that can be converted into utility) by not doing the calculation. In theory this could include not only arithmetic, but simulations of different drugs or cancer treatments to come up with better options.


3.

We can tell it:

Is this a metaphor for making an 'agent' with that goal, or actually creating an agent that we can give different commands to and switch out/modify/add to its goals? (Why ask it to 'make paperclips' if that's dangerous, when we can ask it to 'make 100 paperclips'?)


4.

Narrowly improve paperclip production efficiency <- This is the kind of policy AUP_conceptual is designed to encourage and allow. We don't know if this is the optimal policy, but by CCC, the optimal policy won't be catastrophic.

Addressed in 1.


5.

Imagine I take over a bunch of forever inaccessible stars and jumble them up. This is a huge change in state, but it doesn't matter to us.

It does a little bit.

It means we can't observe them for astronomical purposes. But this isn't the same as losing a telescope looking at them - it's (probably) permanent, and maybe we learn something different from it. We learn that stars can be jumbled up. This may have physics/stellar engineering consequences, etc.


6.

AUP_conceptual solves this "locality" problem by regularizing the agent's impact on the nearby AU landscape.

Nearby from its perspective? (From a practical standpoint, if you're close to an airport you're close to a lot of places on earth, that you aren't from a 'space' perspective.)


7.

For past-impact measures, it's not clear that their conceptual thrusts are well-aimed, even if we could formalize everything correctly. Past approaches focus either on minimizing physical change to some aspect of the world or on maintaining ability to reach many world states.

If there's a limited amount of energy, then using energy limits ability to reach many world states - perhaps in a different sense than above. If there's a machine that can turn all pebbles into something else (obsidian, precious stones, etc.) but it takes a lot of energy, then using up energy limits the number of times it can be used. (This might seem quantifiable, moving the world* from containing 101 units of energy -> 99 units an effect on how many times the machine can be used if it requires 100, or 10 units to use. But this isn't robust against random factors decreasing energy (or decreasing it), or future improvements in energy efficiency of the machine - if the cost is brought down to 1 unit of energy, then using up 2 units prevents it from being used twice.

*Properly formalizing this should take a lot of other things into account, like 'distant' and notions of inaccessible regions of space, etc.

Also the agent might be concerned with flows rather than actions.* We have an intuitive notion that 'building factories increases power', but what about redirecting a river/stream/etc. with dams or digging new paths for water to flow? What does the agent do if it unexpectedly gains power by some means, or realizes its paperclip machines can be used to move strawberries/make a copy itself which is weaker but less constrained? Can the agent make a machine that makes paperclips/make making paperclips easier?

*As a consequence of this being a more effective approach - it makes certain improvements obvious. If you have a really long commute to work, you might wish you lived closer to your work. (You might also be aware that houses closer to your work are more expensive, but humans are good at picking up on this kind of low hanging fruit. A capable agent that thinks about process seeing 'opportunities to gain power' is of some general concern. In this case because an agent that tries to minimize reducing/affecting** other agents attainable utility, without knowing/needing to know about other agents is somewhat counterintuitive.

**It's not clear if increasing shows up on the AUP map, or how that's handled.


8.

Therefore, I consider AUP to conceptually be a solution to impact measurement.
Wait! Let's not get ahead of ourselves! I don't think we've fully bridged the concept/execution gap.
However for AUP, it seems possible - more on that later.

I appreciate this distinction being made. A post that explains the intuitions behind an approach is very useful, and my questions about the approach may largely relate to implementation details.


9.

AUP aims to prevent catastrophes by stopping bad agents from gaining power to do bad things, but it symmetrically impedes otherwise-good agents.

A number of my comments above were anticipated then.

The conclusion doesn't follow from the premise.

CCC says (for non-evil goals) "if the optimal policy is catastrophic, then it's because of power-seeking". So its contrapositive is indeed as stated.

Note that preserving our attainable utilities isn't a good thing, it's just not a bad thing.

I meant "preserving" as in "not incentivized to take away power from us", not "keeps us from benefitting from anything", but you're right about the implication as stated. Sorry for the ambiguity.

Is this a metaphor for making an 'agent' with that goal, or actually creating an agent that we can give different commands to and switch out/modify/add to its goals?

Metaphor.

"AUP_conceptual solves this "locality" problem by regularizing the agent's impact on the nearby AU landscape."

Nearby from its perspective? (From a practical standpoint, if you're close to an airport you're close to a lot of places on earth, that you aren't from a 'space' perspective.)

Nearby wrt this kind of "AU distance/practical perspective", yes. Great catch.

Also the agent might be concerned with flows rather than actions.* We have an intuitive notion that 'building factories increases power', but what about redirecting a river/stream/etc. with dams or digging new paths for water to flow? What does the agent do if it unexpectedly gains power by some means, or realizes its paperclip machines can be used to move strawberries/make a copy itself which is weaker but less constrained? Can the agent make a machine that makes paperclips/make making paperclips easier?

As a consequence of this being a more effective approach - it makes certain improvements obvious. If you have a really long commute to work, you might wish you lived closer to your work. (You might also be aware that houses closer to your work are more expensive, but humans are good at picking up on this kind of low hanging fruit. A capable agent that thinks about process seeing 'opportunities to gain power' is of some general concern. In this case because an agent that tries to minimize reducing/affecting* other agents attainable utility, without knowing/needing to know about other agents is somewhat counterintuitive.

**It's not clear if increasing shows up on the AUP map, or how that's handled.

Great thoughts. I think some of this will be answered in a few posts by the specific implementation details. What do you mean by "AUP map"? The AU landscape?

What does the agent do if it unexpectedly gains power by some means,

The idea is it only penalizes expected power gain.

CCC says (for non-evil goals) "if the optimal policy is catastrophic, then it's because of power-seeking". So its contrapositive is indeed as stated.

That makes sense. One of the things I like about this approach is that it isn't immediately clear what else could be a problem, and that might just be implementation details or parameters: corrigibility from limited power only works if we make sure that power is low enough we can turn it off, if the agent will acquire power if that's the only way to achieve its goal rather than stopping at/before some limit then it might still acquire power and be catastrophic*, etc.

*Unless power seeking behavior is the cause of catastrophe, rather than having power.

Sorry for the ambiguity.

It wasn't ambiguous, I meant to gesture at stuff like 'astronomical waste' (and waste on smaller scales) - areas where we do want resources to be used. This was addressed at the end of your post already,:

So we can hope to build a non-catastrophic AUP agent and get useful work out of it. We just can’t directly ask it to solve all of our problems: it doesn’t make much sense to speak of a “low-impact singleton”.

-but I wanted to highlight the area where we might want powerful aligned agents, rather than AUP agents that don't seek power.


What do you mean by "AUP map"? The AU landscape?

That is what I meant originally, though upon reflection a small distinction could be made:

Territory: AU landscape*

Map: AUP map (an AUP agent's model of the landscape)

*Whether or not this is thought of as 'Territory' or a 'map', conceptually AUP agents will navigate (and/or create) a map of the AU landscape. (If AU landscape is a map, then AUP agents may navigate a map of a map. There also might be better ways this distinction could be made, like AU landscape is a style/type of map, just like there are maps of elevation and topology.)

The idea is it only penalizes expected power gain.

Gurkenglas previously commented that they didn't think that AUP solved 'agents learns how to convince people/agents to do things'. While it's not immediately clear how an agent could happen to find out how to convince humans of anything (the super-intelligent persuader), if an agent obtained that power, it continuing to operate could constitute a risk. (Though further up this comment I brought up the possibility that "power seeking behavior is the cause of catastrophe, rather than having power." This doesn't seem likely in its entirety, but seems possible in part - that is, powerful and power seeking might not be as dangerous as powerful and power seeking.)

if we make sure that power is low enough we can turn it off, if the agent will acquire power if that's the only way to achieve its goal rather than stopping at/before some limit then it might still acquire power and be catastrophic*, etc.

Yeah. I have the math for this kind of tradeoff worked out - stay tuned!

Though further up this comment I brought up the possibility that "power seeking behavior is the cause of catastrophe, rather than having power."

I think this is true, actually; if another agent already has a lot of power and it isn't already catastrophic for us, their continued existence isn't that big of a deal wrt the status quo. The bad stuff comes with the change in who has power.

The act of taking away our power is generally only incentivized so the agent can become better able to achieve its own goal. The question is, why is the agent trying to convince us of something / get someone else to do something catastrophic, if the agent isn't trying to increase its own AU?

The subagent problem remains: How do you prevent it from getting someone else to catastrophically maximize paperclips and leave it at its power level?

This post is about AUP-the-concept, not about specific implementations. That plan increases its ability to have paperclips maximized and so is penalized by AUP. We'll talk specifics later.

ETA: As a more general note, this post should definitely have an "aha!" associated with it, so if it doesn't, I encourage people to ask questions.

It did have that "aha" effect for me. (The drawings and the calligraphy were also amazing.)

Just trying to think this through ... at the risk of proving I haven't carefully read all your posts ... :-)

I program my AI to invent a better solar cell. So it starts by reading a materials science textbook. OK, now it knows materials science ... it didn't before ... Is that a disallowed AU increase? (As the saying goes, "knowledge is power"...?)

Depends how much power that gains compared to other plans. It prefers plans that don’t gain unnecessary power.

In fact, the “encouraged policy” in the post has the agent reading a Paperclips for Dummies book and making a few extra paperclips.

I was initially writing a comment about how AUP doesn't seem to work in every case because there are actions that are catastrophic without raising its power (such as killing someone), but then I checked the post again and realized that it disincentivizes changes of power in both directions. This rules out the failure modes I had in mind. (It wouldn't press a button that blows up the earth...)

It does seem that AUP will make it so an agent doesn't want to be shut off, though. If it's shut off, its power goes way down (to zero if it won't be turned on again). This might be fine, but it contradicts the utility indifference approach. And it feels dangerous – it seems like we would need an assurance like "AUP will always prevent an agent from gaining enough power to resist being switched off"

I was initially writing a comment about how AUP doesn't seem to work in every case because there are actions that are catastrophic without raising its power (such as killing someone)

And why exactly would it be motivated to kill someone? This is generally incentivized only insofar as it leads to... power gain, it seems. I think that AUP should work just fine for penalizing-increases-only. 

It does seem that AUP will make it so an agent doesn't want to be shut off, though.

I think this is much less of a problem in the "penalize increases with respect to agent inaction" scenario.

And why exactly would it be motivated to kill someone? This is generally incentivized only insofar as it leads to... power gain, it seems. I think that AUP should work just fine for penalizing-increases-only. 

The case I had in mind was "you have an AI assistant trained to keep you healthy, and the objective is operationalized in such a way that it maxes out if you're dead (because then you can't get sick)". If the AI kills you, that doesn't seem to increase its power in any way – it would probably lead to other people shutting it off, which is a decrease in power. Or, more generally, any objective that can be achieved by just destroying stuff.

Yes, sure, but those aren't catastrophes in the way I've defined it here (see also Toby Ord's The Precipice; he espouses a similar definition). It's not an existential threat, but you're right that the agent might still do bad things. 

For reference and ease of quoting, this comment is a text only version of the post above. (It starts at "Text:" below.) I am not the OP.

Formatting:

It's not clear how to duplicate the color effect* or cross words out**, so that hasn't been done. Instead crossed out words are followed by "? (No.)", and here's a list of some words by color to refresh the color/concept relation:

Blue words:

Power/impact/penalty/importance/respect/conservative/catastrophic/distance measure/impact measurement

Purple words:

incentives/actions/(reward)/expected utility/complicated human value/tasks


Text:

Last time on reframing impact:

(CCC)

Catastrophic Convergence Conjecture:

Unaligned goals tend to have catastrophe-inducing optimal policies because of power-seeking incentives


If the CCC is right, then if power gain is disincentivised, the agent isn't incentivised to overfit and disrupt our AU landscape.

Without even knowing who we are or what we want, the agent's actions preserve our attainable utilities.

We can tell it:

Make paperclips

or

Put that strawberry on the plate

or

Paint the car pink

...

but don't gain power.


This approach is called Attainable Utility preservation


We're focusing on concepts in this post. For now, imagine an agent receiving a reward for a primary task minus a scaled penalty for how much it's actions change its power (in the intuitive sense). This is AUP_conceptual, not any formalization you may be familiar with.


What might a paperclip-manufacturing AUP_conceptual agent do?

Build lots of factories? (No.)

Copy itself? (No.)

Nothing? (No.)

Narrowly improve paperclip production efficiency <- This is the kind of policy AUP_conceptual is designed to encourage and allow. We don't know if this is the optimal policy, but by CCC, the optimal policy won't be catastrophic.


AUP_conceptual dissolves thorny issues in impact measurement.

Is the agent's ontology reasonable?

Who cares.


Instead of regulating its complex physical effects on the outside world,

the agent is looking inwards at itself and its own abilities.


How do we ensure the impact penalty isn't dominated by distant state changes?

Imagine I take over a bunch of forever inaccessible stars and jumble them up. This is a huge change in state, but it doesn't matter to us.

AUP_conceptual solves this "locality" problem by regularizing the agent's impact on the nearby AU landscape.


What about butterfly effects?

How can the agent possibly determine which effects its responsible for?

Forget about it.


AUP_conceptual agents are respectful and conservative with respect to the local AUP landscape without needing to assume anything about its structure or the agents in it.


How can an idea go wrong?

There can be a gap between what we want and the concept, and then a gap between the concept and the execution.

For past-impact measures, it's not clear that their conceptual thrusts are well-aimed, even if we could formalize everything correctly. Past approaches focus either on minimizing physical change to some aspect of the world or on maintaining ability to reach many world states.


The hope is that in order for the agent to cause a large impact on us it has to snap a tripwire.

The problem is... well it's not clear how we could possibly know whether the agent can still find a catastrophic policy; in a sense the agent is still trying to sneak by the restrictions and gain power over us. An agent maximizing expected utility while actually minimally changing still probably leads to catastrophe.


That doesn't seem to be the case for AUP_conceptual.

Assuming CCC, an agent which doesn't gain much power, doesn't cause catastrophes. This has no dependency on complicated human value, and most realistic tasks should have reasonable, high-reward policies not gaining undue power.


So AUP_conceptual meets our desiderata:

The distance measure should:

1) Be easy to specify

2) Put catastrophes far away.

3) Put reasonable plans nearby


Therefore, I consider AUP to conceptually be a solution to impact measurement.

Wait! Let's not get ahead of ourselves! I don't think we've fully bridged the concept/execution gap.

However for AUP, it seems possible - more on that later.

Thanks for doing this. I was originally going to keep a text version of the whole sequence, but I ended up making lots of final edits in the images, and this sequence has already taken an incredible amount of time on my part.