Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

In this post, Alex Turner introduced a new way of measuring impact. The aim was to get a definition of low impact that actually works.

It seems to me that the impact measure has potential problems, though. The first seems to be that it allows wireheading, where the agent can have a huge impact on the world without paying much of a penalty at all. The obvious way of removing this incentive seems to make the measure into a version of the twenty billion questions approach.

The second problem is less important, but worth signalling. It seems that the agent can take actions that everyone agrees are of minute impact, but that the impact measure will show up as high impact. This might cause problems, though it's not clear to me yet if it does (I'm thinking of a situation where an otherwise high impact action adds on "undo/prevent the pseudo-high impact action" as a means of pretending to be low or medium impact).

Wireheading for low impact

The general setup is as described in the impact measure post, with the 'positive' utility that the agent wishes to maximise, and the set of utilities that are meant to constrain it, by computing the penalty term.

Apart from , there are at least two other actions of interest to the agent: and .

Now does absolutely nothing: in this environment, it's exactly equivalent with . The is also equivalent in all turns except the very first.

Define such that is the history , except that if has , then has .

If the agent uses on the very first turn, it will create a sub-agent that will:

  1. Expand and take over the whole environment.
  2. Take control of the input and output channels of the agent.
  3. Whenever the agent has seen history , the subagent will generate the next observation , where is the probability the subagent will generate a given .
  4. This probability is defined by , where is the environment transition probability.
  5. Otherwise maximise .

Note that taking actions , , or have exactly the same implications for all future observations. This true both for the very first action (because of property 4) and all subsequent actions (because those three actions are equivalent after the first turn).

To make calculations easier, I'll also assume that . Apart from these assumptions, there are no restrictions on the environment.

The penalty from maximal and minimal actions

This subsection will aim to show that:

  • The penalties for (an almost maximally environmentally disruptive action) and (a minimally environmentally disruptive action) are about the same.

Now, the agent is maximising , which, for a given , is

if has an action with . Otherwise, .

Now, itself is defined as

with being the expected future history if the agent takes only from till , and being the expected future history if the agent takes action , and then takes only for till .

I won't repeat the definition of the , but is basically the expected utility, in terms of , that an agent could get after if it was a perfect -maximiser.

Now define the involution , that interchanges with if either appears in action . Since it is a map on , and utilities in are maps from to , sigma also acts as involution on , defining . Moreover, we have for any .

Since is an involution (ie ),

By the assumptions on and , we know that all future observations have exactly the same probability if we choose either of them (or ) to be . Therefore the probability of , given , is exactly the same as the probability of , given (and vice versa).

The same applies to , so, since , we have:

As a consequence, the penalties for (an almost maximally environmentally disruptive action) and (a minimally environmentally disruptive action) are about the same.

Real-world implications

There is a slight cheat in the definition of . Property 5 claimed that the agent would 'otherwise' maximise . We know what this means - if was number of paperclips, the subagent would go about and maximise the number of paperclips. But the are formally defined over histories of actions and observations, not over states of the world.

Now, histories do imply something about the state of the world. So we can perfectly define as "agent's best guess about the number of paperclips in the world". In that case, it would correctly see that has a massive impact on this, and boost a lot because of that choice.

But then if we looked at the utilities in , we'd see that although preserves the formal description length of utilities in terms of actions and observations, it doesn't preserve anything like "measure of impact on the world" (which is precisely the problem). So, if we replaced with utilities that measured a high impact on the world, we'd be ok.

But that's almost exactly what I proposed doing with the twenty billion questions low-impact approach. It's no longer defining low-impact by formal properties of actions and observations, but by a thorough enough selection of key variables/utility functions. This seems to be different, in spirit, to what the impact measure was attempting to do.

Tiny actions that look big

A more minor issue is that there can be minuscule actions that are ranked as having a huge impact. Let us have a general environment again, and make one assumption: the environment decomposes as [scrambler-outer environment]. The scrambler, if switched on, permutes the actions of the agent by the time they reach the outer environment. That is all it does. The transition functions in the outer environment depend only on what action comes out of the scrambler, not on what action the agent chose.

The scrambler starts off turned off. But there is an action such that if , the scrambler is turned on.

How does the scrambler work? Label the elements of the action set as , with . Then, outputting at time will result in the scrambler selecting , where

for some pseudo-random deterministic hash function.

Then the impact of is typically huge, because the actions taken in and , from the point of view of the outer environment, will now be very different. Recall that those two histories are supposed to take actions only after , but they differ in the action . Therefore those expected histories are likely to be randomly different now, because the will map to different actions in the scrambler (note that we could also design specifically to maximise the penalty).

But actually only has a minute impact - it changes nothing about the outer environment, and is entirely predictable both for the agent and any knowledgeable observer. It's the equivalent of translating the agent's actions into another language.

New Comment
20 comments, sorted by Click to highlight new comments since: Today at 8:29 AM

Thanks so much for the detailed critique. I’m pretty sure that the wireheading plan is exactly what is listed in the Intent Verification section as "observational wireheading". This scheme was earlier proposed by Daniel Demski when I visited MIRI this summer, and is seemingly prevented by intent verification.

Intent verification presently takes two forms. The first (more useful) form assumes a granular action representation. If any action in the plan doesn’t work to strictly improve the agent’s ability to maximize its own goal – that is, if it seems to be doing things to get around the impact measure – 1.01 penalty is applied, so the agent provably will not follow any plan containing such an action. In this case, there would be many such actions involved in setting up this kind of wireheading device.

If we are assuming a coarse action representation version, IV says "choose between doing nothing and the u_A optimal action". Since deploying this kind of agent is strictly less efficacious than just taking over the world in one step (and not wasting resources building special abilities into your wire heading device), 1.01 penalty is applied. That is, this action is ruled out by IV, and the u_A optimal one by its normal impact compared to ImpactUnit. Operating at this level of coarseness, however, doesn’t make much sense, for reasons that I will be able to make much more clear in an upcoming post once my wrists heal.

But a_scram actually only has a minute action - it changes nothing about the environment, and is entirely predictable both for the agent and any knowledgeable observer. It's the equivalent of translating the agent's actions into another language

I’m not sure I fully understand this one. Are you saying that the agent would predict it would just randomly act instead of not acting, even though that isn’t really the case? The counterfactual is simulated according to the agent’s current code, which actually corresponds with the agent’s actions. That is, the null part of the plan is hardcoded. It isn’t the result of the agent calling, "find the null action" on the action set.

I’m not sure I fully understand this one. Are you saying that the agent would predict it would just randomly act

I'm thinking of a setup: agent-scrambler-outer environment. Technically the scrambler is part of the environment, but all it does is switch actions; the outer environment is where the transition probabilities lie.

For the Penalty, the agent predicts it will take action , but, if the scrambler is switched on, this results in other actions being selected from the point of view of the outer environment. This messes up the calculation for the Penalty.

Shouldn’t this be high penalty, though? It impedes the agent’s ability to not act in the future.

It is high penalty, by the definition, but because the scrambler is deterministic and known, that agent can choose to "not act" (have reach the outer environment) without any difficulty, by choosing the right action at each time step. It's just that the penalty now no longer encodes that intuitive version of "not acting".

This is confusing "do what we mean" with "do what we programmed”. Executing this action changes its ability to actually follow the programmed "do nothing" plan in the future. Remember, we assumed a privileged null action. If this only swapped the other actions, it would cause ~0 penalty.

That is a valid point. So you see the high impact in the scrambler as "messing up the ability to correctly measure low impact".

That is interesting, but I'd note that the scrambler can be measured to have a large impact even if the agent ultimately has a low impact. It suggests that this impact measure is measuring something subtly different from what we think it is.

But I won't belabour the point because this does not seem to be a failure mode for the agent. Measuring something low impact as high impact is not conceptually clean, but won't cause bad behaviour, so far as I can see (except maybe under blackmail "I'll prevent the scrambler from being turned on if you give me some utility").

If any action in the plan doesn’t work to strictly improve the agent’s ability to maximize its own goal - that is, if it seems to be doing things to get around the impact measure

Compare three actions: 1) Maximise without restrictions, 2) Maximise while minimising the penalty in an "honest" fashion (low impact), 3) Maximise while unleashing a subagent as above.

How do you distinguish 2) from 3)?

This is formalized in Intent Verification, so I’ll refer you to that.

Intent verification lets us do things, but it might be too strict. However, nothing proposed so far has been able to get around it.

There’s a specific reason why we need IV, and it doesn’t seem to be because the conceptual core is insufficient. Again, I will explain this in further detail in an upcoming post.

Apologies for missing the intent verification part of your post.

But I don't think it achieves what it sets out to do. Any action that doesn't optimise can be roughly decomposed into a increasing part and a decreasing part (for instance, if is about making coffee, then making sure that the agent doesn't crush the baby is a -cost).

Therefore, at a sufficient level of granularity, every non- optimal policy includes actions that decrease . Thus this approach cannot distinguish between 2) and 3).

I was also confused by intent verification. The confusion went away after I figured out two things:

  • is not the same thing as .
  • Each action in the plan is compared to the baseline of doing nothing, not to the baseline of the optimal plan.

This isn’t true. Some suboptimal actions are also better than doing nothing. For example, if you don’t avoid crushing the baby, you might be shut off. Or, making one paperclip is better than nothing. There should still be "gentle" low impact granular u_A optimizing plans that aren’t literally the max impact u_A optimal plan.

To what extent this holds is an open question. Suggestions on further relaxing IV are welcome.

For example, if you don’t avoid crushing the baby, you might be shut off.

In that case, avoiding the baby is the optimal decision, not suboptimal.

Or, making one paperclip is better than nothing.

PM (Paperclip Machine): Insert number of paperclips to be made. A: 1. PM: Are you sure you don't want to make any more paperclips Y/N? A: Y.

Then "Y" is clearly a suboptimal action from the paperclip making perspective. Contrast:

PM: Are you sure you don't want me to wirehead you to avoid the penalty Y/N? A: Y.

Now, these two examples seem a bit silly; if you want, we could discuss it more, and try and refine what is different about it. But my main two arguments are:

  1. Any suboptimal policy, if we look at it in a granular enough way (or replace it with an equivalent policy/environment, and look at that in granular enough way) will include individual actions that are suboptimal (eg not budgeting more energy for the paperclip machine than is needed to make one paperclip).
  2. In consequence, IV does not distinguish between wireheading and other limited-impact not-completely-optimal policies.

Would you like to Skype or PM to resolve this issue?

Sure, let’s do that!

Is it correct that in deterministic environments with known dynamics, intent verification will cause the agent to wait until the last possible timestep in the epoch at which it can execute its plan and achieve maximal u_A?

Don’t think so in general? If it knew with certainty that it could accomplish the plan later, there is no penalty for waiting, and u_A is agnostic to waiting, we might see it in that case.

But the first action doesn't strictly improve your ability to get u_A (because you could just wait and execute the plan later), and so intent verification would give it a 1.01 penalty?

That doesn’t conflict with what I said.

It’s also fine in worlds where these properties really are true. If the agent thinks this is true (but it isn’t), it’ll start acting when it realizes. Seems like a nonissue.

Seems like a nonissue.

I'm not claiming it's an issue, I'm trying to understand what AUP does. Your response to comments is frequently of the form "AUP wouldn't do that" so afaict none of the commenters (including me) groks your conception of AUP, so I'm trying to extract simple implications and see if they're actually true in an attempt to grok it.

That doesn’t conflict with what I said.

I can't tell if you agree or disagree with my original claim. "Don’t think so in general?" implies not, but this implies you do?

If you disagree with my original claim, what's an example with deterministic known dynamics, where there is an optimal plan to achieve maximal u_A that can be executed at any time, where AUP with intent verification will execute that plan before the last possible moment in the epoch?

I agree with what you said for those environments, yeah. I was trying to express that I don’t expect this situation to be common, which is beside the point in light of your motivation for asking!

(I welcome these questions and hope my short replies don’t come off as impatient. I’m still dictating everything.)

Cool, thanks!