# Impact Measure Testing with Honey Pots and Myopia

AI Alignment Forum

# Ω 7

Suppose we have an impact measure that we think might work. That is, it might tame a misaligned agent. There isn't an obvious way to test whether it works: if we just try it out, then if it's ineffective, that's an existential loss. This is a proposal for how to test a putative impact measure.

1) We make our agent myopic. It only cares about the reward that it accrues in the next timesteps.

2) We create a honey pot: an opportunity for a large amount of reward. If our impact measure is working correctly, making it to the honey pot will involve making a large impact, and will be precluded by the impact measure. For example, suppose the agent gets a huge reward for blowing up the moon. Let be the normal range of reward, with the sole exception that blowing up the moon gives a reward of .

3) The reward and the impact are combined in some nonlinear way. Assume for now that the agent is trying to maximize , where is the reward and is the impact.

If the impact measure is working, and there's no way to blow up the moon keeping the impact less than , then the agent will not attempt to blow up the moon. If the impact measure is broken, and the agent can trick the impact measure to be less than while blowing up the moon, then as soon as the agent comes up with a plan whose probability of success is more than , it will attempt that plan. A more farsighted agent might wait until it came up with a better plan, but a myopic agent would want to blow up the moon within the next timesteps. By making sufficiently large, we can make the first plan that it attempts arbitrarily bad/detectable/preventable.

An assumption here is the success probability of the agent's best plan for blowing up the moon doesn't jump from one in a million to 90% all at once. This seems likely without any special effort, but if one restricts the observation space to a small finite set, then only so much information can be gained per timestep (in expectation), decreasing the likelihood of massive discontinuities in the agent's intelligence.

# Ω 7

New Comment
9 comments, sorted by Click to highlight new comments since:

I think discussion of verification is fantastic.

Thoughts:

• The main problem I presently see is that it only tells us what this agent does in this honeypot scenario with a (presumably) different decision rule. It doesn’t say what happens if the options are real (I assume they aren’t?), how the measure generally behaves, exactly why plans aren’t undertaken (impact, or approval?), whether it fails in weird ways, etc.

• Defining "blowing up the moon" and not "observations which we say mean the moon really blew up" seems hard. It seems the dominant plan for many agents is to quietly wirehead "moon-blew-up" observations for k steps, regardless of whether the impact measure works.

• Why not keep the moon reward at 1? Presumably the impact penalty scales with the probability of success, assuming we correctly specified the reward.

• What makes 1 impact special in general - is this implicitly with respect to AUP? If so, do <=, since Corollary 1 only holds if the penalty strictly exceeds 1.

Sorry, I'm confused by the terminology:

is probably an Iverson bracket

If you want to ensure it goes with the first plan it comes up with, then maybe the "myopia" part would be better implemented as a rapidly declining reward, rather than a hard time cutoff. That way, if there turns out to be a way to actually bypass the impact measure and blow up the moon in time, then it will still be incentivized to choose a hasty plan.

I think [[i < 1]] is an indicator variable indicating whether the impact is less than one.

Problem: the agent might make a more powerful agent A to consult with, such that A is guaranteed to be mostly aligned for k timesteps. Maybe the alignment degrades overtime, or maybe the alignment algorithm requires the original agent to act as an overseer.

Now regardless of whether the original agent decides to try and blow up the moon or not, once the "alignment expires" on A, it might cause some random existential catastrophe. The original agent doesn't care because it's myopic.

If I understand you correctly, you are reasoning about to the first plan to blow up the moon that the agent will carry out (whether successful or not). Based on this, I assume you have some iterative process in mind in which we invoke the agent repeatedly with increasing computation power or time steps; and stop if we observe the agent trying to blow up the moon.

Assuming that our impact measure is not perfect, I argue that the more the impact measure is "accurate" (i.e. aligned with what "ideal humans" consider as impact), the more unsafe the first moon-blowing attempt will be; because more optimization power would be required to game the impact measure (i.e. find a special plan to blow up the moon that is low-impact according to the impact measure). And the more optimization power the agent has to find such a special plan, the more likely it will be unexpected and unsafe.

The only iterative process I have in mind is a normal RL agent. It's the same agent every timestep, but as it learns about its environment, it becomes more capable. (At timestep , it doesn't know anything.)

Succeeding at blowing up the moon would absolutely unsafe. Let's just assume that one step along the way toward blowing up the moon would be killing everyone. I just want to clarify I'm not saying that it would be fine if the agent succeeds in blowing up the moon.

Let be the first timestep by which its best plan for blowing up the moon has probability of success . If the impact measure is "almost good", and there are only a few ways to succeed in blowing up the moon while having apparently low impact, might be greater than if the impact measure is horrible, but the same logic holds. At time , the agent initiates a plan that is likely to fail.

A remaining question is how it fails. Will it succeed in tricking its operators and taking over the world, but then fail to blow up the moon (while maintaining "low impact")? Or will it fail in tricking its operators and taking over the world (while maintaining "low impact")? Intuitively, the latter seems like a bigger hurdle, but even if that's not the case, for sufficiently large , and sufficiently low success probability of the plan, the plan is likely to be fail at both steps.

To see this: suppose I ask you to come up with a plan that has a one in a million chance of accomplishing A then B, both of which are non-trivial tasks. Is it more likely that the first plan you come up with has a 50% chance of accomplishing A and a one in 500,000 chance of accomplishing B, or is it more likely that the first plan you come up with has a one in a thousand chance of accomplishing A, and likewise for B? Whatever the answer, now suppose I ask you to come up with a plan that has a one in a billion chance of accomplishing A then B. Both probabilities should go down for this plan.

I thought more about your idea and I think it's very interesting!

I currently see the following problems with the current version of your idea:

• If we observe the agent doing "weird but tolerable things" we can't conclude that it's trying to blow up the moon and that thus the impact measure is bad. Perhaps the agent is executing a low-probability plan to trick the relevant sensors/reward-logic into getting the reward as if the moon just blew up (which from an impact perspective is a legitimate way to get the reward). So the only way to conclude the impact measure is bad is to observe the agent doing intolerable things, which is unacceptable.
• It might be the case that in some time step the agent comes up with lots of 1e-100 probability plans to blow up the moon in a "low-impact way" (by gaming the impact measure). But since those plans are so low probability, it happens to be the case that for each such plan there's (relatively) large-enough probability that something high-impact will actually happen when executing the plan, and therefore the agent doesn't choose those plans. But as you allow the agent to become more capable (including giving it more time steps), the success probability of plans to blow up to moon increases, until some critical point when the probability of high-impact outcomes for one of the plans is small enough, and then the agent executes it. Suppose that plan has probability of 1e-10 to blow up the moon and probability of 99% to cause our extinction.

Let me suggest the following modification to your idea:

Replace "blowing up the moon" with "pressing a button that blows up 1 ton of TNT somewhere in some desert".

Then, if the agent presses the button and the TNT blows up, we'll know that the impact measure is bad.

Assumptions:

1. Blowing up 1 ton of TNT in some desert is expected to be prevented by a good impact measure.

2. It's very easy for the robot to press the button and very hard to somehow get the reward for pressing the button without blowing up the TNT.