Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Even if we can measure how impactful an agent's actions are, how impactful do we let the agent be? This post uncovers a surprising fact: armed with just four numbers, we can set the impact level so that the agent chooses a reasonable, non-catastrophic plan on the first try. This understanding increases the competitiveness of impact-limited agents and helps us judge impact measures. Furthermore, the results help us better understand diminishing returns and cost-benefit tradeoffs.

In Reframing Impact, we meet Frank (a capable AI), whom we’ve programmed to retrieve the pinkest object he can find (execute an optimal plan, according to the specified utility function). Because we can’t ask Frank to do exactly what we want, sometimes he chooses a dangerous object (executes a catastrophically bad plan). We asked after an “impact measure” which grades plans and has three properties:

The intuition is that if we view the world in the right way, the dangerous objects are far away from Frank (the catastrophic plans are all graded as high-impact). Reframing Impact explores this kind of new way of looking at the world; this post explores what we do once we have an impact measure with these three properties.

We want Frank to keep in mind both the pinkness of an object (how good a plan is according to the specified utility function) and its distance (the plan’s impact). Two basic approaches are

In terms of units, since we should be maximizing , has type . So can be thought of as a regularization parameter, as a search radius (in the constrained case), or as an exchange rate between impact and utility (in the scaled case). As increases, high-impact plans become increasingly appealing, and Frank becomes increasingly daring.

We take to divide the impact in the scaled formulation so as to make Frank act more cautiously as increases for both formulations. The downside is that some explanations become less intuitive.
In Attainable Utility Preservation: Empirical Results, plays the same role as , except low means high ; . To apply this post's theorems to the reinforcement learning setting, we would take "utility" to be the discounted return for an optimal policy from the starting state, and "impact" to be the total discounted penalty over the course of that policy (before incorporating ).
In both cases, Frank goes from 0 to 60 eventually. For sufficiently small , doing nothing is optimal (lemma 5: the first subinterval is the best plan with minimal impact). For sufficiently large , Frank acts like a normal maximizer (corollary 7: low-impact agents are naive maximizers in the limit).

Here's how Frank selects plans in the constrained setup:

Think about which plans are best for different search radii/exchange rates . By doing this, we're partitioning the positive ray: categorizing the positive real numbers by which plans are optimal.

For the scaled setup, we'll need to quantify the pinkness (utility) and distance (impact) of relevant plans:

We will primarily be interested in the scaled setup because it tends to place catastrophes farther along the partition and captures the idea of diminishing returns.

The scaled setup also helps us choose the best way of transmuting time into money:

In this scaled partition, tending the garden doesn’t show up at all because it’s strictly dominated by mowing the lawn. In general, a plan is dominated when there’s another plan that has strictly greater score but not strictly greater impact. Dominated things never show up in either partition, and non-dominated things always show up in the constrained partition (lemma 3: constrained impact partitions are more refined).

Exercise: For (i.e. your time is worth $11.25 an hour), what is the scaled tradeoff value of mowing the lawn? Of delivering newspapers? Of tending the garden?

Mowing the lawn: .

Delivering newspapers: .

Tending the garden: .

In other words, you only deliver newspapers if your time is worth less than dollars/hour (we're flipping so we can talk about dollars/hour instead of hours/dollar). Notice that when (here, when ), the tradeoff for the paper route isn’t net-negative – but it isn’t necessarily optimal! Remember, you’re trading hours for dollars through your work; mowing the lawn leaves you with twenty bucks and three hours, while the paper route leaves you with forty dollars and no hours. You want to maximize the total value of your resources after the task.

Importantly, you don’t deliver papers here if your time is worth dollars/hour, even though that’s the naive prescription! The newspaper route doesn’t value your time at 11.25 dollars/hour – it marginally values your time at dollars per hour. Let's get some more intuition for this.

Above, we have not yet chosen a task; the blocks represent the additional utility and hours of each task compared to the current one (doing nothing). The scales above imply that , but actually, expresses how many blue blocks each pink block weighs. As increases, the pink platters descend; the agent takes the task whose scales first balance. In other words, the agent takes the best marginal deal as soon as is large enough for it to be profitable to do so (Theorem 4: Scaled domination criterion).

Once you take a deal, you take the blocks off of the other scales (because the other marginal values change). For small (i.e. large valuations of one's time), mowing the lawn is optimal. We then have

Since you've taken the juicier "lower-hanging fruit" of mowing the lawn, the new newspaper ratio is now worse! This always happens – Theorem 8: Deals get worse over time.

At first, this seems inconvenient; to figure out exactly when a plan shows up in a scaled partition, we need to generate the whole partition up to that point.

Going back to Frank, how do we set ? If we set it too high, the optimal plan might be a catastrophe. If we set it too low, the AI doesn’t do much. This seems troublesome.

Exercise: Figure out how to set while avoiding catastrophic optimal plans (assume that the impact measure meets the three properties). You have four minutes.

A big part of the answer is to start with a small value for , and slowly increase. This is simple and intuitively appealing, but how cautiously must we increase ? We don’t want to be surprised by a catastrophe suddenly becoming optimal.

To avoid being surprised by catastrophes as we increase , we want a relative buffer between the reasonable plans (which get the job done well enough for our purposes) and the catastrophic plans. If reasonable plans are optimal by , catastrophic plans shouldn’t be able to be optimal before e.g. .

We say that the partition is -buffered if (for ). If a partition is e.g. 1-buffered, there is a wide reasonable-plan range and we can inch along it without worrying about sudden catastrophe.

For the following, suppose that utility is bounded . Below is a loose criterion guaranteeing -buffering.

For example, if we know that all catastrophes have at least 10 times the impact of reasonable plans, and there's a difference of at least .3 utility between the best and worst reasonable plans, then we can guarantee 2-buffering! If we use the refined criterion of Theorem 11 (and suppose the worst reasonable plan has .4 utility), this improves to 4.5-buffering (even 2-buffering is probably overkill).

Using this theorem, we don't need to know about all of the plans which are available or to calculate the entire scaled partition, or to know how overvalued certain catastrophic plans might be (per earlier concerns). We only need a lower bound on the catastrophe/reasonable impact ratio, and an idea about how much utility is available for reasonable plans. This is exactly what we want. As a bonus, having conservative estimates of relevant quantities allows us to initialize to something reasonable on the first try (see in Theorem 11 below).

Ultimately, the reasoning about e.g. the ratio will still be informal; however, it will be informal reasoning about the right thing (as opposed to thinking "oh, the penalty is probably severe enough").

Exercise: You're preparing to launch a capable AI with a good impact measure. You and your team have a scaled impact partition which is proven 1-buffered. Suppose that this buffer suffices for your purposes, and that the other aspects of the agent design have been taken care of. You plan to initialize , modestly increasing until you get good results.

You have the nagging feeling that this process could still be unsafe, but the team lead refuses to delay the launch without specific reason. Find that reason. You have 5 minutes.

Who says is safe? The buffer is relative. You need a unit of impact by which you increment . For example, start at equalling the impact of making one paperclip, and increment by that.

Technical Appendix: Math

Let be a finite plan space, with utility function and impact measure . For generality, we leave the formalization of plans ambiguous; notice that if you replace "plan" with "snark", all the theorems still go through (likewise for "utility" and "impact"). In this post, we talk about the impact allowance (in Frank's world, the search radius) as a constraint within which the objesctive may be freely maximized, breaking ties in favor of the plan(s) with lower impact. On the other hand, many approaches penalize impact by subtracting a scaled penalty from the objective. We respectively have

We say that the former induces a "constrained impact partition" and that the latter induces a "scaled impact partition". Specifically, we partition the values of for which different (sets of) plans are optimal. We say that a plan corresponds to a subinterval if it is optimal therein (the subinterval also must be the maximal connected one such that this holds; e.g., if is optimal on , we say it corresponds to that subinterval, but not to ), and that appears in a partition if there is such a corresponding subinterval. We say that plans overlap if their corresponding subintervals intersect.

As a technical note, we partition the positive values of for which different sets of plans are optimal; in this set, each value appears exactly once, so this indeed a partition. For clarity, we will generally just talk about which plans correspond to which subintervals. Also, if no plan has zero impact, the first subinterval of the constrained impact partition will be undefined; for our purposes, this isn't important.

We want to be able to prove the "safety" of an impact partition. This means we can expect any terrorists to be some proportional distance farther away than any reasonable marbles. Therefore, for sensible ways of expanding an sufficiently small initial search radius, we expect to not meet any terrorists before finding a marble we're happy with.

In addition, we want to know how far is too far – to give upper bounds on how far away fairly pink marbles are, and lower bounds on how close terrorists might be.

Definition [-buffer]. For , an impact partition is -buffered if , where lower-bounds the first possible appearance of those plans we label 'catastrophes', and upper-bounds the first appearance of plans we deem satisfactory.

We now set out building the machinery required to prove -buffering of a scaled partition.

Lemma 1 [Plans appear at most once]. If appears in a constrained or scaled impact partition, then it corresponds to exactly one subinterval.

Proof outline. The proof for the constrained case is trivial.

For the scaled case, suppose corresponds to more than one subinterval. Consider the first two such subintervals . By definition, (otherwise they would be the same maximal connected subinterval), so there has to be at least one subinterval sandwiched in between (on almost all of which cannot be optimal; let be a plan which is optimal on ). Let , where . By definition of optimality on a subinterval,

by employing the fact that , algebraic manipulation produces an assertion that a quantity is strictly less than itself. Therefore, no such intervening can exist. □

Proposition 2 [Plan overlap is very restricted]. Suppose and appear in an impact partition which is

(a) constrained. and overlap if and only if and .

(b) scaled. If and , then and correspond to the same subinterval. If and overlap at more than one point, then and .

Proof outline. Proving (a) and the first statement of (b) is trivial (remember that under the constrained rule, ties are broken in favor of lower-impact plans).

Suppose that and overlap at more than one point. Pick the first two points of intersection, and . Since both plans are optimal at both of these points, we must have the equalities

Solving the first equality for and substituting in the second, we find . Then , since otherwise one of the plans wouldn't be optimal. □

Proposition 2b means we don't need a tie-breaking procedure for the scaled case. That is, if there's a tie between a lower-scoring, lower-impact plan and a proportionally higher-scoring, higher-impact alternative, the lower-impact plan is optimal at a single point because it's quickly dominated by the alternative.

The following result tells us that if there aren't any catastrophes (i.e., terrorists) before on the constrained impact partition, there aren't any before it on the scaled impact partition either. This justifies our initial framing with Frank.

Lemma 3 [Constrained impact partitions are more refined]. If appears in a scaled impact partition, it also appears in the corresponding constrained impact partition. In particular, if appears after in a scaled impact partition, then appears after in the corresponding constrained impact partition.

Proof. Suppose that didn't have a constrained subinterval starting inclusively at ; then clearly it wouldn't appear in the scaled impact partition, since there would be a strictly better plan for that level of impact. Then has such a subinterval.

Obviously, the fact that appears after implies . □

The converse isn't true; sometimes there's too much penalty for not enough score.

The next result is exactly what we need to answer the question just raised – it says that higher-scoring, higher-penalty plans become preferable when equals the ratio between the additional penalty and the additional score.

Theorem 4 [Scaled domination criterion]. Let and be plans such that and . In the context of the scaled penalty, is strictly preferable to when , and equally preferable at equality.

Proof outline.

Equality at the value of the right-hand side can easily be checked. □

Theorem 4 also illustrates why we can't strengthen the second statement in Proposition 2b: if two plans overlap at exactly one point, they sometimes have proportionally different score and impact, thereby satisfying the equality criterion.

At first, plans with slightly lower impact will be preferable in the scaled case, no matter how high-scoring the other plans are – a plan with 0 score and .99 impact will be selected before a plan with 1,000,000,000 score and 1 impact.

Lemma 5 [First subinterval is the best plan with minimal impact]. The plan with highest score among those with minimal impact corresponds to the first subinterval.

Proof outline. The constrained case is once again trivial (if there is no plan within the constraint, we assume that the agent does nothing / Frank returns no object).

For the scaled case, if all plans have equal impact, the claim is trivial. Otherwise, let and let be any plan with a non-minimal impact. Then the earliest that becomes preferable to any minimally impactful plan is . Since the right hand side is positive, cannot correspond to the first subinterval. Clearly the highest-scoring minimal-impact does. □

Now we can write the algorithm for constructing scaled intervals.

Discard dominated plans. The lowest-impact plan with greatest score appears first in the scaled partition; assign to it the interval .
While plans remain: Find the plan which soonest dominates the previous best plan. close off the previous plan's interval, and assign the new best plan an appropriate interval. Adjust the marginal scores and impacts of remaining plans, discarding plans with negative score.

Since this procedure is well-defined, given , , and , we can speak of the corresponding constrained or scaled impact partition. A more formal algorithm is available here. This algorithm is because of line , although constructing the constrained partition (probably due to sorting) often narrows things down significantly. Unfortunately, is usually huge.

For our purposes, we don't need the whole partition – we just want to have good reason to think that plans similar to a reasonable one we envision will appear well before any catastrophes. Perhaps we can give bounds on the earliest and latest plans can appear, and show that reasonable-bounds don't intersect with catastrophe-bounds?

Theorem 6 [Individual appearance bounds]. If appears in a scaled partition, the earliest it appears is , assuming is not of minimal impact; if it has minimal score minimal impact, it never appears. The latest it appears is , where and .

Proof outline. The two claims clearly correspond to the minimal and maximal values of according to the domination criterion; the second claim's right-hand side uses the fact that is non-negative. □

Corollary 7 [Low-impact agents are naïve maximizers in the limit]. A plan with maximal score corresponds to the last subinterval.

Proof outline. If all plans have the same score, the claim is trivial. Otherwise, let be a plan with the lowest impact of those with maximal score. In the constrained case, clearly it corresponds with the subinterval . In the scaled case, let be a plan with second-highest score. Then by Theorem 6, the latest that can appear is . Since no plans meet the domination criterion with respect to , this is the last subinterval. □

Unfortunately, Theorem 6's appearance bounds are ridiculous in realistic settings – if and return 32-bit floating-point numbers, the next-largest could easily be within , yielding an upper "bound" of . The reason: diminishing returns; this is exactly what was happening with the newspaper route before.

Theorem 8 [Deals get worse over time]. Suppose that is optimal on a subinterval, and are such that but dominates strictly before does. Then

Proof outline.

Since dominates strictly before does, we know that must get more for its : Clearly the conclusion follows, as a number cannot be expressed as the positive combination of larger numbers (the impact differences all must be positive). □

Corollary 9 [Lower bounds which aren't ridiculous]. Suppose appears and that is such that , (i.e. the preconditions of the domination criterion). Then the earliest that appears is .

This obsoletes the lower bound provided by Theorem 6.

Theorem 10 [Order of domination determines order of appearance]. If and both appear in a scaled partition and dominates some before does, then appears before .

Proof outline. For them both to appear, they can't have equal impact but unequal score, nor can they have equal score but unequal impact. For similar reasons, must have both less impact and lower score than ; the converse situation in which they both appear is disallowed by Lemma 3. Another application of this lemma yields the conclusion. □

Theorem 11 [Scaled -buffer criterion]. Let be a scaled impact partition. Suppose that there exist no catastrophic plans with impact below , and that, in the corresponding constrained partition (i.e. plans which aren't strictly worse), plans appearing with score in the satisfactory interval have impact no greater than (assume that there is at least one plan like this). Observe we have the correct bounds

When , a satisfactory plan corresponds to a subinterval with nonzero measure (i.e. not just a point), strictly preceding any catastrophes. Refine the lower bound to get .

Then is -buffered () when

In particular, if is bounded , the above turn into

Lastly, notice that the first of the two inequalities incorporates less information and is harder to satisfy (); therefore, satisfying the second inequality also satisfies the first.

Proof outline. For clarity, the theorem statement included much of the reasoning; straightforward application of existing results proves each claim. □

Exercise: Let , . Using the refined criterion, determine which catastrophe/reasonable impact ratios induce 2.6-buffering.

Exercise: Let . What is the largest for which the simple criterion can guarantee -buffering?

Even More Math

Proposition 12 [Invariances]. Let be an impact partition induced by .

(a) is invariant to translation of .

(b) If is constrained, it is invariant to positive scalar multiplication of , and the relative lengths of its subintervals are invariant to positive scalar multiplication of .

(c) If is scaled, it is invariant to concurrent positive scalar multiplication of and , and to translation of such that its image remains non-negative.

In particular, may be restricted to and translated such that at least one plan has zero impact WLOG with respect to scaled partitions.

Lemma 13. Multiple constrained subintervals are induced iff multiple scaled subintervals are induced.

Proof. Forward direction: there is at least one scaled subinterval by lemma 5. Consider a plan corresponding to a different constrained subinterval; this either appears in the scaled subinterval, or fails to appear because a different plan earlier satisfies the scaled dominance criterion. There must be some such plan because there are multiple constraints of intervals and therefore a plan offering greater score for greater impact. Repeat the argument; the plan space is finite, so we end up with another plan which appears.

The reverse direction follows by lemma 3. □

Bonus exercise: Show that, for any function preserving the ordering induced by , there exists an preserving the ordering induced by such that and induce the same scaled partition. Your reasoning should adapt directly to the corresponding statement about and .

New Comment
9 comments, sorted by Click to highlight new comments since:

I do not understand your proof for proposition 2.

Is there any particular part of it that seems locally invalid? Can you be a little more specific about what’s confusing?


Where does

come from?

Also, the equation seems to imply

Edit: I focused too much on what I suppose is a typo. Clearly you can just rewrite the the first and last equality as equality of an affine linear function

at two points, which gives you equality everywhere.

Oops, you’re right. I fixed the proof.

I would also suggest changing the last sentence of the proof to "since otherwise one of the plans wouldn't be optimal". (the next natural step my internal autocomplete expected in the proof was substituting back to the first equation, and I kept wondering for quite a while what the unspecific "one of them wouldn't appear" means in that context)

And if you'll be editing: there's probably also a typo in the second paragraph of the Technical Appendix: a confusing "N" instead of an "R" .

fixed, thank you.

Suppose the AI finds a plan with 10^50 impact and 10^1000 utility. I don't want that plan to be run. Its probably a plan that involves taking over the universe and then doing something really high utility. I think a constraint is better than a scaling factor.

Utility is bounded [0,1].

If theorem 11 is met, we’re fine. There are some good theoretical reasons not to use constraints (beyond the computational ones).

(It’s true that the buffering criterion is nice and simple for constrained partitions (the first nondominated catastrophe has times the impact of the first non-dominated reasonable plan).)