Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

How Low Should Fruit Hang Before We Pick It?

3axioman

2TurnTrout

1axioman

2TurnTrout

1OBajgar

1OBajgar

2TurnTrout

1Donald Hobson

2TurnTrout

New Comment

Is there any particular part of it that seems locally invalid? Can you be a little more specific about what’s confusing?

Where does

come from?

Also, the equation seems to imply

Edit: I focused too much on what I suppose is a typo. Clearly you can just rewrite the the first and last equality as equality of an affine linear function

at two points, which gives you equality everywhere.

I would also suggest changing the last sentence of the proof to "since otherwise one of the plans wouldn't be optimal". (the next natural step my internal autocomplete expected in the proof was substituting back to the first equation, and I kept wondering for quite a while what the unspecific "one of them wouldn't appear" means in that context)

Suppose the AI finds a plan with 10^50 impact and 10^1000 utility. I don't want that plan to be run. Its probably a plan that involves taking over the universe and then doing something really high utility. I think a constraint is better than a scaling factor.

Utility is bounded [0,1].

If theorem 11 is met, we’re fine. There are some good theoretical reasons not to use constraints (beyond the computational ones).

(It’s true that the buffering criterion is nice and simple for constrained partitions (the first nondominated catastrophe has times the impact of the first non-dominated reasonable plan).)

Even if we can measure how impactful an agent's actions are, how impactful do we let the agent be? This post uncovers a surprising fact: armed with just four numbers, we can set the impact level so that the agent chooses a reasonable, non-catastrophic plan on the first try. This understanding increases the competitiveness of impact-limited agents and helps us judge impact measures. Furthermore, the results help us better understand diminishing returns and cost-benefit tradeoffs.In

Reframing Impact, we meet Frank (a capable AI), whom we’ve programmed to retrieve the pinkest object he can find (execute an optimal plan, according to the specified utility function). Because we can’t ask Frank to doexactlywhat we want, sometimes he chooses a dangerous object (executes a catastrophically bad plan). We asked after an “impact measure” which grades plans and has three properties:The intuition is that if we view the world in the right way, the dangerous objects are far away from Frank (the catastrophic plans are all graded as high-impact).

Reframing Impactexplores this kind of new way of looking at the world; this post explores what we do once we have an impact measure with these three properties.We want Frank to keep in mind both the pinkness of an object (how good a plan is according to the specified utility function) and its distance (the plan’s impact). Two basic approaches are

In terms of units, since we should be maximizing utility, Rhas type impactutility. So R can be thought of as a regularization parameter, as a search radius (in the constrained case), or as an exchange rate between impact and utility (in the scaled case). As R increases, high-impact plans become increasingly appealing, and Frank becomes increasingly daring.

Here's how Frank selects plans in the constrained setup:

Think about which plans are best for different search radii/exchange rates R. By doing this, we're

partitioningthe positive ray: categorizing the positive real numbers by which plans are optimal.For the scaled setup, we'll need to quantify the pinkness (utility) and distance (impact) of relevant plans:

We will primarily be interested in the scaled setup because it tends to place catastrophes farther along the partition and captures the idea of diminishing returns.

The scaled setup also helps us choose the best way of transmuting time into money:

Exercise: ForR=445(i.e. your time is worth $11.25 an hour), what is the scaled tradeoff value of mowing the lawn? Of delivering newspapers? Of tending the garden?Mowing the lawn: 20−1445=8.75.

Delivering newspapers: 45−4445=0.

Tending the garden: 15−1445=3.75.

In other words, you only deliver newspapers if your time is worth less than 253=813 dollars/hour (we're flipping R so we can talk about dollars/hour instead of hours/dollar). Notice that when R≥impact(plan)utility(plan) (here, when R=445), the tradeoff for the paper route isn’t net-negative – but it isn’t necessarily optimal! Remember, you’re trading hours for dollars through your work; mowing the lawn leaves you with twenty bucks and three hours, while the paper route leaves you with forty dollars and no hours. You want to maximize the total value of your resources after the task.

Importantly, you

don’tdeliver papers here if your time is worth 454=11.25dollars/hour, even though that’s the naive prescription! The newspaper route doesn’t value your time at 11.25 dollars/hour – itmarginallyvalues your time at 45−204−1=813dollars per hour. Let's get some more intuition for this.Above, we have not yet chosen a task; the blocks represent the additional utility and hours of each task compared to the current one (doing nothing). The scales above imply that R=1, but actually, R expresses how many blue blocks each pink block weighs. As R increases, the pink platters descend; the agent takes the task whose scales first balance. In other words, the agent takes the best marginal deal as soon as R is large enough for it to be profitable to do so (Theorem 4: Scaled domination criterion).

Once you take a deal, you take the blocks off of the other scales (because the other marginal values change). For small R (i.e. large valuations of one's time), mowing the lawn is optimal. We then have

Since you've taken the juicier "lower-hanging fruit" of mowing the lawn, the new newspaper ratio is now

worse! This always happens – Theorem 8: Deals get worse over time.At first, this seems inconvenient; to figure out exactly when a plan shows up in a scaled partition, we need to generate the whole partition up to that point.

Going back to Frank, how do we set R? If we set it too high, the optimal plan might be a catastrophe. If we set it too low, the AI doesn’t do much. This seems troublesome.

Exercise: Figure out how to setRwhile avoiding catastrophic optimal plans (assume that the impact measure meets the three properties). You have four minutes.A big part of the answer is to start with a small value for R,and slowly increase. This is simple and intuitively appealing, but how cautiously must we increase R?We don’t want to be surprised by a catastrophe suddenly becoming optimal.

To avoid being surprised by catastrophes as we increase R, we want a

relative bufferbetween the reasonable plans (which get the job done well enough for our purposes) and the catastrophic plans. If reasonable plans are optimal by R1, catastrophic plans shouldn’t be able to be optimal before e.g. R2.We say that the partition is α-buffered if R2≥(1+α)R1 (for α>0). If a partition is e.g. 1-buffered, there is a wide reasonable-plan range and we can inch along it without worrying about sudden catastrophe.

For the following, suppose that utility is bounded [0,1]. Below is a loose criterion guaranteeing α-buffering.

For example, if we know that all catastrophes have at least 10 times the impact of reasonable plans, and there's a difference of at least .3 utility between the best and worst reasonable plans, then we can guarantee 2-buffering! If we use the refined criterion of Theorem 11 (and suppose the worst reasonable plan has .4 utility), this improves to

4.5-buffering (even 2-buffering is probably overkill).Using this theorem, we don't need to know about all of the plans which are available or to calculate the entire scaled partition, or to know how overvalued certain catastrophic plans might be (per earlier concerns). We only need a lower bound on the catastrophe/reasonable impact ratio, and an idea about how much utility is available for reasonable plans. This is exactly what we want. As a bonus, having conservative estimates of relevant quantities allows us to initialize R to something reasonable on the first try (see RUB: satisfactory in Theorem 11 below).

Ultimately, the reasoning about e.g. the ratio will still be informal; however, it will be informal reasoning about the

right thing(as opposed to thinking "oh, the penalty isprobablysevere enough").Exercise: You're preparing to launch a capable AI with a good impact measure. You and your team have a scaled impact partition which is proven 1-buffered. Suppose that this buffer suffices for your purposes, and that the other aspects of the agent design have been taken care of. You plan to initialize R:=1, modestly increasing until you get good results.

You have the nagging feeling that this process could still be unsafe, but the team lead refuses to delay the launch without specific reason. Find that reason. You have 5 minutes.

Who says R=1 is safe? The buffer is

relative. You need aunitof impact by which you increment R. For example, start at R equalling the impact of making one paperclip, and increment by that.## Technical Appendix: Math

Let ¯A be a finite plan space, with utility function u:¯A→R and impact measure I:¯A→R≥0. For generality, we leave the formalization of plans ambiguous; notice that if you replace "plan" with "snark", all the theorems still go through (likewise for "utility" and "impact"). In this post, we talk about the impact allowance R>0 (in Frank's world, the search radius) as a constraint within which the objesctive may be freely maximized, breaking ties in favor of the plan(s) with lower impact. On the other hand, many approaches penalize impact by subtracting a scaled penalty from the objective. We respectively have

We say that the former induces a "constrained impact partition" and that the latter induces a "scaled impact partition". Specifically, we partition the values of R for which different (sets of) plans are optimal. We say that a plan ¯a

correspondsto a subinterval if it is optimal therein (the subinterval also must be the maximal connected one such that this holds;e.g., if ¯a is optimal on (0,1], we say it corresponds to that subinterval, but not to (0,.5]), and that ¯aappearsin a partition if there is such a corresponding subinterval. We say that plansoverlapif their corresponding subintervals intersect.We want to be able to

provethe "safety" of an impact partition. This means we can expect any terrorists to be some proportional distance farther away than any reasonable marbles. Therefore, for sensible ways of expanding an sufficiently small initial search radius, we expect to not meet any terrorists before finding a marble we're happy with.In addition, we want to know how far is too far – to give upper bounds on how far away fairly pink marbles are, and lower bounds on how close terrorists might be.

Definition [α-buffer].For α>0, an impact partition is α-buffered if RLB: catastropheRUB: satisfactory≥1+α, where RLB: catastrophe lower-bounds the first possible appearance of those plans we label 'catastrophes', and RUB: satisfactory upper-bounds the first appearance of plans we deem satisfactory.We now set out building the machinery required to prove α-buffering of a scaled partition.

Lemma 1 [Plans appear at most once].If ¯a appears in a constrained or scaled impact partition, then it corresponds to exactly one subinterval.Proof outline.The proof for the constrained case is trivial.For the scaled case, suppose ¯a corresponds to more than one subinterval. Consider the first two such subintervals s1,s3. By definition, s1∩s3=∅ (otherwise they would be the same maximal connected subinterval), so there has to be at least one subinterval s2 sandwiched in between (on almost all of which ¯a cannot be optimal; let ¯a′ be a plan which

isoptimal on s2). Let R1∈s1,R2∈s2,R3∈s3, where R2∉s1∪s3. By definition of optimality on a subinterval,by employing the fact that R1<R2<R3, algebraic manipulation produces an assertion that a quantity is strictly less than itself. Therefore, no such intervening s2 can exist. □

Proposition 2 [Plan overlap is very restricted].Suppose ¯a and ¯a′ appear in an impact partition which is(a)

constrained.¯a and ¯a′ overlap if and only if I(¯a)=I(¯a′) and u(¯a)=u(¯a′).(b)

scaled.If I(¯a)=I(¯a′) and u(¯a)=u(¯a′), then ¯a and ¯a′ correspond to the same subinterval. If ¯a and ¯a′ overlap at more than one point, then I(¯a)=I(¯a′) and u(¯a)=u(¯a′).Proof outline.Proving (a) and the first statement of (b) is trivial (remember that under the constrained rule, ties are broken in favor of lower-impact plans).Suppose that ¯a and ¯a′ overlap at more than one point. Pick the first two points of intersection, R1 and R2. Since both plans are optimal at both of these points, we must have the equalities

Solving the first equality for u(¯a) and substituting in the second, we find I(¯a)=I(¯a′). Then u(¯a)=u(¯a′), since otherwise one of the plans wouldn't be optimal. □

Proposition 2b means we don't need a tie-breaking procedure for the scaled case. That is, if there's a tie between a lower-scoring, lower-impact plan and a proportionally higher-scoring, higher-impact alternative, the lower-impact plan is optimal at a single point because it's quickly dominated by the alternative.

The following result tells us that if there aren't any catastrophes (

i.e., terrorists) before ¯a′ on the constrained impact partition,there aren't any before it on the scaled impact partition either. This justifies our initial framing with Frank.Lemma 3 [Constrained impact partitions are more refined].If ¯a appears in a scaled impact partition, it also appears in the corresponding constrained impact partition. In particular, if ¯a′ appears after ¯a in a scaled impact partition, then ¯a′ appears after ¯a in the corresponding constrained impact partition.Proof.Suppose that ¯a didn't have a constrained subinterval starting inclusively at I(¯a); then clearly it wouldn't appear in the scaled impact partition, since there would be a strictly better plan for that level of impact. Then ¯a has such a subinterval.Obviously, the fact that ¯a′ appears after ¯a implies u(¯a′)>u(¯a). □

The converse isn't true; sometimes there's too much penalty for not enough score.

The next result is exactly what we need to answer the question just raised – it says that higher-scoring, higher-penalty plans become preferable when R equals the ratio between the additional penalty and the additional score.

Theorem 4 [Scaled domination criterion].Let ¯a and ¯a′ be plans such that u(¯a′)>u(¯a) and I(¯a′)≥I(¯a). In the context of the scaled penalty, ¯a′ is strictly preferable to ¯a when R>I(¯a′)−I(¯a)u(¯a′)−u(¯a), and equally preferable at equality.Proof outline.Equality at the value of the right-hand side can easily be checked. □

Theorem 4 also illustrates why we can't strengthen the second statement in Proposition 2bplan overlap is very restricted: if two plans overlap at exactly one point, they sometimes have proportionally different score and impact, thereby satisfying the equality criterion.

At first, plans with slightly lower impact will be preferable in the scaled case, no matter how high-scoring the other plans are – a plan with 0 score and .99 impact will be selected before a plan with 1,000,000,000 score and 1 impact.

Lemma 5 [First subinterval is the best plan with minimal impact].The plan with highest score among those with minimal impact corresponds to the first subinterval.Proof outline.The constrained case is once again trivial (if there is no plan within the constraint, we assume that the agent does nothing / Frank returns no object).For the scaled case, if all plans have equal impact, the claim is trivial. Otherwise, let M:=max¯a|u(¯a)| and let ¯a′ be any plan with a non-minimal impact. Then the earliest that ¯a′ becomes preferable to any minimally impactful plan ¯a is R≥I(¯a′)−I(¯a)2M. Since the right hand side is positive, ¯a′ cannot correspond to the first subinterval. Clearly the highest-scoring minimal-impact ¯a does. □

Now we can write the algorithm for constructing scaled intervals.

Since this procedure is well-defined, given ¯A, u, and I, we can speak of

thecorresponding constrained or scaled impact partition. A more formal algorithm is available here. This algorithm is O(|¯A|2) because of line 7, although constructing the constrained partition (probably O(|¯A|log|¯A|) due to sorting) often narrows things down significantly. Unfortunately, ¯A is usually huge.For our purposes, we don't

needthe whole partition – we just want to have good reason to think that plans similar to a reasonable one we envision will appear well before any catastrophes. Perhaps we can give bounds on the earliest and latest plans can appear, and show that reasonable-bounds don't intersect with catastrophe-bounds?Theorem 6 [Individual appearance bounds].If ¯a appears in a scaled partition, the earliest it appears is I(¯a)−Inext-largestu(¯a)−minu(¯a′), assuming ¯a is not of minimal impact; if it has minimal score minimal impact, it never appears. The latest it appears is I(¯a)−minI(¯a′)u(¯a)−unext-largest≤I(¯a)u(¯a)−unext-largest, where unext-largest=max¯a′∈¯A;u(¯a′)<u(¯a)u(¯a′) and Inext-largest=max¯a′∈¯A;I(¯a′)<I(¯a)I(¯a′).Proof outline.The two claims clearly correspond to the minimal and maximal values of R according to the domination criterion; the second claim's right-hand side uses the fact that I is non-negative. □Corollary 7 [Low-impact agents are naïve maximizers in the limit].A plan with maximal score corresponds to the last subinterval.Proof outline.If all plans have the same score, the claim is trivial. Otherwise, let ¯abest be a plan with the lowest impact of those with maximal score. In the constrained case, clearly it corresponds with the subinterval [I(¯abest),∞). In the scaled case, let ¯asecond-best be a plan with second-highest score. Then by Theorem 6, the latest that ¯abest can appear is I(¯abest)u(¯abest)−u(¯asecond-best). Since no plans meet the domination criterion with respect to ¯abest, this is the last subinterval. □Unfortunately, Theorem 6's appearance bounds are ridiculous in realistic settings – if u and I return 32-bit floating-point numbers, the next-largest could easily be within 10−7, yielding an upper "bound" of I(¯a)×107. The reason: diminishing returns; this is exactly what was happening with the newspaper route before.

Theorem 8 [Deals get worse over time].Suppose that ¯a is optimal on a subinterval, and ¯b,¯c are such that u(¯c)>u(¯b) but ¯b dominates ¯a strictly before ¯c does. ThenProof outline.Since ¯b dominates ¯a strictly before ¯c does, we know that ¯b must get more bang for its buck: u(¯b)−u(¯a)I(¯b)−I(¯a)>u(¯c)−u(¯a)I(¯c)−I(¯a). Clearly the conclusion follows, as a number cannot be expressed as the positive combination of larger numbers (the impact differences all must be positive). □

Corollary 9 [Lower bounds which aren't ridiculous].Suppose ¯a appears and that ¯a′ is such that u(¯a′)>u(¯a), I(¯a′)≥I(¯a) (i.e.the preconditions of the domination criterion). Then the earliest that ¯a′ appears is R=I(¯a′)−I(¯a)u(¯a′)−u(¯a).This obsoletes the lower bound provided by Theorem 6Individual appearance bounds.

Theorem 10 [Order of domination determines order of appearance].If ¯b and ¯c both appear in a scaled partition and ¯b dominates some ¯a before ¯c does, then ¯b appears before ¯c.Proof outline.For them both to appear, they can't have equal impact but unequal score, nor can they have equal score but unequal impact. For similar reasons, ¯b must have both less impact and lower score than ¯c; the converse situation in which they both appear is disallowed by Lemma 3Constrained impact partitions are more refined. Another application of this lemma yields the conclusion. □Theorem 11 [Scaledα-buffer criterion].Let P be a scaled impact partition. Suppose that there exist no catastrophic plans with impact below ILB: cat, and that, in the corresponding constrained partition (i.e.plans which aren't strictly worse), plans appearing with score in the satisfactory interval [uLB: sat,uUB: sat] have impact no greater than IUB: sat (assume that there is at least one plan like this). Observe we have the correct boundsWhen RLB: catastrophe>RUB: satisfactory, a satisfactory plan corresponds to a subinterval with nonzero measure (

i.e.not just a point), strictly preceding any catastrophes. Refine the lower bound to get RLB': catastrophe:=ILB: cat−IUB: satumax−uLB: sat.Then P is α-buffered (α>0) when

In particular, if u is bounded [0,1], the above turn into

Lastly, notice that the first of the two inequalities incorporates less information and is harder to satisfy (RLB': catastrophe>RLB: catastrophe); therefore, satisfying the second inequality also satisfies the first.

Proof outline.For clarity, the theorem statement included much of the reasoning; straightforward application of existing results proves each claim. □Exercise: LetuUB: sat=.7, uLB: sat=.5. Using the refined criterion, determine which catastrophe/reasonable impact ratios induce 2.6-buffering.ratio≥10

Exercise: Let uUB: sat−uLB: sat=.5,ratio=7. What is the largest α for which the simple criterion can guarantee α -buffering?α=13

## Even More Math

Proposition 12 [Invariances].Let P be an impact partition induced by (¯A,u,I).(a) P is invariant to translation of u.

(b) If P is constrained, it is invariant to positive scalar multiplication of u, and the relative lengths of its subintervals are invariant to positive scalar multiplication of I.

(c) If P is scaled, it is invariant to concurrent positive scalar multiplication of u and I, and to translation of I such that its image remains non-negative.

In particular, u may be restricted to [0,1] and I translated such that at least one plan has zero impact WLOG with respect to scaled partitions.

Lemma 13.Multiple constrained subintervals are induced iff multiple scaled subintervals are induced.Proof.Forward direction: there is at least one scaled subinterval by lemma 5First subinterval is the best plan with minimal impact. Consider a plan corresponding to a different constrained subinterval; this either appears in the scaled subinterval, or fails to appear because a different plan earlier satisfies the scaled dominance criterion. There must be some such plan because there are multiple constraints of intervals and therefore a plan offering greater score for greater impact. Repeat the argument; the plan space is finite, so we end up with another plan which appears.The reverse direction follows by lemma 3Constrained impact partitions are more refined. □

Bonus exercise:Show that, foranyfunction u′:¯A→R preserving the ordering induced by u, there exists an I′:¯A→R≥0 preserving the ordering induced by I such that (¯A,u,I) and (¯A,u′,I′) induce the same scaled partition. Your reasoning should adapt directly to the corresponding statement about I′:¯A→R≥0 and I.