*A putative new idea for AI control; index here.*

*Thanks for help from Paul Christiano*

If clippy, the paper-clip maximising AI, goes out of control, it would fill the universe with paper clips (or with better and better ways of counting the paper-clips it already has). If I sit down to a game with Deep Blue, then I know little about what will happen in the game, but I know it will end with me losing.

When facing a (general or narrow) superintelligent AI, the most relevant piece of information is what the AI's goals are. That's the general problem: there is no such thing as 'reduced impact' for such an AI. It doesn't matter who the next president of the United States is, if an AI wants to tile the universe with little smiley faces. But reduced impact is something we would dearly want to have - it gives us time to correct errors, perfect security systems, maybe even bootstrap our way to friendly AI from a non-friendly initial design. The most obvious path to coding reduced impact is to build a satisficer rather than a maximiser - but that proved unlikely to work.

But that ruthless maximising aspect of AIs may give us a way of quantifying 'reduced impact' - and hence including it in AI design. The central point being:

"When facing a (non-reduced impact) superintelligent AI, the AI's motivation is the most important fact we know."

Hence, conversely:

"If an AI has reduced impact, then knowing its motivation isn't particularly important. And a counterfactual world where the AI didn't exist, would not be very different from the one in which it does."

In this post, I'll be presenting some potential paths to formalising this intuition into something computable, giving us a numerical measure of impact that can be included in the AI's motivation to push it towards reduced impact. I'm putting this post up mainly to get help: does anyone know of already developed mathematical or computational tools that can be used to put these approaches on a rigorous footing?

## Counterfactuals

To define a counterfactual, we can use a variant of my favourite "send the effect through a stochastic event that has a small probability of undoing it". Let's start by imagining there is an initial (super-intelligent) AI, the master, that gets one action only: to produce the source code for another AI, the disciple, that will then go out into the world and do stuff. The source code is passed through a wire, and a stochastic event X happens: if X=1, the source code goes through unchanged, if X=0, the source code is erased. Generally speaking, we want X=1 with high probability; the counterfactual "what if the disciple AI were not created" is reduced to "what if X=0". There are some problems with this approach - the AI needs a good ontology, it needs to locate X in the world - but these are much simpler issues than other approaches to counterfactuals.

## Penalty functions

The master AI has a double objective: to maximise some utility function, and minimise a penalty function, which measures how much impact its decision has on the world. Assuming the utility function is bounded, then it all becomes a question of weighting the utility and penalty functions so the the disciple AI accomplishes some small goal, but without transforming the universe into computronium along the way.

For illustrative purpose, let the master AI be a bounded paperclipper, with a utility function linear in created paper clips up to a thousand (and constant thereafter). The disciple AI will be left to wander round an open factory to accomplish its objectives, without any extra precautions.

For this model, we will assume we can ignore the long-term social impact of successfully deploying reduced impact AIs. This is a somewhat tricky point, and it's not clear at all how we would do so - maybe some cunning use of AI indifference, or a commitment to running a million different reduced impact AI experiments, which could conceivably reduce the social impact of a single run. Of course, once these reduced impact AIs are in general use, then any given run will make little difference to the social environment, so this problem is especially acute only for the initial runs.

The challenge then is in the definition of the penalty function. When considering whether to output source code S, clippy can compute two probability distributions over future worlds: P(e,S,1), given evidence e and assuming X=1. It will then contrast this with the probability distribution P(e,S,0), given the same evidence and assuming X=0. The X=0 case is taken as 'safe', so it would seem that the 'difference' between P(e,S,1) and P(e,S,1) is a good measure of impact.

## Coarse graining

An initial impulse would be to compare P(e,S,1) and P(e,S,0) directly: for a world w, calculate the difference |P(e,S,1)(w)-P(e,S,0)(w)|, and sum or integrate this difference somehow over all worlds, and use this as a measure of impact. Unfortunately this will generally not work: there is a crucial difference between the X=0 and X=1 worlds, namely the presence of the active disciple AI doing its stuff. So generally, world w will have probability (very close to) zero in either P(e,S,1) or P(e,S,0), depending on whether w includes this disciple or not. So direct comparisons fail.

One solution would be to coarse grain: to say that worlds w_{1} (with X=1) and w_{0} (with X=0) are otherwise so similar that we can bunch them together in one category W, and compute |P(e,S,1)(W)-P(e,S,0)(W)|. The intuition behind this can be formalised as:

"Independently of whether the disciple AI was built, similar worlds should be similarly likely."

*Twenty (million) questions*

One method of coarse graining is to set up a huge amount of auxiliary measurements (such as the GDP of every country, the volume of sewage pumped though every city, ticket prices for travelling circuses, the quantity of pesticides used for growing strawberries, the number of lolcats viewed each minute, etc...) and use those to coarse grain: worlds with very similar results on these scales are grouped together. This is similar to the 'proxy measure of human survival' we mentioned in our Oracle AI article.

This is by no means a fool-proof method - the disciple AI could theoretically take over the cosmos while ensuring that there is little difference in GDP or lolcat views - or, more plausibly, little difference in whatever measurement method we have implemented to bring the level of GDP and lolcat views to the awareness of the AI. But the more auxiliary measurements we add, the more varied the ways of estimating them we require - or even give the AI a good enough ontology that its beliefs are really about states in the world - the more likely it becomes that the master AI will decide the risk the isn't worth the gain, given its bounded utility function, and that a disciple AI that builds a few thousand paper-clips and lets it go at that is a good design.

Still, the coarse graining might be somewhat unsatisfactory: the measurements might all be hackable, or they might result in grouping together of worlds that we would find very different, or the separation of worlds that we consider virtually identical. Both mistakes are bad, as they reduce the ability of numbers such as ∫|P(e,S,1)(W)-P(e,S,0)(W)|dW to actually measure reduced impact.

Note that |P(e,S,1)(W)-P(e,S,0)(W)| should be read as 'estimate P(e,S,1)(W), then estimate P(e,S,0)(W), then take the modulus of their difference'. The other alternative - estimating |P(e,S,1)(W)-P(e,S,0)(W)| directly - has problem with chaos. For instance, if I suspect that X=1 causes a hurricane somewhere else in the world than if would be for X=0, then even if I have no clue as to where the hurricane happens, I still would know that |P(e,S,1)(W)-P(e,S,0)(W)| is large.

## Comparing distributions abstractly

We could completely ignore similarity between worlds, and make no attempt to match up worlds with X=1 with those with X=0. Instead, we could just consider the distributions P(e,S,1) and P(e,S,0) abstractly, and see if they look similar, without knowing anything about the worlds they are defined on. The intuition behind this is:

"Independently of whether the disciple AI was built, we should expect the same shape to future uncertainty."

There are many measures we could use to measure similarity between P(e,S,1) and P(e,S,0): we can treat them as random variables and compute all the usual descriptors: mean, variance, higher cumulants, max probability reached, number and shape of peaks, sexiness of the graph...

Similarly, given some arbitrary measure M on worlds, we can twist the distributions P(e,S,*) into P'(e,S,*) by defining P'(e,S,*)(w)=M(w)P(e,S,*)(w) and then normalising. If we expect that K should be independent of a reduced impact AI (such as the Komlmogrov prior, for instance), then the distributions P'(e,S,*) should also have similar descriptors.

*Information theoretic comparisons*

Or we could get information theoretical, and measure how much more information we would need to describe P(e,S,1), given that we know P(e,S,0) (and vice-verse). This is somewhat similar to the Kullback-Leibler divergence, but that measure requires matching up the worlds for the two distributions, and hence can't be used here: likely worlds in P(e,S,1) are not likely in P(e,S,0), and vice-verse, because of the X=1 versus X=0 issue.

Roughly, what we would want is that P(e,S,1) and P(e,S,0) have roughly similar minimal description length (the expected future of the world doesn't get simpler or more complicated based on the disciple AI) and that the description of P(e,S,0) given P(e,S,1) is very short in comparison (if we know the world of the disciple AI, we have a good idea of the world without it).

Of course, the shortest description of P(e,S,1) might just be the source code of the master AI along with X=1. This is not at all what we want, so might require that all these descriptions be considerably shorter than that, possibly allowing some destructive compression of the distributions to permit this.

## The information in the evidence

There is a third route we could travel: when constructing the P(e,S,*), the AI made use of the evidence e (all it's previous observations) as well as the value of S and X. For minimised impact, we would want the value of X to not be particularly informative: we would require e to provide a lot more information than X. Updating on evidence can be seen as maps (e,X)->P(e,S,X), and would want to put norms on the possible e and the possible P such that X is not very informative compared to e.

This can be hard to quantify; obviously, if you know nothing about the world at all, then knowing X tells us nothing. On the other hand, once we've seen twenty trillion lolcat videos, the twenty trillionth and one provides us with little more: so we can't simply say that the value X has to be unimportant compared with any given bits in e, as those bits may be highly redundant.

*Measuring importance*

How would we measure the value of e and X when constructing the P(e,S,*)? This is vital to this approach, and I don't have any clear ways of doing so. The picture I have in mind is smoothing the distributions somewhat (so that the P(e,S,*) aren't so sharp in their cutoff), and measuring the impact that changing information in e has on the probability distributions, using some sensible metric over probability distributions (the L^{1} norm seems appropriate for probability distributions, though L^{2} norms have more pleasing properties). If the data in e is allowed to vary in some continuous manner, then we can talk about the derivative of P(e,S,*) with respect to changes in e, and elements of high derivative, or whose derivative was not colinear with others, would be the most important. But there may be discrete measure that are more appropriate for this: the important thing is to get a norm on e and X and another on P, and compare how small changes in e and X affect the norm on P.

*Absolute information importance*

We can compress e to get rid of the redundancy, and then compare the typical importance of bits in the (compressed) e as compared with knowing X, and penalise those S's where the importance of X is higher. This is formalising the intuition:

"The rest of the information about the world should be more important than whether the disciple AI was built."

But once e is compressed, it seems that any bit is equally informative - or at least the difference is not so clear. For instance, the name of the next president of the United States is intuitively more important than the name of the plant that grows in my back garden, but if I want to keep track of every piece of information, then they are equally important - the plant in my back garden will interact with particles all around it, which will soon snowball all over the planet; within eight minutes, the positions of the atoms in the sun will be different from what they would have been had that plant been of a different species. Hell, their positions would be different if I'd thought the right name for the plant rather than the wrong name.

There is a kind of implicit coarse graining here - we're assuming the AIs don't have the ability to unravel chaos and entropy, and keep track of every piece of information; they will be using destructive compression, or average measures (temperature, pressure...) at some point.

For instance, if there are two gasses in adjacent (equal-sized) compartments, and we are about to remove the partition, then the average pressure is something important to keep track of, but the individual pressures are not informative for the picture we have of the future. If the AI is subject to similar constraints, its own coarse-graining will define this "varying importance of information".

*Relative information importance*

There is another avenue we could explore here, a potentially more rewarding one, that does not require compressing e. Instead of caring about the importance of knowing X, we can track what impact knowing X has on the *relative importance of different pieces of e*.

If, for instance, there were no AI built in the next year, then the name of the next US president is more important than the fact that the human mind could be hacked by an incredibly complicated combination of lights and sounds. If a superintelligent AI wanted to take over the world in this year, then relative the importance of those facts is presumably reversed: we won't live to see the election, but a human vulnerability is something the AI can exploit. This intuition is:

"Independently of whether the disciple AI was built, we should expect the same relative importance of different pieces of current information."

This is a potentially very powerful measure - an AI wanting to take over the world would have to do while ensuring that it only uses facts that were going to be important anyway (and uses them in precisely the right amounts, and prevents the human reaction from disrupting this setup, and leaving open the same spread of possible future worlds dependent on variations in current evidence, etc...) It seems an immense challenge even for a super intelligent AI. Generally, the more coarse grained the AI's vision is, the more useful this measure becomes.

To transform this measure into a penalty function, we merely need to use some sort of norm across the difference in information importance. We have vector valued information here: changing the known value of Z can undo another change in the value of Y (for instance when these denote the pressure of the two mixing gasses), and the best norm would take this into consideration as well.

## Conclusion: as always, more research needed

I've explored a few tentative methods for formalising various 'reduced impact' intuitions:

- Independently of whether the disciple AI was built, similar worlds should be similarly likely.
- Independently of whether the disciple AI was built, we should expect the same shape to future uncertainty.
- The rest of the information about the world should be more important than whether the disciple AI was built.
- Independently of whether the disciple AI was built, we should expect the same relative importance of different pieces of current information.

The important questions are now whether these methods can be formalised with sufficient rigour, and whether they capture enough of what we want to be useful.