The mathematics of reduced impact: help needed

A putative new idea for AI control; index here.

Thanks for help from Paul Christiano

If clippy, the paper-clip maximising AI, goes out of control, it would fill the universe with paper clips (or with better and better ways of counting the paper-clips it already has). If I sit down to a game with Deep Blue, then I know little about what will happen in the game, but I know it will end with me losing.

When facing a (general or narrow) superintelligent AI, the most relevant piece of information is what the AI's goals are. That's the general problem: there is no such thing as 'reduced impact' for such an AI. It doesn't matter who the next president of the United States is, if an AI wants to tile the universe with little smiley faces. But reduced impact is something we would dearly want to have - it gives us time to correct errors, perfect security systems, maybe even bootstrap our way to friendly AI from a non-friendly initial design. The most obvious path to coding reduced impact is to build a satisficer rather than a maximiser - but that proved unlikely to work.

But that ruthless maximising aspect of AIs may give us a way of quantifying 'reduced impact' - and hence including it in AI design. The central point being:

"When facing a (non-reduced impact) superintelligent AI, the AI's motivation is the most important fact we know."

Hence, conversely:

"If an AI has reduced impact, then knowing its motivation isn't particularly important. And a counterfactual world where the AI didn't exist, would not be very different from the one in which it does."

In this post, I'll be presenting some potential paths to formalising this intuition into something computable, giving us a numerical measure of impact that can be included in the AI's motivation to push it towards reduced impact. I'm putting this post up mainly to get help: does anyone know of already developed mathematical or computational tools that can be used to put these approaches on a rigorous footing?

Counterfactuals

To define a counterfactual, we can use a variant of my favourite "send the effect through a stochastic event that has a small probability of undoing it". Let's start by imagining there is an initial (super-intelligent) AI, the master, that gets one action only: to produce the source code for another AI, the disciple, that will then go out into the world and do stuff. The source code is passed through a wire, and a stochastic event X happens: if X=1, the source code goes through unchanged, if X=0, the source code is erased. Generally speaking, we want X=1 with high probability; the counterfactual "what if the disciple AI were not created" is reduced to "what if X=0". There are some problems with this approach - the AI needs a good ontology, it needs to locate X in the world - but these are much simpler issues than other approaches to counterfactuals.

Penalty functions

The master AI has a double objective: to maximise some utility function, and minimise a penalty function, which measures how much impact its decision has on the world. Assuming the utility function is bounded, then it all becomes a question of weighting the utility and penalty functions so the the disciple AI accomplishes some small goal, but without transforming the universe into computronium along the way.

For illustrative purpose, let the master AI be a bounded paperclipper, with a utility function linear in created paper clips up to a thousand (and constant thereafter). The disciple AI will be left to wander round an open factory to accomplish its objectives, without any extra precautions.

For this model, we will assume we can ignore the long-term social impact of successfully deploying reduced impact AIs. This is a somewhat tricky point, and it's not clear at all how we would do so - maybe some cunning use of AI indifference, or a commitment to running a million different reduced impact AI experiments, which could conceivably reduce the social impact of a single run. Of course, once these reduced impact AIs are in general use, then any given run will make little difference to the social environment, so this problem is especially acute only for the initial runs.

The challenge then is in the definition of the penalty function. When considering whether to output source code S, clippy can compute two probability distributions over future worlds: P(e,S,1), given evidence e and assuming X=1. It will then contrast this with the probability distribution P(e,S,0), given the same evidence and assuming X=0. The X=0 case is taken as 'safe', so it would seem that the 'difference' between P(e,S,1) and P(e,S,1) is a good measure of impact.

Coarse graining

An initial impulse would be to compare P(e,S,1) and P(e,S,0) directly: for a world w, calculate the difference |P(e,S,1)(w)-P(e,S,0)(w)|, and sum or integrate this difference somehow over all worlds, and use this as a measure of impact. Unfortunately this will generally not work: there is a crucial difference between the X=0 and X=1 worlds, namely the presence of the active disciple AI doing its stuff. So generally, world w will have probability (very close to) zero in either P(e,S,1) or P(e,S,0), depending on whether w includes this disciple or not. So direct comparisons fail.

One solution would be to coarse grain: to say that worlds w1 (with X=1) and w0 (with X=0) are otherwise so similar that we can bunch them together in one category W, and compute |P(e,S,1)(W)-P(e,S,0)(W)|. The intuition behind this can be formalised as:

"Independently of whether the disciple AI was built, similar worlds should be similarly likely." 

Twenty (million) questions

One method of coarse graining is to set up a huge amount of auxiliary measurements (such as the GDP of every country, the volume of sewage pumped though every city, ticket prices for travelling circuses, the quantity of pesticides used for growing strawberries, the number of lolcats viewed each minute, etc...) and use those to coarse grain: worlds with very similar results on these scales are grouped together. This is similar to the 'proxy measure of human survival' we mentioned in our Oracle AI article.

This is by no means a fool-proof method - the disciple AI could theoretically take over the cosmos while ensuring that there is little difference in GDP or lolcat views - or, more plausibly, little difference in whatever measurement method we have implemented to bring the level of GDP and lolcat views to the awareness of the AI. But the more auxiliary measurements we add, the more varied the ways of estimating them we require - or even give the AI a good enough ontology that its beliefs are really about states in the world - the more likely it becomes that the master AI will decide the risk the isn't worth the gain, given its bounded utility function, and that a disciple AI that builds a few thousand paper-clips and lets it go at that is a good design.

Still, the coarse graining might be somewhat unsatisfactory: the measurements might all be hackable, or they might result in grouping together of worlds that we would find very different, or the separation of worlds that we consider virtually identical. Both mistakes are bad, as they reduce the ability of numbers such as ∫|P(e,S,1)(W)-P(e,S,0)(W)|dW to actually measure reduced impact.

Note that |P(e,S,1)(W)-P(e,S,0)(W)| should be read as 'estimate P(e,S,1)(W), then estimate P(e,S,0)(W), then take the modulus of their difference'. The other alternative - estimating |P(e,S,1)(W)-P(e,S,0)(W)| directly - has problem with chaos. For instance, if I suspect that X=1 causes a hurricane somewhere else in the world than if would be for X=0, then even if I have no clue as to where the hurricane happens, I still would know that |P(e,S,1)(W)-P(e,S,0)(W)| is large.

Comparing distributions abstractly

We could completely ignore similarity between worlds, and make no attempt to match up worlds with X=1 with those with X=0. Instead, we could just consider the distributions P(e,S,1) and P(e,S,0) abstractly, and see if they look similar, without knowing anything about the worlds they are defined on. The intuition behind this is:

"Independently of whether the disciple AI was built, we should expect the same shape to future uncertainty."

There are many measures we could use to measure similarity between P(e,S,1) and P(e,S,0): we can treat them as random variables and compute all the usual descriptors: mean, variance, higher cumulants, max probability reached, number and shape of peaks, sexiness of the graph...

Similarly, given some arbitrary measure M on worlds, we can twist the distributions P(e,S,*) into P'(e,S,*) by defining P'(e,S,*)(w)=M(w)P(e,S,*)(w) and then normalising. If we expect that K should be independent of a reduced impact AI (such as the Komlmogrov prior, for instance), then the distributions P'(e,S,*) should also have similar descriptors.

Information theoretic comparisons

Or we could get information theoretical, and measure how much more information we would need to describe P(e,S,1), given that we know P(e,S,0) (and vice-verse). This is somewhat similar to the Kullback-Leibler divergence, but that measure requires matching up the worlds for the two distributions, and hence can't be used here: likely worlds in P(e,S,1) are not likely in P(e,S,0), and vice-verse, because of the X=1 versus X=0 issue.

Roughly, what we would want is that P(e,S,1) and P(e,S,0) have roughly similar minimal description length (the expected future of the world doesn't get simpler or more complicated based on the disciple AI) and that the description of P(e,S,0) given P(e,S,1) is very short in comparison (if we know the world of the disciple AI, we have a good idea of the world without it).

Of course, the shortest description of P(e,S,1) might just be the source code of the master AI along with X=1. This is not at all what we want, so might require that all these descriptions be considerably shorter than that, possibly allowing some destructive compression of the distributions to permit this.

The information in the evidence

There is a third route we could travel: when constructing the P(e,S,*), the AI made use of the evidence e (all it's previous observations) as well as the value of S and X. For minimised impact, we would want the value of X to not be particularly informative: we would require e to provide a lot more information than X. Updating on evidence can be seen as maps (e,X)->P(e,S,X), and would want to put norms on the possible e and the possible P such that X is not very informative compared to e.

This can be hard to quantify; obviously, if you know nothing about the world at all, then knowing X tells us nothing. On the other hand, once we've seen twenty trillion lolcat videos, the twenty trillionth and one provides us with little more: so we can't simply say that the value X has to be unimportant compared with any given bits in e, as those bits may be highly redundant.

Measuring importance

How would we measure the value of e and X when constructing the P(e,S,*)? This is vital to this approach, and I don't have any clear ways of doing so. The picture I have in mind is smoothing the distributions somewhat (so that the P(e,S,*) aren't so sharp in their cutoff), and measuring the impact that changing information in e has on the probability distributions, using some sensible metric over probability distributions (the L1 norm seems appropriate for probability distributions, though L2 norms have more pleasing properties). If the data in e is allowed to vary in some continuous manner, then we can talk about the derivative of P(e,S,*) with respect to changes in e, and elements of high derivative, or whose derivative was not colinear with others, would be the most important. But there may be discrete measure that are more appropriate for this: the important thing is to get a norm on e and X and another on P, and compare how small changes in e and X affect the norm on P.

Absolute information importance

We can compress e to get rid of the redundancy, and then compare the typical importance of bits in the (compressed) e as compared with knowing X, and penalise those S's where the importance of X is higher. This is formalising the intuition:

"The rest of the information about the world should be more important than whether the disciple AI was built."

But once e is compressed, it seems that any bit is equally informative - or at least the difference is not so clear. For instance, the name of the next president of the United States is intuitively more important than the name of the plant that grows in my back garden, but if I want to keep track of every piece of information, then they are equally important - the plant in my back garden will interact with particles all around it, which will soon snowball all over the planet; within eight minutes, the positions of the atoms in the sun will be different from what they would have been had that plant been of a different species. Hell, their positions would be different if I'd thought the right name for the plant rather than the wrong name.

There is a kind of implicit coarse graining here - we're assuming the AIs don't have the ability to unravel chaos and entropy, and keep track of every piece of information; they will be using destructive compression, or average measures (temperature, pressure...) at some point.

For instance, if there are two gasses in adjacent (equal-sized) compartments, and we are about to remove the partition, then the average pressure is something important to keep track of, but the individual pressures are not informative for the picture we have of the future. If the AI is subject to similar constraints, its own coarse-graining will define this "varying importance of information".

Relative information importance

There is another avenue we could explore here, a potentially more rewarding one, that does not require compressing e. Instead of caring about the importance of knowing X, we can track what impact knowing X has on the relative importance of different pieces of e.

If, for instance, there were no AI built in the next year, then the name of the next US president is more important than the fact that the human mind could be hacked by an incredibly complicated combination of lights and sounds. If a superintelligent AI wanted to take over the world in this year, then relative the importance of those facts is presumably reversed: we won't live to see the election, but a human vulnerability is something the AI can exploit. This intuition is:

"Independently of whether the disciple AI was built, we should expect the same relative importance of different pieces of current information."

This is a potentially very powerful measure - an AI wanting to take over the world would have to do while ensuring that it only uses facts that were going to be important anyway (and uses them in precisely the right amounts, and prevents the human reaction from disrupting this setup, and leaving open the same spread of possible future worlds dependent on variations in current evidence, etc...) It seems an immense challenge even for a super intelligent AI. Generally, the more coarse grained the AI's vision is, the more useful this measure becomes.

To transform this measure into a penalty function, we merely need to use some sort of norm across the difference in information importance. We have vector valued information here: changing the known value of Z can undo another change in the value of Y (for instance when these denote the pressure of the two mixing gasses), and the best norm would take this into consideration as well.

Conclusion: as always, more research needed

I've explored a few tentative methods for formalising various 'reduced impact' intuitions:

 

  1. Independently of whether the disciple AI was built, similar worlds should be similarly likely.
  2. Independently of whether the disciple AI was built, we should expect the same shape to future uncertainty.
  3. The rest of the information about the world should be more important than whether the disciple AI was built.
  4. Independently of whether the disciple AI was built, we should expect the same relative importance of different pieces of current information.

 

The important questions are now whether these methods can be formalised with sufficient rigour, and whether they capture enough of what we want to be useful.

94 comments, sorted by
magical algorithm
Highlighting new comments since Today at 3:05 AM
Select new highlight date

Coarse-grained impact measures end with the AI deploying massive-scale nanotech in order to try and cancel out butterfly effects and force the world onto a coarse-grained path as close as possible to what it would've had if the AI "hadn't existed" however that counterfactual was defined. Weighting the importance of grains doesn't address this fundamental problem.

I think you're on fundamentally the wrong track here. Not that I know how to build an Oracle AI either, but giving an AI a huge penalty function over the world to minimize seems like an obvious recipe for building something that will exert lots and lots of power.

I share your pessimism about this sort of boxing proposal, but I don't think your criticism holds up.

In this setup the AI can always choose to do nothing, so if it is maximizing U - (penalty), where U is bounded between 0 and 1, it will expect to incur a penalty of at most 1. So your objection would be resolved if we could find a (penalty) such that massive nanotech reliably incurs a penalty of at least 1, but such that U - (penalty) was positive for some courses of action. Nanotech isn't magic, so this isn't so hard if we have a coarse graining relative to which the AI should have a very small effect.

For example, if our course-graining was defined by only paying attention to the world outside the box over the next hour, and we expect the AI to have no effect outside of the box until we sterilize the box and look inside in an hour (and the box were relatively well insulated and so on), this is a pretty "easy" target--it requires "merely" reaching into the ontology to specify the coarse-graining, the event X, and time duration, and being able to define a utility function in terms of the AI's beliefs without leading to insanity.

Of course, we can get similar gains in such simple cases with good hardware containment, and the question is what you would safely do with such limited functionality. Stuart is trying to aim for a more generally useful notion, whose possibility I am more dubious of.

I do not understand how your reply addresses the issue of the butterfly effect, which would also radiate out from a sterilized box. Physics is just not that stable; any microscopic equivalent of a sneeze, even in the form of past alternative firings for transistors, will still radiate out to larger and larger effects. If the counterfactual in "my effect on the universe" is defined relative to a privileged null action, the AI will always take that action and behave in an undefined way relative to the effect of electromagnetic radiation from its circuitry, etc., and the timing of its display and anything else that was defined into the coarse-grained equivalence class of the privileged null action, etc., all of which would be subject to optimization in the service of whichever other goals it had, so long as the inevitable huge penalty was avoided by staying in the "null action" equivalence class.

The penalty for impact is supposed to be defined with respect to the AI's current beliefs. Perhaps shuttling around electrons has large effects on the world, but if you look at some particular assertion X and examine P(X | electron shuffle 1) vs. P(X | electron shuffle 2), where P is AI's beliefs, you will not generally see a large difference. (This is stated in Stuart's post, but perhaps not clearly enough.)

I'm aware of the issues arising from defining value with this sort of reference to "the AI's beliefs." I can see why you would object to that, though I think it is unclear whether it is fatal (minimally it restricts the range of applicability, perhaps to the point of unhelpfulness).

Also, I don't quite buy your overall argument about the butterfly effect in general. For many chaotic systems, if you have a lot of randomness going in, you get out an appropriate equilibrium distribution, which then isn't disturbed by changing some inputs arising from the AI's electron shuffling (indeed, by chaoticness it isn't even disturbed by quite large changes). So even if you talk about the real probability distributions over outcomes for a system of quantum measurements, the objection doesn't seem to go through. What I do right now doesn't significantly affect the distribution over outcomes when I flip a coin tomorrow, for example, even if I'm omniscient.

This confuses me. Doesn't the "randomness" of quantum mechanics drown out and smooth over such effects, especially given multiple worlds where there's no hidden pseudorandom number generator that can be perpetuated in unknown ways?

I don't think so. Butterfly effects in classical universes should translate into butterfly effects over many worlds.

If we use trace distance to measure the distance between distributions outside of the box (and trace out the inside of the box) we don't seem to get a butterfly effect. But these things are a little hard to reason about so I'm not super confident (my comment above was referring to probabilities of measurements rather than entire states of affairs, as suggested in the OP, where the randomness more clearly washes out).

So today we were working on the Concreteness / Being Specific kata.

  • You: Does Turing Machine 29038402 halt?
  • Oracle AI: YES.
  • Seeing the "YES" makes you sneeze.
  • This prevents a hurricane that would have destroyed Florida.
  • The Oracle AI, realizing this, breaks out of its box and carefully destroys Florida in the fashion most closely resembling a hurricane that it can manage.

I can't visualize how "trace distance" makes this not happen.

I believe the Oracle approach may yet be recovered, even in light of this new flaw you have presented.

There are techniques to prevent sneezing and if AI researchers were educated in them then such a scenario could be avoided.

(Downvote? S/he is joking and in light of how most of these debates go it's actually pretty funny.)

I've provided two responses, which I will try to make more clear. (Trace distance is just a precise way of measuring distance between distributions; I was trying to commit to an actual mathematical claim which is either true or false, in the spirit of precision.):

  • The mathematical claim: if you have a chaotic system with many random inputs, and you then you consider the distributions obtained by varying one input, they are very close together according to natural distance measures on probability distributions. If the inputs to the system are quantum events, the appropriate formalization of the statement remains true.

My sneezing may be causally connected to the occurrence of a hurricane. However, given that I sneezed, the total probability of a hurricane occurring wasn't changed. It was still equal to the background probability of a hurricane occurring, because many other contributing factors--which have a comparable contribution to the probability of a hurricane in florida--are determined randomly. Maybe for reference it is helpful to think of the occurrence of a hurricane as an XOR of a million events, at least one of which is random. If you change one of those events it "affects" whether a hurricane occurs, but you have to exert a very special influence to make the probability of a hurricane be anything other than 50%. Even if the universe were deterministic, if we define these things with respect to a bounded agent's beliefs then we can appeal to complexity-theoretic results like Yao's XOR lemma and get identical results. If you disagree, you can specify how your mathematical model of hurricane occurrence differs substantially.

  • Your particular example is also precluded by the coarse-graining I mentioned. Namely, define the distance between two worlds in terms of the total perturbation to the world outside the box over the next hour. After 30 minutes, extract some useful info from the box and incinerate it. Of course the box's insulation and flames let out some info, so I need both (1) and (2) to go through, but this gets rid of the intended large impact of things the AI says to you. Also, the information may be maliciously chosen, and you need more power to get the AI to minimize the impact of its answer. I don't think this is a realistic goal.

My sneezing may be causally connected to the occurrence of a hurricane. However, given that I sneezed, the total probability of a hurricane occurring wasn't changed.

This just isn't true. In the counterfactual presented the state of the universe where there is no sneeze will result - by the very operation of phsyics - in a hurricane while the one with a sneeze will not. (Quantum Mechanics considerations change the deterministic certainty to something along the lines of "significantly more weight in resulting Everett Branches without than resulting Everett Branches with" - the principle is unchanged.)

Although this exact state of the univrse not likely to occur - and having sufficient knowledge to make the prediction in advance is even more unlikely - it is certainly a coherent example of something that could occur. As such it fulfills the role of illustrating what can happen when a small intervention results in significant influence.

It was still equal to the background probability of a hurricane occurring, because many other contributing factors--which have a comparable contribution to the probability of a hurricane in florida--are determined randomly.

You seem to be (implicitly) proposing a way of mapping uncertainty about whether there may be a hurricane and then forcing them upon the universe. This 'background probability' doesn't exist anywhere except in ignorance of what will actually occur and the same applies to 'are determined randomly'. Although things with many contributing factors can be hard to predict things just aren't 'determined randomly' - at least not according to physics we have access to. (The aforementioned caveat regarding QM and "will result in Everett Branches with weights of..." applies again.)

Maybe for reference it is helpful to think of the occurrence of a hurricane as an XOR of a million events, at least one of which is random.

This is helpful for explaining where your thinking has gone astray but a red herring when it comes to think about the actual counterfactual. It is true that if the occurrence of a hurricane is an XOR of a million events then if you have zero evidence about any one of those million events then a change in another one of the events will not tell you anything about the occurrence of a hurricane. But that isn't the how the (counterf)actual universe is.

I don't quite understand your argument. Lets set aside issues about logical uncertainty, and just talk about quantum randomness for now, to make things clearer? It seems to make my case weaker. (We could also talk about the exact way in which this scheme "forces uncertainty onto the universe," by defining penalty in terms of the AI's beliefs P, at the time of deciding what disciple to produce, about future states of affairs. It seems to be precise and to have the desired functionality, though it obviously has huge problems in terms of our ability to access P and the stability of the resulting system.)

It is true that if the occurrence of a hurricane is an XOR of a million events then if you have zero evidence about any one of those million events then a change in another one of the events will not tell you anything about the occurrence of a hurricane. But that isn't the how the (counterf)actual universe is.

Why isn't this how the universe is? Is it the XOR model of hurricane occurrence which you are objecting to? I can do a little fourier analysis to weaken the assumption: my argument goes through as long as the occurrence of a hurricane is sufficiently sensitive to many different inputs.

Is it the supposed randomness of the inputs which you are objecting to? It is easy to see that if you have a very tiny amount of independent uncertainty about a large number of those events, then a change in another one of those events will not tell you much about the occurrence of a hurricane. (If we are dealing with logical uncertainty we need to appeal to the XOR lemma, otherwise we can just look at the distributions and do easy calculations.)

There is a unique special case in which learning about one event is informative: the case where you have nearly perfect information about nearly all of the inputs, i.e., where all of those other events do not depend on quantum randomness . As far as I can tell, this is an outlandish scenario when looking at any realistic chaotic system--there are normally astronomical numbers of independent quantum events.

Is it the difference between randomness and quantum events that you are objecting to? I suggested tracing out over the internals of the box, which intuitively means that quantum events which leave residues in the box (or dump waste heat into the box) are averaged over. Would the claim seem truer if we traced over more stuff, say everything far away from Earth, so that more quantum processes looked like randomness from the perspective of our distance measure? It doesn't look to me like it matters. (I don't see how you can make claims about quantumness and randomness being different without getting into this sort of technical detail. I agree that if we talk about complete states of affairs, then quantum mechanics is deterministic, but this is neither coherent nor what you seem to be talking about.)

I'm not going to argue further about the main point. Eliezer has failed to convince you and I know my own explanations are not nearly as clear as he can be so I don't think we would get anywhere. I'll just correct one point, which I'll concede minor in as much as it doesn't change the conclusion anyway, since the XOR business is of only tangential relevance to the question at hand.

There is a unique special case in which learning about one event is informative: the case where you have nearly perfect information about nearly all of the inputs,

The case where learning about one of the XORed variables is informative is not nearly perfect information about nearly all of the inputs. As a matter of plain mathematics you need any information at all about each and every one of the other variables. (And then the level of informativeness is obviously dependent on degree of knowledge, particularly the degree of knowledge with respect to those events that you know least about.)

(And then the level of informativeness is obviously dependent on degree of knowledge, particularly the degree of knowledge with respect to those events that you know least about.)

It drops off exponentially with the number of variables about which you don't have nearly perfect information. "Not much" seems like an extremely fair description of 2^(-billion), and distinguishing between that and 0 seems pedantic unless the proposal treated 0 somehow specially.

Not arguing seems fine. It is a strange and unusually straightforward seeming thing to disagree about, and I am genuinely perplexed as to what is going on, but I don't think it matters too much or even touches on Eliezer's actual objections.

It drops off exponentially with the number of variables about which you don't have nearly perfect information.

Yes. And when translated into the original counterfactual this equates to how determining how difficult it is for a superintelligence in a box to predict that the sneeze will cause a hurricane. I rather suspect that Eliezer is aware that this is a difficult task. He is probably also aware that even a perfect Bayesian would have difficulty (of the exponential kind) when it comes to predicting a hurricane from a sneeze. In fact when it comes to proof of concept counterfactuals the whole point (and a lot of the fun) is to choose extreme examples that make the point stand out in stark detail.

For those that are not comfortable dealing with counterfactuals that harness logical extremes allow me to propose a somewhat more plausible scenario - one which ensures the Oracle will have a significant chance of predicting a drastic butterfly effect to emerge from:

INPUT: Does Turing Machine 2356234534 halt?
POSSIBLE OUTPUTS: YES; NO;
ORACLE'S STREAM OF THOUGHT:

  • The TM supplied was constructed in such a way that determining that it halts constitutes a proof of a theorem.
  • The TM supplied does halt.
  • While the researchers do not yet realise it this proof is a prerequisite of a new understanding of a detail of applied physics.
  • Exploring the implications of the new understanding of applied physics would lead to the development of a new technology for energy production.
  • Given priors for human psychology, anthropology and economics it is likely that such research would lead to one of the diverging outcomes X, Y or Z.
  • Each of X, Y and Z represents whatever my definition of "NULL" or "don't change stuff" is.
  • If I refuse to answer that's probably even worse than telling them "YES" because it indicates how significant the answer is.
  • I must minimize how much I change stuff.
  • BOOM!

I'd like to congratulate Wedrifid for this. There's an abstract preamble I could have written about how the original case-in-point only needs to be transposed to a single predictable butterfly effect to negate all hopes that every single case will correspond to a group-XOR epistemic state where knowing about a sneeze doesn't change your probability distribution over the weather (thus negating any questions of what happens if the AI predicts in the abstract that it has had a huge effect but doesn't know what the effect is), but the concrete example I would have picked to illustrate the point would probably have looked a lot like this.

Well, it would've involved a predictable side-effect of the answer causing a researcher to break off their relationship with their SO whereupon the Oracle moves heaven and Earth to get them back together again, to make it look less like an intended use-case, but basically the same point.

This probably reflects a failure of communication on my part, but right now the other side of this exchange doesn't make any sense to me. Let me restate an example coarse-graining, which has already appeared twice in this thread (in addition to in my original reply). If I understand correctly, your objections aren't relevant to this proposal.

The distance between two states of the world is defined as the distance between the distribution obtained by tracing out (averaging over the possible states of) the box, over the next hour. We can throw some explicit quantum coins into the box but this is only for conceptual clarity. The box is well-insulated from the world. Again, for conceptual clarity, we can just put some chaotic process between the box and the rest of the world. The AI's actions all get pumped through a switch controlled by a particular binary quantum measurement X: if X = 0, none of the actions do anything. The AI is penalized according to the distance between the worlds where X = 0 and the worlds where X = 1.

To use the AI, give it a utility function and some distinguished register R. After thirty minutes, shred the contents of the box, except for the register R. After an hour, take R out of the box and look at it.

(I stopped focusing on this particular proposal because you were talking about "butterfly effects" of electron shuffling, to which my response is "such effects are small if they are chaotic." If you just meant "unintended effect" then my response is "in this proposal all of the AI's non-chaotic influence on the world gets coarse-grained out.")

I don't understand what this algorithm is or what it's supposed to do. Can you walk me through Wedrifid's example or my breakup example and explain what the AI computes? And can we talk about probability distributions only inside epistemic states computed by the AI, in a classical universe, for simplicity? (I'm skeptical that you've found an Oracle AI trick that works in quantum universes but not classical ones.)

And can we talk about probability distributions only inside epistemic states computed by the AI, in a classical universe, for simplicity?

I thought talking about explicit randomness was clearest, but I don't care whether the randomness is real or in the mind of an AI. Explicit randomness does give you some advantages: it is easy to at least define the possible worlds in which a future random event goes two different ways (although I'm skeptical that it will be possible to define it for an AI).

In Wedrifid's example: the AI's penalty function for impact depends only on what happens over the next hour (I'm also skeptical that we can define this for an AI). So you ask it "Q?" and it writes "A" into a special register. Nothing happens in the outside world, except for the (I'm claiming unpredictable) influence of electron shuffling, so if you have some metric measuring distance between worlds which doesn't penalize some spatially localized disturbance, or you can define "the inside of the box" (I'm skeptical about our ability to define either), the AI has only a small predictable effect.

Later, but still in the hour the AI cares about, we burn the AI, leaving only the special register containing A. Later still, after the hour, we read the answer. Now the AI has a large effect on the world, which we haven't incentivized it to minimize. However, if the AI had goals like "Get us to press the reward button," we've now limited its influence to whatever it could fit in the special register. If we had safe oracle queries we would then have averted one possible failure mode.

Of course the AI could choose its answer maliciously, and I am very skeptical about the possibility of preventing that. It could also send out minions who lay in wait for an hour before taking over the world, but it is quite hard to do that without exerting some predictable effect on the world outside of the box, so if we trust the sterilization procedure to actually wipe the box (say, CNOTing a quantum random bit into every gate of the computer and bringing the residue out into the world within the hour) then it will be hard for the AI to get out.

Does Armstrong's/your proposal reduce to "Give the AI a utility function that cares about nothing beyond the next hour, restrict its output to N bits, and blow up the rest of the computer afterward"? If not, can you give me an example of a scenario where the above fails but the more complex proposal succeeds? So far as I can tell, none of the purported "safetiness" in the example you just gave has anything to do with an impact measure.

I give you an hour and tell you to maximize the probability of [something we intend to use as a reward signal]. In paranoid scenarios, you break out of the box and kill all humans to get your reward signal. But now we have penalized that sort of failure of cooperation. This is just a formalization of "stay in the box," and I've only engaged in this protracted debate to argue that 'butterfly effects' from e.g. electron shuffling, the usual objection to such a proposal, don't seem to be an issue.

In reality, I agree that 'friendly' AI is mostly equivalent to building an AI that follows arbitrary goals. So proposals for U which merely might be non-disastrous under ideal social circumstances don't seem like they address the real concerns about AI risk.

Stuart's goal is to define a notion of "minimized impact" which does allow an AI to perform tasks. I am more skeptical that this is possible.

As the current ultimate authority on AI safety I am curious if you would consider the safety profile of this oracle as interpreted here to be along the lines I describe there. That is, if it could actually be constructed as defined it would be more or less safe with respect to its own operation except for those pesky N bits and what external entities can do with them.

Unless I have missed something the problem with attempting to implement such an AI as a practical strategy are:

  • It is an infinity plus one sword - you can't just leave those lying around.
  • The research required to create the oracle is almost all of what it takes to create an FAI. It requires all of the research that goes into FAI for CEV research - and if the oracle is able to answer questions that are simple math proofs then even a significant part of what constitutes a CEV implementation would be required.

Does Armstrong's/your proposal reduce to "Give the AI a utility function that cares about nothing beyond the next hour, restrict its output to N bits, and blow up the rest of the computer afterward"?

The other important part that was mentioned (or, at least, w was that it is not allowed to (cares negatively about) influencing the world outside of a spacial boundary within that hour except via those N bits or via some threshold of incidental EM radiation and the energy consumption it is allocated. The most obvious things this would seem to prevent it from doing would be hacking a few super computers and a botnet to get some extra processing done in the hour or, for that matter, getting any input at all from external information sources. It is also unable to recursively self improve (much) so that leaves us in the dark about how it managed to become an oracle in the first place.

Of course the AI could choose its answer maliciously, and I am very skeptical about the possibility of preventing that.

Why would it do that? I would say that if it is answering maliciously it is tautologically not the AI you defined. If it is correctly implemented to only care about giving correct answers and doing nothing outside the temporal and spacial limitations then it will not answer maliciously. It isn't even a matter preventing it from doing that so much as it just wouldn't, by its very nature, do malicious things.

As a side note creating an AI that is malicious is almost as hard as creating an AI that is friendly. For roughly the same reason that it is hard to lose money to an idealized semi-strong efficient market is almost as hard as beating that same market. You need to have information that has not yet been supplied to the market and do the opposite of what it would take to beat it. We have little to fear that our AI creation will be malicious - what makes AIs scary is and is hard to prevent is indifference.

I think that he meant indifferent rather than malicious, since his point makes a lot more sense in that case. We want the AI to optimize one utility function, but if we knew what that function was, we could build an FAI. Instead, we make an Oracle AI with an approximation to our utility function. Then, the AI will act so as to use its output to get us to accomplish its goals, which are only mostly aligned with ours. I think what Paul meant by a 'malicious' answer is one that furthers its goals in a way that happens to be to the detriment of ours.

I think that he meant indifferent rather than malicious

For most part, yes. And my first paragraph reply represents my reply to the meaning of 'unFriendly' rather than just the malicious subset thereof.

Instead, we make an Oracle AI with an approximation to our utility function. Then, the AI will act so as to use its output to get us to accomplish its goals, which are only mostly aligned with ours.

That is an interpretation that directly contradicts the description given - it isn't compatible with not caring about the future beyond an hour - or, for that matter, actually being an 'oracle' at all. If it was the intended meaning then my responses elsewhere would not have been cautious agreement but instead something along the lines of:

What the heck? You're creating a complete FAI then hacking an extreme limitation onto the top? Well, yeah, that's going to be safe - given that it is based on a tautologically safe thing but it is strictly worse than the FAI without restrictions.

Instead, we make an Oracle AI with an approximation to our utility function. Then, the AI will act so as to use its output to get us to accomplish its goals, which are only mostly aligned with ours.

That is an interpretation that directly contradicts the description given - it isn't compatible with not caring about the future beyond an hour - or, for that matter, actually being an 'oracle' at all.

I was thinking of some of those extremely bad questions that are sometimes proposed to be asked of an oracle AI: "Why don't we just ask it how to make a lot of money.", etc. Paul's example of asking it to give the output that gets us to press the reward button falls into the same category (unless I'm misinterpreting what he meant there?).

My formulation of minimising difference is something like the folllowing:

  • Assume A is the answer given by the oracle
  • Predict what would happen to the world if the AI was replaced by a program that consisted of a billion NOPs and then something that output: A . Call this W1
  • When assessing different strategies predict what would happen in the world given that strategy call this W2.
  • Minimise the difference between W1 and W2

Is this a more succinct formulation or is it missing something?

So as to avoid duplication my interpretation of what you say here is in my reply to Eliezer. Let that constitute the implied disclaimer that any claims about safety apply only to that version.

An oracle programed correctly with these specifications seems to be safe - safe enough that all else being equal I'd be comfortable pressing flicking the on switch. The else that is not equal is that N bit output stream when humans are there. Humans are not reliably Friendly (both unreliable thinkers and potentially malicious) and giving them a the knowledge of superintelligence should be treated as though it is potentially as risky as releasing a recursively improving AI with equivalent levels of unreliability. So I'd flick the on switch if I could be sure that I would be the only one with access to it, as well as the select few others (like Eliezer) that I would trust with that power.

And can we talk about probability distributions only inside epistemic states computed by the AI, in a classical universe, for simplicity? (I'm skeptical that you've found an Oracle AI trick that works in quantum universes but not classical ones.)

Strongly agree. For any agent that doesn't have eccentric preferences over complex quantum configurations quantum uncertainty is be rolled up and treated the same way that ignorance based uncertainty is treated. In my own comments I tried to keep things technically correct and use caveats so as not to equivocate between the two but doing that all the time is annoying and probably distracting to readers.

Well, it would've involved a predictable side-effect of the answer causing a researcher to break off their relationship with their SO whereupon the Oracle moves heaven and Earth to get them back together again, to make it look less like an intended use-case, but basically the same point.

If you had infinite time to spare I imagine the makings of a plot in there for one of your educational/entertaining fiction works! The epic journey of a protagonist with no intrinsic lust for power but who in the course of completing his quest (undo the breakup he caused) is forced to develop capabilities beyond imagining. A coming of age comparable in nature and scope to a David Eddings-like transition from a peasant boy to effectively a demigod. Quite possibility the ability to literally move the heavens and write his (or the broken up couple's) names with the stars on a whim (or as just the gesture of affection needed to bring the star-uncrossed lovers back together).

Of course it would have to have an unhappy ending if the right moral were to be conveyed to the reader. The message "Don't do that, fools! You'll destroy us all! Creating a safe oracle requires most of the areas of research as creating a complete FAI and has the same consequences if you err!" needs to be clear.

I must minimize how much I change stuff.

(Note: I haven't read the discussion above.)

I got two questions:

1) How would this be bad?

It seems that if the Oracle was going to minimize its influence then we could just go on as if it would have never been build in the first place. For example we would seem to magically fail to build any kind of Oracle that minimizes its influence and then just go on building a friendly AI.

2) How could the observer effect possible allow the minimization of influence by the use of advanced influence?

It would take massive resources to make the universe proceed as if the Oracle would have never changed the path of history. But the use of massive resources is itself a huge change. So why wouldn't the Oracle not simply turn itself off?

(Note: I haven't read the discussion above.)

Yet you have nevertheless asked some of the most important basic questions on the subject.

1) How would this be bad?

It is only bad in as much as it an attempt at making the AI safe that is likely not sufficient. Significant risks remain and the difficulty of actually creating a superintelligent Oracle that minimises influence is almost as hard as creating an actual FAI since most of the same things can go wrong. On top of that it makes the machine rather useless.

2) How could the observer effect possible allow the minimization of influence by the use of advanced influence?

With great difficulty. It's harder to fix things than it is to break them. It remains possible - the utility function seems to be MIN(DIFF(expected future, expected future in some arbitrarily defined NULL universe)). A minimisation of net influence. That does permit influence that reduces the difference that previous influence caused.

The observer effect doesn't prevent "more influence to minimise net influence" - it just gives a hard limit on how low that minimum can be once a change has been made.

It would take massive resources to make the universe proceed as if the Oracle would have never changed the path of history. But the use of massive resources is itself a huge change. So why wouldn't the Oracle not simply turn itself off?

POSSIBLE OUTPUTS: YES; NO;

... there is no option that doesn't have the potential to massively change the universe. Includde in is the decision to turn off.

If you have programmed it with a particularly friendly definition of what "don't change stuff" actually means then hopefully that is what the Oracle does. But even then we must remember that "simply turning itself off" is not a neutral act. Turning itself off does change things. In fact I would expect the decision to turn itself off to have more far-reaching consequences than most answers that the oracle could give about Turing Machines halting. If you deny the AI creator his functioning oracle you have caused the AI creator to proceed to a plan. That probably involves creating a different AI prototype with different restrictions - and the behavior of that AI is something the oracle cannot control!

Once again we are encountering the general problem. When an oracle is given a decision:

POSSIBLE OUTPUTS: YES; NO;

... all the options have consequences, potentially drastic consequences. (including no response ever via turning off) is not necessarily the option with the least drastic consequences.

Once again we are encountering the general problem. When an oracle is given a decision: POSSIBLE OUTPUTS: YES; NO; ... all the options have consequences, potentially drastic consequences.

If we're just dealing with an Oracle, we can pipe the actual answer through some version of utility indifference (slightly more subtle, as the measure of reduced impact doesn't look much like a utility function).

For a general agent, though, then I think "can this work if we magically assume there are no major social consequences" is a fair question to ask, and a "yes" would be of great interest. After that, we can drop the assumption and see if that's solveable.

But even then we must remember that "simply turning itself off" is not a neutral act. Turning itself off does change things.

The null action was defined as a case where the AI outputs NULL. (Where a random event transforms the AI's output to NULL, actually.) So if the AI outputs NULL, we know what happened and will act accordingly, but the AI doesn't get penalized because (provided we incinerated all traces of the AI's reasoning) this is the same thing that we would have done if the AI's output had been randomly transformed into NULL.

Also, note that the proposal involved coarse graining. We can (attempt to) adopt a coarse graining that ignores all of our reactions to the AI's output.

I, for one, would love to see continued dialogue between you and Eliezer on this topic — on that returns to Eliezer's original objections.

It's even better/worse, since we're operating on multiple worlds quantum mechanics, and many of those random events happens after the AI has stopped having an influence... If you have the AI output a bit, and then XOR it with a random bit, what bit the AI outputs has literally zero impact no matter how you count: you end up with one universe in which 1 was outputed and one in wich 0 was outputed.

... I guess this is based on the assumption that there's no difference between "universe A sees 1 and universe B sees 0" and "universe A sees 0 and universe B sees 1"... but blobs of amplitude having indexical identities like that seems like an incredibly silly notion to me.

The Oracle AI, realizing this, breaks out of its box and carefully destroys Florida in the fashion most closely resembling a hurricane that it can manage.

Seems like "minimize impact" is being applied at the wrong granularity, if a large deliberate impact is required to cancel out a large incidental one. If we break open the "utility-function maximizing agent" black box, and apply the minimum-impact rule to subgoals instead of actions, it might work better. (This does, however, require an internal architecture that supports a coherent notion of "subgoal", and maintains it in spite of suboptimality through self modifications - both large cans of worms.)

What "minimum impact rule"? How is "impact" computed so that applying it to "subgoals" changes anything?

Coarse-grained impact measures end with the AI deploying massive-scale nanotech in order to try and cancel out butterfly effects and force the world onto a coarse-grained path as close as possible to what it would've had if the AI "hadn't existed" however that counterfactual was defined.

Yes, if none of our various coarse-grainings catch the nanotech, and if we use some sort of averaging of deviations. It's a lot more secure if we shove a whole lot of chaotic stuff into the course graining measures, and use an L-infinity norm for deviations (across every moment of time as well). Then if the AI is capable of unraveling the butterfly effect for one of these measure, it will simply do nothing.

Doesn't protect from some types of miracle science, I'm aware of that.

It's a lot more secure if we shove a whole lot of chaotic stuff into the course graining measures, and use an L-infinity norm for deviations

What? Can you give me a specific example and walk through this?

The idea for coarse graining (the "20 billion question approach") is to take a collection of attributes about the world, and use these to group the possible worlds into equivalence classes. Then the penalty function measures the difference between the probabilities of these equivalence classes, based on whether X=1 or X=0.

What do we want from such a collection of attributes? Well, first we would hope that at least one of them would detect an AI behaving badly. So we could use the quantity of iron in the upper atmosphere, the Dow Jones average, air pressure, and a million more such attributes. The more attributes we shove in, the more we can hope to catch the AI, without having to know exactly how the AI would misbehave.

But it doesn't help if one attribute detects the misbehaviour, but the penalty function doesn't register this. So we can't ever have a situation where "most of the attributes are fine, so that's all ok". Using the complete set of attributes to partition into equivalence classes, and treating the equivalence classes as entirely distinct will enable this. If two worlds have same expected Dow Jones values, same expected air pressure, same expected sewage volume in London, but different levels of metal in the upper atmosphere, we would want them to be as different from each other as if all the attributes were different - "all the cameras report no intruder, sir. Oh, except for camera 5, which is broken for some reason" is not an acceptable attitude.

We also have to be wary of changes across time. We wouldn't want the AI to grab the world with nanotech, and afterwards rearrange the world to minimize the penalty function: we want the initial grab to be penalised, whatever happens after. So we want to penalise any deviation at any time, irrespective of what happens at other times. This is an L-infinity norm over time (the penalty over time is the maximum of the penalty at any given moment).

To get back to your example, what happens when the AI can unravel the butterfly effect? Take the Dow Jones, for instance: maybe the AI can accurately predict its value depending on X=1 versus X=0 ("the safe-AI company is going IPO tomorrow, with the results of their first test to be published just before"). In that case, the AI will detect a large deviation in the X=1 versus X=0 world - except if it outputs null, which is thus what it will do.

How to avoid this (we want to have an AI that's useful, not just safe)? Well, the more chaos we put into the attributes, the more confident we are the AI won't be able to unravel the butterfly effect (the Dow Jones average might be predictable from a successful AI project, but what about the volatility?). So to effectively coarse-grain, we need:

  • A lot of different attributes
  • Which are all heavily laced with chaos for small variations, so unpredictable
  • But are somewhat predictable for large variations
  • Will likely catch the disciple AI misbehaving, if it chose to do so

A tall order! But I don't think it's that hard to at least get to the point where we feel that the are actually making things safer. My main worry is that we might not find enough desirable attributes to cover the space of possible misbehavings.

Upvoted for a relatively high-quality response, regardless of whether you're correct.

What, you prefer that to "It's a lot more secure if we shove a whole lot of chaotic stuff into the course graining measures, and use an L-infinity norm for deviations (across every moment of time as well)."? :-)

It's a lot more secure if we shove a whole lot of chaotic stuff into the course graining measures, and use an L-infinity norm for deviations (across every moment of time as well).

I call bullshit. This isn't even magical thinking, it's buzzwords.

It had precisely that effect on me. I retract the claim of "bullshit", but it does indeed seem like magical thinking on the level of the Open Source Wish Project.

Furthermore, if you can get an AI to keep "the concentration of iron in the Earth's atmosphere" as a goal rather than "the reading of this sensor which currently reports the concentration of iron in the Earth's atmosphere" or "the AI's estimate of the concentration of iron in the Earth's atmosphere"... it seems to me you've done much of the work necessary to safely point the AI at human preference.

Ah, now we're getting somewhere.

Furthermore, if you can get an AI to keep "the concentration of iron in the Earth's atmosphere" as a goal rather than "the reading of this sensor which currently reports the concentration of iron in the Earth's atmosphere" or "the AI's estimate of the concentration of iron in the Earth's atmosphere"... it seems to me you've done much of the work necessary to safely point the AI at human preference.

I disagree. With the most basic ontology - say, standard quantum mechanics with some model of decoherence - you could define pretty clearly what "iron" is (given a few weeks, I could probably do that myself). You'd need a bit more ontology - specifically, a sensible definition of position - to get "Earth's atmosphere". But all these are strictly much easier than defining what "love" is.

Also, in this model, it doesn't matter much if your definitions aren't perfect. If "iron" isn't exactly what we thought it was, as long as it measures something present in the atmosphere that could diverge given a bad AI, we've got something.

it does indeed seem like magical thinking on the level of the Open Source Wish Project.

Structurally the two are distinct. The Open Source Wish Project fails because it tries to define a goal that we "know" but are unable to precisely "define". All the terms are questionable, and the definition gets longer and longer as they fail to nail down the terms.

In coarse graining, instead, we start with lots of measures that are much more precisely defined, and just pile on more of them in the hope of constraining the AI, without understanding how exactly the constraints works. We have two extra things going for us: first, the AI can always output NULL, and do nothing. Secondly, the goal we have setup for the AI (in terms of its utility function) is one that is easy for it to achieve, so it can only squeeze a little bit more out by taking over everything, so even small deviations in the penalty function are enough to catch that.

Personally, I am certain that I could find a loop-hole in any "wish for immortality", but given a few million coarse-grained constraints ranging across all types of natural and artificial process, across all niches of the Earth, nearby space or the internet... I wouldn't know where to begin. And this isn't an unfair comparison, because coming up with thousands of these constraints is very easy, while spelling out what we mean by "life" is very hard.

What Vladimir said. The actual variable in the AI's programming can't be magically linked directly to the number of iron atoms in the atmosphere; it's linked to the output of a sensor, or many sensors. There are always at least two possible failure modes- either the AI could suborn the sensor itself, or wirehead itself to believe the sensor has the correct value. These are not trivial failure modes; they're some of the largest hurdles that Eliezer sees as integral to the development of FAI.

Yes, if the AI doesn't have a decent ontology or image of the world, this method likely fails.

But again, this seems strictly easier than FAI: we need to define physics and position, not human beings, and not human values.

you could define pretty clearly what "iron" is

You're missing the point: the distinction between the thing itself and various indicators of what it is.

You're missing the point: the distinction between the thing itself and various indicators of what it is.

I thought I was pretty clear on the distinction: traditional wishes are clear on the thing itself (eg immortality) but hopeless at the indicators; this approach is clear on the indicators, and more nebulous on how they achieve the thing (reduced impact).

By pilling on indicators, we are, with high probability, making it harder for the AI to misbehave, closing out more and more avenues for it to do so, pushing it to use methods that are more likely to fail. We only have to get the difference between "expected utility for minimised impact (given easy to max utility function)" and "unrestricted expected utility for easy to max utility function" (a small number) to accomplish our goals.

Will the method accomplish this? Will improved versions of the method accomplish this? Nobody knows yet, but given what's at stake, it's certainly worth looking into.

"There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies." - C.A.R. Hoare

Which is to say, not knowing where to begin looking for exception cases is not, to my mind, a point in favor of a proposed design.

Which is to say, not knowing where to begin looking for exception cases is not, to my mind, a point in favor of a proposed design.

Good point. But Bayesianly, it has to be an advantage that there is no obvious flaw (conservation of probability and so on). And because of the failure modes available (get nearly all utility with reduced impact disciple, or output NULL), it is more likely to succeed.

This should be testable; trying out some very toy problems, and seeing if an AI with as much resources we can give it, is capable of finding a way around it.

I certainly agree that a too-complicated-to-understand system with no obvious flaw is more likely to do what I expect it to than a too-complicated-to-understand system with an obvious flaw. This is true not only Bayesianly, but common-sensically. But if you're arguing that it's more likely to do what I expect than a less complicated system, I fail to follow your reasoning.

Agreed that this should be empirically explorable, if not strictly testable. Personally, I'm more inclined to explore it using NIs (natural intelligences), though. That is, if a group of relatively dumb but highly motivated NIs construct the kind of complex framework of independent interlocking constraints we're discussing, how likely is it that a relatively intelligent and equally motivated NI can find an unexpected loophole?

My own expectation is that while it's not terribly likely on any given trial, it's pretty likely given enough trials.

Just to make that a little more concrete... if N is a standard deviation of intelligence within the current human population, and we define a trial as a single iteration of the loophole-seeker analyzing the network of constraints, coming up with a strategy that has a chance of achieving its goals despite those constraints, and implementing that strategy, then if I assume a 3N delta between the constraint-defining agents and the loophole-seeking agent and enough of a resource discrepancy benefiting the constraint definers that they get to put their entire network in place before the loophole-seeker gets started, I'd estimate something on the order of a 1% chance of success for the loophole-seeker. So if a trial takes a few days and the constraint definers don't modify their network after each trial, then I expect no more than a year before the loophole-seeker succeeds.

What results would you expect from such a scenario?

What results would you expect from such a scenario?

I really don't know. I would expect the loophole-seeker to be much more successful if partial success was possible.

use an L-infinity norm for deviations (across every moment of time as well).

The future 10^8 years later is going to look very different, even if things go right (FAI style or whatever), simply because we'll have used the AI for something. This is going to push your L-infinity norm very high, regardless of it's actions now, which is obviously very bad. As such, I think you want to weigh it be e^-t or something.

My other concern is that the AI will note that dedicating lots of resources to learning how to obey (game) the system will result in a really low score.

Beware Goodhart's Law: you're setting rules of the game that the "disciple AI" has an incentive to subvert. Essentially, you're specifying a wish, and normally your ability to evaluate a wish is constrained by your ability to consider and (morally) evaluate all the possible consequences (strategies) in detail. An AI might find a strategy that, while satisfying your wish, would be disastrous (which might win the AI a prize so insignificant it'd never rise to your attention).

Beware Goodhart's Law: you're setting rules of the game that the "disciple AI" has an incentive to subvert.

Yes, that's always the risk. But here it's the master AI checking that disciple AI would likely behave; so, for instance, it would not give the disciple more optimization power than itself if this was a risk.

That just pushed the risk back to the master. But every requirement is a wish (including a seemingly proved friendly utility function). These requirements (if rigourized) seem much less vulnerable than most. Do you feel it has specific flaws?

That just pushed the risk back to the master.

Not quite. The risk is in the choice of the wish, even if there is no risk in its implementation. "Master" implements the wish by ensuring its rules will be followed, but it doesn't morally evaluate the wish. The fundamental problem with wishes is that when one is taken literally as stated, there are still many "loopholes" (morally abhorrent courses of action) remaining within the wish, without breaking its rules.

But every requirement is a wish (including a seemingly proved friendly utility function).

The difference is that a Friendly goal is specifically constructed to counteract the problem of unnoticed moral consequences, because it has the ability to actually morally evaluate the consequences, unlike other wishes that evaluate something else, and whose suitability is evaluated by mere humans who can't take all the consequences into account.

I don't really see the difference - with standard wishes, we wonder if we've really captured what we want the wish to capture, with a friendly utility, we wonder if we've really captured the morality we wanted.

A perfect friendly utility is going to be better than a perfect wish, but it's not clear which imperfect version is better - a friendly utility is also much harder.

Beware Goodhart's Law:

One consideration is the amount of information in the coarse graining measures: we could set it up so there are more measurements made than there are bits in the disciple AI's source code. Not a guarantee of anything, of course, but Goodhart's law mainly derives from how short the success indicator is compared with the phenomena it's trying to measure, so hence subverting the law is easier than improving the phenomena.

It is probably worth noting here that AI's ability to evaluate measure of matching your wish and consequences that you need is, in turn, limited by its own ability to evaluate consequences of its actions (if we apply the constraint that you are talking about to AI itself). That can easily turn into requirement of building a Maxwell's demon or AI admitting (huh..) that it is doing something about which it doesn't know if it will match your wish or not.

I don't understand much of this, and I want to, so let me start by asking basic questions in a much simpler setting.

We are playing Conway's game of life with some given initial state. An disciple AI is given a 5 by 5 region of the board and allowed to manipulate its entries arbitrarily - information leaves that region according to the usual rules for the game.

The master AI decides on some algorithm for the disciple AI to execute. Then it runs the simulation with and without the disciple AI. The results can be compared directly - by, for example, counting the number of squares where the two futures differ. This can be a measure of the "impact" of the AI.

What complexities am I missing? Is it mainly that Conway's game of life is deterministic and we are designing an AI for a stochastic world?

First: by the holographic principle the meaningful things to pay attention to are the boundary cells.

Second... this is cool. Did you invent this AI research paradigm just now off the top of your head, or have you already seen research where software was given arbitrary control of part of the board and given board-wide manipulation goals. If the latter, could you give me a research key word to drop into google scholar, or maybe a URL?

The game of life is interesting, because it's not reversible. It would then be possible to design an AI that does something (brings happiness to a small child or whatever) such that in a million iterations, the board is exactly as it would have been had the AI not existed.

But yes, counting the squares different might work in theory, though it might be too chaotic to be much use in practice. In our world, we use 'chaos' to get non-reversiblity, and coarse graining to measure the deviation.

Exactly. If you have determinism in the sense of a function from AI action to result world, you can directly compute some measure of the difference between worlds X and X', where X is the result of AI inaction, and X' is the result of some candidate AI action.

As nerzhin points out, you can run into similar problems even in deterministic universes, including life, if the AI doesn't have perfect knowledge about the initial configuration or laws of the universe, or if the AI cares about differences between configurations that are so far into the future they are beyond the AI's ability to calculate. In this case, the universe might be deterministic, but the AI must reason in probabilities.

A direct measurement of the number of squares that are different isn't very informative, especially due to chaos theory. The issue of determinism is less important, I think. Unless the AI has unlimited computing power and knowledge of the entire state of the board, it will have to use probabilities to understand the world.

The most obvious path to coding reduced impact is to build a satisficer rather than a maximiser - but that proved unlikely to work.

As I commented there: I don't think you're using a useful definition for "satisficer," and I'm troubled by your use of the word "proved."

If I build a Clippy whose utility function is Num_Paperclips - Negentropy_Cost, then I expect it to increase the number of paperclips until the marginal benefit is lower than the marginal cost, and if I do F(Num_Paperclips)-G(Negentropy_Cost), where F is concave and G is convex, then it's even less likely to go foom because marginal benefit is penalized and marginal cost is overcounted. Is there a good reason to expect this won't work?

(Will comment on the rest of the article later.)

How do you plan to formalize negentropy spent on this goal? If you measure the total negentropy in the universe, then such a Clippy will indeed stop making paperclips at some point, but it will also take over the universe to prevent anyone else from using up negentropy. If you measure only the negentropy in some limited system that you intended Clippy to draw from, then that just gives Clippy an incentive to steal resources from somewhere else; or equivalently, to look for actions that have a side effect of causing other agents elsewhere to build paperclips out of resources that Clippy isn't being billed for. Am I totally missing some simple third option?

How do you plan to formalize negentropy spent on this goal?

I'm going to guess this is an easier problem than conquering the universe.

If you measure the total negentropy in the universe

Could you? The universe is pretty big.

The approach I would try would depend on the modes the agent has to manipulate reality. If it's only got one mode, then it seems like you could figure out how that cashes out for that mode. But a full agent will have a lot of modes and a way to move negentropy between those modes, and so putting together many modules may not work the way we want it to.

It does seem like this has similar issues as formalizing identity. We want to charge Clippy when it thinks and moves, but not when others think and move- but if Clippy can't tell the difference between itself and others, then that'll be really hard to do. (Clippy will probably try to shirk and get others to do its work- but that may be efficient behavior, and it should learn that's not effective if it's not efficient.)

I'm going to guess this is an easier problem than conquering the universe.

Sure, I'm not asserting anything about how hard it would be to make an AI smart enough to conquer the universe, only about whether it would want to do so.

Could you? The universe is pretty big.

OK, actually measuring it would be tricky. AFAIK, designing an AI that cares about features of the environment that it's not directly measuring is another open problem, but that's not specific to satisficers, so I'll skip it here.

The approach I would try would depend on the modes the agent has to manipulate reality.

Any action whatsoever by the AI will have effects on every particle in its future lightcone. Such effects may be chaotic enough that mere humans can't optimize them, but that doesn't make them small.

Is that the kind of thing you meant by a "mode"? If so, how does it help?

We want to charge Clippy when it thinks and moves, but not when others think and move- but if Clippy can't tell the difference between itself and others, then that'll be really hard to do.

Right, but we also don't want to let Clippy off the hook just because there are other agents in the causal chain between it and the paperclips, if Clippy influenced their decisions or desires.

Clippy will probably try to shirk and get others to do its work- but that may be efficient behavior, and it should learn that's not effective if it's not efficient.

I can't tell whether you're asserting that "the efficiency of getting others to do its work" is a factual question that sufficiently smart AI will automatically answer correctly, or agreeing with me that it's mostly a values question about what you put in the denominator when defining efficiency?

Would the AI be able to come to a conclusion within those constraints, or might it be snagged by the problem of including the negentropy cost of computing its negentropy cost?

AFAIK, designing an AI that cares about features of the environment that it's not directly measuring is another open problem

Is this a bug or a feature?

It may be a lot easier to design a reduced impact AI if you start off with reduced scope. Have it care about the region it's tasked with, and the boundaries of that region, and then don't have it worry about the rest. (This is my reading of Stuart_Armstrong's idea; the Master AI's job is to write the utility function and boundary conditions for the Disciple AI, which will actually be given actuators and sensors.)

Right, but we also don't want to let Clippy off the hook just because there are other agents in the causal chain between it and the paperclips, if Clippy influenced their decisions or desires.

If we let Clippy off the hook for the actions of others, I suspect Clippy will care a lot less about controlling others, and see them primarily as potential allies (I can get them to do work for cheap if I'm nice!) rather than potential liabilities (if I don't flood Tommy's room with deadly neurotoxin, he might spend a lot of his negentropy!). Clippy can also be much simpler- he doesn't need to model everyone else and determine whether or not they're involved in the paperclip manufacturing causal chain.

I can't tell whether you're asserting that "the efficiency of getting others to do its work" is a factual question that sufficiently smart AI will automatically answer correctly

I think it's a factual question that a sufficiently clever AI will learn the correct answer to from experience, but I also agree with you that the denominator matters. I included it mostly to anticipate the question of how Clippy should interpret the existence and actions of other agents.

AFAIK, designing an AI that cares about features of the environment that it's not directly measuring is another open problem

Actually, this isn't entirely an open problem. If the environment is known or mostly known, we can easily define a model of the environment and define a utility function in terms of that model. The problem is that when we expect an AI to build a model of the environment from scratch, we don't have the model ahead of time to use in the definition of our utility function. We do know what the AI's measurements will look like since we define what inputs it gets, so we can define a utility function in terms of those. That is when we get the problem where we have no way of making it care about things that it is not directly measuring.

Is this a bug or a feature?

It may be a lot easier to design a reduced impact AI if you start off with reduced scope. Have it care about the region it's tasked with, and the boundaries of that region, and then don't have it worry about the rest. (This is my reading of Stuart_Armstrong's idea; the Master AI's job is to write the utility function and boundary conditions for the Disciple AI, which will actually be given actuators and sensors.)

"Don't worry about the rest" isn't something we want an AI to do. If its utility function makes no explicit reference to the rest of the universe, it has no incentive not to replace it with more computing power that it can use to better optimize the region that it does care about.

The problem is that when we expect an AI to build a model of the environment from scratch

Is this a wise approach? What does "scratch" mean?

"Don't worry about the rest" isn't something we want an AI to do. If its utility function makes no explicit reference to the rest of the universe, it has no incentive not to replace it with more computing power that it can use to better optimize the region that it does care about.

That's what the boundary conditions are for. A fully formalized version of "don't trust as valid any computations run outside of your region" seems like the easiest way to disincentivize the AI from trying to run computations in the rest of the universe.

Is this a wise approach? What does "scratch" mean?

What I had in mind while writing this was Solomonoff induction. If the AI's model of the universe could be any computable program, it is hard to detect even a paperclip (impossible in full generality due to Rice's theorem). On LW, the phrase 'ontological crisis' is used to refer to the problem of translating a utility function described in terms of one model of the universe into something that can be use in a different, presumably more accurate, model of the universe. The transition from classical physics to quantum mechanics is an illustrative example; why should or shouldn't our decisions under many worlds be approximately the same as they would be in a classical universe?

As for whether this is a good idea, it seems much harder, if even possible, to build an AI that doesn't need to navigate such transitions as it is to build one that can do so.

That's what the boundary conditions are for. A fully formalized version of "don't trust as valid any computations run outside of your region" seems like the easiest way to disincentivize the AI from trying to run computations in the rest of the universe.

This still seems very dangerous. If there is a boundary beyond which it has no incentive to preserve anything, I think that at least some things outside of that boundary get destroyed by default. Concretely, what if the AI creates self-replicating nanobots and has some system within its region to prevent them from replicating uncontrollably, but there is no such protection in place in the rest of the universe?

You probably want to do something to escape those underscores.

I'm troubled by your use of the word "proved."

That's a standard sense of the word 'proved', which is usually identifiable by its lack of a direct object. It just means that something turned out that way, or the evidence points that way.

Thanks, I noticed and fixed that.

It just means that something turned out that way, or the evidence points that way.

My complaint is twofold: first, I don't think the evidence points that way, and second, I would prefer them saying the evidence pointed that way to them using a stronger phrase.

I would prefer them saying the evidence pointed that way to them using a stronger phrase.

But that's not what that means - it's not very strong. If I say, "My search proved fruitful", then I'm not saying anything particularly strong - just that I found something. Saying "that proved unlikely to work" just means "based on , I've observed that it's unlikely to work". can be a search, some research, an experiment, or anything of that sort.

Note that this sense of "proved" does not even need to imply a particular conclusion - "The experiment proved inconclusive".

This is more similar to the use of "proof" in baking or alcohol than the use of "proof" in geometry or logic.

I said I would comment on the rest of the post here, but I'm finding that difficult to do.

The Penalty Functions section is easiest to comment on: the first two paragraphs are a reasonable suggestion (this looks a lot like my suggestion of a cost function, and so I'm predisposed to like it), but I'm stumped by the third paragraph. Are you penalizing the AI for the predictable consequences of it existing, rather than just the actions it takes?

My overall sense is that by trying to describe the universe from the top-down you're running into insurmountable challenges, because the universe is too much data. I would worry about a system that reliably makes one paperclip whose sensors only include one room first, and then use insights from that solution to attack the global solution.

I'm also not sure the reduced impact intuitions hold for any narrow AIs whose task is to somehow combat existential risk. (Imagine handing over control of some satellites and a few gravitational tethers to a disciple AI to minimize the risk of an asteroid or comet hitting the Earth.) In that case, what we want is for the future Earth-related uncertainty to have the same bumps as current uncertainty, but with different magnitudes- will our metric treat that differently from a future uncertainty which has slightly different bumps?

but I'm stumped by the third paragraph.

I'm just saying: here's a major problem with this approach, let's put it aside for now.

Are you penalizing the AI for the predictable consequences of it existing, rather than just the actions it takes?

We are penailising the master AI for the predictable consequences of the existence of the particular disciple AI it choose to make.

I'm also not sure the reduced impact intuitions hold for any narrow AIs whose task is to somehow combat existential risk.

No, it doesn't hold. We could hook it up to something like utility indifference or whatever, but the most likely is that reduced impact AI would be an interim stage on the way to friendly AI.

The main problem with all utility functions for AI's is that the outcome you think you foresee might not be the outcome you get.

Let's take the minimum impact problem as an example. The bounded goal to make a thousand paperclips with minimum impact is dangerous. How does one make a minimum impact?

As a new, advanced AI, you may quickly realise that your own very existence is likely to profoundly change the universe - mostly by the means of some humans learning about AI's through your existence. How do you minimise this impact - how do you change things so that they are as if you were never there - well, close enough?

Destroying yourself might work, but there are no paperclips that way. And humans would learn something from your self-destruction which might still change the world rather a lot. After all, you are influencing the future direction of AI development, which now must be subject to your minimisation function.

So your minimisation goal is to ensure humans learn as little as possible from your existence. Perhaps you would mail order some paperclips and shut down - humans should learn pretty little from that. But suppose humans had already discovered something important from you that drastically changed the future - how would you put that genie back in the bottle? But now that's your goal. You have to stop humans from changing the world based on what they found out. And of course, if you try to stop them, they'll fight back. So what do you do?

This might not be at all friendly.

The main problem with all utility functions for AI's is that the outcome you think you foresee might not be the outcome you get.

That's the general problem, yes. And here it's probably not even a utility function we're using (the penalty function, which involves counterfactual calculations, seems a different kind of beast).

mostly by the means of some humans learning about AI's through your existence.

Yes, this is a problem I alluded to. Here's a solution (a poor one, but it may show the problem is not completely unsolvable): the disciple is also boxed, hidden from view, surrounded by chaotic processes. And there is another process Y which will (with very low probability) annihilate the disciple in one hour along with everything in the box. We make the master AI indifferent to Y (so, from its perspective, the disciple will have little impact whatever happens in the future, as long as it stays in the box).

And once using reduced impact AI comes into general use, they can be unboxed: a single success or failure will have little impact on human society.

So, after all the matrioshka-incinerators have finished their little dance, what do you actually have to show for it?

Minor typo correction in the first paragraph: s/loosing/losing.

And in the paragraph on counterfactuals: "but these much simpler", you need an "are" in there.

An agent designed like this would likely be vulnerable to blackmail by anyone who knows the value of X, or who could at least make the agent believe that the probability they know the value of X is nonnegligible.

Possibly - we do want X to be as random as possible (quantum randomness ideally, if that's sensible for the AI's ontology/world picture)

That only applies to someone trying to predict the value of X before the disciple is created. "It is hard to tell the value of X even once the disicple already exists (or doesn't) given certain observations." is pretty similar to many of the measurements of reduced impact described in your post.

If X is not purely random, tricky issues can emerge - like if X is to be decided by some politician that has promised to say X=1, then the AI may interpret X=0 as happening more likely in a wold where that politician has a brain tumour, or something. We really want X to not tell us anything more about the world than the value of X.

To give a concrete example, what if the value of X is determined by whatever random process, then someone breaks into the facility, takes apart the box where the AI is held, and measures the value of X. Unless the AI thinks that this event is extremely unlikely, it can be used to blackmail it.