# 18

A putative new idea for AI control; index here. See also Utility vs Probability: idea synthesis.

Ok, here is the problem:

• You have to create an AI that believes (or acts as if it believed) that event X is almost certain, while you believe that X is almost impossible. Furthermore, you have to be right. To make things more interesting, the AI is much smarter than you, knows everything that you do (and more), and has to react sensibly when event X doesn't happen.

Answers will be graded on mathematics, style, colours of ink, and compatibility with the laws of physics. Also, penmanship. How could you achieve this?

## Acts and beliefs

The main trick is the parenthetical "(or acts as if it believed)". If the AI actually has true beliefs, then there is no problem with it being smart, being knowledgeable, or updating on the fact that X didn't happen. So the problem reduces to:

• How can an AI, that believes that X is almost impossible, behave as if X were almost certain?

This will be constructed by developing on my idea for "utility indifference" (that old concept needs a better name, btw; corrigibility is a much better candidate for being labelled "utility indifference").

## Noisy events and thermodynamic miracles

Imagine that a signal is travelling down a wire inside a noisy gas (or some other source of randomness). Almost certainly the signal will get through, but very occasionally, by the spontaneous action of the molecules in the gas, the signal will get overwhelmed, erased and overwritten with another signal, randomly. This could be seen as a type of thermodynamic miracle: an incredibly unlikely event caused by spontaneous and coincidental coordination between lots of different molecules. We assume here that the AI cannot cause a thermodynamic miracle, nor make it more likely (this may require some specific definitions of what "counts" as a miracle).

The idea is to make the AI indifferent to any world in which there is no thermodynamic miracle. So give it a utility u such that u(w)=constant, for any world w without a thermodynamic miracle at the point of interest. The value of that constant doesn't matter; the result is that the AI's actions have no impact on utility in any world where the miracle didn't happen. Thus, as long as there is a tiny chance that the miracle happened, the AI will behave as if that was a certainty: for only in those worlds do its actions have any impact.

You can probably see where I'm going with this. Suppose the signal goes through the wire to turn on another AI. That AI will only turn on given a very specific 100-digit "ON" binary code. Given any other code, the second AI won't turn on, and will erase itself. We send the "ON" code.

So what happens? The first AI reasons that given a thermodynamic miracle (the only worlds it cares about) the output signal is going to be 100 random bits, which have only 2-100 chance of being "ON". Therefore it will behave as if it's almost certain "ON" will not get through the gas.

And what will (almost certainly) happen in reality? We send the "ON" signal, and it gets through unchanged. The first AI observes this (in fact, it expected this). However, the only possible world that it cares about, now, is one in which the "ON" signal was overwritten by a thermodynamic miracle... into exactly the same "ON" signal. As long as the possibility of this is non-zero, the AI will behave as if it's certain.

We should develop this point a little. The AI is almost certain, ahead of time, that "ON" will happen. It behaves, however, as if "ON" is almost certain not to happen. Then, after "ON" has gone through, it must adapt itself and backtrack on its previous behaviour. Moreover, it knows that this is what it will (almost certainly) do. Since it knows ahead of time it will be backtracking, why does it still go ahead with its behaviour?

The answer is that most of the universes in which it backtracks, are universes where its actions have no consequences, and hence it doesn't care about them. So it might have 1-ε chance of backtracking, but it only cares about a proportion ε(2-100) of those worlds. On the other hand, it won't backtrack with probability ε(1-2-100), but it cares about those worlds completely.

## False Thermodynamic Miracles

This is the explanation for the title: the AI will behave as if there was a real thermodynamic miracle (a spontaneous overwriting of a signal with the same signal) while we "know" that actually it was just the same signal going through.

By varying what we put at the other end of the wire, we can get the AI to acts "as if" it believed almost anything, without introducing problematic beliefs. We just have to be a bit cautious with our setups, by considering how humans would react if there was a real thermodynamic miracle (because the AI is certainly going to be considering that very carefully).

This allows us to consider AI designs with counterfactuals, false "beliefs", etc..., which will be of great use. Are there any flaws I'm missing? Practical flaws are ok (eg how do you define a thermodynamic miracle and make sure the AI can't affect it), but theoretical flaws are much more important at this stage.

## In equations

Edit: (thanks to TylerJay)

In equation form, the AI is maximising

P(¬X)*C + P(X)*u(X, A),

for some constant C, some unlikely event X that the AI cannot affect, some set of relevant descriptors A, and some utility u. Since C is constant, this is exactly the same as maximising u(X, A) - the probability P(X) is irrelevant.

The whole setup described is simply a way to ensure that if W is the likely set of worlds consistent with observations after ¬X/X, then

P(W) ≈ P(¬X) ≈ 1 (we "know" that X doesn't happen and that we end up in W),

while

P(W|X) << 1 (in the worlds it cares about, the AI behaves as if W was incredibly unlikely to come about).

## Same result, computationally

The above was a "physical" design, revolving around a particular physical setup, plus some value hacking. But we can do the same in a purely computational way, by adapting drnickbone's design.

To adapt that design, assume Ω will run UDT/TDT agent A to get its estimation as to the likely probability of any one of 2100 sequences of bits coming down the wire. It will then choose to send the signal that A assigns the lowest probability to. If there is a tie, it will choose the first sequence, lexicographically.

It seems that A must assign equal probability 2-100 to each sequence (as any increase in probability makes that sequence less likely to appear, and any decrease makes it more likely). Therefore, we "know" that the signal will be "0000...0000", but the AI cannot assign a higher probability to that. This achieves the same outcome as above.

I personally feel the physical design is more secure - it seems easier to define what a thermodynamic miracle at a given point in spacetime would be, than to be certain that the AI will not evade the computational restrictions somehow, or that the computational setup will remain stable and well defined. Also, A may not spread exactly equal probability over 2100 sequences, messing up our certainty as to the result.

New Comment

I'm uneasy about this for similar reasons to why I was uneasy about utility indifference. If the AI collects compelling evidence that a thermodynamic miracle did not occur, then it is possible that the hypotheses left in which the thermodynamic miracle did occur will be dominated by strange, complicated hypotheses (e.g. the existence of some sort of Cartesian demon trying to trick the AI into thinking that the thermodynamic miracle occurred), and the AI's behavior may become erratic as a result.

Yes, this is a concern. But it seems a solvable concern, once we have the principles right (and it almost certainly won't be implemented as an actual wire in a gas cloud setup).

This is really interesting. I thought I understood it and I wanted to verify that by trying to summarize it (and maybe help others too) but now I'm not so sure...

Edit: Just to save anybody the reading time, my reasoning is false below. After sleeping on it, I see my mistake. Nothing below the "False Thermodynamic Miracles" subheading made sense to me yesterday because I thought the purpose of the setup was to have an "off switch" on the simulated AI under the false belief (letting it see the result of the signal after some time period). I get it now though. Maximizing "[P(no miracle) C] + [P(miracle) u(A) given miracle]" is the same as maximizing "u(A) given miracle". So the AI will act as if the miracle happened, because there's no cost associated with those actions if it didn't happen—only benefits if it did.

As I understand it, the basic premise is that the AI has a utility function u(w,A) where "w" is whether or not the specific thermodynamic miracle (TM) occurred and A represents a regular/normal input to a utility function (state of the universe, action taken, or whatever).

u(w,A) = C (a low constant) if w is false (TM did not happen), and u(w,A) is essentially "u(A) given w" when w is true (ie. it's a "normal" kind of utility function, dominated by things other than the TM, though it still has to incorporate the fact that the TM happened into its model of the world).

So, the signal is sent, and it calculates its expected utility for either scenario. If w = false, u(w,A) = a low constant, and if w = true, it determines an action A that maximizes the utility function u(A)|w. As long as 2^-100 u( A ) > C, it starts taking action A.

Is this right so far?

Then, after "ON" has gone through, it must adapt itself and backtrack on its previous behaviour.

This mechanism seems underspecified to me (though maybe it's obvious to others). Why would it backtrack (or what do you mean by backtrack)? Eventually, it observes that w = false (that "ON" went through unchanged) and that its actions are no longer beneficial, so it just stops doing anything, right? The process terminates or it goes to standby?

If we want the AI to actually backtrack, wouldn't we need to modify the utility function so that it takes a configuration parameter as well? u(w,A,X) s.t. X = true iff the AI is in its initial configuration, and set it so that u(F,A,F) = 0, but u(F,A,T) = C ? (this probably counts as just an implementation detail, but I figured I'd mention it anyway)

The part that makes me feel like I didn't actually understand it at all is:

However, the only possible world that it cares about, now, is one in which the "ON" signal was overwritten by a thermodynamic miracle... into exactly the same "ON" signal. As long as the possibility of this is non-zero, the AI will behave as if it's certain.

Can you explain this a bit more? Do you mean that even after it observes "ON", it doesn't stop acting as if w = true? That P(TM that overwrites ON with ON) * u(A)|w > C ? If that's the case, then it would never backtrack, right? So it's essentially a full simulation of an AI under the assumption w, but with the knowledge that w is incredibly unlikely, and no built-in halting condition?

Thanks

Why would it backtrack (or what do you mean by backtrack)? Eventually, it observes that w = false (that "ON" went through unchanged) and that its actions are no longer beneficial, so it just stops doing anything, right? The process terminates or it goes to standby?

I think the presumption is that the case where the "ON" signal goes thru normally and the case where the "ON" signal is overwritten by a thermodynamic miracle... into exactly the same "ON" signal are equivalent. That is that after the "ON" signal has gone though the AI would behave identically to an AI that was not indifferent to worlds where the thermodynamic miracle did not occur.

The reason for this is that although the chance that the "ON" signal was overwritten into exactly the same "ON" signal is tiny, it is the only remaining possible world that the AI cares about so it will act as if that is what it believes.

C need not be a low constant, btw. The only requirement is that u(false,action a, A) = u(false, action b, A) for all actions a and b and all A. ie nothing the AI does affects the utility of worlds where w is false, so this does not constrain its actions.

Basically the AI observes the ON signal going through, and knows that either a) the signal went through normally, or b) the signal was overwritten by coincidence by exactly the same signal. It's actions have no consequences in the first case, so it ignores it, and acts "as if" it was certain there had been a thermodynamic miracle that happened.

Thanks. I understand now. Just needed to sleep on it, and today, your explanation makes sense.

Basically, the AI's actions don't matter if the unlikely event doesn't happen, so it will take whatever actions would maximize its utility if the event did happen. This maximizes expected utility

Maximizing [P(no TM) C + P(TM) u(TM, A))] is the same as maximizing u(A) under assumption TM.

Maximizing [P(no TM) C + P(TM) u(TM, A))] is the same as maximizing u(A) under assumption TM.

Yes, that's a clear way of phrasing it.

I am fairly confident that I understand your intentions here. A quick summary, just to test myself:

HAL cares only about world states in which an extremely unlikely thermodynamic even occurs- namely, the world in which one hundred random bits are generated spontaneously during a specific time interval. HAL is perfectly aware that these are unlikely events, but cannot act in such a way as to make the event more likely. HAL will therefore increase total utility over all possible worlds where the unlikely even occurs, and otherwise ignore the consequences of its choices.

This time interval corresponds by design with an actual signal being sent. HAL expects the signal to be sent, with a very small chance that it will be overwritten by spontaneously generated bits and thus be one of the words where it wants to maximize utility. Within the domain of world states that the machine cares about, the string of bits is random. There is a string among all these worlds states that corresponds to the signal, but it is the world where that signal is generated randomly by the spontaneously generated bits. Thus, within the domain of interest to HAL, the signal is extremely unlikely, whereas within all domains known to HAL, the signal is extremely likely to occur by means of not being overwritten in the first place. Therefore, the machine's behavior will treat the actual signal in a counterfactual way despite HAL's object-level knowledge that the signal will occur with high probability.

If that's correct, then it seems like a very interesting proposal!

I do see at least one difference between this setup, and a legitimate counterfactual belief. In particular, you've got to worry about behavior in which (1-epsilon)% of all possible worlds have a constant utility. It may not be strictly equivalent to the simple counterfactual belief. Suppose, in a preposterous example, that there exists some device which marginally increases your ability to detect thermodynamic miracles (or otherwise increases your utility during such a miracle); unfortunately, if no thermodynamic miracle is detected, it explodes and destroys the Earth. If you simply believe in the usual way that a thermodynamic miracle is very likely to occur, you might not want to use the device, since it's got catastrophic consequences for the world where your expectation is false. But if the non-miraculous world states are simply irrelevant, then you'd happily use the device.

As I think about it, I think maybe the real weirdness comes from the fact that your AI doesn't have to worry about the possibility of it being wrong about there having been a thermodynamic miracle. If it responds to the false belief that a thermodynamic miracle has occurred, there can be no negative consequences.

It can account for the 'minimal' probability that the signal itself occurs, of course- that's included in the 'epsilon' domain of worlds that it cares about. But when the signal went through, the AI would not necessarily be acting in a reasonable way on the probability that this was a non-miraculous event.

Yep, that's pretty much it.

To adapt that design, assume Ω will run UDT/TDT agent A to get its estimation as to the likely probability of any one of 2^100 sequences of bits coming down the wire. It will then choose to send the signal that A assigns the lowest probability to. If there is a tie, it will choose the first sequence, lexicographically.

Any probability mass at all assigned to the hypothesis that the wire's sequence doesn't follow this rule, ends up breaking the scheme. Also, I think the AI can answer with probability distributions that are well defined, but for which this selection is intractable.

Could you maybe add some more explanation of how the stated problem is relevant for AI control? It's not obvious to me from the outset why I care about duping an AI.

Many approaches can be used if you can use counterfactuals or "false" information in the AI. Such as an AI that doesn't "believe" that a particular trigger is armed, and then gets caught by that trigger as it defects without first neautralising it.

There's a lot of stuff coming that uses that, implicitly or explicitly. See http://lesswrong.com/lw/lt6/newish_ai_control_ideas/

One high-level concern. If I'm reading this right (and please let me know if I'm not), then this is talking about handling counterfactuals by spawning a copy of the AI with a different utility function.

Just on general principles, spawning a copy of the AI with a different utility function seems really, really dangerous. The new copy would be motivated to trade off anything at all in the no-thermodynamic-miracle scenario in order to increase utility in the event of a thermodynamic miracle. In particular, if the AI were boxed (as we might expect for counterfactual processing) then it would presumably try to get out of that box.

Here's my explanation of it. Let me know if this helps with your concerns at all:

Imagine we have an AI design we want to test. Call this AI C3PO, and let its utility function be U(A) where A is a world-state from the set of all possible world-states. And let the super-unlikely-event-happening-at-the-specified-time described in the post be w such that w = true if it happens and w = false if it doesn't happen. Then let A be a world state in the subset of all world-states A in which w = true. Basically, A is A given that w happened (this is how we simulate a "false belief" by only allowing the AI to consider worlds in which w = true). Finally, let C be a constant.

The proposal is that we create a variant of C3PO, C3PO* that has the utility function:

U*(A) = P(!w) * C + P(w) * (U(A*))


If the AI is boxed such that it cannot affect the probability of w occurring and it cannot modify its own utility function, then maximizing U is exactly the same as maximizing U once event w has occurred (ie. with false belief w). In this way, we are able to perfectly simulate C3P0 to find out what it would do if w were true, but we don't actually have to convince it that w* is true.

Right, that much makes sense. The problem is the "perfectly simulate C3PO" part toward the end. If we really want to see what it would do, then we need a perfect simulation of the environment in addition to C3PO itself. Any imperfection, and C3PO might realize it's in a simulated environment. All else equal, once C3PO* knows it's in a simulated environment, it would presumably try to get out. Since its utility function is different from C3PO, it would sometimes be motivated to undermine C3PO (or us, if we're the ones running the simulation).

Just remember that this isn't a boxing setup. This is just a way of seeing what an AI will do under a false belief. From what I can tell, the concerns you brought up about it trying to get out isn't any different between the scenario when we simulate C3PO* and when we simulate C3PO. The problem of making a simulation indistinguishable from reality is a separate issue.

One way to make AI to believe in a claim that we know is false, is a situation than disproving the claim requires much more computational complexity that suggesting it. Fo example, it is very cheap thought for me to suggest that we live in a vast computer simulation with probability 0.5. But to disprove this claim with probability one, AI may need a lot of thinking, may be more than total maximum possible computational power of the universe allows.

What if the thermodynamic miracle has no effect on the utility function because it occurs elsewhere? Taking the same example, the AI simulates sending the signal down the ON wire... and it passes through, but the 0s that came after the signal is miraculously turned into 0s.

This way the AI does indeed care about what happens in this universe. Assuming that AI wants to turn on the second AI, the AI could have sent another signal down the ON wire, and then end up simulating failure due to any kind of thermodynamic miracle, or it could have sent the ON signal, and ALSO simulate success, but only when the thermodynamic miracle appears after the last bit is transmitted (or before the first bit is transmitted), so it no longer behaves as if it believes sending a signal down the wire accomplishes anything at all, but instead that sending a signal down the wire has a higher utility.

This probably means that I don't understand what you mean... How does this problem not arise in the model you have in your head?

What if the thermodynamic miracle has no effect on the utility function because it occurs elsewhere?

Where it occurs, and other such circumstances and restrictions, need to be part of the definition for this setup.

This is basically telling the AI that it should accept a Pascal's Wager.

Not really. There is no huge expected utility reward to compensate the low probability, and the setup is very specific (not a general "accept pascal wagers").

I'm nervous about designing elaborate mechanisms to trick an AGI, since if we can't even correctly implement an ordinary friendly AGI without bugs and mistakes, it seems even less likely we'd implement the weird/clever AGI setups without bugs and mistakes. I would tend to focus on just getting the AGI to behave properly from the start, without need for clever tricks, though I suppose that limited exploration into more fanciful scenarios might yield insight.

The AGI does not need to be tricked - it knows everything about the setup, it just doesn't care. The point of this is that it allows a lot of extra control methods to be considered, if friendliness turns out to be as hard as we think.

Fair enough. I just meant that this setup requires building an AGI with a particular utility function that behaves as expected and building extra machinery around it, which could be more complicated than just building an AGI with the utility function you wanted. On the other hand, maybe it's easier to build an AGI that only cares about worlds where one particular bitstring shows up than to build a friendly AGI in general.

One naive and useful security precaution is to only make the AI care about world where the high explosives inside it won't actually ever detonate... (and place someone ready to blow them up if the AI misbehaves).

There are other, more general versions of that idea, and other uses to which this can be put.

I guess you mean that the AGI would care about worlds where the explosives won't detonate even if the AGI does nothing to stop the person from pressing the detonation button. If the AGI only cared about worlds where the bomb didn't detonate for any reason, it would try hard to stop the button from being pushed.

But to make the AGI care about only worlds where the bomb doesn't go off even if it does nothing to avert the explosion, we have to define what it means for the AGI to "try to avert the explosion" vs. just doing ordinary actions. That gets pretty tricky pretty quickly.

Anyway, you've convinced me that these scenarios are at least interesting. I just want to point out that they may not be as straightforward as they seem once it comes time to implement them.

we have to define what it means for the AGI to "try to avert the explosion" vs. just doing ordinary actions. That gets pretty tricky pretty quickly.

We don't actually have to do that. We set it up so the AI only cares about worlds in which a certain wire in the detonator doesn't pass the signal through, so the AI has no need to act to remove the explosives or prevent the button from being pushed. Now, it may do those for other reasons, but not specifically to protect itself.

Or another example: an oracle that only cares about worlds in which its output message is not read: http://lesswrong.com/r/discussion/lw/mao/an_oracle_standard_trick/

Couldn't you just send one bit X (1 means on, 0 means off) which is most likely 1 but could turn into 0 due to noise and define the utility u* in the same way as for corrigibility? That is,

u*(A_1,0,A_2)= u(A_1,0,A_2)

u*(A_1,1,A_2)=u(A_1,1,A_2)+E_{A_2'} u(A_1,0,A_2')- E_{A_2'} u(A_1,1,A_2')

Here A_1 denotes what happens in the world before the signal is sent, and A_2 what happens afterwards. This way you only use 1 bit rather than 100 and there is no longer a contribution of 2^{-100} from the case where there is a thermodynamic miracle that turns the on-signal into the on-signal (and you don't have to worry about the distribution of the signal given a thermodynamic miracle). The oracle will optimize u given that X=0 until X is revealed. When that is revealed, we will most likely have X=1, and the oracle will optimize u given X=1 (if the oracle is still running). Does the above idea achieve something more?

This seems to work. The difference is how the revelation of X is handled. I'm not sure which approach is better, or in which circumstances.