False thermodynamic miracles

[-]AlexMennen11y160

I'm uneasy about this for similar reasons to why I was uneasy about utility indifference. If the AI collects compelling evidence that a thermodynamic miracle did not occur, then it is possible that the hypotheses left in which the thermodynamic miracle did occur will be dominated by strange, complicated hypotheses (e.g. the existence of some sort of Cartesian demon trying to trick the AI into thinking that the thermodynamic miracle occurred), and the AI's behavior may become erratic as a result.

[-]Stuart_Armstrong11y40

Yes, this is a concern. But it seems a solvable concern, once we have the principles right (and it almost certainly won't be implemented as an actual wire in a gas cloud setup).

[-]TylerJay11y70

This is really interesting. I thought I understood it and I wanted to verify that by trying to summarize it (and maybe help others too) but now I'm not so sure...

Edit: Just to save anybody the reading time, my reasoning is false below. After sleeping on it, I see my mistake. Nothing below the "False Thermodynamic Miracles" subheading made sense to me yesterday because I thought the purpose of the setup was to have an "off switch" on the simulated AI under the false belief (letting it see the result of the signal after some time period). I get it now though. Maximizing "[P(no miracle) C] + [P(miracle) u(A) given miracle]" is the same as maximizing "u(A) given miracle". So the AI will act as if the miracle happened, because there's no cost associated with those actions if it didn't happen—only benefits if it did.

As I understand it, the basic premise is that the AI has a utility function u(w,A) where "w" is whether or not the specific thermodynamic miracle (TM) occurred and A represents a regular/normal input to a utility function (state of the universe, action taken, or whatever).

u(w,A) = C (a low constant) if w is false (TM did not happen), and u(w,A) is essentially "u(A) given w" when w is true (ie. it's a "normal" kind of utility function, dominated by things other than the TM, though it still has to incorporate the fact that the TM happened into its model of the world).

So, the signal is sent, and it calculates its expected utility for either scenario. If w = false, u(w,A) = a low constant, and if w = true, it determines an action A that maximizes the utility function u(A)|w. As long as 2^-100 u( A ) > C, it starts taking action A.

Is this right so far?

Then, after "ON" has gone through, it must adapt itself and backtrack on its previous behaviour.

This mechanism seems underspecified to me (though maybe it's obvious to others). Why would it backtrack (or what do you mean by backtrack)? Eventually, it observes that w = false (that "ON" went through unchanged) and that its actions are no longer beneficial, so it just stops doing anything, right? The process terminates or it goes to standby?

If we want the AI to actually backtrack, wouldn't we need to modify the utility function so that it takes a configuration parameter as well? u(w,A,X) s.t. X = true iff the AI is in its initial configuration, and set it so that u(F,A,F) = 0, but u(F,A,T) = C ? (this probably counts as just an implementation detail, but I figured I'd mention it anyway)

The part that makes me feel like I didn't actually understand it at all is:

However, the only possible world that it cares about, now, is one in which the "ON" signal was overwritten by a thermodynamic miracle... into exactly the same "ON" signal. As long as the possibility of this is non-zero, the AI will behave as if it's certain.

Can you explain this a bit more? Do you mean that even after it observes "ON", it doesn't stop acting as if w = true? That P(TM that overwrites ON with ON) * u(A)|w > C ? If that's the case, then it would never backtrack, right? So it's essentially a full simulation of an AI under the assumption w, but with the knowledge that w is incredibly unlikely, and no built-in halting condition?

Thanks

[-]GMHowe11y40

Why would it backtrack (or what do you mean by backtrack)? Eventually, it observes that w = false (that "ON" went through unchanged) and that its actions are no longer beneficial, so it just stops doing anything, right? The process terminates or it goes to standby?

I think the presumption is that the case where the "ON" signal goes thru normally and the case where the "ON" signal is overwritten by a thermodynamic miracle... into exactly the same "ON" signal are equivalent. That is that after the "ON" signal has gone though the AI would behave identically to an AI that was not indifferent to worlds where the thermodynamic miracle did not occur.

The reason for this is that although the chance that the "ON" signal was overwritten into exactly the same "ON" signal is tiny, it is the only remaining possible world that the AI cares about so it will act as if that is what it believes.

[-]Stuart_Armstrong11y30

C need not be a low constant, btw. The only requirement is that u(false,action a, A) = u(false, action b, A) for all actions a and b and all A. ie nothing the AI does affects the utility of worlds where w is false, so this does not constrain its actions.

Basically the AI observes the ON signal going through, and knows that either a) the signal went through normally, or b) the signal was overwritten by coincidence by exactly the same signal. It's actions have no consequences in the first case, so it ignores it, and acts "as if" it was certain there had been a thermodynamic miracle that happened.

[-]TylerJay11y40

Thanks. I understand now. Just needed to sleep on it, and today, your explanation makes sense.

Basically, the AI's actions don't matter if the unlikely event doesn't happen, so it will take whatever actions would maximize its utility if the event did happen. This maximizes expected utility

Maximizing [P(no TM) C + P(TM) u(TM, A))] is the same as maximizing u(A) under assumption TM.

[-]Stuart_Armstrong11y30

Maximizing [P(no TM) C + P(TM) u(TM, A))] is the same as maximizing u(A) under assumption TM.

Yes, that's a clear way of phrasing it.

[-]Toggle11y40

I am fairly confident that I understand your intentions here. A quick summary, just to test myself:

HAL cares only about world states in which an extremely unlikely thermodynamic even occurs- namely, the world in which one hundred random bits are generated spontaneously during a specific time interval. HAL is perfectly aware that these are unlikely events, but cannot act in such a way as to make the event more likely. HAL will therefore increase total utility over all possible worlds where the unlikely even occurs, and otherwise ignore the consequences of its choices.

This time interval corresponds by design with an actual signal being sent. HAL expects the signal to be sent, with a very small chance that it will be overwritten by spontaneously generated bits and thus be one of the words where it wants to maximize utility. Within the domain of world states that the machine cares about, the string of bits is random. There is a string among all these worlds states that corresponds to the signal, but it is the world where that signal is generated randomly by the spontaneously generated bits. Thus, within the domain of interest to HAL, the signal is extremely unlikely, whereas within all domains known to HAL, the signal is extremely likely to occur by means of not being overwritten in the first place. Therefore, the machine's behavior will treat the actual signal in a counterfactual way despite HAL's object-level knowledge that the signal will occur with high probability.

If that's correct, then it seems like a very interesting proposal!

I do see at least one difference between this setup, and a legitimate counterfactual belief. In particular, you've got to worry about behavior in which (1-epsilon)% of all possible worlds have a constant utility. It may not be strictly equivalent to the simple counterfactual belief. Suppose, in a preposterous example, that there exists some device which marginally increases your ability to detect thermodynamic miracles (or otherwise increases your utility during such a miracle); unfortunately, if no thermodynamic miracle is detected, it explodes and destroys the Earth. If you simply believe in the usual way that a thermodynamic miracle is very likely to occur, you might not want to use the device, since it's got catastrophic consequences for the world where your expectation is false. But if the non-miraculous world states are simply irrelevant, then you'd happily use the device.

As I think about it, I think maybe the real weirdness comes from the fact that your AI doesn't have to worry about the possibility of it being wrong about there having been a thermodynamic miracle. If it responds to the false belief that a thermodynamic miracle has occurred, there can be no negative consequences.

It can account for the 'minimal' probability that the signal itself occurs, of course- that's included in the 'epsilon' domain of worlds that it cares about. But when the signal went through, the AI would not necessarily be acting in a reasonable way on the probability that this was a non-miraculous event.

[-]Stuart_Armstrong11y20

Yep, that's pretty much it.

[-]jimrandomh11y40

To adapt that design, assume Ω will run UDT/TDT agent A to get its estimation as to the likely probability of any one of 2^100 sequences of bits coming down the wire. It will then choose to send the signal that A assigns the lowest probability to. If there is a tie, it will choose the first sequence, lexicographically.

Any probability mass at all assigned to the hypothesis that the wire's sequence doesn't follow this rule, ends up breaking the scheme. Also, I think the AI can answer with probability distributions that are well defined, but for which this selection is intractable.

[-]ESRogs11y20

Could you maybe add some more explanation of how the stated problem is relevant for AI control? It's not obvious to me from the outset why I care about duping an AI.

[-]Stuart_Armstrong11y40

Many approaches can be used if you can use counterfactuals or "false" information in the AI. Such as an AI that doesn't "believe" that a particular trigger is armed, and then gets caught by that trigger as it defects without first neautralising it.

There's a lot of stuff coming that uses that, implicitly or explicitly. See http://lesswrong.com/lw/lt6/newish_ai_control_ideas/

[-]johnswentworth11y20

One high-level concern. If I'm reading this right (and please let me know if I'm not), then this is talking about handling counterfactuals by spawning a copy of the AI with a different utility function.

Just on general principles, spawning a copy of the AI with a different utility function seems really, really dangerous. The new copy would be motivated to trade off anything at all in the no-thermodynamic-miracle scenario in order to increase utility in the event of a thermodynamic miracle. In particular, if the AI were boxed (as we might expect for counterfactual processing) then it would presumably try to get out of that box.

[-]TylerJay11y20

Here's my explanation of it. Let me know if this helps with your concerns at all:

Imagine we have an AI design we want to test. Call this AI C3PO, and let its utility function be U(A) where A is a world-state from the set of all possible world-states. And let the super-unlikely-event-happening-at-the-specified-time described in the post be w such that w = true if it happens and w = false if it doesn't happen. Then let A be a world state in the subset of all world-states A in which w = true. Basically, A is A given that w happened (this is how we simulate a "false belief" by only allowing the AI to consider worlds in which w = true). Finally, let C be a constant.

The proposal is that we create a variant of C3PO, C3PO* that has the utility function:

U*(A) = P(!w) * C + P(w) * (U(A*))

If the AI is boxed such that it cannot affect the probability of w occurring and it cannot modify its own utility function, then maximizing U is exactly the same as maximizing U once event w has occurred (ie. with false belief w). In this way, we are able to perfectly simulate C3P0 to find out what it would do if w were true, but we don't actually have to convince it that w* is true.

[-]johnswentworth11y20

Right, that much makes sense. The problem is the "perfectly simulate C3PO" part toward the end. If we really want to see what it would do, then we need a perfect simulation of the environment in addition to C3PO itself. Any imperfection, and C3PO might realize it's in a simulated environment. All else equal, once C3PO* knows it's in a simulated environment, it would presumably try to get out. Since its utility function is different from C3PO, it would sometimes be motivated to undermine C3PO (or us, if we're the ones running the simulation).

[-]TylerJay11y30

Just remember that this isn't a boxing setup. This is just a way of seeing what an AI will do under a false belief. From what I can tell, the concerns you brought up about it trying to get out isn't any different between the scenario when we simulate C3PO* and when we simulate C3PO. The problem of making a simulation indistinguishable from reality is a separate issue.

[-]avturchin7y10

One way to make AI to believe in a claim that we know is false, is a situation than disproving the claim requires much more computational complexity that suggesting it. Fo example, it is very cheap thought for me to suggest that we live in a vast computer simulation with probability 0.5. But to disprove this claim with probability one, AI may need a lot of thinking, may be more than total maximum possible computational power of the universe allows.

[-]GraceFu11y10

What if the thermodynamic miracle has no effect on the utility function because it occurs elsewhere? Taking the same example, the AI simulates sending the signal down the ON wire... and it passes through, but the 0s that came after the signal is miraculously turned into 0s.

This way the AI does indeed care about what happens in this universe. Assuming that AI wants to turn on the second AI, the AI could have sent another signal down the ON wire, and then end up simulating failure due to any kind of thermodynamic miracle, or it could have sent the ON signal, and ALSO simulate success, but only when the thermodynamic miracle appears after the last bit is transmitted (or before the first bit is transmitted), so it no longer behaves as if it believes sending a signal down the wire accomplishes anything at all, but instead that sending a signal down the wire has a higher utility.

This probably means that I don't understand what you mean... How does this problem not arise in the model you have in your head?

[-]Stuart_Armstrong11y10

What if the thermodynamic miracle has no effect on the utility function because it occurs elsewhere?

Where it occurs, and other such circumstances and restrictions, need to be part of the definition for this setup.

[-]Unknowns11y10

This is basically telling the AI that it should accept a Pascal's Wager.

[-]Stuart_Armstrong11y20

Not really. There is no huge expected utility reward to compensate the low probability, and the setup is very specific (not a general "accept pascal wagers").

[-]Brian_Tomasik10y00

I'm nervous about designing elaborate mechanisms to trick an AGI, since if we can't even correctly implement an ordinary friendly AGI without bugs and mistakes, it seems even less likely we'd implement the weird/clever AGI setups without bugs and mistakes. I would tend to focus on just getting the AGI to behave properly from the start, without need for clever tricks, though I suppose that limited exploration into more fanciful scenarios might yield insight.

[-]Stuart_Armstrong10y10

The AGI does not need to be tricked - it knows everything about the setup, it just doesn't care. The point of this is that it allows a lot of extra control methods to be considered, if friendliness turns out to be as hard as we think.

[-]Brian_Tomasik10y30

Fair enough. I just meant that this setup requires building an AGI with a particular utility function that behaves as expected and building extra machinery around it, which could be more complicated than just building an AGI with the utility function you wanted. On the other hand, maybe it's easier to build an AGI that only cares about worlds where one particular bitstring shows up than to build a friendly AGI in general.

[-]Stuart_Armstrong10y20

One naive and useful security precaution is to only make the AI care about world where the high explosives inside it won't actually ever detonate... (and place someone ready to blow them up if the AI misbehaves).

There are other, more general versions of that idea, and other uses to which this can be put.

[-]Brian_Tomasik10y10

I guess you mean that the AGI would care about worlds where the explosives won't detonate even if the AGI does nothing to stop the person from pressing the detonation button. If the AGI only cared about worlds where the bomb didn't detonate for any reason, it would try hard to stop the button from being pushed.

But to make the AGI care about only worlds where the bomb doesn't go off even if it does nothing to avert the explosion, we have to define what it means for the AGI to "try to avert the explosion" vs. just doing ordinary actions. That gets pretty tricky pretty quickly.

Anyway, you've convinced me that these scenarios are at least interesting. I just want to point out that they may not be as straightforward as they seem once it comes time to implement them.

[-]Stuart_Armstrong10y20

we have to define what it means for the AGI to "try to avert the explosion" vs. just doing ordinary actions. That gets pretty tricky pretty quickly.

We don't actually have to do that. We set it up so the AI only cares about worlds in which a certain wire in the detonator doesn't pass the signal through, so the AI has no need to act to remove the explosives or prevent the button from being pushed. Now, it may do those for other reasons, but not specifically to protect itself.

Or another example: an oracle that only cares about worlds in which its output message is not read: http://lesswrong.com/r/discussion/lw/mao/an_oracle_standard_trick/

[-]Sune10y00

Couldn't you just send one bit X (1 means on, 0 means off) which is most likely 1 but could turn into 0 due to noise and define the utility u* in the same way as for corrigibility? That is,

u*(A_1,0,A_2)= u(A_1,0,A_2)

u*(A_1,1,A_2)=u(A_1,1,A_2)+E_{A_2'} u(A_1,0,A_2')- E_{A_2'} u(A_1,1,A_2')

Here A_1 denotes what happens in the world before the signal is sent, and A_2 what happens afterwards. This way you only use 1 bit rather than 100 and there is no longer a contribution of 2^{-100} from the case where there is a thermodynamic miracle that turns the on-signal into the on-signal (and you don't have to worry about the distribution of the signal given a thermodynamic miracle). The oracle will optimize u given that X=0 until X is revealed. When that is revealed, we will most likely have X=1, and the oracle will optimize u given X=1 (if the oracle is still running). Does the above idea achieve something more?

[-]Stuart_Armstrong10y00

This seems to work. The difference is how the revelation of X is handled. I'm not sure which approach is better, or in which circumstances.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

19

False thermodynamic miracles

19

19

Acts and beliefs

Noisy events and thermodynamic miracles

False Thermodynamic Miracles

In equations

Same result, computationally