False Thermodynamic Miracles

Stuart_Armstrong

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Cross posted at Less Wrong

A putative new idea for AI control; index here.

Ok, here is the problem:

You have to create an AI that believes (or acts as if it believed) that event X is almost certain, while you believe that X is almost impossible. Furthermore, you have to be right. To make things more interesting, the AI is much smarter than you, knows everything that you do (and more), and has to react sensibly when event X doesn't happen. Answers will be graded on mathematics, style, colours of ink, and compatibility with the laws of physics. Also, penmanship. How could you achieve this?

Acts and beliefs

The main trick is the parenthetical "(or acts as if it believed)". If the AI actually has true beliefs, then there is no problem with it being smart, being knowledgeable, or updating on the fact that X didn't happen. So the problem reduces to:

How can an AI, that believes that X is almost impossible, behave as if X were almost certain? This will be constructed by developing on my idea for "utility indifference" (that old concept needs a better name, btw; corrigibility is a much better candidate for being labelled "utility indifference").

Noisy events and thermodynamic miracles

Imagine that a signal is travelling down a wire inside a noisy gas (or some other source of randomness). Almost certainly the signal will get through, but very occasionally, by the spontaneous action of the molecules in the gas, the signal will get overwhelmed, erased and overwritten with another signal, randomly. This could be seen as a type of thermodynamic miracle: an incredibly unlikely event caused by spontaneous and coincidental coordination between lots of different molecules. We assume here that the AI cannot cause a thermodynamic miracle, nor make it more likely (this may require some specific definitions of what "counts" as a miracle).

The idea is to make the AI indifferent to any world in which there is no thermodynamic miracle. So give it a utility u such that u(w)=constant, for any world w without a thermodynamic miracle at the point of interest. The value of that constant doesn't matter; the result is that the AI's actions have no impact on utility in any world where the miracle didn't happen. Thus, as long as there is a tiny chance that the miracle happened, the AI will behave as if that was a certainty: for only in those worlds do its actions have any impact.

You can probably see where I'm going with this. Suppose the signal goes through the wire to turn on another AI. That AI will only turn on given a very specific 100-digit "ON" binary code. Given any other code, the second AI won't turn on, and will erase itself. We send the "ON" code.

So what happens? The first AI reasons that given a thermodynamic miracle (the only worlds it cares about) the output signal is going to be 100 random bits, which have only $2^{- 100}$ chance of being "ON". Therefore it will behave as if it's almost certain "ON" will not get through the gas.

And what will (almost certainly) happen in reality? We send the "ON" signal, and it gets through unchanged. The first AI observes this (in fact, it expected this). However, the only possible world that it cares about, now, is one in which the "ON" signal was overwritten by a thermodynamic miracle... into exactly the same "ON" signal. As long as the possibility of this is non-zero, the AI will behave as if it's certain.

We should develop this point a little. The AI is almost certain, ahead of time, that "ON" will happen. It behaves, however, as if "ON" is almost certain not to happen. Then, after "ON" has gone through, it must adapt itself and backtrack on its previous behaviour. Moreover, it knows that this is what it will (almost certainly) do. Since it knows ahead of time it will be backtracking, why does it still go ahead with its behaviour?

The answer is that most of the universes in which it backtracks, are universes where its actions have no consequences, and hence it doesn't care about them. So it might have $1 - ϵ$ chance of backtracking, but it only cares about a proportion $ϵ 2^{- 100}$ of those worlds. On the other hand, it won't backtrack with probability $ϵ (1 - 2^{- 100})$ , but it cares about those worlds completely.

False Thermodynamic Miracles

This is the explanation for the title: the AI will behave as if there was a real thermodynamic miracle (a spontaneous overwriting of a signal with the same signal) while we "know" that actually it was just the same signal going through.

By varying what we put at the other end of the wire, we can get the AI to acts "as if" it believed almost anything, without introducing problematic beliefs. We just have to be a bit cautious with our setups, by considering how humans would react if there was a real thermodynamic miracle (because the AI is certainly going to be considering that very carefully).

This allows us to consider AI designs with counterfactuals, false "beliefs", etc..., which will be of great use. Are there any flaws I'm missing? Practical flaws are ok (eg how do you define a thermodynamic miracle and make sure the AI can't affect it), but theoretical flaws are much more important at this stage.

Same result, computationally

The above was a "physical" design, revolving around a particular physical setup, plus some value hacking. But we can do the same in a purely computational way, by adapting drnickbone's design.

To adapt that design, assume Ω will run UDT/TDT agent A to get its estimation as to the likely probability of any one of $2^{100}$ sequences of bits coming down the wire. It will then choose to send the signal that A assigns the lowest probability to. If there is a tie, it will choose the first sequence, lexicographically.

It seems that A must assign equal probability $2^{- 100}$ to each sequence (as any increase in probability makes that sequence less likely to appear, and any decrease makes it more likely). Therefore, we "know" that the signal will be "0000...0000", but the AI cannot assign a higher probability to that. This achieves the same outcome as above.

I personally feel the physical design is more secure - it seems easier to define what a thermodynamic miracle at a given point in spacetime would be, than to be certain that the AI will not evade the computational restrictions somehow, or that the computational setup will remain stable and well defined. Also, A may not spread exactly equal probability over $2^{100}$ sequences, messing up our certainty as to the result.