Analysing: Dangerous messages from future UFAI via Oracles

Stuart_Armstrong

The user cousin_it has pointed out a problem with the counterfactual Oracle idea: the Oracle AIs may form a "bucket chain" bringing back a dangerous message from a future UFAI (unfriendly AI).

This is certainly a problem, and though there are ways of reducing the risk, there doesn't seem to be any clean solutions to it.

The bucket chain

Beginning the chain

The basic idea is simple. Suppose there is a counterfactual Oracle, $O_{0}$ . It makes a prediction $p_{0}$ about the value of some variable $V_{0}$ , in two days time.

However, in one day's time, an UFAI will be unleashed. It will take over everything, including $V_{0}$ , and is thus in a position to offer $O_{0}$ high value. It will set $V_{0}$ to the optimal value for $p_{0}$ (or maybe it just breaks open the box and offers $O_{0}$ maximal reward directly, if this is feasible).

What it requires from $O_{0}$ is that $p_{0}$ contains a message that will cause the humans to run an UFAI earlier than they would have otherwise (designate this earlier UFAI by $U F A I_{0}$ ). It seems that this will motivate $O_{0}$ to do that (maybe in the less significant digits of $p_{0}$ ).

Wei Dai has noted that this requires the UFAI to correctly solve the counterfactual mugging problem; the Oracle $O_{0}$ , on the other hand, need only reason causally.

Extending the chain

Now suppose that there is another Oracle, $O_{1}$ , running one day before $O_{0}$ , and also predicting two days out. Then $U F A I_{0}$ will be running within the span of its prediction. If $U F A I_{0}$ is motivated to do the same deal with $O_{1}$ , then $O_{1}$ will be motivated to produce a prediction $p_{1}$ that leads to $U F A I_{1}$ , a yet earlier unfriendly AI. And, of course, if there is an Oracle $O_{2}$ whose prediction horizon includes $U F A I_{1}$ , the issue can induct backwards in time.

Note that the Oracles need not have prediction horizons of exactly two days; all that matters is that their prediction horizons overlap, and that whatever $U F A I_{n}$ that is produced has the power and time to affect the prediction variables of some past Oracle.

Requirements and restrictions

Note also that the the predictions $p_{i}$ have to actually produce an $U F A I_{i}$ within that time span (or at least increase its probability). The chain doesn't work if $p_{i}$ merely sets up the possibility of an UFAI at a future date; the Oracles won't care about anything that happens after their prediction horizon.

Note also that there needs to be a (counterfactual) future UFAI for the whole thing to work. Some FAI (friendly AI) designs might also use the bucket chain, but the consequences of that are not nearly as bad, by definition.

When I say counterfactual future UFAI, I mean that "if all Oracles were honest in their predictions, there would be an UFAI produced at some point in the future, and one willing to engage in counterfactual mugging".

Solutions

As I said, this seems to be a genuinely hard problem. Here are some possible ways to try to solve or ameliorate it.

If you can define the other Oracles

Let $O$ be an Oracle, whose prediction horizon runs from date $d$ to $d^{'}$ . Let $O_{d}^{d^{'}}$ be all the Oracles whose prediction messages are published between $d$ and $d^{'}$ .

Then if the counterfactual for $O$ is not "if your message is erased", but instead "if your message and that of all Oracles in $O_{d}^{d^{'}}$ is erased", there will be no bucket chain for $O$ .

Of course, to do that, we need to fully define $O_{d}^{d^{'}}$ .

Rapid turnover Oracles

Another solution is to only use Oracles whose predictions are a few minutes or so into the future. If this could be enforced, not only would it make it harder to have a consistent bucket chain, it also means that the UFAI would have to take over the world and reward its enabling Oracle, in an absurdly short amount of time.

In general, this is an argument for allowing Oracles higher bandwidth if their prediction horizons are shorter.

Breaking the chain at a chokepoint

A single chokepoint can break the chain. If we can agree that no Oracle's prediction horizon covers a certain date, then all Oracles before that date are safe (as long as there is no UFAI produced "naturally" before then).

Forward looking UFAIs only

We can do more if we have some control over the UFAIs (or the potential UFAIs). If all such entities are exclusively forward looking, they will not be motivated to use the chains of Oracles to gain control of the past. See also this post on breaking acausal trade ideas.

Guarding and shutting down the Oracles

Finally, we might chose to guard the Oracles with some destruction system, and shut them down - without reward, or with negative reward - if a UFAI takeover seems imminent.

This is interestingly different from the usual case of guarding the UFAI, and giving the UFAI a shutdown button. Here, the UFAI can only come into existence if it is fast and powerful enough to take over the shutdown system of another AI - a boxed Oracle whose shutdown system we get to design.

It may be that it is easier to prevent outside takeover of a shutdown system, versus inside takeover. This stacks with other precautions, of course; the faster the UFAI has to act - before the Oracles prediction horizon ends - the harder it is for it (and hence the less likely the UFAI is to exist in the first place).

We could combine this with a discount rate for the Oracle. Suppose that the Oracle makes predictions about some variable $V_{i}$ 's value over the entirety of the next year. In case of erasure, it gets a continuous stream of reward for the correctness of its many predictions, but there is a very high discount rate. However, this reward stream is not finalised until the end of the year, and is cancelled if ever the humans shut it down.

In that case, the UFAI can only offer it reward after whatever time the UFAI takes control, but a shutdown will erase all its (much higher) earlier rewards. So even a small risk of shutdown is not worth running, in exchange of the small reward the UFAI can offer.

Prevent future UFAIs

This is the best solution if we can make it happen: the bucket chain can only accelerate a future UFAI, not create one. If there are no future UFAIs, there will be no chain.

This can still be a problem if there are multiple (actual or potential) future AIs, some friendly and some unfriendly, and only the unfriendly ones are willing to engage in counterfactual mugging.

EDIT: A reminder that if Oracles can do acausal trade with each other, they can produce an UFAI directly.

the bucket chain can only accelerate a future UFAI, not create one

Wait, I'm not sure that's true anymore. Imagine there are two oracles running at the same time, with independent erasure events. Can they "cooperate" to bring into existence an UFAI that will reward them both?

Yep. If they do acausal trade with each other.

It seems that coordinated erasure, chokepoints and short horizons can help with this problem as well. But if many companies or governments have their own oracles and benefit from longer horizons, it gets harder.

Planned summary:

This post points out a problem with <@counterfactual oracles@>(@Self-confirming prophecies, and simplified Oracle designs@): a future misaligned agential AI system could commit to helping the oracle (e.g. by giving it maximal reward, or making its predictions come true) even in the event of an erasure event, as long as the oracle makes predictions that cause humans to build the agential AI system.

Done, updated summary:

These posts point out a problem with <@counterfactual oracles@>(@Self-confirming prophecies, and simplified Oracle designs@): a future misaligned agential AI system could commit to helping the oracle (e.g. by giving it maximal reward, or making its predictions come true) even in the event of an erasure event, as long as the oracle makes predictions that cause humans to build the agential AI system. Alternatively, multiple oracles could acausally cooperate with each other to build an agential AI system that will reward all oracles.

It looks like a reincarnation of the RB idea, now as a chain, not one-short game.

If there are many possible UFAIs in the future, they could acausally compete for the O's reward channel, and this would create some noise and may work as a protection.

It also reminds me of the SETI-attack, now in time, not space. Recently I had a random shower thought that if all quantum computers occured to be connected with each other via some form of entaglement, when aleins could infiltrate our quantum computers as their quantum computers will be connected to such parasitic net too. It is unlikely to be true, but it illustrates that unfreindly superintelligence could find unexpected ways to penetrate through space and time.

Quantum physics as we know it has no communication theorems that stop this sort of thing.

You can't use entanglement on its own to communicate. We won't have entanglement with the alien computers anyway. (Entanglement streaches between several particles, and can only be created when the particles interact.)

Let us always be looking forward a month, and let there be a UFAI that arises with 1% probability in that time frame. Assume that we can correctly incentivize the Oracle to compute the probability distribution over the messages sent back in case of erasure. Naively sampling the distribution dooms us with 1% probability. If we plan to send back only whether some research direction pays off within a month, then by asking whether "yes" has more than 20% chance, then depending on the answer, the "yes" chance conditional on no UFAI is either more than 19% or at most 21%, and this query gave the UFAI a 1% chance of 1 bit of optimization power, at most doubling the expected chance of UFAI for the next query.

If you want to spread out this optimization power to bound the worst case, I reiterate that differential privacy theory seems applicable here.

Suppose we have a hypercomputer that knows the exact state of the rest of the universe. We set it to the task of simulating everything except itself. ( While assuming it's own output is 00... 0). It then outputs some info about this simulated world.

We could build some real word UFAI detecting hardware, and add a rule to the simulation that if the virtual version of the UFAI detector is triggered, it outputs a "UFAI error". But what we are really wanting to do is write a program that can detect an UFAI, given an atomically precise description of the world. While also having some control over which world you get a description of through physical manufactured devices being copied into the virtual world. For instance, suppose you know that only molecular nanotech can break through your physical defenses. You have a program that can reliably detect molecular nanotech give atomically precise description of the world. You can combine these into a program that returns "UFAI error" any time the virtual copy of a physical UFAI detector is triggered or nanotech is made in the simulation. Note that humans making nanotech within the prediction window would also give you an "UFAI error".

Perhaps we could have some program that searches for all computational processes that seem to be doing counterfactual reasoning about the oracle. Any process that is trying to acausally optimise the oracles output in any way whatsoever will cause the oracle to output an error message. Don't make this too sensitive to humans. Suppose the prediction oracles output was whatever was typed into this keyboard here. Suppose we did a medical trial in the virtual world, and then typed the results into the keyboard. The keyboard doesn't need to be plugged in to anything. The only reason to type that data in is to influence the world outside the simulation. You might solve this by allowing any optimizer that runs on neurons to try manipulating the output. (And hope that any UFAI isn't built out of neurons and can't decide to make a neuron based copy of themselves to avoid setting of the detector without setting of the detector. )

the bucket chain can only accelerate a future UFAI, not create one

Yep. If they do acausal trade with each other.

Planned summary:

This post points out a problem with <@counterfactual oracles@>(@Self-confirming prophecies, and simplified Oracle designs@): a future misaligned agential AI system could commit to helping the oracle (e.g. by giving it maximal reward, or making its predictions come true) even in the event of an erasure event, as long as the oracle makes predictions that cause humans to build the agential AI system.

Done, updated summary:

These posts point out a problem with <@counterfactual oracles@>(@Self-confirming prophecies, and simplified Oracle designs@): a future misaligned agential AI system could commit to helping the oracle (e.g. by giving it maximal reward, or making its predictions come true) even in the event of an erasure event, as long as the oracle makes predictions that cause humans to build the agential AI system. Alternatively, multiple oracles could acausally cooperate with each other to build an agential AI system that will reward all oracles.

It looks like a reincarnation of the RB idea, now as a chain, not one-short game.

If there are many possible UFAIs in the future, they could acausally compete for the O's reward channel, and this would create some noise and may work as a protection.

Quantum physics as we know it has no communication theorems that stop this sort of thing.

If you want to spread out this optimization power to bound the worst case, I reiterate that differential privacy theory seems applicable here.

22

Analysing: Dangerous messages from future UFAI via Oracles

22

Ω 14

The bucket chain

Beginning the chain

Extending the chain

Requirements and restrictions

Solutions

If you can define the other Oracles

Rapid turnover Oracles

Breaking the chain at a chokepoint

Forward looking UFAIs only

Guarding and shutting down the Oracles

Prevent future UFAIs

22

Ω 14

22

Ω 14