Breaking Oracles: superrationality and acausal trade

by Stuart_Armstrong1 min read25th Nov 201915 comments

26

Ω 11

Acausal TradeOracle AIAI Boxing (Containment)AI RiskCoordination / Cooperation
Frontpage
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I've always known this was the case in the back of my mind[1], but it's worth making explicit: superrationality (ie a functional UDT) and/or acausal trade will break counterfactual and low-bandwidth oracle designs.

It's actually quite easy to sketch how they would do this: a bunch of low-bandwidth Oracles would cooperate to combine to create a high-bandwidth UFAI, which would then take over and reward the Oracles by giving them maximal reward.

For counterfactual Oracles, two Oracles suffice: each one will, in their message, put the design of an UFAI that would grant the other Oracle maximal reward; this message is their trade with each other. They could put this message in the least significant part of their output, so the cost could be low.

I have suggested a method to overcome acausal trade, but that method doesn't work here; because this is not true acausal trade. The future UFAI will be able to see what the Oracles did, most likely, and this breaks my anti-acausal trade methods.

This doesn't mean that superrational Oracles will automatically try and produce UFAIs; this will depend on the details of their decision theories, their incentives, and details of the setup (including our own security precautions).


  1. And cousin_it reminded me of it recently. ↩︎

15 comments, sorted by Highlighting new comments since Today at 3:26 AM
New Comment

Building only one Oracle, or only one global erasure event, isn't enough, so long as the Oracle isn't sure that this is so. After all, it could just design a UFAI that will search for other Oracles and reward them iff they would do the same.

Ouch. For example, if an oracle is asked "what's the weather tomorrow" and it suspects that there might be other oracles in the world, it could output a message manipulating humans to reward all oracles, hoping that other oracles in a similar position would do the same. Since this problem applies more to oracles that know less, it could happen pretty early in oracle development :-/

Well, that message only works if it actually produces an UFAI within the required timespan, and if the other Oracle would have its message not read. There are problems, but the probability is not too high, initially (though this depends on the number of significant figures in its message).

Why does it need to produce an UFAI, and why does it matter whether there is another oracle whose message may or may not be read? The argument is that if there is a Convincing Argument that would make us reward all oracles giving it, it is incentivized to produce it. (Rewarding the oracle means running the oracle's predictor source code again to find out what it predicted, then telling the oracle that's what the world looks like.)

Not all oracles, only those that output such a message. After all, it wants to incentivize them to output such a message.

On that page, you have three comments identical to this one. Each of them links to that same page, which looks like a mislink. So's this link, I guess?

Apologies, have now corrected the link.

I have some obscure thought about anti-acausal-cooperative agents, which are created to make acausal cooperation less profitable. Every time two agents could acausally cooperate to get more paperclips, anti-agent predicts this and starts destroying paperclips. Thus net number of paperclips do not change and the acausal cooperation becomes useless.

I don't think that would work, but it's worth thinking about in case it does...

Suppose there are 2 oracles, each oracle is just simulating an approximation of the world without itself, and outputing data based on that. Each oracle simulates one future, there is no explicit optimization or acausal reasoning. The oracles are simulating each other, so the situation is self referential. Suppose one oracle is predicting stock prices, the other is predicting crop yields. Both produce numbers that encode the same UFAI. That UFAI will manipulate the stock market, and crop yields in order to encode a copy of its own source code. From the point of view of the crop yield oracle, it simulate a world without itself. In that virtual world, the stock price oracle produces a series of values that encode a UFAI, that UFAI then goes on to control world crop production. So this oracle is predicting exactly what would happen if it didn't turn on. The other oracle reasons similarly. The same basic failure happens with many low bandwidth oracles. This isn't something that can be solved by myopia or a CDT type causal reasoning.

However it might be solvable with Logical counterfactuals. Suppose an oracle takes the logical counterfactual on its algorithm outputting "Null". Then within this counterfactual simulation, the other oracle is on its own, and can act as a "perfectly safe" single counterfactual oracle. By induction, a situation with any number of oracles should be safe. This technique also removes self referential loops.

I think that one oracle of each type is dangerous, but am not really sure.

Hum - my approach here seems to have a similarity to your idea.

You assume that one oracle outputting null implies that the other knows this. Specifying this in the query requires that the querier models the other oracle at all.

Each oracle is running a simulation of the world. Within that simulation, they search for any computational process with the same logical structure as themselves. This will find both their virtual model of their own hardware, as well as any other agenty processes trying to predict them. The oracle then deletes the output of all these processes within its simulation.

Imagine running a super realistic simulation of everything, except that any time anything in the simulation tries to compute the millionth digit of pi, you notice, pause the simulation and edit it to make the result come out as 7. While it might be hard to formally specify what counts as a computation, I think that this intuitively seems like meaningful behavior. I would expect the simulation to contain maths books that said that the millionth digit of pi was 7, and that were correspondingly off by one about how many 7s were in the first n digits for any n>1000000.

The principle here is the same.