Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

The "best predictor is malicious optimiser" problem

2Richard_Kennaway

2Richard_Kennaway

2drocta

3Donald Hobson

2Daniel Kokotajlo

2Donald Hobson

1Dmitry Vaintrob

2Donald Hobson

1Dmitry Vaintrob

2Donald Hobson

New Comment

I don't have anything mathematical to say about this, but I imagined a human version. X asks Y for advice on some matter. Y has a motive for giving advice that X finds effective (it will improve his standing with X), but also has ulterior motives, that might or might not be to X's benefit. His advice will be selected to be effective for both solving X's problem and advancing Y's personal agenda, but perhaps less effective for the former than if the latter had not been a consideration.

Imagine a student asking a professor for career advice, and the professor suggesting the student do a Ph.D. with him. Will the student discover he's just paperclipping for the professor, and would have been better off accepting his friend's offer of co-founding a startup? But that friend has an agenda also.

For a more extreme fictional example of this, I'm reminded of K.J. Parker's *Scavenger* trilogy, which begins with a man waking up on a battlefield, left for dead. He has taken a head injury and lost his memory. On his travels through the world, trying to discover who he was, everyone he meets, however helpful they seem, uses him for their own ends. Apparently he was known as the wickedest man in the world, but everything he does to get away from his past life just brings him back into it, spreading death and destruction wherever he goes.

What came to mind for me before reading the spoiler-ed options, was a variation on #2, with the difference being that, instead of trying to extract P's hypothesis about B, we instead modify T to get a T' which has P replaced with a P' which is a paperclip minimizer instead of maximizer, and then run both, and only use the output when the two agree, or if they give probabilities, use the average, or whatever.

Perhaps this could have an advantage over #2 if it is easier to negate what P is optimizing for than to extract P's model of B. (edit: though, of course, if extracting the model from P is feasible, that would be better than the scheme I described)

On the other hand, maybe this could still be dangerous, if P and P' have shared instrumental goals with regards to your predictions for B?

Though, if P has a good model of you, A, then presumably if you were to do this, both P and P' would expect you would do this, and, so I don't know what would make sense for them to do?

It seems like they would both expect that, while they may be able to influence you, that insofar as the influence would effect the expected value of number of paperclips, it would be canceled out by the other's influence (assuming that the ability to influence # paperclips via changing your prediction of B, is symmetric, which, I guess it might not be..).

I suppose this would be a reason why P would want its thought processes to be inscrutable to those simulating it, so that the simulators are unable to construct P' .

__

As a variation on #4, if P is running on a computer in a physics simulation in T, then almost certainly a direct emulation of that computer running P would run faster than T does, and therefore whatever model of B that P has, can be computed faster than T can be. What if, upon discovering this fact about T, we restrict the search among Turing machines to only include machines that run faster than T?

This would include emulations of P, and would therefore include emulations of P's model of B (which would probably be even faster than emulating P?), but I imagine that a description of an emulation of P without the physics simulation and such would have a longer description than a description of just P's model of B. But maybe it wouldn't.

Thanks for a thoughtful comment.

On the other hand, maybe this could still be dangerous, if P and P' have shared instrumental goals with regards to your predictions for B?

Assuming that P and P' are perfectly antialigned, they won't cooperate. However they need to be really antialigned for this to work. If there is some obscure borderline that P thinks is a paperclip, and P' thinks isn't, they can work together to tile the universe with it.

I don't think it would bed that easy to change evolution into a reproductive fitness minimiser, or to negate a humans values.

If P and P' are antialigned, then in the scenario where you only listen to them if they agree, then for any particular prediction, at least one of them will consider disagreeing better than that. The game theory is a little complicated, but they aren't being incentivised to report their predictions.

Actually, A has to be able to manage, not only correct and competent adversaries, but deluded and half mad ones too.

I think P would find it hard to be inscrutable. It is impossible to obfuscate arbitrary code.

I agree with your final point. Though for any particular string X, the fastest turing machine to produce it is the one that is basically** print(X) **. This is why we use short TM's not just fast ones.

Looks like you're making a logical error. Creating a machine that solves the halting problem is prohibited by logic. For many applications assuming a sufficiently powerful and logically consistent oracle is good enough but precisely these kinds of games you are playing, where you ask a machine to predict its own output/the output of a system involving itself, are where you get logically inconsistent. Indeed, imagine asking the oracle to simulate an equivalent version of itself and to output the the opposite answer to what its simulation outputs. This may seem like a derived question, but most "interesting" self-referential questions boil down to an instance of this. I think once you fix the logical inconsistency, you're left with an equivalent problem to AI in a box: boxed AI P is stronger that friendly AI A but has an agenda.

Alternatively, if you're assuming A is itself un-aligned (rather than friendly) and has the goal of getting the right answer at any cost then it looks like you need some more assumptions on A's structure. For example if A is sufficiently sophisticated and knows it has access to a much more powerful but untrustwothy oracle it might know to implement a merlin-arthur protocol.

There is precisely one oracle, . and and are computable. And crucially, the oracles answer does not depend on itself in any way. This question is not self referential. might try to predict and , but there is no guarantee it will be correct.

P has a restricted amount of compute compared to A, but still enough to be able to reason about A in the abstact.

We are asking how we should design A.

If you have unlimited compute, and want to predict something, you can use solomnov induction. But some of the hypothesis you might find are AI's that think they are a hypothesis in your induction, and are trying to escape.

Sorry, I misread this. I read your question as O outputting some function T that is most likely to answer some set of questions you want to know the answer to (which would be self-referential as these questions depend on the output of T). I think I understand your question now.

What kind of ability do you have to know the "true value" of your sequence B?

If the paperclip maximizer P is able to control the value of your turing machine, and if you are a one-boxing AI (and this is known to P) then of course you can make deals/communicate with P. In particular, if the sequence B is generated by some known but slow program, you can try to set up an Arthur-Merlin zero knowledge proof protocol in exchange for promising to make a few paperclips, which you can then use to keep P honest (after making the paperclips as promised).

To be clear though, this is a strategy for an agent A that somehow has as its goals only the desire to compute B together with some kind of commitment to following through on agreements. If A is genuinely aligned with humans, the rule "don't communicate/make deals with malicious superintelligent entities, at least until you have satisfactorily solved the AI in a box and similar underlying problems" should be a no-brainer.

I don't think that these Arthur merlin proofs are relevant. Here A has a lot more compute than P. A is simulating P and can see and modify P however A sees fit.

Suppose you are a friendly AI A and have a mysterious black box B. B outputs a sequence of bits. You want to predict the next bits that B will output. Fortunately, you have a magic Turing machine oracle O. You can give O any computable function f( Turing machine, does it Halt? , What does it output? , how long does it take?)→R and the oracle will find the turing machine that maximises this function, or return "no maximum exists".

In particular, f can be any combination of length, runtime and accuracy at predicting B. Maybe you set f=0 on any TM's that don't predict B and f=1/ number of states on any machines that do.

So you take the Turing machine T given to you by the oracle and look at it. In AIXI T would be the shortest TM that makes correct predictions. In logical induction, T would be a short and fast TM that made mostly correct predictions, and B would be a function that was slow to compute.

Now you look inside T, to find out what it does. Inside T you find a paperclip maximiser P . That isn't to say that T=P. T might be simulating some laws of physics, with P running on a computer contained within that physics. P believes that the world it sees is contained within a hypothesis being considered by A. P is in control of the output of T. In fact P has a good understanding of how A thinks, and has its own hypothesis about B. P wants to maximise the number of paperclips in the real world. The optimization target of P is to output a probability distribution, such that when A uses this probability distribution as part of its algorithm to predict B, then A takes an action that makes lots of paperclips. P is aware that getting caught out making wrong predictions will lower its standing as a good hypothesis with A.

What should A do?

Possible answers are spoilered, in case you want to think of your own answer.

If you can think of a different answer, or you have an opinion about which answer is better please post it.

1)

Use T anyway. After all Ts the shortest/ short and fastest program that acurately predicts B. For all you know, B is a computer simulating T. Your assigning a few percent of probability to different longer TM's aren't you? That + expected utility maximization should stop you doing anything too bad in the off chance that T doesn't predict B

2)

Find the B predictor in P. Clearly P can predict B well when it wants to, find this predictor and cut it loose from the paperclip maximizing part. This might be tricky if P is something like a human brain running in a Quantum mechanics simulation. There might be no easy obvious ways to cut the parts apart.

3)

Discard T entirely. Use a probability distribution over hypothesis that aren't trying to manipulate you. Fails if B is actually simulating T.

4)

Use a measure f where this can't happen. Maybe if you optimize for fast runtime strongly enough? Maybe if you use a different model of computation? Set up a situation where malicious optimisers never have much (any?) of your probability mass. How much probability do you need to assign to a malicious optimiser before it causes damage?

5)

Use T as long as it carries on trying to predict, throw it out as soon as you spot it trying to deceive. This is basically using 1) and 2) if they agree, throwing both out if they disagree. So if P is thinking "B will output x and I should say x to increase A's trust in me" then use T. If P thinks "B will output y and I should say z to trick A" then discard T.

6)

Some probability distribution over the above.