The AI, the best human advisor

Stuart_Armstrong

A putative new idea for AI control; index here.

All my ideas on reduced impact AI follow the same format: some restriction to ensure a reduced impact (I generally use something like "could a future AI detect this one existed"?) and some "trick" to allow a way of still achieving a larger impact. Further more, there is always a "nice" way of achieving this larger impact (do the thing we really want you to do, and only that!), so that we hope the AI can be forced into that behaviour. Examples of "tricks" are:

It could be an output channel which does not count against reduced impact.
It could involve several AIs that combine to have larger impact, with a conditional motivation to "fool" the reduced impact requirement.
It could be a variant of the above where we attempt to generalise the conditional motivation in a sensible manner.

In this post, I'll present another variant - one that I think is the most interesting so far. I won't change the reduced impact condition, but will present a different "trick".

AI pretends to be human

As in this post, imagine that we have a reduced-impact AI in a box that conceals it from us. The AI has some goal, and will output some data to further that goal, then turn itself off. And then we'll shake the box around or something, so we can't be sure that the AI was turned on.

There are two issues to deal with, here: a) The message itself is strong evidence that the AI was turned on, and b) if we exclude the message from the reduced impact, the message becomes dangerous.

To deal with a), first of all. This approach gets much easier if we have a whole brain emulation (WBE). Then we can house the WBE in the box with the AI. If the AI doesn't send out a message, then the WBE sends out is own message instead. There are variants we can try here if we don't have WBE - maybe some human volunteer willing to either die or take amnesia pills at the end of the experiment.

So, receiving a coherent message is not a surprising thing to come out the output channel. However, an AI motivated strongly by reduced impact, will simply seek to imitate as precisely as possible what the WBE would send. We still need a trick to get something useful.

AI pretends to be BEST human

Imagine that we wanted the AI to give us investment advice to make us rich. Say, a binary task: get us ten million dollars over the course of the year.

Now, this is the kind of thing that the WBE could conceivably achieve. We could run the WBE fast, for three subjective weeks, and it has a chance of giving us advice that would get us that rich. Assume that it only has 0.1% chance of doing so. The AI, on the other hand, could produce advice that makes us that rich with almost certainty.

Therefore if we get advice, follow it, and end up at least ten million dollars richer, we can conclude that the odds update of the AI having given us the advice (rather than the WBE doing so) is about 1:1000.

And this gives us the solution to b). The "trick", the escape clause, is by conditioning on the outcome of our investments. So we're asking: given that we have become rich (and given the other facts about the universe in 50 years time) what additional evidence is there that the AI was turned on?

Now, a sensible WBE is almost certain to not come up with a scheme that destroys the universe with self-tiling agents - or, indeed, does anything too crazy. We could instruct the WBE to this effect, and we're only running it for three weeks, which prevents the kind of drifts that you see in more extreme situations. Thus a message from the WBE, no matter how likely it is to make us rich, is almost certain to not lead to tilling the universe with self-replicating nano-bots.

So here, the AI merely has to come up with a message that could have come from the WBE, with at least 0.01% probability, and that would indeed make us rich. And, therefore, the message cannot contain secret code that would optimise the universe.

Note that we need that the goal we are looking for be not too extreme in impact (the higher the impact of the goal, the more latitude the AI has to have impact itself, "hiding" inside the goal's impact), and we should restrict to event of probability, say, greater than 1/10,000 (as when the probability gets too small, odd events start happening).

It seems that this scheme could result in AIs that always give the very best possible advice a human (or WBE) could ever give, which would be a very useful result.

Whose probability?

We have a probability distribution over events. The AI itself will have a probability distribution over events. The future hypothetical super-super-AI it is using to compute reduced impact has a probability distribution over events - and the AI has a probability distribution over that probability distribution. If all of them agree on the probability of us getting richer (given WBE advice and given not), then this scheme should work.

If they disagree, there might be problems. A more complex approach could directly take into account the divergent probability estimates; I'll think of that and return to the issue later.

This is a really neat idea. It limits the optimizing power of the AI to no more than a very lucky and intelligent human. But that might be enough to bootstrap to more useful work. Ask the human to solve some hard technical problem. Or even "come up with a better approach to FAI".

This seems like a really elaborate way of limiting the AIs optimization power. Why not just limit it's computational resources or level of self improvement? Surely there must be ways to physically restrict an AI's intelligence below super-intelligent ability. I know it's a tricky thing to get right, but so are these ideas. Has this approach been considered at all?

I don't think all the whole brain emulation stuff is necessary. Just ask it to produce outputs that mimic humans. We have plenty human writing to train it on. It's goal is then to maximize the probability that it it's output came from a human vs an AI, conditioned on it having solved the problem. I think that is about equivalent to your idea.

Why not just limit it's computational resources or level of self improvement?

Because it's very hard to define "level of self improvement", and it's not clear how to relate "limited computational resources" with "limited abilities".

Then perhaps we should research ways to measure and restrict intelligence/optimization power.

Just off the top of my head, one way would be to add another term to it's utility function. Representing the amount of computing power used (or time). It would then have an incentive to use as little computing power as possible to meet it's goal.

An example, you ask the AI to solve a problem for you. The utility function is maximizing the probability that it's answer will be accepted by you as a solution. But after the probability goes above 90%, the utility stops, and a penalty is added for using more computing power.

So the AI tries to solve the problem, but uses the minimal amount of optimization necessary, and doesn't over optimize.

Those approaches fail the "subagent problem". As in, the AI can pass it by creating a subagent to solve the problem for it, without the subagent having those restrictions.

I'm assuming the AI exists in a contained box. We can accurately measure the time it is on and/or resources used within the box. So it can't create any subagents that also don't use up it's resources and count towards the penalty.

If the AI can escape from the box, we've already failed. There is little point in trying to control what it can do with it's output channel.

Reduced impact can control an AI that has the ability to get out of its box. That's what I like about it.

Very interesting proposal. It seems to me that this gives you a tunable parameter (the 0.1%) that trades off optimisation and risk. For instance, you can bet the best WBE abvisor won't optimise the universe, but it could very well give you a strategy that causes a catastrophic market crash, civil wars, or some other undesired outcome (while setting you up to profit from it). The more improbable you allow the result to be, the more effective the advice, but the greater the chance of an unexpected unwanted outcome.

Why do we think the WBE is "safe"? Natural intelligence is unfriendly in exactly the same way as a naively created AI.

The human is likely to be less effective than an AI, which makes it safer. But I don't see how you can assert that a human is less likely to intentionally or accidentally destroy the universe, given the same power.

We think that WBE are safe, in that they are unlikely to be able to produce a single message starting an optimisation process taking over the universe.

You do have to be being careful not to give it too much computation time: http://lesswrong.com/lw/qk/that_alien_message/

Indeed! That's why I give them three subjective weeks.

Robin Parbrook, the highest performing long term fund manager is essentially unknown even when you google for him. I wouldn't be so confident that simply outperforming humans in investment advisory would raise the AI's profile as the 'best human'.

So here, the AI merely has to come up with a message that could have come from the WBE, with at least 0.01% probability, and that would indeed make us rich. And, therefore, the message cannot contain secret code that would optimise the universe.

It seems that you are adding unnecessary complexity without actually addressing the core issue.

If you give to the AI the goal "Earn me 10 million dollars/paper clips" then it is unlikely ( * ) that it will do something questionable in order to achieve the goal. If you give it the goal "Maximize the number of dollars/paper clips to my name" then it may start to optimize the universe in ways you won't like. Your trick does nothing to solve this issue.

( * well, provided that you at least tried to implement some moral and legal constraints, otherwise it may well go Walter White)

If you give to the AI the goal "Earn me 10 million dollars/paper clips" then it is unlikely that it will do something questionable in order to achieve the goal.

I disagree. In a stochastic universe, you can never be certain you've achieved your goal. An otherwise unrestricted AI with that goal will create a few trillion paper clips. just to be sure, and then obsessively count them again and again. You might make the argument that minor practical restrictions can make that AI design safe, which is plausible, but the goal is not intrinsically safe.

Your trick does nothing to solve this issue.

This is for a reduced impact AI. The idea is to allow a larger impact in one area, without reduced impact shutting it down.

I disagree. In a stochastic universe, you can never be certain you've achieved your goal. An otherwise unrestricted AI with that goal will create a few trillion paper clips. just to be sure, and then obsessively count them again and again. You might make the argument that minor practical restrictions can make that AI design safe, which is plausible, but the goal is not intrinsically safe.

Under a literal interpretation of the statement it will create exactly 10 million paper clips, not one more. Anyway, nothing is 100% safe.

Anyway, nothing is 100% safe.

Yes, but expected utility maximisers of this type will still take over the universe, if they could do it easily, to better accomplish their goals. Reduced impact agents won't.