Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

A putative new idea for AI control; index here.

All my ideas on reduced impact AI follow the same format: some restriction to ensure a reduced impact (I generally use something like "could a future AI detect this one existed"?) and some "trick" to allow a way of still achieving a larger impact. Further more, there is always a "nice" way of achieving this larger impact (do the thing we really want you to do, and only that!), so that we hope the AI can be forced into that behaviour. Examples of "tricks" are:

In this post, I'll present another variant - one that I think is the most interesting so far. I won't change the reduced impact condition, but will present a different "trick".

AI pretends to be human

As in this post, imagine that we have a reduced-impact AI in a box that conceals it from us. The AI has some goal, and will output some data to further that goal, then turn itself off. And then we'll shake the box around or something, so we can't be sure that the AI was turned on.

There are two issues to deal with, here: a) The message itself is strong evidence that the AI was turned on, and b) if we exclude the message from the reduced impact, the message becomes dangerous.

To deal with a), first of all. This approach gets much easier if we have a whole brain emulation (WBE). Then we can house the WBE in the box with the AI. If the AI doesn't send out a message, then the WBE sends out is own message instead. There are variants we can try here if we don't have WBE - maybe some human volunteer willing to either die or take amnesia pills at the end of the experiment.

So, receiving a coherent message is not a surprising thing to come out the output channel. However, an AI motivated strongly by reduced impact, will simply seek to imitate as precisely as possible what the WBE would send. We still need a trick to get something useful.

AI pretends to be BEST human

Imagine that we wanted the AI to give us investment advice to make us rich. Say, a binary task: get us ten million dollars over the course of the year.

Now, this is the kind of thing that the WBE could conceivably achieve. We could run the WBE fast, for three subjective weeks, and it has a chance of giving us advice that would get us that rich. Assume that it only has 0.1% chance of doing so. The AI, on the other hand, could produce advice that makes us that rich with almost certainty.

Therefore if we get advice, follow it, and end up at least ten million dollars richer, we can conclude that the odds update of the AI having given us the advice (rather than the WBE doing so) is about 1:1000.

And this gives us the solution to b). The "trick", the escape clause, is by conditioning on the outcome of our investments. So we're asking: given that we have become rich (and given the other facts about the universe in 50 years time) what additional evidence is there that the AI was turned on?

Now, a sensible WBE is almost certain to not come up with a scheme that destroys the universe with self-tiling agents - or, indeed, does anything too crazy. We could instruct the WBE to this effect, and we're only running it for three weeks, which prevents the kind of drifts that you see in [more extreme situations](http://lesswrong.com/lw/mfx/an_overall_schema_for_the_friendly_ai_problems/}. Thus a message from the WBE, no matter how likely it is to make us rich, is almost certain to not lead to tilling the universe with self-replicating nano-bots.

So here, the AI merely has to come up with a message that could have come from the WBE, with at least 0.01% probability, and that would indeed make us rich. And, therefore, the message cannot contain secret code that would optimise the universe.

Note that we need that the goal we are looking for be not too extreme in impact (the higher the impact of the goal, the more latitude the AI has to have impact itself, "hiding" inside the goal's impact), and we should restrict to event of probability, say, greater than 1/10,000 (as when the probability gets too small, odd events start happening).

It seems that this scheme could result in AIs that always give the very best possible advice a human (or WBE) could ever give, which would be a very useful result.

Whose probability?

We have a probability distribution over events. The AI itself will have a probability distribution over events. The future hypothetical super-super-AI it is using to compute reduced impact has a probability distribution over events - and the AI has a probability distribution over that probability distribution. If all of them agree on the probability of us getting richer (given WBE advice and given not), then this scheme should work.

If they disagree, there might be problems. A more complex approach could directly take into account the divergent probability estimates; I'll think of that and return to the issue later.

New Comment
4 comments, sorted by Click to highlight new comments since: Today at 7:32 AM

This seems pretty similar to this, using the WBE as the natural distribution. I think it has the same problems, namely butterfly effects and the fact that it's difficult to safely make multiple decisions.

Butterfly effects are probably solvable using false thermodynamic miracles, but there's still a big problem when we want to make multiple decisions. If we make 1000 decisions in a sequence, then we don't have a good guarantee that each individual decision isn't contributing, in a small part, to creating UFAI (say, by depositing a few bits on a computer somewhere). Seeing the results of 1000 decisions by the AI would give us very strong evidence that it is the AI, not the human, in the box. It doesn't seem like creating multiple get-rich-quick schemes would lead to UFAI, but I don't have a good mathematical argument for this yet.

As a side note, I don't think we really need WBE for this; we just need a probabilistic model for humans that doesn't assign extremely low probability to outputting useful things like get-rich-quick plans. Perhaps even n-gram language models would be good enough.

If we make 1000 decisions in a sequence, then we don’t have a good guarantee that each individual decision isn’t contributing, in a small part, to creating UFAI (say, by depositing a few bits on a computer somewhere).

Why is this? Seeing an UFAI built this way is very strong evidence that it was an AI in the box! And this is what the AI is wanting to avoid. And it's only got "odd" probabilities with respect to its own creation; it can perfectly know about all the other AI reaching their decisions.

What I mean is that the probability that each individual action was done by a human is not too low, but the joint probability for all the actions is very low.

For concreteness: say that every minute, the AI or human either decides to make a stock trade or not. There's another computer somewhere that detects whether the AI/human made a trade and appends a bit (0 or 1) to a file accordingly, and runs it (say, as a Python program) after collecting a million bits.

Now suppose the AI made a stock trade only on minutes corresponding to 1 bits in a UFAI source code. Of course, the probability that a human would make all of these decisions is very small. However, the probability that a human would make one of these decisions (to trade a stock or not in particular minute) is probably close to 50%. So if we're extremely paranoid, we might only be able to use this scheme to get rich once (instead of getting rich multiple times).

I see. There are many ways to solve this issue, but they're kludges; I'll try and think if there's a principled way of doing it.