Is it possible to build a safe oracle AI?

by Karl1 min read20th Apr 201125 comments

0

Oracle AI
Personal Blog

It seem to me possible to create a safe oracle AI.

Suppose that you have a sequence predictor which is a good approximation of Solomonoff induction but which run in reasonable time. This sequence predictor can potentially be really useful (for example, predict future siai publications from past siai publications then proceed to read the article which give a complete account of Friendliness theory...) and is not dangerous in itself.

The question, of course, is how to obtain such a thing.

The trick rely on the concept of program predictor. A program predictor is a function which predict, more or less accurately, the output of the program (note that when we refer to a program we refer to a program without side effect that just calculate an output.) it take as it's input but within reasonable time. If you have a very accurate program predictor then you can obviously use it to gain a good approximation of Solomonoff induction which run in reasonable time.

But of course, this just displace the problem: how do you get such an accurate program predictor?

Well, suppose you have a program predictor which is good enough to be improved on. Then, you use it to predict the program of less than N bits of length (with N sufficiently big of course) which maximize a utility function which measure how accurate the output of that program is as a program predictor given that it generate this output in less than T steps (where T is a reasonable number given the hardware you have access to). Then you run that program. Check the accuracy of the obtained program predictor. If insufficient repeat the process. You should eventually obtain a very accurate program predictor. QED.

So we've reduced our problem to the problem of creating a program predictor good enough to be improved upon. That should be possible. In particular, it is related to the problem of logical uncertainty. If we can get a passable understanding of logical uncertainty it should be possible to build such a program predictor using it. Thus a minimal understanding of logical uncertainty should be sufficient to obtain agi. In fact even without such understanding, it may be possible to patch together such a program predictor...

Oracle AI2
Personal Blog

0

25 comments, sorted by Highlighting new comments since Today at 5:31 PM
New Comment

From Dreams of Friendliness:

Every now and then, someone proposes the Oracle AI strategy: "Why not just have a superintelligence that answers human questions, instead of acting autonomously in the world?"

Sounds pretty safe, doesn't it? What could possibly go wrong?

[...]

While that doesn't mean the issue shouldn't be discussed, it'd be nice if it was done in conctext of the previous discussions on the subject, rather than starting from scratch (this isn't the first time since Eliezer posted that that someone comes along and says "I know, let's create an Oracle AI!").

It seems to me that Eliezer's post is just wrong. His argument boils down to this:

the AI needs goals in order to decide how to think: that is, the AI has to act as a powerful optimization process in order to plan its acquisition of knowledge, effectively distill sensory information, pluck "answers" to particular questions out of the space of all possible responses, and of course, to improve its own source code up to the level where the AI is a powerful intelligence. [...] the AI needs a goal of answering questions, and that has to give rise to subgoals of choosing efficient problem-solving strategies, improving its code, and acquiring necessary information.

It's not obvious that a Solomonoff-approximating AI must have a "goal". It could just be a box that, y'know, predicts the next bit in a sequence. After all, if we had an actual uncomputable black box that printed correct Solomonoff-derived probability values for the next bit according to the mathematical definition, that box wouldn't try to manipulate the human operator by embeding epileptic patterns in its predictions or something.

Maybe you could make a case that self-improvement requires real-world goals and is scary (instead of "superintelligence requires real-world goals and is scary"). But I'm not convinced of that either. In fact Karl's post shows that it's not necessarily the case. Also see Schmidhuber's work on Goedel machines, etc. Most self-improving thingies I can rigorously imagine are not scary at all.

It is indeed true that reinforcement learning AIs are scary. For example, AIXI can and will manipulate you into rewarding it. But there are many ideas besides reinforcement learning.

ETA: I gave an idea for AI containment sometime ago, and it didn't get shot down. There are probably many other ways to build a non-dangerous strong AI that don't involve encoding or inferring the unitlity function of humanity.

ETA 2: it turns out that the connotations of this comment are wrong, thanks roystgnr.

A box that does nothing except predict the next bit in a sequence seems pretty innocuous, in the unlikely event that its creators managed to get its programming so awesomely correct on the first try that they didn't bother to give it any self-improvement goals at all.

But even in that case there are probably still gotchas. Once you start providing the box with sequences that correspond to data about the real-world results of the previous and current predictions, then even a seemingly const optimization problem statement like "find the most accurate approximation of the probability distribution function for the next data set" becomes a form of a real-world goal. Stochastic approximation accuracy is typically positively correlated with the variance of the true solution, for instance, and it's clear that the output variance of the world's future would be greatly reduced if only there weren't all those random humans mucking it up...

That doesn't sound right. The box isn't trying to minimize the "variance of the true solution". It is stating its current beliefs that were computed from the input bit sequence by using a formula. If you think it will manipulate the operator when some of its output bits are fed into itself, could you explain that a little more technically?

I never said the box was trying to minimize the variance of the true solution for it's own sake, just that it was trying to find an efficient accurate approximation to the true solution. That this efficiency typically increases as the variance of the true solution decreases means that the possibility of increasing efficiency by manipulating the true solution follows. Surely, no matter how goal-agnostic your oracle is, you're going to try to make it as accurate as possible for a given computational cost, right?

That's just the first failure mode that popped into my mind, and I think it's a good one for any real computing device, but let's try to come up with an example that even applies to oracles with infinite computational capability (and that explains how that manipulation occurs in either case). Here's a slightly more technical but still grossly oversimplified discussion:

Suppose you give me the sequence of real world data y1, y2, y3, y4... and I come up with a superintelligent way to predict y5, so I tell you y5 := x5. You tell me the true y5 later, I use this new data to predict y6 := x6.

But wait! No matter how good my rule xn = f(y1...y{n-1}) was, it's now giving me the wrong answers! Even if y4 was a function of {y1,y2,y3}, the very fact that you're using my prediction x5 to affect the future of the real world means that y5 is now a function of {y1, y2, y3, y4, x5}. Eventually I'm going to notice this, and now I'm going to have to come up with a new, implicit rule for xn = f(y1...y{n-1},xn).

So now we're not just trying to evaluate an f, we're trying to find fixed points for an f - where in this context "a fixed point" is math lingo for "a self-fulfilling prophecy". And depending on what predictions are called for, that's a very different problem. "What would the stock market be likely to do tomorrow in a world with no oracles?" may give you a much more stable answer than "What is the stock market likely to do tomorrow after everybody hears the announcement of what a super-intelligent AI thinks the stock market is likely to do tomorrow?" "Who would be likely to kill someone tomorrow in a world with no oracles?" will probably result in a much shorter list than "Who is likely to kill someone tomorrow, after the police receives this answer from the oracle and sends SWAT to break down their doors?" "What is the probability of WW3 within ten years have been without an oracle?" may have a significantly more pleasant answer than "What would the probability of WW3 within ten years be, given that anyone whom the oracle convinces of a high probability has motivation to react with arms races and/or pre-emptive strikes?"

Ooh, this looks right. A predictor that "notices" itself in the outside world can output predictions that make themselves true, e.g. by stopping us from preventing predicted events, or something even more weird. Thanks!

(At first I thought Solomonoff induction doesn't have this problem, because it's uncomputable and thus cannot include a model of itself. But it seems that a computable approximation to Solomonoff induction may well exhibit such "UDT-ish" behavior, because it's computable.)

This idea is probably hard to notice at first, since it requires recognizing that a future with a fixed definition can still be controlled by other things with fixed definitions (you don't need to replace the question in order to control its answer). So even if a "predictor" doesn't "act", it still does determine facts that control other facts, and anything that we'd call intelligent cares about certain facts. For a predictor, this would be the fact that its prediction is accurate, and this fact could conceivably be controlled by its predictions, or even by some internal calculations not visible to its builders. With acausal control, air-tight isolation is more difficult.

[-][anonymous]9y 0

I am pretty sure that Solomonoff induction doesn't have this problem.

Not because it is uncomputable, but because it's not attempting to minimise its error rate. It doesn't care if its predictions don't match reality.

[This comment is no longer endorsed by its author]Reply
[-][anonymous]9y 0

If reality ~ computable, then minimizing error rate ~ matching reality.

(Retracted because I misread your comment. Will think more.)

[This comment is no longer endorsed by its author]Reply
[-][anonymous]9y 0

I am pretty sure that Solomonoff induction doesn't have this problem. Not because it is uncomputable, but because it's not attempting to minimise its error rate.

[This comment is no longer endorsed by its author]Reply

If you play taboo with the word "goals" I think the argument may be dissolved.

My laptop doesn't have a "goal" of satisfying my desire to read LessWrong. I simply open the web browser and type in the URL, initiating a basically deterministic process which the computer merely executes. No need to imbue it with goals at all.

Except now my browser is smart enough to auto-fill the LessWrong URL after just a couple of letters. Is that goal-directed behavior? I think we're already at the point of hairsplitting semantic distinctions and we're talking about web browsers, not advanced AI.

Likewise, it isn't material whether an advanced predictor/optimizer has goals, what is relevant is that it will follow its programming when that programming tells it to "tell me the answer." If it needs more information to tell you the answer, it will get it, and it won't worry about how it gets it.

I think your taboo wasn't strong enough and you allowed some leftover essence of anthropomorphic "goaliness" to pollute your argument.

When you talk about an "advanced optimizer" that "needs more information" to do something and goes out there to "get it", that presupposes a model of AIs that I consider wrong (or maybe too early to talk about). If the AI's code consists of navigating chess position trees, it won't smash you in the face with a rook in order to win, no matter how strongly it "wants" to win or how much "optimization power" it possesses. If an AI believes with 100% probability that its Game of Life universe is the only one that exists, it won't set out to conquer ours. AIXI is the only rigorously formulated dangerous AI that I know of, its close cousin Solomonoff Induction is safe, both these conclusions are easy, and neither requires CEV.

ETA: if someone gets a bright new idea for a general AI, of course they're still obliged to ensure safety. I'm just saying that it may be easy to demonstrate for some AI designs.

Also, what about this?

Peter Wegner has produced dozens of papers over the past few decades arguing that Turing Machines are inadequate models of computation, as computation is actually practiced. To get a sample of this, Google "Wegner computation interaction".

I sometimes think that part of the seductiveness of the "safe oracle AI" idea comes from the assumption that the AI really will be like a TM - it will have no interaction with the external world between the reading of the input tape and the writing of the answer. To the contrary, the danger arises because the AI will interact with us in the interim between input and output with requests for clarification, resources, and assistance. That is, it will realize that manipulation of the outside world is a permitted method in achieving its mission.

Peter Wegner has produced dozens of papers over the past few decades arguing that Turing Machines are inadequate models of computation, as computation is actually practiced. To get a sample of this, Google "Wegner computation interaction".

I had a look, and at a brief glance I didn't see anything beyond what CSP and CCS were invented for more than 30 years ago. The basic paper defining a type of interaction machine is something that I would have guessed was from the same era, if I didn't know it was published in 2004.

I say this as someone who worked on this sort of thing a long time ago -- in fact, I did my D.Phil. with Hoare (CSP) and a post-doc with Milner (CCS), more than 30 years ago. I moved away from the field and know nothing about developments of recent years, but I am not getting the impression from the 2006 book that Wegner has co-edited on the subject that there has been much. Meanwhile, in the industrial world people have been designing communication protocols and parallel hardware, with or without the benefit of these theoretical researches, for as long as there have been computers.

None of which is to detract from your valid point that thinking of the AI as a TM, and neglecting the effects of its outputs on its inputs, may lead people into the safe oracle fallacy.

To the contrary, the danger arises because the AI will interact with us in the interim between input and output with requests for clarification, resources, and assistance. That is, it will realize that manipulation of the outside world is a permitted method in achieving its mission.

Except this is not the case for the AI I describe in my post.

The AI I describe in my post cannot make request for anything. It doesn't need clarification because we don't ask it question in a natural language at all! So I don't think you're criticism apply to this specific model.

I sometimes think that part of the seductiveness of the "safe oracle AI" idea comes from the assumption that the AI really will be like a TM - it will have no interaction with the external world between the reading of the input tape and the writing of the answer. To the contrary, the danger arises because the AI will interact with us in the interim between input and output with requests for clarification, resources, and assistance. That is, it will realize that manipulation of the outside world is a permitted method in achieving its mission.

A forecaster already has acutators - its outputs (forecasts).

Its attempts to manipulate the world seem pretty likely to use its existing output channel initially.

This sequence predictor can potentially be really useful (for example, predict future siai publications from past siai publications then proceed to read the article which give a complete account of Friendliness theory...) and is not dangerous in itself.

I see a way in which a simple, super-intelligent sequence predictor can be dangerous. If it can predict an entire journal issue, it surely can simulate a human being sufficiently well to build a persuasive enough argument for letting it out of the box.

However you don't need a complicated program predictor, you can just use the speed prior instead of the universal prior.

[-][anonymous]10y 0

Then, you use it to predict the program of less than N bits of length (with N sufficiently big of course) which maximize a utility function which measure how accurate the output of that program is as a program predictor given that it generate this output in less than T steps (where T is a reasonable number given the hardware you have access to). Then you run that program.

Wait, what?

Then, you use it to predict the program of less than N bits of length (with N sufficiently big of course) which maximize a utility function which measure how accurate the output of that program is as a program predictor given that it generate this output in less than T steps (where T is a reasonable number given the hardware you have access to).

How do you check how "accurate" a program predictor is - if you don't already have access to a high-quality program predictor?

You can do it in a very calculation intensive manner: take all programs of less than K bits (with K sufficiently big) calculate their answer (to avoid halting problem wait for the answer only a finite but truly enormous number of step, for exemple 3^^^3 steps) and compare it to the answer given by the program predictor. Of course you can't do that in any reasonable amount of time, which is why you're using the "good enough to be improved on" program predictor to predict the result of the calculation.

It sounds as though you are proposing using your approximate program predictor as a metric of how accurate a new candidate program predictor is. However, that is not going to result in any incentive to improve on the original approximate program predictor's faults.

In fact, I'm "asking" the program predictor to find the program which generate the best program predictor. It should be noted that the program predictor do not necesserly "consider" itself perfect: if you ask it to predict how many of the programs of less than M bits it will predict correctly, it won't necesserly say "all of them" (in fact it shouldn't say that if it's good enouygh to be improved on).

You are making this harder than it needs to be. General forecasting is equivalent to general stream compression. That insight offers a simple and effective quality-testing procedure - you compress the output of randomly-configured FSMs.

There's a big existing literature about how to create compressors - it is a standard computer-science problem.

I'm not sure what you're accusing me of making harder than it need to be.

Could you clarify?