Versions of AIXI can be arbitrarily stupid

19V_V

14Wei Dai

4Stuart_Armstrong

3Squark

4Wei Dai

2Squark

2Wei Dai

2Squark

2Wei Dai

2Squark

1Houshalter

0Wei Dai

1Houshalter

6buybuydandavis

1MrMind

1Stuart_Armstrong

0buybuydandavis

2Stuart_Armstrong

0buybuydandavis

6skeptical_lurker

2Stuart_Armstrong

0[anonymous]

0skeptical_lurker

3Stuart_Armstrong

0skeptical_lurker

0Stuart_Armstrong

0skeptical_lurker

-1V_V

0skeptical_lurker

-1V_V

5Viliam

3Jan_Rzymkowski

2Squark

0Stuart_Armstrong

2tailcalled

0Stuart_Armstrong

2tailcalled

1David_Bolin

2Stuart_Armstrong

0David_Bolin

1owencb

1Stuart_Armstrong

1owencb

0Slider

0Irgy

2Stuart_Armstrong

2Wei Dai

2Stuart_Armstrong

4Wei Dai

2Stuart_Armstrong

2Wei Dai

2Stuart_Armstrong

3V_V

0Wei Dai

0Stuart_Armstrong

1V_V

2Wei Dai

-3Houshalter

-4Username

New Comment

59 comments, sorted by Click to highlight new comments since: Today at 2:40 PM

Interesting paper, but I'm not sure this example is a good way to illustrate the result, since if someone actually built AIXI using the prior described in the OP, it will quickly learn that it's not in Hell since it won't actually receive ε reward for outputting "0".

Here's my attempt to construct a better example. Suppose you want to create an agent that qualifies as an AIXI but keeps just outputting "I am stupid" for a very long time. What you do is give it a prior which assigns ε weight to a "standard" universal prior, and rest of the weight to a Hell environment which returns exactly the same (distribution of) rewards and inputs as the "standard" prior for outputting "I am stupid." and 0 reward forever if the AIXI ever does anything else. This prior still qualifies as "universal".

This AIXI can't update away from its initial belief in the Hell environment because it keeps outputting "I am stupid" for which the Hell environment is indistinguishable from the real environment. If in the real world you keep punishing it (give it 0 reward), I think eventually this AIXI will do something else because its expected reward for outputting "I am stupid" falls below ε so risking almost certainty of the 0 reward forever of Hell for the ε chance of getting a better outcome becomes worthwhile. But if ε is small enough it may be impossible to punish AIXI consistently enough (i.e., it could occasionally get a non-zero reward due to cosmic rays or quantum tunneling) to make this happen.

I think one could construct similar examples for UDT so the problem isn't with AIXI's design, but rather that a prior being "universal" isn't "good enough" for decision making. We actually need to figure out what the "actual", or "right", or "correct" prior is. This seems to resolve one of my open problems.

it will quickly learn that it's not in Hell since it won't actually receive ε reward for outputting "0".

The example was meant to show that if it was in Heaven, it will behave as if it was in Hell (now that's a theological point there ^_^ ). Your example is more general.

The result of the paper is that as long as the AIXI gets a minimum non-zero average reward (essentially), you can make it follow that policy forever.

As I discussed before, IMO the correct approach is not looking for the one "correct" prior since there is no such thing but specifying a "pure learning" phase in AI development. In the case of your example, we can imagine the operator overriding the agent's controls and forcing it to produce various outputs in order to update away from Hell. Given a sufficiently long learning phase, all universal priors should converge to the same result (of course if we start from a ridiculous universal prior it will take ridiculously long, so I still grant that there is a fuzzy domain of "good" universal priors).

As I discussed before, IMO the correct approach is not looking for the one "correct" prior since there is no such thing but specifying a "pure learning" phase in AI development.

I'm not sure about "no correct prior", and even if there is no "correct prior", maybe there is still "the right prior for me", or "my actual prior", which we can somehow determine or extract and build into an FAI?

In the case of your example, we can imagine the operator overriding the agent's controls and forcing it to produce various outputs in order to update away from Hell.

How do you know when you've forced the agent to explore enough? What if the agent has a prior which assigns a large weight to an environment that's indistinguishable from our universe, except that lots of good things happen if the sun gets blown up? It seems like the agent can't update away from this during the training phase.

(of course if we start from a ridiculous universal prior it will take ridiculously long, so I still grant that there is a fuzzy domain of "good" universal priors)

So you think "universal" isn't "good enough", but something more specific (but perhaps not unique as in "the correct prior" or "the right prior for me") is? Can you try to define it?

I'm not sure about "no correct prior", and even if there is no "correct prior", maybe there is still "the right prior for me", or "my actual prior", which we can somehow determine or extract and build into an FAI?

This sounds much closer home. Note, however, that there is certain ambiguity between the prior and the utility function. UDT agents maximize Sum Prior(x) U(x) so certain simultaneous redefinitions of Prior and U will lead to the same thing.

But in that case, why do we need a special "pure learning" period where you force the agent to explore? Wouldn't any prior that would qualify as "the right prior for me" or "my actual prior" not favor any particular universe to such an extent that it prevents the agent from exploring in a reasonable way?

To recap, if we give the agent a "good" prior, then the agent will naturally explore/exploit in an optimal way without being forced to. If we give it a "bad" prior, then forcing it to explore during a pure learning period won't help (enough) because there could be environments in the bad prior that can't be updated away during the pure learning period and cause disaster later. Maybe if we don't know how to define a "good" prior but there are "semi-good" priors which we know will reliably converge to a "good" prior after a certain amount of forced exploration, then a pure learning phase would be useful, but nobody has proposed such a prior, AFAIK.

In the post you linked to, at the end you mention a proposed "fetus" stage where the agent receives no external inputs. Did you ever write the posts describing it in more detail? I have to say my initial reaction to that idea is also skeptical though. Human don't have a fetus stage where we think/learn about math with external inputs deliberately blocked off. Why do artificial agents need it? If an agent couldn't simultaneously learn about math and process external inputs, it seems like something must be wrong with the basic design which we should fix instead of work around.

I didn't develop the idea, and I'm still not sure whether it's correct. I'm planning to get back to these questions once I'm ready to use the theory of optimal predictors to put everything on rigorous footing. So I'm not sure we really need to block the external inputs. However, note that the AI is in a sense more fragile than a human since the AI is capable of self-modifying in irreversible damaging ways.

There is no such thing as an "actual" or "right" or "correct" prior. A lot of the arguments for frequentist statistical methods were that bayesians require a subjective prior, and there is no way to make priors not subjective.

What would it even mean for there to be a universal prior? You only exist in this one universe. How good a prior is, is simply how much probability it assigns to this universe. You could try to find a prior empirically, by testing different priors and seeing how well they fit the data. But then you still need a prior over those priors.

But we can still pick a *reasonable* prior. Like a uniform distribution over all possible LISP programs, biased towards simplicity. If you use this as your prior of priors, then any crazy prior you can think of should have some probability. Enough that a little evidence should cause it to become favored.

What would it even mean for there to be a universal prior?

I have a post that may better explain what I am looking for.

You only exist in this one universe. How good a prior is, is simply how much probability it assigns to this universe.

This seems to fall under position 1 or 2 in my post. Currently my credence is mostly distributed between positions 3 and 4 in that post. Reading it may give you a better idea of where I'm coming from.

Position 1 or 2 is correct. 3 isn't coherent; what is "reality fluid" and how can things be more "real" than other things. Where do subjective beliefs come from in this model? 4 has nothing to do with probability theory. Values and utility functions don't enter into it. Probability theory is about making predictions and doing statistics, not how much you care about different worlds which may or may not actually exist.

I interpret probability as expectation. I want to make predictions about things. I want to maximize the probability I assign to the correct outcomes. If I multiply all the predictions I ever made together, I want that number to be as high as possible (predictions of the correct outcome, that is.) That would the probability I gave to the world. Or at least my observations of it.

So then it doesn't really matter what the numbers represent. Just that I want them to be as high as possible. When I make decisions based on the numbers using some decision theory/algorithm and utility function, the higher the numbers are, the better my results will be.

I'm reminded of someone's attempt to explain probability without using words like "likely", "certain" or "frequency", etc. It was basically an impossible task. If I was going to attempt that, I would say something like the previous two paragraphs. Saying things like "weights", "reality fluid", "measure", "possible world", etc, just pushes the meaning elsewhere.

In any case, all of your definitions should be mathematically equivalent. They *might* have philosophical implications, but they should all produce the same results on any real world problems. Or at least I think they should. You aren't disputing Bayes theorem or standard probability theory or anything?

In that case the choice of prior should have the same consequences. And you still want to choose the prior that you think will assign the actual outcome the highest probability.

then an AIXI that follows one prior can be arbitrarily stupid with respect to another.

Yet another application of David Wolpert's No Free Lunch theorems.

We have dubbed the associated results NFL theorems because they demonstrate that if an algorithm performs well on a certain class of problems then it necessarily pays for that with degraded performance on the set of all remaining problems.

NFL works with algorithms operating on finite problems. With algorithms operating on unbounded problems, you can benefit from Blum's speedup theorem: for every algorithm and every computable measure of performance, there's a second algorithm performing better than the first on *almost* all inputs.

I suspect here's happening something similar: AIXI is finitely bias-able, and there are environments that can exploit that to arbitrarily constrain the agent's behaviour. If the analogy holds, there's then a class of environments for which AIXI, however finitely biased, is still optimally intelligent.

There's another real point - focusing on your prior as "fit" to the problem/universe.

The space of possible priors Wolpert considered were very unlike our experience - basically imposing no topological smoothness on points - every point is a ball from the urn of possible balls. That's just not the way it is. Choosing your prior, and exploiting the properties of your prior then becomes the way to advance.

Uncomputable AIXI can be approximated almost arbitrarily well by computable versions. And the general problem is that "Hell" is possible in any world - take a computable version of AIXI in our world, and give it a prior that causes it to never do anything...

And surely any intelligence has to have priors and can run into the exact same problem?

This means that "pick a complexity prior" does not solve the problem of priors for active agents (though it does for passive agents) because which complexity prior we pick matters.

I know that the uncomputable AIXI assigns zero probability to its own existence - would a computable version be able to acknowledge its own existence? If not, would this cause problems involving being unable to self-modify, avoid damage, negotiate etc?

This means that "pick a complexity prior" does not solve the problem of priors for active agents (though it does for passive agents) because which complexity prior we pick matters.

Is this similar to being vulnerable to pascal's muggings? Would programming AIXI to ignore probabilities less than, say, 10^-9, help?

See here for approaches that can deal with the AIXI existence issue: http://link.springer.com/chapter/10.1007/978-3-319-21365-1_7

Also, the problem is the prior, in that a poor choice raises the likelyhood of a particular world. Ignoring low probabilities doesn't help, because that world will have a weirdly high probability; we need a principled way of choosing the prior.

It seems that "just pick a random language (eg C++), without adding any specific weirdness" should work to avoid the problem - but we just don't know at this point.

See here for approaches that can deal with the AIXI existence issue:

I can't read past the abstract, but I'd find this more reassuring if it didn't require Turing oracles.

It seems that "just pick a random language (eg C++), without adding any specific weirdness" should work to avoid the problem - but we just don't know at this point.

My understanding is that functional languages have properties which would be useful for this sort of thing, but anyway I agree, my instincts are that while this problem might exist, you would only actually run into it if using a language specifically designed to create this problem.

I can't read past the abstract, but I'd find this more reassuring if it didn't require Turing oracles.

That's the first step.

my instincts are that while this problem might exist, you would only actually run into it if using a language specifically designed to create this problem.

My instincts agree with your instincts, but that's not a proof... A bit more analysis would be useful.

my instincts are that while this problem might exist, you would only actually run into it if using a language specifically designed to create this problem.

I think it's actually worse. If I understand correctly, corollary 14 implies that for any choice of the programming language, there exist some mixtures of environments which exhibit that problem. This means that if the environment is chosen adversarially, even by a computable adversary, AIXI is screwed.

So essentially the AIXI will avoid experiments where it has high *prior probability* that the punishment *could* be astronomical (greater than any benefit gained by learning). And, never doing experiments in that area, it cannot update.

If I imagine the same with humans, it seems like both good and bad thing. Good: it would make a human unlikely to experiment with suicide. Bad: it would make a human unlikely to experiment with abandoning religion, or doing any other thing with scary taboo.

Or perhaps an analogy would be running LHC-like experiements which (some people believe) have a tiny chance of destroying the universe. Maybe the chance is extremely small, but if we keep doing more and more extreme experiments, it seems like a question of time until we find something that "works". On the other hand, this analogy has the weakness that we can make educated guesses about laws of physics by doing things *other* than potentially universe-destroying experiments, while in the example in article, the AIXI has no other source to learn from.

I have described essentially the same problem about a year ago, only in the framework of the updateless intelligence metric which is more sophisticated than AIXI. I have also proposed a solution, albeit provided no optimality proof. Hopefully such a proof will become possible once I make the updatless intelligence metric rigorous using the formalism of optimal predictors.

The details may change but I think that something in the spirit of that proposal has to be used. The AI's subhuman intelligence growth phase has to be spent in a mode with frequentism-style optimality guarantees while in the superhuman phase it will switch to Bayesian optimization.

Thanks, this is an important result showing that the dominating property really isn't enough to pick out a prior for a good agent. I like your example as a to-the-point explanation of the issue.

I think the post title is a somewhat misleading, though: it sounds as though differences in instantiations of AIXI don't really matter, and they can all be arbitrarily stupid. Any chance of changing that? Perhaps to something like "Versions of AIXI can be arbitrarily stupid"?

I think this shows how the whole "language independent up to a constant" thing is basically just a massive cop-out. It's very clever for demonstrating that complexity is a real, definable thing, with properties which at least transcend representation in the infinite limit. But as you show it's useless for doing anything practical.

My personal view is that there's a true universal measure of complexity which AIXI ought to be using, and which wouldn't have these problems. It may well be unknowable, but AIXI is intractable anyway so what's the difference? In my opinion, this complexity measure could give a real, numeric answer to seemingly stupid questions like "You see a number. How likely is it that the number is 1 (given no other information)?". Or it could tell us that 16 is actually less complex than, say, 13. I mean really, it's 2^2^2, spurning even a need for brackets. I'm almost certain it would show up in real life more often than 13, and yet who can even show me a non-contrived language or machine in which it's simpler?

Incidentally, they "hell" scenario you describe isn't as unlikely as it at first sounds. I remember an article here a while back lamenting the fact that left unmonitored AIXI could easily kill itself with exploration, the result of which would have a very similar reward profile to what you describe as "hell". It seems like it's both too cautious and not cautious enough in even just this one scenario.

Yes. The problem is not the Hell scenarios, the problem is that we can make them artificially probable via language choice.

I think this shows how the whole "language independent up to a constant" thing is basically just a massive cop-out.

Some results are still true. An exploring agent (if it survives) will converge on the right environment, independent of language. And episodic environments do allow AIXI to converge on optimal behaviour (as long as the discount rate is gradually raised).

An exploring agent (if it survives) will converge on the right environment, independent of language.

But it seems like such an agent could only survive in an environment where it literally can't die, i.e., there is nothing it can do that can possibly cause death, since in order to converge on the right environment, independent of language, it has to try all possible courses of action as time goes to infinity and eventually it will do something that kills itself.

What value (either practical or philosophical, as opposed to purely mathematical), if any, do you see in this result, or in the result about episodic environments?

My argument is that "(if it survives) will converge on the right environment, independent of language" is *not* a property we want in an FAI, because that implies it will try every possible courses of action at some point, including actions that with high probability kills itself or worse (e.g., destroys the universe). Instead, it seems to me what we need is a standard EU maximizing agent that just uses a better prior than merely "universal", so that it explores (and avoids exploring) in ways that we'd think reasonable. Sorry if I didn't make that fully explicit or clear. If you still think "an AIXI-like agent that balances exploration and exploitation could be what is needed", can you please elaborate?

We have the universal explorer - it will figure out everything, if it survives, but it'll almost certainly kill itself.

We have the bad AIXI model above - it will survive for a long time, but is trapped in a bad epistemic state.

What would be ideal would be a way of establishing the minimal required exploration rate.

What would be ideal would be a way of establishing the minimal required exploration rate.

Do you mean a way of establishing this independent of the prior, i.e., the agent will explore at some minimum rate regardless of what prior we give it? I don't think that can be right, since the correct amount of exploration must depend on the prior. (By giving AIXI a different bad prior, we can make it explore too much instead of too little.) For example suppose there are physics theories P1 and P2 that are compatible with all observations so far, and an experiment is proposed to distinguish between them, but the experiment will destroy the universe if P1 is true. Whether or not we should do this experiment must depend on what the correct prior is, right? On the other hand, if we had the correct prior, we wouldn't need a "minimal required exploration rate". The agent would just explore/exploit optimally according to the prior.

What value (either practical or philosophical, as opposed to purely mathematical), if any, do you see in this result, or in the result about episodic environments?

There are plenty of applications of reinforcement learning where it is plausible to assume that the environment is ergodic (that is, the agent can't "die" or fall into traps that permanently result in low rewards) or episodic. The Google DQN Atari game agent, for instance, operates in an episodic environment, therefore, stochastic action selection is acceptable.

Of course, this is not suitable for an AGI operating in an unconstrained physical environment.

So if you give an agent a bad prior, it can make bad decisions. This is not a new insight.

Low probability hypotheses predicting vast rewards/punishments, seems equivalent to Pascal's Mugging. Any agent that maximizes expected utility will spend increasing amounts of resources, worrying about more and more unlikely hypotheses. In the limit, it will spend all of it's time and energy caring about a single random hypotheses which predicts infinite reward (like your examples), even if it has zero probability.

I've argued in the past that maximizing expected utility should be abandoned. I may not have the perfect alternative, and alternatives may be somewhat ad hoc. But that's better than just ignoring the problem.

AIXI is still optimal at doing what you told it to do. It's maximizing it's expected reward, given the prior you tell it. It's just what you told it to do isn't what you really want. But we already knew that.

Oh, one interesting thing is that your example does appear similar to real life. If you die, you get stuck in a state where you don't receive any more rewards. I think this is actually a desirable thing and solves the anvil problem. I've suggested this solution in the past.

Many people (including me) had the impression that AIXI was ideally smart. Sure, it was uncomputable, and there might be "up to finite constant" issues (as with anything involving Kolmogorov complexity), but it was, informally at least, "the best intelligent agent out there". This was reinforced by Pareto-optimality results, namely that there was no computable policy that performed at least as well as AIXI in all environments, and strictly better in at least one.

However, Jan Leike and Marcus Hutter have proved that AIXI can be, in some sense, arbitrarily bad. The problem is that AIXI is not fully specified, because the universal prior is not fully specified. It depends on a choice of a initial computing language (or, equivalently, of an initial Turing machine).

For the universal prior, this will only affect it up to a constant (though this constant could be arbitrarily large). However, for the agent AIXI, it could force it into continually bad behaviour that never ends.

For illustration, imagine that there are two possible environments:

Now simply choose a language/Turing machine such that the ratio P(Hell)/P(Heaven) is higher than the ratio 1/ε. In that case, for any discount rate, the AIXI will always output "0", and thus will never learn whether its in Hell or not (because its too risky to do so). It will observe the environment giving reward ε after receiving "0", behaviour which is compatible with both Heaven and Hell. Thus keeping P(Hell)/P(Heaven) constant, and ensuring the AIXI never does anything else.

In fact, it's worse than this. If you use the prior to measure intelligence, then an AIXI that follows one prior can be arbitrarily stupid with respect to another.