Frequentist Magic vs. Bayesian Magic

Wei Dai

58 Frequentist Magic vs. Bayesian Magic

8th Apr 2010

4 min read

58

[I posted this to open thread a few days ago for review. I've only made some minor editorial changes since then, so no need to read it again if you've already read the draft.]

This is a belated reply to cousin_it's 2009 post Bayesian Flame, which claimed that frequentists can give calibrated estimates for unknown parameters without using priors:

And here's an ultra-short example of what frequentists can do: estimate 100 independent unknown parameters from 100 different sample data sets and have 90 of the estimates turn out to be true to fact afterward. Like, fo'real. Always 90% in the long run, truly, irrevocably and forever.

And indeed they can. Here's the simplest example that I can think of that illustrates the spirit of frequentism:

Suppose there is a machine that produces biased coins. You don't know how the machine works, except that each coin it produces is either biased towards heads (in which case each toss of the coin will land heads with probability .9 and tails with probability .1) or towards tails (in which case each toss of the coin will land tails with probability .9 and heads with probability .1). For each coin, you get to observe one toss, and then have to state whether you think it's biased towards heads or tails, and what is the probability that's the right answer.

Let's say that you decide to follow this rule: after observing heads, always answer "the coin is biased towards heads with probability .9" and after observing tails, always answer "the coin is biased towards tails with probability .9". Do this for a while, and it will turn out that 90% of the time you are right about which way the coin is biased, no matter how the machine actually works. The machine might always produce coins biased towards heads, or always towards tails, or decide based on the digits of pi, and it wouldn't matter—you'll still be right 90% of the time. (To verify this, notice that in the long run you will answer "heads" for 90% of the coins actually biased towards heads, and "tails" for 90% of the coins actually biased towards tails.) No priors needed! Magic!

What is going on here? There are a couple of things we could say. One was mentioned by Eliezer in a comment:

It's not perfectly reliable. They assume they have perfect information about experimental setups and likelihood ratios. (Where does this perfect knowledge come from? Can Bayesians get their priors from the same source?)

In this example, the "perfect information about experimental setups and likelihood ratios" is the information that a biased coin will land the way it's biased with probability .9. I think this is a valid criticism, but it's not complete. There are perhaps many situations where we have much better information about experimental setups and likelihood ratios than about the mechanism that determines the unknown parameter we're trying to estimate. This criticism leaves open the question of whether it would make sense to give up Bayesianism for frequentism in those situations.

The other thing we could say is that while the frequentist in this example appears to be perfectly calibrated, he or she is liable to pay a heavy cost for this in accuracy. For example, suppose the machine is actually set up to always produce head-biased coins. After observing the coin tosses for a while, a typical intelligent person, just applying common sense, would notice that 90% of the tosses come up heads, and infer that perhaps all the coins are biased towards heads. They would become more certain of this with time, and adjust their answers accordingly. But the frequentist would not (or isn't supposed to) notice this. He or she would answer "the coin is head-biased with probability .9" 90% of the time, and "the coin is tail-biased with probability .9" 10% of the time, and keep doing this, irrevocably and forever.

The frequentist magic turns out to be weaker than it first appeared. What about the Bayesian solution to this problem? Well, we know that it must involve a prior, so the only question is which one. The maximum entropy prior that is consistent with the information given in the problem statement is to assign each coin an independent probability of .5 of being head-biased, and .5 of being tail-biased. It turns out that a Bayesian using this prior will give the exact same answers as the frequentist, so this is also an example of a "matching prior". (To verify: P(biased heads | observed heads) = P(OH|BH)*P(BH)/P(OH) = .9*.5/.5 = .9)

But a Bayesian can do much better. A Bayesian can use a universal prior. (With a universal prior based on a universal Turing machine, the prior probability that the first 4 coins will be biased "heads, heads, tails, tails" is the probability that the UTM will produce 1100 as the first 4 bits of its output, when given a uniformly random input tape.) Using such a prior guarantees that no matter how the coin-producing machine works, as long as it doesn't involve some kind of uncomputable physics, in the long run your expected total Bayes score will be no worse than someone who knows exactly how the machine works, except by a constant (that's determined by the algorithmic complexity of the machine). And unless the machine actually settles into deciding the bias of each coin independently with 50/50 probabilities, your expected Bayes score will also be better than the frequentist (or a Bayesian using the matching prior) by an unbounded margin as time goes to infinity.

I consider this magic also, because I don't really understand why it works. Is our prior actually a universal prior, or is the universal prior just a handy approximation that we can substitute in place of the real prior? Why does the universe that we live in look like a giant computer? What about uncomputable physics? Just what are priors, anyway? These are some of the questions that I'm still confused about.

But as long as we're choosing between different magics, why not pick the stronger one?

Bayes' TheoremProbability & Statistics

Frontpage

58

Mentioned in

32(Subjective Bayesianism vs. Frequentism) VS. Formalism

New Comment

83 comments, sorted by

top scoring

Click to highlight new comments since: Today at 2:42 AM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

[-]Toby_Ord15y110

We can't use the universal prior in practice unless physics contains harnessable non-recursive processes. However, this is exactly the situation in which the universal prior doesn't always work. Thus, one source of the 'magic' is through allowing us to have access to higher levels of computation than the phenomena we are predicting (and to be certain of this).

Also, the constants involved could be terrible and there are no guarantees about this (not even probabilistic ones). It is nice to reach some ratio in the limit, but if your first Graham's number of guesses are bad, then that is very bad for (almost) all purposes.

3Wei Dai15y

My position is that the uncomputabiilty of the universal prior shouldn't count against it. I think the fact that it works so well shows that our actual prior is likely also uncomputable, and that means we have to handle uncomputable priors in our decision theory, for example by specifying that we choose the option that we can prove (or just heuristically believe) has the highest expected payoff, instead of actually computing the expected payoffs. A worse problem is that there seems to be reason to think that our actual prior is not just uncomputable, but unformalizable. See my earlier posts on this.

6Vladimir_Nesov15y

If you make ontological claims, you are bound to get in trouble. Decision theory should speak of what the agent's algorithm should do, in terms of its behavior, not what it means for the agent to do that in terms of consequences in the real world. What the agent's algorithm does is always formalizable (as that algorithm!). (For people unfamiliar with the discussion -- see "ontology problem" in this sequence.)

2Wei Dai15y

It seems that we do tend to get into trouble when we make ontological claims, but why "bound to"? Your proposed FAI, after it has extracted human values, will still have to solve the ontological problem, right? If it can, then why can't we? You advocate "being lazy" as FAI programmers and handing off as many problems as we can to the FAI, but I'm still skeptical that any FAI approach will succeed in the near future, and in the mean time, I'd like to try to better understand what my own values are, and how I should make decisions.

3Vladimir_Nesov15y

I don't believe even superintelligence can solve the ontology problem completely. A fine goal, but I doubt it can contribute to FAI design (which, even it'll take more than a century to finish, still has to be tackled to make that possible). Am I right in thinking that you agree with that?

2Wei Dai14y

Why? I'm not sure what you're referring to by "that" here. Do you mean "preserving our preferences"? Assuming you do... No, I think we have at least two disagreements here: 1. If I can figure out what my own values are, there seem to be several ways that could contribute to FAI design. The simplest way is that I program those values into the FAI manually. 2. I don't think FAI is necessarily the best way to preserve our values. I can, for example, upload myself, and then carefully increase my intelligence. As you've mentioned in previous comments, this is bound to cause value drift, but such drift may be acceptable, compared to the risks involved in implementing de novo FAI. My guess is that the root cause of these disagreements is my distrust of human math and software engineering abilities, stemming from my experiences in the crypto field. I think there is a good chance that we (unenhanced biological humans) will never find the correct FAI theory, and that in the event we think we've found the right FAI theory, it will turn out that we're mistaken. And even if we manage to get FAI theory right, it's almost certain that the actual AI code will be riddled with bugs. You seem to be less concerned with these risks.

0Jess_Riedel15y

Could you suggest a source for further reading on this?

[-]Roko15y90

The Bayesian using the universal prior is cheating. Kolmogorov complexity is incomputable. So you have to use a realistic compressor. Then the optimality results go away. But, in practice if your compressor is good, it's almost always going to outdo the frequentist.

3cousin_it15y

Yep, the universal prior should be filed under "science fiction". And you know why it's uncomputable? Because for any computable prior I can create a coin-producing machine that will anticipate and thwart the Bayesian's expectations :-) The frequentist approach is analogous to a minimax strategy in a game: no matter how malicious the universe is, you still get your 90%. Other, non-minimax strategies inevitably take risks to try and exploit stupid opponents. This connection was formalized by Abraham Wald, who was mentioned by Cyan in the two posts that started this whole affair. Replying to Wei Dai's post, I still don't know which magic is the stronger one or what "stronger" actually means here. In actual statistical practice, choosing good priors clearly requires skills and techniques that aren't part of the naive Bayesian canon. This creates a doubt in my mind that just won't go away.

8wedrifid15y

* If 'the universe is malicious' is the right prior to use then a Bayesian will use it. * Using a universal prior on a malicious universe can only give a score that differs by a constant from the frequentist over infinite time. * Any sane computable approximation of the universal prior would necessarily include a weight on the hypothesis 'the universe is more evil/complex than than can be represented with algorithms of the length that this approximation handles'. * If the universe is malicious you probably shouldn't be doing statistics on it. This is true. That is when it is time to use frequentist approximations.

0AlexMennen15y

Frequentist techniques can be useful in certain situations, but only because of the great difficulty in assigning accurate priors and the fact that we often have such overwhelmingly large amounts of evidence that any hypothesis with a substancial prior will probably have a posterior either near-zero or near-one if proper Bayesian reasoning is used. In those situations, judging the probability of a hypothesis by its p-values saves a lot of complicated work and is almost as good. However, when there is not overwhelming evidence or when the evidence points to a hypothesis with a negligible prior, frequentist statistics ceases to provide an adequate approximation. Always remember to think like a Bayesian, even when using frequentist methods. Also, this was a terrible example of frequentist statistics avoiding the use of priors, because the prior was that the probability that a coin would land heads anything other than 90% or 10% of the time is zero. This is a tremendous assertion, and refusing to specify the probabilities of 90% heads vs. 10% heads just makes the prior incomplete. Saying that there is no prior is like saying that I did not make this post just because the last sentence lacks a period

-4cousin_it15y

It has wings that are way too big for a mammal, and it lays eggs! Which just makes it an incomplete mammal.

3wnoise15y

It's really more than that. Consider the great anti-Bayesian Cosma Shalizi. He's shown that the use of a prior is really equivalent to a method of smoothing, of regularization on your hypothesis space, trading off (frequentist) bias and variance. And everyone has a regularization scheme, even if they claim it is "don't: 'let the data speak for themselves'". Your assertion that the machine is exactly 90% biased is exactly equivalent to an evenly-split point-mass prior at 10% and 90%. A Bayesian with that prior would exactly reproduce your reasoning. You're absolutely right that it is in no way an incomplete prior -- it exactly specifies everything. One could consider it an overconfident prior, but if you do in fact know that it's either 10% or 90%, and have no idea which it's perfectly appropriate. The choice of which Frequentist statistical techniques to use is pretty much isomorphic to the choice of Bayesian priors. EDIT: On the bias/variance trade-off, I just realized that the Frequentist prescription for binary frequency estimation s/(s+f) is, rather than a maximum entropy prior, a maximum variance prior. (To be specific it is an improper prior, the limiting case of the beta distribution, with both parameters going to zero. Although this has support everywhere in [0,1], if you try to normalize it, the integral going to infinity kills everything except delta spikes at 0 and 1. But normalizing after conditioning on data gives a reasonable prior, with the maximum likelihood estimate being the standard estimate.)

1[anonymous]9y

Cosma's writings for those interested

0Daniel_Burfoot15y

It seems odd to interpret this point as anti-Bayesian. To me it seems pro-Bayesian: it means that whenever you use a regularizer you're actually doing Bayesian inference. Any method that depends on a regularizer is open to the same critique of subjectivity to which Bayesian methods are vulnerable. Two frequentists using different regularizers will come to different conclusions based on the same evidence, and the choice of a regularizer is hardly inevitable or dictated by the problem. If you have a link to a paper that contains anti-Bayesian arguments by Shalizi, I would be interested in reading it.

0wnoise15y

Well, it seems odd to me too. He has another rant up comparing Bayesian updating to evolution saying "okay, that's why Bayesian updating seems to actually work OK in many cases", whereas I see that as explaining why evolution works... http://cscs.umich.edu/~crshalizi/weblog/cat_bayes.html Is a good start. He also has a paper on the arXiv that is flat-out wrong, so ignore "The Backwards Arrow of Time of the Coherently Bayesian Statistical Mechanic", though showing how it goes wrong takes a fair bit of explaining of fairly subtle points.

2cupholder15y

I've tried reading it before - for me to understand just the paper itself would also take a fair bit of explaining of fairly subtle points! I understand Shalizi's sketch of his argument in words: My problem is that I know a plausible-looking argument expressed in words can still quite easily be utterly wrong in some subtle way, so I don't know how much credence to give Shalizi's argument.

6JGWeissman15y

The problem with the quoted argument from Shalizi is that it is describing a decrease in entropy over time of an open system. To track a closed system, you have to include the brain that is making observations and updating its beliefs. Making the observations requires thermodynamic work that can transfer entropy.

1cupholder15y

D'oh! Why didn't I think of that?!

2Cyan15y

If you write such a post, I'll almost certainly upvote it.

0Cyan15y

Just Google him -- his website is full of tons of interesting stuff.

6JGWeissman15y

This is because you constrained the universe to only being able to present with you sequences of one of two possible values, both of which reveal themselves on first inspection 90% of the time. Let the universe throw at you a machine that, for all you know, can produce any distribution or pattern of coins with any bias. Try to get your guaranteed 90% making predictions on the bias of those coins from a single flip.

0[anonymous]15y

I didn't constrain the universe, Wei Dai did. You wanna talk about another problem, fine. I assume you have some secret Bayesian technique for solving it?

[-]cupholder15y40

For example, suppose the machine is actually set up to always produce head-biased coins. After observing the coin tosses for a while, a typical intelligent person, just applying common sense, would notice that 90% of the tosses come up heads, and infer that perhaps all the coins are biased towards heads. They would become more certain of this with time, and adjust their answers accordingly. But the frequentist would not (or isn't supposed to) notice this.

They wouldn't (they aren't)?

He or she would answer "the coin is head-biased with probability

... (read more)

[-]RolfAndreassen15y20

I don't understand why your universal prior passes the tape through the Turing machine. If you have a source of genuinely random tape inputs in the first place, why not just use the probability that a tape has the first four bits '1100'? And in any case, what's the advantage of invoking Turing machinery in place of the good old fair coins?

6jimrandomh15y

The trick, and the reason for using Turing machines, is that '0000' and '1111' ought to be considered more likely than '0101' and '1010', because they're simpler; and the way this model represent simplicity, a sequence is simple if it's the output of short programs (where 'program' in this case means 'prefix to a Turing machine tape'). If you used the tape directly, then '0000000000' and '0100011011' would be equally likely, which, generally speaking, they're not.

5gjm15y

You get completely different results by doing that. The universal prior gives higher probability to simpler hypotheses -- 0000000000000000 is more likely than 0010111001001101, or at least analogous things are true once the strings get long enough. The idea is that if there's any computable regularity in what you're trying to model, as you get more data you'll recognize that regularity (i.e., your posterior will give much higher probability to futures in which the regularity continues than to ones in which it doesn't).

1RolfAndreassen15y

Ah so, I understand. But in that case, doesn't evaluating your prior require you to integrate over the space of all possible Turing machines? There is no conceptual difficulty with this, but I must say that it strikes me as computationally rather formidable for a poxy little throw-four-possibly-biased-coins problem. :)

0gjm15y

It's not just computationally formidable, it's computationally impossible. :-)

2gwern15y

My understanding was that it's perfectly possible to solve the Halting Problem for particular TMs. For a TM of length n, you begin running it and either it terminates at some point, or it runs for BusyBeaver(n)+1 steps, in which case you know it doesn't ever halt. So as long as you have a fixed length n that all possible TMs fit into and know the Busy Beaver for n, you can in fact use the universal prior. (Busy Beavers are uncomputable in general, though we know a fair number, but this is an easy condition to weaken down to something practical like high probability of being the Busy Beaver.)

1gjm15y

1. The fact that BB(n) is uncomputable means that your procedure doesn't, in fact, work. No? 2. You don't, in any case, have a fixed n. (Unless perhaps you are only interested in modelling Turing machines that fit in the universe -- but in that case, wouldn't you also want your own computations to fit in the universe?)

1gwern15y

1. Finding BB(n) for all n is uncomputable. Finding BB(n) where n=10, say, is not uncomputable. 2. Yeah, this is an interesting point. My thinking is that yes, we do only care about modeling Turing machines that fit in the universe, since by definition those are the only Turing machines whose output we will ever observe. We might also have to bite the bullet and say we only care about predicting Turing machines which are small enough that there is room left in the universe for us; we don't want to perfectly model the entire universe, just small parts.

3gjm15y

1. If it's uncomputable in general then it's uncomputable for some particular n. I expect that in fact it's uncomputable for any n above some rather small value. (In particular, once n is large enough for there to be n-state Turing machines that unprovably don't always halt.) 2. Well, if you only care about predicting smallish Turing machines, then for the same reason prediction is only useful if it can be done with smallish machines, in reasonable time. And even for very small n this isn't true for calculating BB(n).

1RobinZ15y

Just out of my own curiosity, I checked Wikipedia on Busy Beaver numbers: BB(10) > 3^^^3. While the article admits that every finite sequence (such as n = 1...10) is computable, I think it is safe to call it intractable right now.

3gwern15y

Hey, don't laugh. That's useful to know - now you know any Turing Machine less than 10 is bluffing when it threatens 3^^^^3 specks of sand in eyes.

2thomblake15y

Off the cuff, couldn't it actually accomplish that if it just doesn't terminate?

1gwern15y

If it doesn't terminate, then I think its threat would be an infinite number of sand specks, wouldn't it?

0JGWeissman15y

How did you conclude that from a lower bound?

3gwern15y

I don't think the lower-bound will turn out to be all that off, and I tossed in another ^ as insurance. Jests don't have to be air-tight.

0RolfAndreassen15y

I don't think so, provided you limit yourself to Turing machines constructed of at most a universe's worth of energy and have unlimited time for the computation. However, that said, invoking such a heavy-duty prior seems to me a bit of a cop-out; you might almost as well invoke "My prior is what Omega told me". For choosing between Bayesian and frequentist approaches regarding actual problems in the real brain space available to humans, I prefer priors that can be computed in something like a human lifetime. To show that a Bayesian with a prior that would take multiple billion years to compute can do better than a frequentist is not really a very strong result.

[-]MendelSchmiedekamp15y20

What I keep coming to here is, doesn't the entire point of this post come to the situations where the parameters in question, the bias of the coins, are not independent? And doesn't this contradict?

estimate 100 independent unknown parameters

Which leads me to read the later half of this post as, we can (in principle, perhaps not computably) estimate 1 complex parameter with 100 data sets better than 100 independent unknown parameters from individual data sets. This shouldn't be surprising. I certainly don't find it as such.

The first half just points ou... (read more)

[-]CronoDAS15y20

How hard is it to "crack" a pseudorandom number generator? Say, one used to generate cards in an online poker game?

[-]wmorgan15y140

In the very early days of online poker (late 90's), a group of people did exactly that.

A number of factors combined to make it possible:

The site published the code for its shuffler, supposedly to demonstrate its competence.
The RNG was seeded by the number of milliseconds since midnight, a range laughably smaller than the range of possible poker hands.
Online poker was played for sufficiently low stakes that nobody thought either of the above would be a real problem.

Nowadays, sites keep their algorithms secret, and the even distribution of hands is constantly being verified by third parties with large data sets. They also harvest large sources of entropy, from user mouse gestures and physical RNGs. See PokerStars' write-up on security.

0Thomas15y

So, at least in theory, one could a bit manipulate their RNG, sending them some "mouse moving signals back".

0wnoise15y

One would also need to know the other inputs and their hash function in order to control what the manipulation did. Otherwise, it would be "random".

2Thomas15y

Needn't to be a perfect manipulation, to get something already useful.

0Paul Crowley15y

If the hash function is strong, it doesn't help - though as far as I know there's no good formal definition of the properties the hash function has to have.

1Baughn15y

One of the properties a cryptographic hash function should have is indeed that, if you add known plaintext to an unknown plaintext in any way, you still can't predict any property of the output that you couldn't without this - you get no additional information. Of course, we still don't have a clue how to prove this for any interesting hash function. They frequently get broken, because there's basically no mathematical framework behind them.

6Paul Crowley15y

This is a long, long, long way from the truth.

1Oscar_Cunningham15y

Anyone care to post evidence either way?

6Baughn15y

No need to worry, cipher is right. I'm being horribly inaccurate, as usual. Voted him up for correctness. :-) What I should have said is that there are no correctness proofs, of the style you'd get for RSA. No guarantees, in other words. There is plenty of math, just no math that lets you prove they have the properties you want them to have; this is doubly annoying because, historically, they will almost certainly turn out later to not have them. There's plenty of math that guarantees they lack specific vulnerabilities, and far more that attempts to exploit vulnerabilities, just nothing that says there are no vulnerabilities; at least, nothing I've heard of. Different hash functions have different levels of security; for example, most block ciphers can technically be used for hashing, hopefully with less chance of trouble than the faster MD5 or SHA hashes. The fact of the matter is still that, outside of the annoyingly slow RSA family of ciphers, nothing I've heard of in cryptography that's practically usable also has a strong mathematical proof; a problem that is borne out by a long string of theoretical and sometimes practical attacks on the algorithms. It's easy to get disillusioned, but at least we get amusing attack names like "miss-in-the-middle", and SHA-2 is still mostly intact. And lest someone calls me on it, I should probably point out that RSA does not, strictly speaking, have a security proof either. Its security hinges on a single open question, namely whether prime factorization is slow or not; that's a lot better than the attack surface of other ciphers, but not quite no surface at all, and that's even ignoring the existence of Shor's algorithm. If you have a quantum computer, RSA is broken. Well, there are still elliptic curves.

6Paul Crowley15y

There are no such proofs for RSA either. The best you can hope for with public key crypto is a reduction to a problem believed hard; in the case of RSA, this is the "RSA problem" of finding roots modulo semiprimes of unknown factorization (which is not the integer factorization problem). I think in practice we have more faith in the security of our stronger symmetric primitives than we do that the hard problems in public key crypto are hard; this has things backwards. Incidentally, as far as I can tell there's no good reason ever to use RSA; the Rabin family reduce to a strictly harder problem (factorization) and are faster too. MD5 and SHA are built around custom block ciphers. Against quantum computers lots of elliptic curve crypto is also broken. Look up "post-quantum cryptography" for a sampling of the algorithms that are not known to fall; none of the well-known ones.

2Paul Crowley15y

Another word for a strong pseudorandom number generator is a "stream cipher". (This isn't strictly true, but close enough)

2cousin_it15y

Depends on the generator. Obviously hardware ones cannot be cracked, and even some software ones are thought to be cryptographically secure.

0wedrifid15y

A lot harder than it would be to prove that the random number generator is rigged. ;)

[-]Jonathan_Graehl15y10

Thank you for introducing me to the "universal prior".

The model I'd have used in explaining a sequence of such coin flips would indeed assume that the machine generates each coin independently as heads-biased with fixed probability H. But rather than fixing H=1/2, I'd have a prior distribution for H (probably uniform).

How superior would the universal prior be to this? I confess I can't do the math.

4Wei Dai15y

Eliezer also wrote a post about the universal prior, although he didn't use that name. See Occam's Razor.

1Academian15y

Also thanks for introducing me indirectly to AIXI, the universal artificial intelligence. Apparently, for every time and space bounds (t,l), Hutter has defined an algorithm AIXI(t,l) that performs optimally assuming "the environment is computable". Sounds like an awesome model to compare our own rationality against: http://www.hutter1.net/ai/paixi.htm This blog post has some interesting quotes, including one by Eliezer about the fearful unfriendliness of AIXI: http://www.acceleratingfuture.com/michael/blog/2009/12/computable-aixi-should-we-be-afraid/

3SilasBarta15y

Meh. I was excited about it at first, but here's the problem: that claim, at least as you've phrased it, is wrong. It would only be correct if you mean average performance over all computable environments, having an Occamian prior as their only knowledge. But if you consider agents that operate in this computable environment, there are numerous examples that perform better than AIXI-tl. Specifically, all animals do. The reason is that they don't, like AIXI/AIXI-tl, start from tabula rasa, but are born with implicit knowledge about their environments (that is closer to the truth than a mere Occamian prior) and implicit heuristics for exploiting them. That's why Hutter isn't just running AIXI-tl and outfoxing them all. (pardon the pun) More generally, there's the important issue that we shouldn't want an agent that performs optimally, on average, over all computable environments, if that means sacrificing performance in this environment, which it almost certainly does.

1gwern15y

Couldn't we patch this easily? Supply the agent with all prior observations. If it's optimal, it'll make better use of the observations than we or animals would. Or to put it another way: wouldn't AIXI-tl perform optimally on average over all computable environments consistent with our history? That seems desirable. (Interesting thought: evolution can only discover a few bits every generation, and likely our genomes and cellular environments embody many fewer than that upper limit of a few gigabytes; how many observations would we have to supply a new AIXI-tl before it caught up to us?)

0SilasBarta15y

Short answer: we can patch it, but not easily. Longer answer: If we could somehow represent the knowledge we have (both explicit and implicit) in a format that integrates nicely with the way the AIXI-approximating program stores its own knowledge, then we could "bring it up to speed" with where we are, and let it learn from there. The knowledge we give it would make it throw out the spurious solutions it would otherwise have to consider. But representation of our knowledge, especially the important implicit knowledge we never realize we're using, is very hard -- probably AI-complete. So it doesn't reduce the problem. And any knowledge you'd be feeding it would be found by some human way, when finding this knowledge is the very thing you're supposed to be automating! But yes, that would work. I'm not sure, but I don't think so. The requirement that it perform optimally on average (probability weighted) over all computable environments consistent with its history means that it has to pass some of its performance into those other environments, which can still be drastically different. One thing to keep in mind: Though we like to neatly model beliefs, values, and observations as cleanly separated, Nature is under no obligation to respect the difference -- in nature, these are all blended together, such that the three can only be inferred from the overall dynamics. This is why I want to set up some kind of model of an environment that is constrained by (some analog of) the laws of thermodynamics and chemistry so that we can see the development of (phenomena we could describe as) complexity, intelligence, and life, but without explicitly programming any of those into the model.

0gwern15y

Too many restrictions there, I think. The format doesn't have to be nice - any format it doesn't know it will know after a fixed-length penalty. We could just dump in the Internet Archive raw. But those other environments where it performs poorly are the ones that ought to perform poorly: its best performance is reserved for the most likely future histories, just like we would aspire to. We would perform poorly in an anti-Occamian universe just like it would, but we're far from optimal and so would perform worse in other scenarios, I would think. I suppose we could be so biased and incorrect that we luck out and our biases and errors are just right, but is it plausible that we could luck out enough to overcome the general performance difference?

0SilasBarta15y

Alright, the thing I meant by "nice" needs some elaboration. I'll put it this way: For an Ultimate AI, all of its knowledge -- with no exceptions -- is in terms of constraints on what it expects to observe. (And yes, this is what rationalists should strive for too.) So there is no, "light is waves", no "that's the Doppler effect". There are only mappings from inputs to probability distributions on future inputs. (Confusingly, this would also mean an expectation for [phenomena explainable as] humans [generating sound waves explainable by them] saying [something we would recognize as the statement] "light is waves". Phew!) Any large human-generated knowledge base initially appears to AIXI as a long string of characters and/or some input/output black box. What in its inputspace do the characters refer to? What is the appropriate way to group them? Most importantly, after being told that "this stuff is true" or "this is a lot of what goes on in the environment that computes your inputs", how does it know how it maps to the rest of the environment's generating function? (Which I guess is ultimately the same as the first question.) That problem is nearly as intractable as starting from just an Occamian prior. It's only resolved by symbol-grounding, which means representing knowledge in the form of a probability distribution on observations, in a way your program can understand. Which I think brings you back into the AI-complete realm. Okay, you're right, I wasn't keeping the comparison baselines straight. If you could give the program enough knowledge, in a form it understands or could quickly learn, to distinguish this computable environment from substantively different environment (including its location within it, the relevant history, etc.), then yes, it would make better inferences than humans. But the point stands that dumping any kind of knowledge base on a computable version of AIXI won't help you a bit until you've done a lot more of the cognitive labor.

1Academian15y

That's a good point; in-born priors are much better (as priors) for typical human environments, and plenty of environments we'd like to design robots for. But consider things like black holes, the relativity of pretty much everything kinematic except light speed, and almost all of quantum mechanics. These theories stopped seeming like outright paradoxes to me precisely when I eased up and starting assuming only that the universe was probably just computational in some sense, and not necessarily "3D-physical" or "spacial" or "temporal" or whatever. So naturally, I'm pleased to find a well-developing theory of "AIXI", an idealized implementation of this very starting point :)

0[anonymous]15y

s/Eliezer/Michael Anissimov, I assume?

1JGWeissman15y

One way that the universal prior is superior to the one you propose is that it can notice that the machine alternates between producing heads-biased and tails-biased coins. In this comment, I described another prior (still much easier to use that the universal prior) that could detect this pattern, and also noted another behavior the machine can have that my prior can't notice. The advantage of the universal prior is that it assigns non-zero probability to all computable behaviors the machine might have, so that it can concentrate probability into that behavior if the machine exhibits it.

0[anonymous]15y

Far, far superior - Wei Dai did not say that the coins are independently distributed.

0wedrifid15y

A lot, but it is not (particularly) quantifiable. Allowing H to be an unknown constant rather than 0.5 allows for many more possible algorithms, but it is only an iterative step towards 'allow any algorithm'. That, in turn, is not as good as 'allow any algorithm but with a prior distribution based on complexity'.

[-]Teeth15y00

After observing the coin tosses for a while, a typical intelligent person, just applying common sense, would notice that 90% of the tosses come up heads, and infer that perhaps all the coins are biased towards heads. They would become more certain of this with time, and adjust their answers accordingly.

A bayesian could do better if you allowedtot to remember things.

coin lands on heads: P(headsbiased|heads) = .9.5/.5 =.9 P(tailsbiased|heads) = .1.5/.5=.1 another heads: P(headsbiased|heads)=.9.9/(.1.1+.9*.9)=.987 P(tailsbiased|heads)=.0121

0Teeth15y

Oops. I meant that for all possible biases torwards heads, not always or never. Except that only has 3/4 success. I'm just saying comparing it to a human is unfair. Humans use feeble subsets of the universal prior that are easy to imitate.

[+][anonymous]9y-60

[+]gerg15y-70

Moderation Log