We can't use the universal prior in practice unless physics contains harnessable non-recursive processes. However, this is exactly the situation in which the universal prior doesn't always work. Thus, one source of the 'magic' is through allowing us to have access to higher levels of computation than the phenomena we are predicting (and to be certain of this).
Also, the constants involved could be terrible and there are no guarantees about this (not even probabilistic ones). It is nice to reach some ratio in the limit, but if your first Graham's number of guesses are bad, then that is very bad for (almost) all purposes.
The Bayesian using the universal prior is cheating. Kolmogorov complexity is incomputable. So you have to use a realistic compressor. Then the optimality results go away. But, in practice if your compressor is good, it's almost always going to outdo the frequentist.
For example, suppose the machine is actually set up to always produce head-biased coins. After observing the coin tosses for a while, a typical intelligent person, just applying common sense, would notice that 90% of the tosses come up heads, and infer that perhaps all the coins are biased towards heads. They would become more certain of this with time, and adjust their answers accordingly. But the frequentist would not (or isn't supposed to) notice this.
They wouldn't (they aren't)?
...He or she would answer "the coin is head-biased with probability
I don't understand why your universal prior passes the tape through the Turing machine. If you have a source of genuinely random tape inputs in the first place, why not just use the probability that a tape has the first four bits '1100'? And in any case, what's the advantage of invoking Turing machinery in place of the good old fair coins?
What I keep coming to here is, doesn't the entire point of this post come to the situations where the parameters in question, the bias of the coins, are not independent? And doesn't this contradict?
estimate 100 independent unknown parameters
Which leads me to read the later half of this post as, we can (in principle, perhaps not computably) estimate 1 complex parameter with 100 data sets better than 100 independent unknown parameters from individual data sets. This shouldn't be surprising. I certainly don't find it as such.
The first half just points ou...
How hard is it to "crack" a pseudorandom number generator? Say, one used to generate cards in an online poker game?
In the very early days of online poker (late 90's), a group of people did exactly that.
A number of factors combined to make it possible:
Nowadays, sites keep their algorithms secret, and the even distribution of hands is constantly being verified by third parties with large data sets. They also harvest large sources of entropy, from user mouse gestures and physical RNGs. See PokerStars' write-up on security.
Thank you for introducing me to the "universal prior".
The model I'd have used in explaining a sequence of such coin flips would indeed assume that the machine generates each coin independently as heads-biased with fixed probability H. But rather than fixing H=1/2, I'd have a prior distribution for H (probably uniform).
How superior would the universal prior be to this? I confess I can't do the math.
After observing the coin tosses for a while, a typical intelligent person, just applying common sense, would notice that 90% of the tosses come up heads, and infer that perhaps all the coins are biased towards heads. They would become more certain of this with time, and adjust their answers accordingly.
A bayesian could do better if you allowedtot to remember things.
coin lands on heads: P(headsbiased|heads) = .9.5/.5 =.9 P(tailsbiased|heads) = .1.5/.5=.1 another heads: P(headsbiased|heads)=.9.9/(.1.1+.9*.9)=.987 P(tailsbiased|heads)=.0121
[I posted this to open thread a few days ago for review. I've only made some minor editorial changes since then, so no need to read it again if you've already read the draft.]
This is a belated reply to cousin_it's 2009 post Bayesian Flame, which claimed that frequentists can give calibrated estimates for unknown parameters without using priors:
And indeed they can. Here's the simplest example that I can think of that illustrates the spirit of frequentism:
Suppose there is a machine that produces biased coins. You don't know how the machine works, except that each coin it produces is either biased towards heads (in which case each toss of the coin will land heads with probability .9 and tails with probability .1) or towards tails (in which case each toss of the coin will land tails with probability .9 and heads with probability .1). For each coin, you get to observe one toss, and then have to state whether you think it's biased towards heads or tails, and what is the probability that's the right answer.
Let's say that you decide to follow this rule: after observing heads, always answer "the coin is biased towards heads with probability .9" and after observing tails, always answer "the coin is biased towards tails with probability .9". Do this for a while, and it will turn out that 90% of the time you are right about which way the coin is biased, no matter how the machine actually works. The machine might always produce coins biased towards heads, or always towards tails, or decide based on the digits of pi, and it wouldn't matter—you'll still be right 90% of the time. (To verify this, notice that in the long run you will answer "heads" for 90% of the coins actually biased towards heads, and "tails" for 90% of the coins actually biased towards tails.) No priors needed! Magic!
What is going on here? There are a couple of things we could say. One was mentioned by Eliezer in a comment:
In this example, the "perfect information about experimental setups and likelihood ratios" is the information that a biased coin will land the way it's biased with probability .9. I think this is a valid criticism, but it's not complete. There are perhaps many situations where we have much better information about experimental setups and likelihood ratios than about the mechanism that determines the unknown parameter we're trying to estimate. This criticism leaves open the question of whether it would make sense to give up Bayesianism for frequentism in those situations.
The other thing we could say is that while the frequentist in this example appears to be perfectly calibrated, he or she is liable to pay a heavy cost for this in accuracy. For example, suppose the machine is actually set up to always produce head-biased coins. After observing the coin tosses for a while, a typical intelligent person, just applying common sense, would notice that 90% of the tosses come up heads, and infer that perhaps all the coins are biased towards heads. They would become more certain of this with time, and adjust their answers accordingly. But the frequentist would not (or isn't supposed to) notice this. He or she would answer "the coin is head-biased with probability .9" 90% of the time, and "the coin is tail-biased with probability .9" 10% of the time, and keep doing this, irrevocably and forever.
The frequentist magic turns out to be weaker than it first appeared. What about the Bayesian solution to this problem? Well, we know that it must involve a prior, so the only question is which one. The maximum entropy prior that is consistent with the information given in the problem statement is to assign each coin an independent probability of .5 of being head-biased, and .5 of being tail-biased. It turns out that a Bayesian using this prior will give the exact same answers as the frequentist, so this is also an example of a "matching prior". (To verify: P(biased heads | observed heads) = P(OH|BH)*P(BH)/P(OH) = .9*.5/.5 = .9)
But a Bayesian can do much better. A Bayesian can use a universal prior. (With a universal prior based on a universal Turing machine, the prior probability that the first 4 coins will be biased "heads, heads, tails, tails" is the probability that the UTM will produce 1100 as the first 4 bits of its output, when given a uniformly random input tape.) Using such a prior guarantees that no matter how the coin-producing machine works, as long as it doesn't involve some kind of uncomputable physics, in the long run your expected total Bayes score will be no worse than someone who knows exactly how the machine works, except by a constant (that's determined by the algorithmic complexity of the machine). And unless the machine actually settles into deciding the bias of each coin independently with 50/50 probabilities, your expected Bayes score will also be better than the frequentist (or a Bayesian using the matching prior) by an unbounded margin as time goes to infinity.
I consider this magic also, because I don't really understand why it works. Is our prior actually a universal prior, or is the universal prior just a handy approximation that we can substitute in place of the real prior? Why does the universe that we live in look like a giant computer? What about uncomputable physics? Just what are priors, anyway? These are some of the questions that I'm still confused about.
But as long as we're choosing between different magics, why not pick the stronger one?