Frequentist Magic vs. Bayesian Magic

[I posted this to open thread a few days ago for review. I've only made some minor editorial changes since then, so no need to read it again if you've already read the draft.]

This is a belated reply to cousin_it's 2009 post Bayesian Flame, which claimed that frequentists can give calibrated estimates for unknown parameters without using priors:

And here's an ultra-short example of what frequentists can do: estimate 100 independent unknown parameters from 100 different sample data sets and have 90 of the estimates turn out to be true to fact afterward. Like, fo'real. Always 90% in the long run, truly, irrevocably and forever.

And indeed they can. Here's the simplest example that I can think of that illustrates the spirit of frequentism:

Suppose there is a machine that produces biased coins. You don't know how the machine works, except that each coin it produces is either biased towards heads (in which case each toss of the coin will land heads with probability .9 and tails with probability .1) or towards tails (in which case each toss of the coin will land tails with probability .9 and heads with probability .1). For each coin, you get to observe one toss, and then have to state whether you think it's biased towards heads or tails, and what is the probability that's the right answer.

Let's say that you decide to follow this rule: after observing heads, always answer "the coin is biased towards heads with probability .9" and after observing tails, always answer "the coin is biased towards tails with probability .9". Do this for a while, and it will turn out that 90% of the time you are right about which way the coin is biased, no matter how the machine actually works. The machine might always produce coins biased towards heads, or always towards tails, or decide based on the digits of pi, and it wouldn't matter—you'll still be right 90% of the time. (To verify this, notice that in the long run you will answer "heads" for 90% of the coins actually biased towards heads, and "tails" for 90% of the coins actually biased towards tails.) No priors needed! Magic!

What is going on here? There are a couple of things we could say. One was mentioned by Eliezer in a comment:

It's not perfectly reliable. They assume they have perfect information about experimental setups and likelihood ratios. (Where does this perfect knowledge come from? Can Bayesians get their priors from the same source?)

In this example, the "perfect information about experimental setups and likelihood ratios" is the information that a biased coin will land the way it's biased with probability .9. I think this is a valid criticism, but it's not complete. There are perhaps many situations where we have much better information about experimental setups and likelihood ratios than about the mechanism that determines the unknown parameter we're trying to estimate. This criticism leaves open the question of whether it would make sense to give up Bayesianism for frequentism in those situations.

The other thing we could say is that while the frequentist in this example appears to be perfectly calibrated, he or she is liable to pay a heavy cost for this in accuracy. For example, suppose the machine is actually set up to always produce head-biased coins. After observing the coin tosses for a while, a typical intelligent person, just applying common sense, would notice that 90% of the tosses come up heads, and infer that perhaps all the coins are biased towards heads. They would become more certain of this with time, and adjust their answers accordingly. But the frequentist would not (or isn't supposed to) notice this. He or she would answer "the coin is head-biased with probability .9" 90% of the time, and "the coin is tail-biased with probability .9" 10% of the time, and keep doing this, irrevocably and forever.

The frequentist magic turns out to be weaker than it first appeared. What about the Bayesian solution to this problem? Well, we know that it must involve a prior, so the only question is which one. The maximum entropy prior that is consistent with the information given in the problem statement is to assign each coin an independent probability of .5 of being head-biased, and .5 of being tail-biased. It turns out that a Bayesian using this prior will give the exact same answers as the frequentist, so this is also an example of a "matching prior". (To verify: P(biased heads | observed heads) = P(OH|BH)*P(BH)/P(OH) = .9*.5/.5 = .9)

But a Bayesian can do much better. A Bayesian can use a universal prior. (With a universal prior based on a universal Turing machine, the prior probability that the first 4 coins will be biased "heads, heads, tails, tails" is the probability that the UTM will produce 1100 as the first 4 bits of its output, when given a uniformly random input tape.) Using such a prior guarantees that no matter how the coin-producing machine works, as long as it doesn't involve some kind of uncomputable physics, in the long run your expected total Bayes score will be no worse than someone who knows exactly how the machine works, except by a constant (that's determined by the algorithmic complexity of the machine). And unless the machine actually settles into deciding the bias of each coin independently with 50/50 probabilities, your expected Bayes score will also be better than the frequentist (or a Bayesian using the matching prior) by an unbounded margin as time goes to infinity.

I consider this magic also, because I don't really understand why it works. Is our prior actually a universal prior, or is the universal prior just a handy approximation that we can substitute in place of the real prior? Why does the universe that we live in look like a giant computer? What about uncomputable physics? Just what are priors, anyway? These are some of the questions that I'm still confused about.

But as long as we're choosing between different magics, why not pick the stronger one?

79 comments, sorted by
magical algorithm
Highlighting new comments since Today at 3:01 PM
Select new highlight date
Moderation Guidelinesexpand_more

We can't use the universal prior in practice unless physics contains harnessable non-recursive processes. However, this is exactly the situation in which the universal prior doesn't always work. Thus, one source of the 'magic' is through allowing us to have access to higher levels of computation than the phenomena we are predicting (and to be certain of this).

Also, the constants involved could be terrible and there are no guarantees about this (not even probabilistic ones). It is nice to reach some ratio in the limit, but if your first Graham's number of guesses are bad, then that is very bad for (almost) all purposes.

We can't use the universal prior in practice unless physics contains harnessable non-recursive processes. However, this is exactly the situation in which the universal prior doesn't always work. Thus, one source of the 'magic' is through allowing us to have access to higher levels of computation than the phenomena we are predicting (and to be certain of this).

My position is that the uncomputabiilty of the universal prior shouldn't count against it. I think the fact that it works so well shows that our actual prior is likely also uncomputable, and that means we have to handle uncomputable priors in our decision theory, for example by specifying that we choose the option that we can prove (or just heuristically believe) has the highest expected payoff, instead of actually computing the expected payoffs.

A worse problem is that there seems to be reason to think that our actual prior is not just uncomputable, but unformalizable. See my earlier posts on this.

A worse problem is that there seems to be reason to think that our actual prior is not just uncomputable, but unformalizable. See my earlier posts on this.

If you make ontological claims, you are bound to get in trouble. Decision theory should speak of what the agent's algorithm should do, in terms of its behavior, not what it means for the agent to do that in terms of consequences in the real world. What the agent's algorithm does is always formalizable (as that algorithm!).

(For people unfamiliar with the discussion -- see "ontology problem" in this sequence.)

If you make ontological claims, you are bound to get in trouble.

It seems that we do tend to get into trouble when we make ontological claims, but why "bound to"? Your proposed FAI, after it has extracted human values, will still have to solve the ontological problem, right? If it can, then why can't we?

You advocate "being lazy" as FAI programmers and handing off as many problems as we can to the FAI, but I'm still skeptical that any FAI approach will succeed in the near future, and in the mean time, I'd like to try to better understand what my own values are, and how I should make decisions.

Your proposed FAI, after it has extracted human values, will still have to solve the ontological problem, right? If it can, then why can't we?

I don't believe even superintelligence can solve the ontology problem completely.

You advocate "being lazy" as FAI programmers and handing off as many problems as we can to the FAI, but I'm still skeptical that any FAI approach will succeed in the near future, and in the mean time, I'd like to try to better understand what my own values are, and how I should make decisions.

A fine goal, but I doubt it can contribute to FAI design (which, even it'll take more than a century to finish, still has to be tackled to make that possible). Am I right in thinking that you agree with that?

I don't believe even superintelligence can solve the ontology problem completely.

Why?

A fine goal, but I doubt it can contribute to FAI design (which, even it'll take more than a century to finish, still has to be tackled to make that possible).

I'm not sure what you're referring to by "that" here. Do you mean "preserving our preferences"? Assuming you do...

Am I right in thinking that you agree with that?

No, I think we have at least two disagreements here:

  1. If I can figure out what my own values are, there seem to be several ways that could contribute to FAI design. The simplest way is that I program those values into the FAI manually.
  2. I don't think FAI is necessarily the best way to preserve our values. I can, for example, upload myself, and then carefully increase my intelligence. As you've mentioned in previous comments, this is bound to cause value drift, but such drift may be acceptable, compared to the risks involved in implementing de novo FAI.

My guess is that the root cause of these disagreements is my distrust of human math and software engineering abilities, stemming from my experiences in the crypto field. I think there is a good chance that we (unenhanced biological humans) will never find the correct FAI theory, and that in the event we think we've found the right FAI theory, it will turn out that we're mistaken. And even if we manage to get FAI theory right, it's almost certain that the actual AI code will be riddled with bugs. You seem to be less concerned with these risks.

Could you suggest a source for further reading on this?

For example, suppose the machine is actually set up to always produce head-biased coins. After observing the coin tosses for a while, a typical intelligent person, just applying common sense, would notice that 90% of the tosses come up heads, and infer that perhaps all the coins are biased towards heads. They would become more certain of this with time, and adjust their answers accordingly. But the frequentist would not (or isn't supposed to) notice this.

They wouldn't (they aren't)?

He or she would answer "the coin is head-biased with probability .9" 90% of the time, and "the coin is tail-biased with probability .9" 10% of the time, and keep doing this, irrevocably and forever.

Only if they committed to this particular rule at the start, I would have thought...? I'm unsure what compels a frequentist to use that particular rule.

I don't understand why your universal prior passes the tape through the Turing machine. If you have a source of genuinely random tape inputs in the first place, why not just use the probability that a tape has the first four bits '1100'? And in any case, what's the advantage of invoking Turing machinery in place of the good old fair coins?

You get completely different results by doing that. The universal prior gives higher probability to simpler hypotheses -- 0000000000000000 is more likely than 0010111001001101, or at least analogous things are true once the strings get long enough. The idea is that if there's any computable regularity in what you're trying to model, as you get more data you'll recognize that regularity (i.e., your posterior will give much higher probability to futures in which the regularity continues than to ones in which it doesn't).

Ah so, I understand. But in that case, doesn't evaluating your prior require you to integrate over the space of all possible Turing machines? There is no conceptual difficulty with this, but I must say that it strikes me as computationally rather formidable for a poxy little throw-four-possibly-biased-coins problem. :)

It's not just computationally formidable, it's computationally impossible. :-)

My understanding was that it's perfectly possible to solve the Halting Problem for particular TMs.

For a TM of length n, you begin running it and either it terminates at some point, or it runs for BusyBeaver(n)+1 steps, in which case you know it doesn't ever halt.

So as long as you have a fixed length n that all possible TMs fit into and know the Busy Beaver for n, you can in fact use the universal prior. (Busy Beavers are uncomputable in general, though we know a fair number, but this is an easy condition to weaken down to something practical like high probability of being the Busy Beaver.)

  1. The fact that BB(n) is uncomputable means that your procedure doesn't, in fact, work. No?

  2. You don't, in any case, have a fixed n. (Unless perhaps you are only interested in modelling Turing machines that fit in the universe -- but in that case, wouldn't you also want your own computations to fit in the universe?)

  1. Finding BB(n) for all n is uncomputable. Finding BB(n) where n=10, say, is not uncomputable.
  2. Yeah, this is an interesting point. My thinking is that yes, we do only care about modeling Turing machines that fit in the universe, since by definition those are the only Turing machines whose output we will ever observe. We might also have to bite the bullet and say we only care about predicting Turing machines which are small enough that there is room left in the universe for us; we don't want to perfectly model the entire universe, just small parts.
  1. If it's uncomputable in general then it's uncomputable for some particular n. I expect that in fact it's uncomputable for any n above some rather small value. (In particular, once n is large enough for there to be n-state Turing machines that unprovably don't always halt.)

  2. Well, if you only care about predicting smallish Turing machines, then for the same reason prediction is only useful if it can be done with smallish machines, in reasonable time. And even for very small n this isn't true for calculating BB(n).

Just out of my own curiosity, I checked Wikipedia on Busy Beaver numbers: BB(10) > 3^^^3.

While the article admits that every finite sequence (such as n = 1...10) is computable, I think it is safe to call it intractable right now.

Hey, don't laugh. That's useful to know - now you know any Turing Machine less than 10 is bluffing when it threatens 3^^^^3 specks of sand in eyes.

now you know any Turing Machine less than 10 is bluffing when it threatens 3^^^^3 specks of sand in eyes.

How did you conclude that from a lower bound?

I don't think the lower-bound will turn out to be all that off, and I tossed in another ^ as insurance. Jests don't have to be air-tight.

Off the cuff, couldn't it actually accomplish that if it just doesn't terminate?

If it doesn't terminate, then I think its threat would be an infinite number of sand specks, wouldn't it?

I don't think so, provided you limit yourself to Turing machines constructed of at most a universe's worth of energy and have unlimited time for the computation. However, that said, invoking such a heavy-duty prior seems to me a bit of a cop-out; you might almost as well invoke "My prior is what Omega told me". For choosing between Bayesian and frequentist approaches regarding actual problems in the real brain space available to humans, I prefer priors that can be computed in something like a human lifetime. To show that a Bayesian with a prior that would take multiple billion years to compute can do better than a frequentist is not really a very strong result.

The trick, and the reason for using Turing machines, is that '0000' and '1111' ought to be considered more likely than '0101' and '1010', because they're simpler; and the way this model represent simplicity, a sequence is simple if it's the output of short programs (where 'program' in this case means 'prefix to a Turing machine tape'). If you used the tape directly, then '0000000000' and '0100011011' would be equally likely, which, generally speaking, they're not.

What I keep coming to here is, doesn't the entire point of this post come to the situations where the parameters in question, the bias of the coins, are not independent? And doesn't this contradict?

estimate 100 independent unknown parameters

Which leads me to read the later half of this post as, we can (in principle, perhaps not computably) estimate 1 complex parameter with 100 data sets better than 100 independent unknown parameters from individual data sets. This shouldn't be surprising. I certainly don't find it as such.

The first half just points out that in the independent case of this particular example, Bayesian and Frequentist perform equivalently for relatively similar assumptions. But cousin_it made a general claim about the Frequentist approach, so this isn't worth much weight on its own.

How hard is it to "crack" a pseudorandom number generator? Say, one used to generate cards in an online poker game?

In the very early days of online poker (late 90's), a group of people did exactly that.

A number of factors combined to make it possible:

  • The site published the code for its shuffler, supposedly to demonstrate its competence.
  • The RNG was seeded by the number of milliseconds since midnight, a range laughably smaller than the range of possible poker hands.
  • Online poker was played for sufficiently low stakes that nobody thought either of the above would be a real problem.

Nowadays, sites keep their algorithms secret, and the even distribution of hands is constantly being verified by third parties with large data sets. They also harvest large sources of entropy, from user mouse gestures and physical RNGs. See PokerStars' write-up on security.

They also harvest large sources of entropy, from user mouse gestures

So, at least in theory, one could a bit manipulate their RNG, sending them some "mouse moving signals back".

One would also need to know the other inputs and their hash function in order to control what the manipulation did. Otherwise, it would be "random".

Needn't to be a perfect manipulation, to get something already useful.

If the hash function is strong, it doesn't help - though as far as I know there's no good formal definition of the properties the hash function has to have.

One of the properties a cryptographic hash function should have is indeed that, if you add known plaintext to an unknown plaintext in any way, you still can't predict any property of the output that you couldn't without this - you get no additional information.

Of course, we still don't have a clue how to prove this for any interesting hash function. They frequently get broken, because there's basically no mathematical framework behind them.

there's basically no mathematical framework behind them.

This is a long, long, long way from the truth.

No need to worry, cipher is right. I'm being horribly inaccurate, as usual. Voted him up for correctness. :-)

What I should have said is that there are no correctness proofs, of the style you'd get for RSA. No guarantees, in other words. There is plenty of math, just no math that lets you prove they have the properties you want them to have; this is doubly annoying because, historically, they will almost certainly turn out later to not have them.

There's plenty of math that guarantees they lack specific vulnerabilities, and far more that attempts to exploit vulnerabilities, just nothing that says there are no vulnerabilities; at least, nothing I've heard of. Different hash functions have different levels of security; for example, most block ciphers can technically be used for hashing, hopefully with less chance of trouble than the faster MD5 or SHA hashes.

The fact of the matter is still that, outside of the annoyingly slow RSA family of ciphers, nothing I've heard of in cryptography that's practically usable also has a strong mathematical proof; a problem that is borne out by a long string of theoretical and sometimes practical attacks on the algorithms. It's easy to get disillusioned, but at least we get amusing attack names like "miss-in-the-middle", and SHA-2 is still mostly intact.

And lest someone calls me on it, I should probably point out that RSA does not, strictly speaking, have a security proof either. Its security hinges on a single open question, namely whether prime factorization is slow or not; that's a lot better than the attack surface of other ciphers, but not quite no surface at all, and that's even ignoring the existence of Shor's algorithm. If you have a quantum computer, RSA is broken.

Well, there are still elliptic curves.

of the style you'd get for RSA.

There are no such proofs for RSA either. The best you can hope for with public key crypto is a reduction to a problem believed hard; in the case of RSA, this is the "RSA problem" of finding roots modulo semiprimes of unknown factorization (which is not the integer factorization problem). I think in practice we have more faith in the security of our stronger symmetric primitives than we do that the hard problems in public key crypto are hard; this has things backwards.

Incidentally, as far as I can tell there's no good reason ever to use RSA; the Rabin family reduce to a strictly harder problem (factorization) and are faster too.

MD5 and SHA are built around custom block ciphers.

Against quantum computers lots of elliptic curve crypto is also broken. Look up "post-quantum cryptography" for a sampling of the algorithms that are not known to fall; none of the well-known ones.

Another word for a strong pseudorandom number generator is a "stream cipher".

(This isn't strictly true, but close enough)

Depends on the generator. Obviously hardware ones cannot be cracked, and even some software ones are thought to be cryptographically secure.

How hard is it to "crack" a pseudorandom number generator? Say, one used to generate cards in an online poker game?

A lot harder than it would be to prove that the random number generator is rigged. ;)

Thank you for introducing me to the "universal prior".

The model I'd have used in explaining a sequence of such coin flips would indeed assume that the machine generates each coin independently as heads-biased with fixed probability H. But rather than fixing H=1/2, I'd have a prior distribution for H (probably uniform).

How superior would the universal prior be to this? I confess I can't do the math.

Eliezer also wrote a post about the universal prior, although he didn't use that name. See Occam's Razor.

Also thanks for introducing me indirectly to AIXI, the universal artificial intelligence. Apparently, for every time and space bounds (t,l), Hutter has defined an algorithm AIXI(t,l) that performs optimally assuming "the environment is computable". Sounds like an awesome model to compare our own rationality against:

http://www.hutter1.net/ai/paixi.htm

This blog post has some interesting quotes, including one by Eliezer about the fearful unfriendliness of AIXI:

http://www.acceleratingfuture.com/michael/blog/2009/12/computable-aixi-should-we-be-afraid/

Apparently, for every time and space bounds (t,l), Hutter has defined an algorithm AIXI(t,l) that performs optimally assuming "the environment is computable". Sounds like an awesome model to compare our own rationality against:

Meh. I was excited about it at first, but here's the problem: that claim, at least as you've phrased it, is wrong. It would only be correct if you mean average performance over all computable environments, having an Occamian prior as their only knowledge.

But if you consider agents that operate in this computable environment, there are numerous examples that perform better than AIXI-tl. Specifically, all animals do. The reason is that they don't, like AIXI/AIXI-tl, start from tabula rasa, but are born with implicit knowledge about their environments (that is closer to the truth than a mere Occamian prior) and implicit heuristics for exploiting them.

That's why Hutter isn't just running AIXI-tl and outfoxing them all. (pardon the pun)

More generally, there's the important issue that we shouldn't want an agent that performs optimally, on average, over all computable environments, if that means sacrificing performance in this environment, which it almost certainly does.

More generally, there's the important issue that we shouldn't want an agent that performs optimally, on average, over all computable environments, if that means sacrificing performance in this environment, which it almost certainly does.

Couldn't we patch this easily? Supply the agent with all prior observations. If it's optimal, it'll make better use of the observations than we or animals would.

Or to put it another way: wouldn't AIXI-tl perform optimally on average over all computable environments consistent with our history? That seems desirable.

(Interesting thought: evolution can only discover a few bits every generation, and likely our genomes and cellular environments embody many fewer than that upper limit of a few gigabytes; how many observations would we have to supply a new AIXI-tl before it caught up to us?)

Short answer: we can patch it, but not easily.

Longer answer: If we could somehow represent the knowledge we have (both explicit and implicit) in a format that integrates nicely with the way the AIXI-approximating program stores its own knowledge, then we could "bring it up to speed" with where we are, and let it learn from there. The knowledge we give it would make it throw out the spurious solutions it would otherwise have to consider.

But representation of our knowledge, especially the important implicit knowledge we never realize we're using, is very hard -- probably AI-complete. So it doesn't reduce the problem. And any knowledge you'd be feeding it would be found by some human way, when finding this knowledge is the very thing you're supposed to be automating! But yes, that would work.

Or to put it another way: wouldn't AIXI-tl perform optimally on average over all computable environments consistent with our history?

I'm not sure, but I don't think so. The requirement that it perform optimally on average (probability weighted) over all computable environments consistent with its history means that it has to pass some of its performance into those other environments, which can still be drastically different.

One thing to keep in mind: Though we like to neatly model beliefs, values, and observations as cleanly separated, Nature is under no obligation to respect the difference -- in nature, these are all blended together, such that the three can only be inferred from the overall dynamics. This is why I want to set up some kind of model of an environment that is constrained by (some analog of) the laws of thermodynamics and chemistry so that we can see the development of (phenomena we could describe as) complexity, intelligence, and life, but without explicitly programming any of those into the model.

If we could somehow represent the knowledge we have (both explicit and implicit) in a format that integrates nicely with the way the AIXI-approximating program stores its own knowledge, then we could "bring it up to speed" with where we are, and let it learn from there.

Too many restrictions there, I think. The format doesn't have to be nice - any format it doesn't know it will know after a fixed-length penalty. We could just dump in the Internet Archive raw.

The requirement that it perform optimally on average (probability weighted) over all computable environments consistent with its history means that it has to pass some of its performance into those other environments, which can still be drastically different.

But those other environments where it performs poorly are the ones that ought to perform poorly: its best performance is reserved for the most likely future histories, just like we would aspire to.

We would perform poorly in an anti-Occamian universe just like it would, but we're far from optimal and so would perform worse in other scenarios, I would think. I suppose we could be so biased and incorrect that we luck out and our biases and errors are just right, but is it plausible that we could luck out enough to overcome the general performance difference?

Too many restrictions there, I think. The format doesn't have to be nice - any format it doesn't know it will know after a fixed-length penalty. We could just dump in the Internet Archive raw.

Alright, the thing I meant by "nice" needs some elaboration. I'll put it this way: For an Ultimate AI, all of its knowledge -- with no exceptions -- is in terms of constraints on what it expects to observe. (And yes, this is what rationalists should strive for too.) So there is no, "light is waves", no "that's the Doppler effect". There are only mappings from inputs to probability distributions on future inputs. (Confusingly, this would also mean an expectation for [phenomena explainable as] humans [generating sound waves explainable by them] saying [something we would recognize as the statement] "light is waves". Phew!)

Any large human-generated knowledge base initially appears to AIXI as a long string of characters and/or some input/output black box. What in its inputspace do the characters refer to? What is the appropriate way to group them? Most importantly, after being told that "this stuff is true" or "this is a lot of what goes on in the environment that computes your inputs", how does it know how it maps to the rest of the environment's generating function? (Which I guess is ultimately the same as the first question.)

That problem is nearly as intractable as starting from just an Occamian prior. It's only resolved by symbol-grounding, which means representing knowledge in the form of a probability distribution on observations, in a way your program can understand. Which I think brings you back into the AI-complete realm.

But those other environments where it performs poorly are the ones that ought to perform poorly: its best performance is reserved for the most likely future histories, just like we would aspire to.

Okay, you're right, I wasn't keeping the comparison baselines straight. If you could give the program enough knowledge, in a form it understands or could quickly learn, to distinguish this computable environment from substantively different environment (including its location within it, the relevant history, etc.), then yes, it would make better inferences than humans.

But the point stands that dumping any kind of knowledge base on a computable version of AIXI won't help you a bit until you've done a lot more of the cognitive labor.

That's a good point; in-born priors are much better (as priors) for typical human environments, and plenty of environments we'd like to design robots for.

But consider things like black holes, the relativity of pretty much everything kinematic except light speed, and almost all of quantum mechanics. These theories stopped seeming like outright paradoxes to me precisely when I eased up and starting assuming only that the universe was probably just computational in some sense, and not necessarily "3D-physical" or "spacial" or "temporal" or whatever. So naturally, I'm pleased to find a well-developing theory of "AIXI", an idealized implementation of this very starting point :)

One way that the universal prior is superior to the one you propose is that it can notice that the machine alternates between producing heads-biased and tails-biased coins.

In this comment, I described another prior (still much easier to use that the universal prior) that could detect this pattern, and also noted another behavior the machine can have that my prior can't notice.

The advantage of the universal prior is that it assigns non-zero probability to all computable behaviors the machine might have, so that it can concentrate probability into that behavior if the machine exhibits it.

How superior would the universal prior be to this? I confess I can't do the math.

A lot, but it is not (particularly) quantifiable. Allowing H to be an unknown constant rather than 0.5 allows for many more possible algorithms, but it is only an iterative step towards 'allow any algorithm'. That, in turn, is not as good as 'allow any algorithm but with a prior distribution based on complexity'.

After observing the coin tosses for a while, a typical intelligent person, just applying common sense, would notice that 90% of the tosses come up heads, and infer that perhaps all the coins are biased towards heads. They would become more certain of this with time, and adjust their answers accordingly.

A bayesian could do better if you allowedtot to remember things.

coin lands on heads: P(headsbiased|heads) = .9.5/.5 =.9 P(tailsbiased|heads) = .1.5/.5=.1 another heads: P(headsbiased|heads)=.9.9/(.1.1+.9*.9)=.987 P(tailsbiased|heads)=.0121

Oops. I meant that for all possible biases torwards heads, not always or never. Except that only has 3/4 success.

I'm just saying comparing it to a human is unfair. Humans use feeble subsets of the universal prior that are easy to imitate.

I wish people like yourselves would spend your time trying to work on cures for diseases instead of thinking up different ways to explain mundane events

This is a belated reply to cousin_it's 2009 post Bayesian Flame

Not to be a grammar Nazi, but I believe it should be cousin_its....

Orthography is not intuitive. To test my native speaker instinct, I'll pick a case that is. Imagine a user whose name was "Praise_Him". To me, it would be more natural to say "Praise_Him's post" than "Praise_His post"; the former might give me a second's pause, but the latter would make me reread the sentence. Thus, at least the way I use the language, a proper name which incorporates a pronoun is possessivized as a whole, and cousin_it's is correct. But "Its" and "It's" are homophonous, so it wouldn't matter to me much.

The underscore _ signifies italics in whatever-markup-system-this-is: _emph_ => emph. You can escape the underscore with a backslash before it: \_emph\_ => _emph_.

Now I'm going to be arguing about that with myself all day. How do you possessivize a proper noun that ends in the word "it"?

I vote for "cousin_it's".

(Yes, that is a period outside of the quotation mark. I am unrepentant.)

I agree with your vote and your punctuation.

"Cousin_it's" is good. It's like eir name is Schmit or something.

I too am behind your punctuation choice.

The ones in the right have nothing to repent for!

(Yes, that is a period outside of the quotation mark. I am unrepentant.)

For similar reasons I would not personally discourage people from (incorrectly) referring to (for example) Clippy's paper clips as " it's ". The language is clearly flawed.

good thing I didn't go with the username "this.is.she"!