A History of Bayes' Theorem

29th Aug 2011

37Kaj_Sotala

29lessdazed

28lukeprog

12Kaj_Sotala

7Jonathan_Graehl

1alex_zag_al

2EHeller

1alex_zag_al

2EHeller

12lessdazed

8gwern

4Manfred

4lessdazed

4kurokikaze

6jhuffman

3lessdazed

2AlanCrowe

26Craig_Heldreth

11Davorak

19Paul Crowley

4Kaj_Sotala

12endoself

3lessdazed

1endoself

0fool

2endoself

1fool

2endoself

2fool

0endoself

5gwern

15fool

12John_Baez

11Eliezer Yudkowsky

17jhuffman

10DavidPlumpton

10taw

5Vaniver

3taw

0magfrump

4taw

2magfrump

2lessdazed

9endoself

0lukeprog

2endoself

9Kaj_Sotala

3lukeprog

2lessdazed

8lessdazed

5Douglas_Knight

0kurokikaze

7Dr_Manhattan

7Vaniver

3lukeprog

6calcsam

4jsalvatier

0James_K

0jsalvatier

0calcsam

5Scott Alexander

1XiXiDu

5prase

1lukeprog

4Mass_Driver

6Cyan

1Mass_Driver

4Cyan

0jhuffman

4Solvent

5MinibearRex

2alex_zag_al

3[anonymous]

0gwern

3snarles

3gjm

2Pattern

2habryka

2snarles

5fool

2gwern

0snarles

1neverix

1eclecticos

1Fabrice Pautot

1gwern

1loup-vaillant

6fool

1ksvanhorn

0gwern

0Asymmetric

New Comment

91 comments, sorted by Click to highlight new comments since: Today at 10:19 PM

Some comments are truncated due to high volume. (⌘F to expand all)

I shared the link to this post on an IRC channel populated by a number of people, but mostly by mathematically inclined CS majors. It provoked a bunch of discussion about the way frequentism/bayesianism is generally discussed on LW. Here are a few snippets from the conversation (nicknames left out except my own, less relevant lines have been edited out):

11:03 < Person A> For fucks sake "And so at age 62, Laplace — the world's first Bayesian — converted to frequentism, which he used for the remaining 16 years of his life."

11:04 <@Guy B> well he believed that the results were the same

11:04 <@Guy B> counterexamples were invented only later

11:05 < Person A> Guy B: Still, I just hate the way that lesswrong talks about "bayesians" and "frequentists"

11:05 <@Guy B> Person A: oh, I misinterpreted you

11:06 < Person A> Every time yudkowsky writes "The Way of Bayes" i get a sudden urge to throw my laptop out of the window.

11:08 < Person A> Yudkowsky is a really good popular writer, but I hate the way he tries to create strange conflicts even where they don't exist.

11:10 <@Xuenay> I guess I should point out ...

Does anybody better versed in the debate have a comment?

Though I was not addressed by that, here goes anyway:

That people are happy doing whatever works doesn't make them part Bayesian and part Frequentist in LW's meaning any more than eating some vegetables and some meat makes one part vegetarian and part carnivore. Omnivores are not insiders among vegetarians or carnivores.

Bayesians - those who really do care, as you put it - believe something like "learning works to the extent it models Bayesian updating". When omnistatisticians decide to use a set of tools they customize for the situation, and make the result look clean and right and not silly and even extrapolatable and predictive, etc., and this gets a result better than formal Bayesian analysis or any other analysis, Bayesians believe that the thing that modeled Bayesian updating happened within the statisticians' own minds - their models are not at all simple, because the statistician is *part of the model*. Consequently, any non-Bayesian model is almost by definition poorly understood.

This is my impression of the collective LW belief, that impression is of course open to further revision.

LW has contributed to the...

But the content in my post isn't by Less Wrong, it's by McGrayne.

The history in McGrayne's book is an excellent substantiation of just how deep, serious, and long-standing the debate between frequentism and Bayesianism really is. If they want, they can check the notes at the back of McGrayne's book and read the original articles from people like Fisher and Jeffreys. McGrayne's book is full of direct quotes, filled with venom for the 'opposing' side.

But the content in my post isn't by Less Wrong, it's by McGrayne.

Fair point. Still, a person who hasn't read the book can't know whether lines such as "at age 62, Laplace — the world's first Bayesian — converted to frequentism" are from the book or if they were something you came up when summarizing.

If they want, they can check the notes at the back of McGrayne's book and read the original articles from people like Fisher and Jeffreys.

In previous discussions on the topic, I've seen people express the opinion that the fierce debates are somewhat of a thing of the past. I.e. yes there have been fights, but these days people are mostly over that.

713y

I took this as a successful attempt at humor.

110y

This is something I was told over and over again by professors, when I was applying to grad school for biostatistics and told them I was interested in doing specifically Bayesian statistics. They mistook my epistemological interest in Bayes as like... ideological alignment, I guess. This is how I learned 1. that there were fierce debates in the recent past and 2. most people in biology don't like them or consider them productive.

210y

I'm not sure that the debates were even THAT recent. I think your professsors are worried about a common failure mode that sometimes creeps up- people like to think they know the "one true way" to do statistics (or really any problem) and so they start turning every problem into a nail so that they can keep using their hammer, instead of using appropriate methodology to the problem at hand.
I see this a fair amount in data mining, where certain people ONLY use neural nets, and certain people ONLY use various GLMs and extensions and sometimes get overly-heated about it.

110y

Thanks for the warning. I thought the only danger was ideological commitment. But--correct me if I'm wrong, or just overrecahing--it sounds like if I fail, it'll be because I develop an expertise and become motivated to defend the value of my own skill.

210y

No, more like you'll spend months (or more) pushing against a research problem to make it approachable via something in a Bayesian toolbox when there was a straightforward frequentist approach sitting there all along.

Because of its subject, your post in particular will obviously focus on those who care about the debate. It's not about the practice of learning from data, it's about the history of views on how to learn from data.

The criticism that it ignores those who utilize and do not theorize is wrong headed. The only thing that prevents it from being an outright bizarre accusation is that LW has repeatedly ignored the mere utilizers who are outside the academic debate when they should have been discussed and addressed.

But the content in my post isn't by Less Wrong, it's by McGrayne.

I strongly, strongly disagree. Even presenting unaltered material in a context not planned by the original author is a form of authorship. You have gone far, far beyond that by paraphrasing. You have presented an idea to a particular audience with media, you are an author, you are responsible.

If my friend asks to borrow a book to read, and I say "Which book" and he or she says "Whichever" I affect what is read and create the context in which it is read.

812y

I literally just finished the book, and Luke's paraphrase seems pretty apt. As presented by McGrayne, with specific quotes and punitive actions, the feud was brutal.

413y

My problem, and likely the chatters', is that by leading a team cheer for one audience, the larger neutral audience feels excluded. Doesn't really matter whose words it was.
And while most of the history was very interesting, some of it felt cherry-picked or spun, adding to that feeling of team-ization.

413y

I don't think "neutral" is quite the right word for the audience in question. It may be the best one, but there is more to it, as it only captures the group's view of itself, and not how others might see it.
The Bayesians (vegetarians) see the "neutrals" (omnivores) as non-understanding (animal-killers). The neutrals see themselves as partaking of the best tools (foods) there are, both Bayesian and frequentist (vegetable and animal), and think that when Bayesians call them "non-Bayesians" (animal-killers) the Bayesians are making a mistake of fact by thinking that they are frequentists (carnivores). Sometimes Bayesians even say "frequentist" when context makes it obvious they mean "non-Bayesian" (or that they are making a silly mistake, which is what the threatened "neutrals" are motivated to assume).
As neutrals is absolutely how those in the group in question see themselves, but also true is that Bayesians see them as heretics, (murderers of Bambi, Thumper, and Lambchop), or what have you, without them making a mistake of fact. The Bayesian theoretical criticisms should not be brushed aside on the grounds that they are out of touch with how things are done, and do not understand that it that most use all available tools (are omnivorous). They can be addressed by invoking the outside view against the inside view, or practice against theory, etc. (these are arguments in which Bayesians and frequentists are joined against neutrals) and subsequently (if the "neutrals" (omnivores) do not win against the Bayesians [and their frequentist allies {those favoring pure diets}] outright in that round) on the well worn Bayesian (vegetarian) v. frequentist (carnivore) battlegrounds.

413y

I think vegetarian-carnivore metaphor here doesn't help at all :)

613y

I found it helpful. But I'm an omnivore so I (mistakenly) think that I don't have a dog in that fight.

313y

This is quite possible, but there is some irony here - you have misrepresented the analogy by describing a three category grouping system by naming two of its categories, implying it is about opposites!
I think that people do this too often in general and that it is implicated in this debate's confused character. Hence, the analogy with more than a dichotomy of oppositional groups!

213y

Realising that it is a three-way split, not a two-way split is my latest hammer. See me use it in Is Bayesian probability individual, situational, or transcendental: a break with the usual subjective/objective bun fight.
Having said that, I find myself agreeing with kurokikaze; the vegetarian-omnivore-carnivore metaphor doesn't help. The spilt blood (and spilt sap) distract from, and obscure, the "Three, not two" point.

In my laboratory statistics manual from college (the first edition of this book) the only statistics were frequentist, and Jaynes was considered a statistical outlier in my first year of graduate school. His results were respected, but the consensus was that he got them in spite of his unorthodox reduction method, not because of it.

In my narrow field (reflection seismology) two of the leaders explicitly addressed this question in a (surprisingly to me little-read and seldom-referenced) paper: To Bayes or not to Bayes. Their conclusion: they prefer their problems neat enough to not require the often-indispensable Bayes method.

It is a debate I prefer to avoid unless it is required. The direction of progress is unambiguous but it seems to me a classic example of a Kuhn paradigm shift where a bunch of old guys have to die before we can proceed amicably.

A very small minority of people hate Bayesian data reduction. A very small minority of people hate frequentist data reduction. The vast majority of people do not care very much unless the extremists are loudly debating and drowning out all other topics.

Another graduate student, I have in general heard a similar opinions from many professors through undergrad and grad school. Never disdan for bays but often something along the lines of "I am not so sure about that" or "I never really grasped the concept/need for bayes." The statistics books that have been required for classes, in my opinion durring the class, used a slightly negative tone while discussing bayes and 'subjective probability.'

I think this is due to Yudkowsky's focus on AI theory; an AI can't use discretion to choose the right method unless we formalize this discretion. Bayes' theorem is applicable to all inference problems, while frequentist methods have domains of applicability. This may seem philosophical to working statisticians - after all, Bayes' theorem is rather inefficient for many problems, so it may still be considered inapplicable in this sense - but programming an AI to use a frequentist method without a complete understanding of its domain of applicability could be disastrous, while that problem just does not exist for Bayesianism. There is the problem of choosing a prior, but that can be dealt with by using objective priors or Solomonoff induction.

313y

I'm not sure what you meant by that, but as far as I can tell not explicitly using Bayesian reasoning makes AIs less functional, not unfriendly.

113y

Yes, mostly that lesser meaning of disastrous, though an AI that almost works but has a few very wrong beliefs could be unfriendly. If I misunderstood your comment and you were actually asking for an example of a frequentist method failing, one of the simplest examples is a mistaken assumption of linearity.

013y

"There is the problem of choosing a prior, but that can be dealt with by using objective priors or Solomonoff induction."
Yeah, well. That of course is the core of what is dubious and disputed here. Really, Bayes' theorem itself is hardly controversial, and talking about it this way is pointless.
There's sort of a continuum here. A weak claim is that these priors can be an adequate model of uncertainty in many situations. Stronger and stronger claims will assert that this works in more and more situations, and the strongest claim is that these cover all forms of uncertainty in all situations. Lukeprog makes the strongest claim, by means of examples which I find rather sketchy relative to the strength of the claim.
To Kaj Sotala's conversation, adherents of the weaker claim would be fine with the "use either methodlogy if that suits it" attitude. This is less acceptable to those who think priors should be broadly applicable. And it is utterly unacceptable from the perspective of the strongest claim.
For that matter "either" is incorrect (note the original conversation one of them actually talks about several rather than two). There is lots of work on modeling uncertainty in non-frequentist and non-bayesian ways.

213y

Anyone who bases decisions on a non-Bayesian model of uncertainty that is not equivalent to Bayesianism with some prior is vulnerable to Dutch books.

113y

It seems not. Sniffnoy's recent thread asked the very question as to whether Savage's axioms could really be justified by dutch book arguments.

213y

I was thinking of the simpler case of someone who has already assigned utilities as required by the VNM axioms for the noncontroversial case of gambling with probabilities that are relative frequencies, but refuses on philosophical grounds to apply the expected utility decision procedure to other kinds of uncertainty.
(I do think the statement still stands in general. I don't have a complete proof but Savage's axioms get most of the way there.)

213y

On the thread cited I gave a three state, two outcome counterexample to P2 which does just that. Having two outcomes obviously a utility function is not an issue. (It can be extended it with an arbitrary number of "fair coins" for example to satisfy P6, which covers your actual frequency requirement here)
My weak claim is that it is not vulnerable to "Dutch-book-type" arguments. My strong claim is that this behaviour is reasonable, even rational. The strong claim is being disputed on that thread. And of course we haven't agreed on any prior definition of reasonable or rational. But nobody has attempted to Dutch book me, and the weak claim is all that is needed to contradict your claim here.

013y

Sorry, I didn't check that thread for posts by you. I replied there.

513y

--Thomas Kuhn, The Structure of Scientific Revolutions

I was looking a little bit into this claim that Poincaré used subjective priors to help acquit Dreyfus. In a word, FAIL.

Poincaré's use of subjective priors was not a betrayal of his own principles because he needed to win, as someone above put it. He was granting his opponent's own hypothesis in order to criticise him. Strange that this point was not clear to whoever was researching it, given that the granting of the hypothesis was prefaced with a strong protest.

The court intervention in question was a report on Bertillon's calculations, by Poincaré with Appel and Darboux, « Examen critique des divers systèmes ou études graphologiques auxquels a donné lieu le bordereau » (discussed and quoted [here] ). It speaks for itself.

« Or cette probabilité *a priori*, dans des question comme celle qui nous occupe, est uniquement formée d'éléments moraux qui échappent absolument au calcul, et si, comme nous ne pouvons rien calculer sans la connaître, tout calcul devient impossible. Aussi Auguste Comte a-t-il dit avec juste raison que l'application du calcul des probabilités aux sciences morales était le scandale des mathématiques. Vouloir éliminer les éléments moraux et y substituer des chiffre...

Maybe this is not news to people here, but in England, a judge has ruled against using Bayes' Theorem in court - unless the underlying statistics are "firm", whatever that means.

Dreyfus was still convicted, though his sentence was reduced, but the public was outraged and the president issued a pardon two weeks later.

???

Dreyfus spent years on a prison island and emerged looking rather the worse for wear. http://en.wikipedia.org/wiki/Dreyfus_affair

The trial discussed in this piece was his *second* trial, held after years of imprisonment.

On September 19, 1899, following a passionate campaign by his supporters, including leading artists and intellectuals like Émile Zola, Dreyfus was pardoned by President Émile Loubet in 1899 and released from prison. He had been subjected to a second trial in that year and again declared guilty of treason despite the evidence in favor of his innocence.

Can anybody give a URL or show a simple practical worked example similar to the applications described here? It all sounds awesome but I have little idea how to apply it to estimating the mass of Saturn and my artillery shelling is suffering somewhat.

Actually a fun example might be the probability that the Nickel/Hydrogen low energy fusion system being developed by Rossi is real or bogus. Points in favour: several tricky to fake successful demonstrations in front of scientists and the president of a skeptics society. Points against: no public disclosure of the secret catalyst, previous convictions for fraud, and cancelling the contract with the company that was going to manufacture the system.

When Nate Silver of FiveThirtyEight.com used Bayes to predict results of the November 2008 race, he correctly predicted the winner in 49 states, an unmatched record among pollsters.

Intrade got it equally right, and to be honest there's nothing particularly "Bayesian" about Nate Silver's methodology. It's just intelligently weighted average of polling data.

513y

I think the premise is that, if you are weighting the importance of polls based on how well the polls predicted past elections, you are using the spirit of Bayes, and the only consistent and correct way to do it mathematically is some form of Bayes itself.

313y

IIRC his weights were based on objective quality metrics like sample size and recency.

013y

When you say "objective quality metrics," how can they be determined to be such without using prior knowledge?

413y

For sample size, it's actually objectively measurable. For recency etc. you can just use your expert judgment and validate against data with ad hoc techniques.
Ask Nate Silver for details if you wish. He never indicated he has a big Bayesian model behind all that.
You reach a point very early where model uncertainty makes Bayesian methods no better than ad hoc methods.

213y

I don't mean to argue that Nate Silver had a "big Bayesian model behind all that." But if sample size and recency increase the reliability of polls, you can objectively measure how much they do and it seems that using Bayesian methods you could create an objectively best prior weighting system, which seems like the point that Vaniver was making.
I'm not immediately familiar with the math but it seems odd to me that it would be much more work to do a regression for a "best prior" than to come up with an ad hoc method, especially considering that "expert judgment" tends to be really bad (at least according to Bishop and Trout ).
Of course, I should probably wait to disagree until he [Nate Silver] gets something wrong.

213y

Theories describing reality at a deep level have problems such as unclear intellectual ownership of intelligent methods when the methods aren't clearly inspired by the theoretical tradition. It's a good problem to have.

So, he invented the 'ban', defined as "about the smallest change in weight of evidence that is directly perceptible to human intuition."

Wikipedia defines the ban as the amount of information in a decimal digit and makes the observation, due to I. J. Good, that a deciban is approximately the smallest intuitively perceptible change in evidence.

013y

Hmmm, McGrayne's book says that Turing thought the ban was about the smallest change in weight of evidence directly perceptible to human intuition. Maybe Good had a different intuition about the measure, later.

213y

I think a simple mistake is far more likely. A ban corresponds to an odds ratio of 10:1, clearly well within the range of perception.

Thanks! That was an interesting look at things - it's nice to know about the historical and sociological background of things. Seeing a similar account from a frequentist perspective would be nice, as this was quite one-sided, but I don't know which book should be summarized for that.

and discovered that accident rates had plummeted suddenly in the early 1890s. An historian suggested why: in 1989, the miners had formed a safety coalition.

I think one of these years is wrong.

313y

No doubt!
I'll fix the date thing; thanks.

213y

That would be Wicked!

Laplace's friend Bouvard used his method to calculate the masses of Jupiter and Saturn from a wide variety of observations. Laplace was so impressed that he offered his readers a famous bet: 11,000 to 1 odds that Bouvard's results for Saturn were within 1% of the correct answer, and a million to one odds for Jupiter. Nobody seems to have taken Laplace's bet, but today's technology confirms that Laplace should have won both bets.

How were they to determine the correct answer?

513y

Here are English translations of relevant papers of Laplace (search for "Bouvard"). Laplace appeals to the next century of data. It's not phrased as a bet there and while I can't address everything he wrote, older translations use "bet" where newer ones use "odds."
Newton used (his derivation of) Kepler's laws relate the mass to the period of satellites; I think Bouvard did the same. Newton correctly calculated the mass of Jupiter because it has easily visible satellites, but was off by 15% for Saturn. Bouvard differed by 0.5% from Newton on Jupiter, but both were 2% off (Laplace was overconfident). Bouvard's error on Saturn was less than 0.5%. Since Laplace thought he should be less accurate on Saturn, it was probably luck. Bouvard also computed the mass of Uranus within 30% by looking at its effects on other planets. ETA: since Laplace says the mass includes the moons, it probably isn't computed from the orbits of the moons; perhaps it is the effect on other planets.

013y

Calculated from gravitational force.

Nice article, though parts seem needlessly politicised, as Kaj noted. Also seems it could use editing in places. Just one example -

were expected to collect statistics on all sorts of things: cholera victims, the chest sizes of soldiers, the number ofFrench officialskilled by kicking horsesPrussian officers

Seems historically implausible, unless them were French horses..

713y

That's actually a famous example in statistics and probability, from Bortkiewicz's book on the Law of Small Numbers. Like Laplace and Bayes, Bortkiewicz did the heavy lifting on an idea named after someone else (in this case, the Poisson distribution). As Bortkiewicz was Polish, German, or Russian (depending on how you look at things), it doesn't make sense to lump him in with French officials.

313y

Oops. I think I started the sentence listing only things French officials were expected to do, and then added more things to the list as I kept reading without appropriately making the subject of the sentence more general. Fixed.

Alan Turing used it to decode the German Enigma cipher and arguably save the Allies from losing the Second World War; the U.S. Navy used it to search for a missing H-bomb and to locate Soviet subs; RAND Corporation used it to assess the likelihood of a nuclear accident; and Harvard and Chicago researchers used it to verify the authorship of the Federalist Papers.

I haven't seen any explanation of *how* these kinds of things were done, including calculations. Eliezer's Intuitive Explanation is good, of course, but the examples are very basic. Anything that ...

413y

Are you looking for Bayesian statistics in general or these specifics examples? My Bayesian statistics textbook recommendation is here.

013y

Thank you for the recommendation.

013y

You're welcome!

013y

That is helpful, thanks!
Not necessarily these specific examples, but some complex example.
I'm not sure if I would buy a textbook, but I would definitely read a link. Others likely fall into this category.

Excellent article and reminds me how little I really know about Bayesian techniques and statistics.

Maybe when you're done grounding morality and unraveling the human brain and developing friendly AI and disproving the existence of God, you can write a series of posts picking up where "Intuitive Explanation of Bayes Theorem" and "Technical Explanation of Technical Explanation" left off.

113y

Don't hold your breath :-)

113y

Lol! Who the heck is 'Napolean'?
Fixed, thanks.

Does anyone know anything about Bayesian statistics in academic political science? To put it mildly, political science has quite a number of open questions, and last I checked all of the statistical analysis in the field was frequentist. Political scientists spend a good chunk of their time sniping each other for getting the super-advanced frequentist statistics wrong. Maybe there's some room for basic Bayesian statistics to do some useful work?

613y

Andrew Gelman springs to mind.

113y

Yes, that fits the bill! Looks like a good thinker and a strong writer, too. Know anyone in comparative or international politics?

413y

Alas, no. I know of Gelman because of his Bayesian stats textbook, not because of his political science background. You could email and ask him directly -- he's been responsive to emails from me and sometimes posts emails from others and his responses to his blog.

013y

The sniping is called "training".

That was nicely written and fun to read. I might pick up that book.

A question: I found the odds ratio version of Bayes's theorem far more intuitive. Throughout history, has the equation ever been given as an odds ratio?

513y

E.T. Jaynes used it a fair amount. I don't know how much others have used it.

210y

I saw it put this way in a talk once. The talk was about integrating evidence from multiple sources to figure out if two biological macromolecules physically interact.
The reason, I think, is that this is a yes or no question. Most of the time, though, Bayes' theorem is used for numerical quantities: H means that a real world quantity X has a particular value x. But try to write it in the odds ratio form for this problem. You have to write probabilities given ~H, probabilities of the evidence just excluding a particular value of X, which is really awkward.

The German codes, produced by Enigma machines with customizable wheel positions that allowed the codes to be changed rapidly, were considered unbreakable, so nobody was working on them.

That's not true. Polish Cipher Bureau was (for obvious reasons) interested in cryptoanalysis of German encryption system. Polish mathematicians: Rejewski, Różycki and Zygalski had significant achievements: they broke Enigma and even made working replicas. However, their methods of analysis weren't Bayesian (as far as I remember) and relied on some weaknesses of German procedures.

012y

Luke's summary omits details. McGrayne does indeed cover the Polish efforts, and then about the British efforts:

Fisher's criticism of Bayesianism in *Statistical Methods for Research Workers* is rather pathetic--one of his justifications went along the lines of "since other intelligent people dismiss Bayesianism, there must be some reason to dismiss it." I would say that simple irrationality is insufficient to explain why clearly intelligent and experienced people would actively choose to ignore Bayesianism for such flimsy reasons. Instead, to explain the popularity of frequentism over Bayesianism, it is necessary to understand that scientists are motivate...

Finally, in 1983 the US Air Force sponsored a review of NASA's estimates of the probability of shuttle failure. NASA's estimate was 1 in 100,000. The contractor used Bayes and estimated the odds of rocket booster failure at 1 in 35. In 1986,Challengerexploded.

Who was the contractor?

But Tukey publicly denied Bayesianism. When working on the NBC projects, he said he wasn't using Bayes, instead he was "borrowing strength." He didn't allow anybody on his team to talk about their methods, either, saying it was proprietary information.

According to this paper, Tukey used the term "borrowing strength" to describe empirical Bayes techniques, which comprise an entirely different methodology than Bayesianism.

513y

Good-Turing estimation which was part of the Enigma project should also go under the empirical heading.

212y

In what sense is empirical Bayes - using the frequencies in initial data to set the original priors - "entirely" different from "Bayesianism", as opposed to be an interesting subset or variation?

012y

Empirical Bayes procedures can be shown to be robust to the distribution of the data in a way that Bayesian procedures cannot. The difference between Empirical bayes and Bayesianism along this important dimension make them very distinct procedures from the perspective of many users.
This difference is most commonly seen in practice when some density must be estimated for inference. Use of kernel density estimation in empirical Bayes ensures an asymptotic convergence to the true density at some rate. In contrast, no Bayesian prior has yet been developed with consistency for density estimation.

Thanks very much for this summary of the book. Quick correction -- the second time you mention Dennis Lindley, you refer to him as David Lindley.

I enjoyed the book a lot; McGrayne has a good eye for the amusing details, and she conveys at least some of the intuition (although some graphs or examples would have helped the reader - I liked the flipping coin illustrations in Dasivia 2006 *Bayesian Data Analysis*). It's also remarkably synoptic: I was repeatedly surprised by names popping up in the chronology, like BUGS, Bretthorst, Fisher's smoking papers, Diaconis, the actuarial use of Bayes etc, and I have a better impression of Laplace and Good's many contributions. The math was very light, which und...

613y

No comment at this time on Kolmogorov, but the Poincaré example I found particularly sketchy. Clearly Bertillon and arguably also Poincaré were not trying to use mathematics to find the right answer, they had a right answer in mind and were trying to convince others of it. (Lies, damned lies, and statistics, as they say.)
In any case Poincaré was not putting forth a mathematical argument for innonce so much as destroying Bertillon's mathematical argument for guilt. Are we really sure he advocated the use of subjective priors here? (And again, the controversial point is not Bayes' Theorem itself, but the priors). So I would like to see more details on this.

Thanks for the interesting review. Sounds like a book I'll want to read.

Still... no mention of Jaynes and the method of maximum entropy? No mention of Cox's Theorem?

012y

Cox is not mentioned. Jaynes is mentioned twice:
and later:

Are there any instances of Bayes failing, or achieving a less correct prediction in hindsight than a frequentist calculation? There must be a reason why frequentists exist.

[This comment is no longer endorsed by its author]

So begins Sharon McGrayne's fun new book,

The Theory That Would Not Die, a popular history of Bayes' Theorem. Instead of reviewing the book, I'll summarize some of its content below. I skip the details and many great stories from the book, for example the (Bayesian) search for a lost submarine that inspiredHunt for Red October. Also see McGrayne's Google Talk here. She will be speaking at the upcoming Singularity Summit, too, which you can register for here (price goes up after August 31st).## Origins

In the 1700s, when probability theory was just a whiff in the air, the English Reverend Thomas Bayes wanted to know how to infer causes from effects. He set up his working problem like this: How could he learn the probability of a future event occurring if he only knew how many times it had occurred or not occurred in the past?

He needed a number, and it was hard to decide which number to choose. In the end, his solution was to just

guessand then improve his guess later as he gathered more information.He used a thought experiment to illustrate the process. Imagine that Bayes has his back turned to a table, and he asks his assistant to drop a ball on the table. The table is such that the ball has just as much chance of landing at any one place on the table as anywhere else. Now Bayes has to figure out where the ball is, without looking.

He asks his assistant to throw another ball on the table and report whether it is to the left or the right of the first ball. If the new ball landed to the left of the first ball, then the first ball is more likely to be on the right side of the table than the left side. He asks his assistant to throw the second ball again. If it again lands to the left of the first ball, then the first ball is even

morelikely than before to be on the right side of the table. And so on.Throw after throw, Bayes is able to narrow down the area in which the first ball probably sits. Each new piece of information constrains the area where the first ball probably is.

Bayes' system was: Initial Belief + New Data -> Improved Belief.

Or, as the terms came to be called: Prior + Likelihood of your new observation given competing hypotheses -> Posterior.

In each new round of belief updating, the most recent posterior becomes the prior for the new calculation.

There were two enduring criticisms to Bayes' system. First, mathematicians were horrified to see something as whimsical as a

guessplay a role in rigorous mathematics. Second, Bayes said that if he didn't know what guess to make, he'd just assign all possibilitiesequalprobability to start. For most mathematicians, thisproblem of priorswas insurmountable.Bayes never published his discovery, but his friend Richard Price found it among his notes after Bayes' death in 1761, re-edited it, and published it. Unfortunately, virtually no one seems to have read the paper, and Bayes' method lay cold until the arrival of Laplace.

## Laplace

By the late 18th century, Europe was awash in scientific data. Astronomers had observations made by the Chinese in 1100 BC, by the Greeks in 200 BC, by the Romans in AD 100, and by the Arabs in AD 1000. The data were not of equal reliability. How could scientists process all their observations and choose the best? Many astronomers simply

averagedtheir three 'best' observations, but this was ad-hoc. The world needed a better way to handle all these data.Pierre-Simon Laplace, a brilliant young mathematician, came to believe that probability theory held the key, and he independently rediscovered Bayes' mechanism and published it in 1774. Laplace stated the principle not with an equation, but in words: the probability of a cause (given an event) is proportional to the probability of the event (given its cause). And for the next 40 years, Laplace used, extended, clarified, and proved his new principle.

In 1781, Richard Price visited Paris, and word of Bayes' earlier discovery eventually reached Laplace. Laplace was now all the more confident that he was on the right track.

He needed to test his principle, so he turned to the largest data set available: birth records. A few people had noticed that slightly more boys than girls were born, and Laplace wanted to know if this was an anomalous or constant phenomenon. He began by applying equal probability to his hunches, and then updated his belief as he examined data sets from Paris, from London, from Naples, from St. Petersburg, and from rural areas in France. Later he even asked friends for birth data from Egypt and Central America. Finally, by 1812, he was almost certain that the birth of more boys than girls was "a general law for the human race."

Laplace's friend Bouvard used his method to calculate the masses of Jupiter and Saturn from a wide variety of observations. Laplace was so impressed that he offered his readers a famous bet: 11,000 to 1 odds that Bouvard's results for Saturn were within 1% of the correct answer, and a million to one odds for Jupiter. Nobody seems to have taken Laplace's bet, but today's technology confirms that Laplace should have won both bets.

Laplace used his principle on the issue of testimony, both in court and in the Bible, and made famous progress in astronomy. When asked by Napoleon who authored the heavens, Laplace replied that natural law could explain the behavior of the heavens. Napoleon asked why Laplace had failed to mention God in his book on the subject. Laplace replied: "Sire, I have no need of that hypothesis."

The answer became a symbol of the new science: the search for natural laws that produced phenomena without the need to call upon magic in the explanation.

And then, Laplace invented the central limit theorem, which let him handle almost any kind of data. He soon realized that where large amounts of data were available, both the Bayesian and the frequentist approaches (judging an event's probability by how frequently it occurs among many observations) to probability tended to produce the same results. (Only much later did scientists discover how wildly the two approaches can diverge even given lots of data.)

And so at age 62, Laplace — the world's first Bayesian — converted to frequentism, which he used for the remaining 16 years of his life.

...though he did finally realize what the general theorem for Bayes' method had to be:

Which says that the probability of a hypothesis C

givensome evidence E equals our initial estimate of the probabilitytimesthe probability of the evidence given the hypothesis C divided by the sum of the probabilities of the data in all possible hypotheses.Basically, Laplace did all the hard work, and he deserves most of the honor for what we call Bayes' Theorem. But historical accidents happen, and the method is named after Bayes.

## The Decline of Bayes' Theorem

Empowered by Laplace's central limit theorem, government officials were expected to collect statistics on all sorts of things: cholera victims, the chest sizes of soldiers, the number of Prussian officers killed by kicking horses, and so on. But the idea that probability quantifies our ignorance was gone, replaced by the idea that the new science could not allow for anything 'subjective'. John Stuart Mill denounced probability as "ignorance... coined into science."

By 1891, the Scottish mathematician George Chrystal urged: "[Laplace's principle] being dead, [it] should be decently buried out of sight, and not embalmed in text-books and examination papers... The indiscretions of great men should be quietly allowed to be forgotten."

And thus, Bayes' Theorem fell yet again in disuse... at least among theoreticians. A smattering of practitioners continued to find it useful.

Joseph Bertrand was convinced that Bayes' Theorem was the only way for artillery officers to correctly deal with a host of uncertainties about the enemies' location, air density, wind direction, and more. From 1890-1935, French and Russian artillery officers used Bertrand's Bayesian textbook to fire their weapons.

When the French Jew Alfred Dreyfus was falsely accused of having sold a letter to German military expert, France's famous mathematician Henri Poincaré was called to the stand. Poincaré was a frequentist, but when asked whether Dreyfus had written the letter, Poincaré invoked Bayes' Theorem as the only sensible way for a court of law to update a hypothesis with new evidence, and proclaimed that the prosecution's discussion of probability was nonsense. Dreyfus was still convicted, though his sentence was reduced, but the public was outraged and the president issued a pardon two weeks later.

Statisticians used Bayes' Theorem to set up a functioning Bell phone system, set of up the United States' first working social insurance system, and solve other problems.

Meanwhile, the biologist R.A. Fisher was pioneering new randomization methods, sampling theory, tests of significant, analyses of variance, and a variety of experimental designs. In 1925 he published his revolutionary manual,

Statistical Methods of Research Workers. The success of the book enshrined frequentism and the standard statistical method.## Jeffreys

Even during its decline, a few people made progress on Bayesian theory. At about the same time, three men in three countries — Émile Borel, Frank Ramsey, and Bruno de Finetti — independently happened upon the same idea: knowledge

issubjective, and we can quantify it with a bet. The amount we wager shows how strongly we believe something.And then, the geologist Harold Jeffreys made Bayes' Theorem useful for scientists, proposing it as an alternative to Fisher's 'p-values' and 'significance tests', which depended on "imaginary repetitions." In contrast, Bayesianism considered data as fixed evidence. Moreover, the p-value is a statement about data, but Jeffreys wanted to know about his hypothesis

giventhe data. He published the monumentalTheory of Probabilityin 1939, which remained for many years the only explanation of how to use Bayes to do science.For decades, Fisher and Jeffreys were the world's two greatest statisticians, though both were practicing scientists instead of theoreticians. They traded blows over probability theory in scientific journals and in public. Fisher was louder and bolder, and frequentism was easier to use than Bayesianism.

## Bayes at War

In 1941, German U-Boats were devastating allied naval forces. Britain was cut off from its sources of food, and couldn't grow enough on its own soil to feed its citizens. Winston Churchill said the U-boat problem was the scariest part of the war for him.

The German codes, produced by Enigma machines with customizable wheel positions that allowed the codes to be changed rapidly, were considered unbreakable, so nobody was working on them. This attracted Alan Turing to the problem, because he liked solitude. He built a machine that could test different code possibilities, but it was slow. The machine might need four days to test all 336 wheel positions on a particular Enigma code. Until more machines could be built, Turing had to find a way for reducing the burden on the machine.

He used a Bayesian system to guess the letters in an Enigma message, and add more clues as they arrived with new data. With this method he could reduce the number of wheel settings to be tested by his machine from 336 to as few as 18. But soon, Turing realized that he couldn't compare the probabilities of his hunches without a standard unit of measurement. So, he invented the 'ban', defined as "about the smallest change in weight of evidence that is directly perceptible to human intuition." This unit turned out to be very similar to the bit, the measure of information discovered using Bayes' Theorem while working for Bell Telephone.

Now that he had a unit of measurement, he could target the amount of evidence he needed for a particular hunch and then stop the process when he had that much evidence.

While Turing was cracking the Enigma codes in Britain, Andrey Kolmogorov was fleeing the German artillery bombardment of Moscow. In 1933 he had showed that probability theory can be derived from basic mathematical axioms, and now Russia's generals were asking him about how best to fire back at the Germans. Though a frequentist, Kolmogorov recommended they used Bertrand's Bayesian firing system in a crisis like this.

Shortly after this, the British learned that the Germans were now using stronger, faster encryption machines: Lorenz machines. The British team used Turing's Bayesian scoring system and tried a variety of priors to crack the codes.

Turing visited America and spent time with Claude Shannon, whose brilliant insights about information theory came a bit later. He realized that the purpose of information is to reduce uncertainty and the purpose of encryption is to increase it. He was using Bayes for both. Basically, if the posterior in a Bayesian equation is very different from the prior, then much has been learned, but if the posterior is roughly the same as the prior, then the information content is low. Shannon's unit for information was the 'bit'.

Meanwhile, Allied patrol planes needed to narrow their search for German U-boats. If 7 different listening posts intercepted the same message from the same U-boat, it could be located to somewhere in a circle 236 miles across. That's a lot of uncertainty, and mathematician Bernard Koopman was assigned to solve the problem. He wasn't bashful about Bayes at all. He said: "Every operation involved in search is beset with uncertainties; it can be understood quantitatively only in terms of... probability. This may now be regarded as a truism, but it seems to have taken the developments in operational research of the Second World War to drive home its practical implications."

Koopman started by assigning 50% probability that a U-boat was inside the 236-mile circle, and then update his probability as more data came in, apportioning plane flyover hours according to the probabilities of U-boat locations.

And then, a few day's after Germany's surrender, Churchill ordered the destruction of all evidence that decoding has helped win the war, apparently because the British didn't want the Soviets to know they could decrypt Lorenz codes. It wasn't until 1973 that the story of Turing and Bayes began to emerge.

## Revival

Its wartime successes classified, Bayes' Theorem remained mostly in the dark after the Second World War. Textbooks self-righteously dismissed Bayes. During the McCarthyism of the 1950s, one government statistician half-jokingly called a colleague "un-American because [he] was a Bayesian, ...undermining the United States Government."

In 1950, an economist preparing a report asked statistician David Blackwell (not yet a Bayesian) to estimate the probability of another world war in the next five years. Blackwell answered: "Oh, that question just doesn't make sense. Probability applies to a long sequence of repeatable events, and this is clearly a unique situation. The probability is either 0 or 1, but we won't know for five years." The economist replied, "I was afraid you were going to say that. I've spoken to several other statisticians, and they all told me the same thing."

Still, there were flickers of life. For decades after the war, one of Turing's American colleagues taught Bayes to NSA cryptographers. I.J. Good, one of Turing's statistics assistant, developed Bayesian methods and theory, writing about 900 articles about Bayes.

And then there was the Bible-quoting business executive Arthur Bailey.

Bailey was trained in statistics, and when he joined an insurance company he was horrified to see them using Bayesian techniques developed in 1918. They asked not "What should the new rates be?" but instead "How much should the present rates be changed?" But after a year of trying different things, he realized that the Bayesian actuarial methods worked better than frequentist methods. Bailey "realized that the hard-shelled underwriters were recognizing certain facts of life neglected by the statistical theorists." For example, Fisher's method of maximum likelihood assigned a zero probability to nonevents. But since many businesses don't file insurance claims, Fisher's method produced premiums that were too low to cover future costs.

Bailey began writing a paper about his change in attitude about Bayes. By 1950 he was vice president of a large insurance company in Chicago. On May 22 he read his famous paper at a black-tie banquet for an actuarial society. The title: 'Credibility Procedures: Laplace's Generalization of Bayes' Rule and the Combination of [Prior] Knowledge with Observed Data.'

Bailey praised his colleagues for standing mostly alone against the statistics establishment. Then he announced that their beloved Credibility formula was actually Bayes Theorem, and in fact that the person who had published Bayes' work, Richard Price, would today be considered an actuary. He used Bayes' ball-and-table thought experiment to attack Fisher and his methods, and ended with a rousing call to put prior knowledge back into probability theory. His speech occupied theorists for years, and actuaries often see Bailey as taking their profession out of its dark ages.

That same year, I.J. Good published

Probability and the Weighing of Evidence, which helped to found Bayes' Theorem into a logical, coherent methodology. Good was smart, quick, and by now perhaps the world's expert on codes. He introduced by holding out his hand and saying "I am Good." When the British finally declassified his cryptanalysis work, allowing him to reveal Bayes' success during WWII, he bought a vanity licensed plate reading 007 IJG.In the 1950s, Dennis Lindley and Jimmie Savage worked to turn the statistician's hodgepodge of tools into a "respectable branch of mathematics," as Kolmogorov had done for probability in in general in the 1930s. They found some success at putting statistics on a rigorous mathematical footing, and didn't realize at the time that they couldn't get from their theorems to the ad hoc methods of frequentism. Lindley said later, "We were both fools because we failed completely to recognize the consequences of what we were doing."

In 1954, Savage published

Foundations of Statistics, which built on Frank Ramsey's earlier attempts to use Bayes' Theorem not just for making inferences but for making decisions, too. His response to a classic objection to Bayesianism is worth remembering. He was asked, "If prior opinions can differ from one researcher to the next, what happens to scientific objectivity in data analysis?" Savage explained that as we gain data, subjectivists move into agreement, just as scientists come to consensus as evidence accumulates about, say, cigarettes causing lung cancer. When they have little data, scientists are subjectivists. When they have tons of data, they agree and become objectivists.Savage became a Messianic advocate of Bayesianism, but died suddenly of a heart attack in 1971. I.J. Good was active but working at a small university and was poor at public speaking. David Lindley, however, moved to Britain and almost single-handedly created 10 Bayesian departments in the U.K. — professorship by professorship, battle by battle, he got Bayesians hired again and again. By 1977 he was exhausted and retired early.

## Medicine

In 1951, history major Jerome Cornfield used Bayes' Theorem to solve a puzzle about the chances of a person getting lung cancer. His paper helped epidemiologists to see how patients' histories could help measure the link between a disease and its possible cause. Moreover, he had begun to establish the link between smoking and lung cancer. Later efforts in England and the U.S. confirmed Cornfield's results.

Fisher and Neyman, the world's two leading anti-Bayesians, didn't accept the research showing that cigarettes caused lung cancer. Fisher, especially, published many papers. He even developed the hypothesis that, somehow, lung cancer might cause smoking. But in 1959, Cornfield published a paper that systematically addressed every one of Fisher's arguments, and Fisher ended up looking ridiculous.

Cornfield went on to be involved in most of the major public health battles involving scientific data and statistics, and in 1974 was elected president of the American Statistical Association despite never having gotten any degree in statistics. He had developed a congenial spirit and infectious laugh, which came in handy when enduring long, bitter battles over health issues.

In 1979 he was diagnosed with pancreatic cancer, but his humor remained. A friend told him, "I'm so glad to see you." Smiling, Cornfield replied, "That's nothing compared to how happy

Iam to be able to see you." As he lay dying, he called to his two daughters and told them: "You spend your whole life practicing humor for the times when you really need it."## Practical Use

Frequentist methods worked for repetitive, standardized phenomena like crops, genetics, gambling, and insurance. But business executives needed to make decisions under conditions of uncertainty, without sample data. And frequentism didn't address that problem.

At Harvard Business School, Robert Schlaifer thought about the problem. He realized that starting with prior information about demand for a product was better than nothing. From there, he realized that he could update his prior with new evidence, and independently arrived at Bayes' Theorem. Unaware of the literature, he reinvented Bayesian decision theory from scratch and began to teach it confidently. He did not think of it as 'an' approach. It was

theapproach, and everybody else was wrong, and he couldshoweverybody else why they were wrong.Later, he recruited Howard Raiffa to come work with him, because he needed another Bayesian to teach him more math. Together, the two invented the field of Decision-making Under Uncertainty (DUU). Schlaifer wrote the first practical textbook written entirely from a Bayesian perspective:

Probability and Statistics for Business Decisions(1959). They introduced useful tools like decision trees, 'tree-flipping', and conjugate priors. They co-authored what would become the standard textbook of Bayesian statistics for two decades:Applied Statistical Decision Theory. Today, Bayesian methods dominate the business decision-making literature but frequentists still have some hold on statistics departments.Meanwhile, Frederick Mosteller spent a decade using early computers and hundreds of volunteers to painstakingly perform a Bayesian analysis of the disputed

Federalist Papers, and concluded with high probability that they were all written by Madison, not Hamilton. The work impressed many statisticians, even frequentists.Bayes had another chance at fame during the 1960 presidential race between Nixon and Kennedy. The race was too close to call, but the three major TV networks all wanted to be the first to make the correct call. NBC went looking for someone to help them predict the winner, and they found Princeton statistics professor John Tukey. Tukey analyzed huge amounts of voting data, and by 2:30am during the election Tukey and his colleagues were ready to call Kennedy as the winner. The pressure was too much for NBC to make the call, though, so they locked Tukey and his team in a room until 8am when it was clear Kennedy was indeed the winner. NBC immediately asked him to come back for the 1962 election, and Tukey worked with NBC for 18 years.

But Tukey publicly denied Bayesianism. When working on the NBC projects, he said he wasn't using Bayes, instead he was "borrowing strength." He didn't allow anybody on his team to talk about their methods, either, saying it was proprietary information.

In 1980 NBC soon switched to exit polling to predict elections. Exit polling was more visual, chatty, and fun than equations. It would be 28 years before someone used Bayes to predict presidential election results. When Nate Silver of FiveThirtyEight.com used Bayes to predict results of the November 2008 race, he correctly predicted the winner in 49 states, an unmatched record among pollsters.

When the U.S. Atomic Energy Commission ordered a safety study of nuclear power plants, they hired Norman Rasmussen. At the time, there had never been a nuclear power plant accident. He couldn't use frequentist methods to estimate the probability of something that had never happened. So he looked to two sources: equipment failure rates, and expert opinion. But how could he combine those two types of evidence?

Bayes' Theorem, of course. But Rasmussen knew that Bayes was so out of favor that his results would be dismissed by the statistics community if he used the word 'Bayes'. So he used Raiffa's decision trees, instead. They were grounded in Bayes, but this way he didn't have to use the word 'Bayes.'

Alas, the report's subjectivist approach to statistics was roundly damned, and the U.S. Nuclear Regulatory Commission withdrew its support for the study five years later. And two months after they did so, the Three Mile Island accident occurred.

Previous experts had said the odds of severe core damage were extremely low, but the effects would be catastrophic. Instead, the Rasmussen report had concluded that the probability of core damage was higher than anticipated, but the consequences wouldn't be catastrophic. The report also identified two important sources of the problem: human error and radioactivity outside the building. In the eyes of many, the report had been vindicated.

Finally, in 1983 the US Air Force sponsored a review of NASA's estimates of the probability of shuttle failure. NASA's estimate was 1 in 100,000. The contractor used Bayes and estimated the odds of rocket booster failure at 1 in 35. In 1986,

Challengerexploded.## Victory

Adrian Raftery examined a set of statistics about coal-dust explosions in 19th-century British mines. Frequentist techniques had shown the coal mining accident rates changed over time gradually. Our of curiosity, Raftery experimented with Bayes' Theorem, and discovered that accident rates had plummeted suddenly in the early 1890s. A historian suggested why: in 1889, the miners had formed a safety coalition.

Frequentist statistics worked okay when one hypothesis was a special case of another, but when hypotheses were competing and abrupt changes were in the data, frequentism didn't work. Many sociologists were ready to give up on p-values already, and Raftery's short 1986 paper on his success with Bayes led many sociologists to jump ship to Bayesianism. Raftery's paper is now one of the most cited in sociology.

One challenge had always been that Bayesian statistical operations were harder to calculate, and computers were still quite slow. This changed in the 90s, when computers became much faster and cheaper than before, and especially with the invention of the Markov Chain Monte Carlo method, which suddenly allowed Bayesians to do a

lotmore than frequentists can. The BUGS program also helped.These advances launched the 'Bayesian revolution' in a long list of fields: medical diagnosis, ecology, geology, computer science, artificial intelligence, machine learning, genetics, astrophysics, archaeology, psychometrics, education performance, sports modeling, and more. This is only partly because Bayes' Theorem shows us the mathematically correct response to new evidence. It is also because Bayes' Theorem

works.