[epistemic status: I am not a statistician, thus I've probably made errors in this post. I welcome any corrections to the merits of the post. I'm even open to removing it if proven the whole idea is stupid. However, I'd like to preserve the visual form of the post, as the sole reason to write this article in the first place, was that I couldn't find this visual analogy anywhere, and personally find it very helpful]

In this article I'd like to verbalize and more importantly - visualize - some intuitions about what a Confidence Interval really means when a Frequentist or a Bayesian use the term, and how come it can have more than one meaning.

Let me start with a toy example, which is hopefully simple enough, that even if Frequentists and Bayesians disagree about what "probability" really is, they can at least agree that it looks like there are two random events involved in following two steps:

  1. We first roll a 6-sided dice to generate the number N (1,2,3,4,5 or 6).
  2. We then flip N coins and count number of heads, H.

Now, suppose that we only know H, but are interested in guessing what was the value of N. Hopefully, you can see how this is a very oversimplified variant of a real world situation, where you try to deduce the value of some real world parameter given just a noisy observation - here H plays a role of the noisy observation of the real world parameter N/2.

This example is simple enough, that we can count on fingers, how often each possible combination of N and H "should" occur in an ideal world: for a given N, number of ways to choose H coins is by definition , and there are  possible sequences of heads and tails of length N, so the said probability is , which can be presented as a grid:

N=60.3%1.6%3.9%5.2%3.9%1.6%0.3%
N=50.5%2.6%5.2%5.2%2.6%0.5% 
N=41.0%4.2%6.3%4.2%1.0%  
N=32.1%6.3%6.3%2.1%   
N=24.2%8.3%4.2%    
N=18.3%8.3%     
 H=0H=1H=2H=3H=4H=5H=6

Above, each row corresponds to a particular result of the dice roll in the first step of the process, and thus numbers in each row sum to  of the total number of trials. In an "ideal" ("fair") world, repeating the experiment  times, we'd expect each particular N to occur exactly 64 times, and a specific combination of N and H to occur  times, which conveniently is always a natural number:

N=61615201561
N=52102020102 
N=441624164  
N=3824248   
N=2163216    
N=13232     
 H=0H=1H=2H=3H=4H=5H=6

We can see, that for a single observed value of H, there might be multiple values of N which could lead to it, so there is no way to determine N from H with certainty

What we can do for a given H, though, is try to provide a 90%-Confidence Interval: a range of possible values of N, which could lead to this particular H "in 90% of cases" - I put this in scare quotes, because what is the real meaning of such phrase is not at all clear, and Frequentists might mean something else than Bayesians. We can make it more clear (same way you can clarify if by "sound" you mean "auditory experiences" or "waves of pressure") to dissolve the disagreement, and I will soon try to provide visual intuitions about what the two sides mean by the phrase, and how it could be described in a more precise way. For now, let's just say, that both sides agree that among the several things the phrase "in 90% of cases" implies must be this one: that they the procedure assigning an interval of values of N to an observed value of H, followed to the letter for a million of repeats of this two-phase experiment, will produce an interval which indeed contains N roughly 900k times and miss the N roughly 100k times. They just "mysteriously" disagree what this procedure should be, and thus output different intervals. (Because they differ about other implications this phrase should have).

For example, if we know that H=1, then a Bayesian might say, that this narrows the possibilities to this single column:

N=61615201561
N=52102020102 
N=441624164  
N=3824248   
N=2163216    
N=13232     
 H=0H=1H=2H=3H=4H=5H=6

which contains 6+10+16+24+32+32=120 cases, so a range 1-5, which covers 10+16+24+32+32=114 of possibilities, should contain the real value of N in  of cases, thus is a 95%-Confidence Interval for H=1. Similarly the range 1-4, which covers 16+24+32+32=104 possibilities, should contain the correct value in  of cases, thus is an 86.7%-Confidence Interval. But what would be the 90%-Confidence Interval here? Well, real-world problems involving Confidence Intervals deal with distributions over real numbers, and as I've warned you, the above toy example is oversimplified. If N was a continuous variable, we could pick a range which ends somewhere between 4 and 5, which covers precisely 90%. So, in order to proceed, I need to increase the resolution of the grid - if I want to use grids of pixels as an intuition pump, we have to at least have pixels small enough that adding or removing one pixel from the Confidence-Interval causes a change small enough, that it's actually possible to cover at least approximately 90% of cases, as opposed to jumping from 86.7% straight to 95%. 

For this reason, let me just adjust the first step of our experiment: let's add 6 to the outcome of the dice roll, so N will be as random as before, but now it's in the range 7-12, not 1-6. So, now H can be any natural number in range 0-12, which increases the "horizontal" resolution of our grid. For N=12, there are  possible sequences of heads and tails, which makes the values in the cells at the boundary relatively small for our purposes. 

Repeating the new experiment  times, we expect each value of N to appear 4096 times, and number of occurrences of each particular combination of N and H is given by the grid:

N=121126622049579292479249522066121
N=11222110330660924924660330110222 
N=104401804808401008840480180404  
N=987228867210081008672288728   
N=816128448896112089644812816    
N=7322246721120112067222432     
H:0123456789101112

"Vertical" algorithm

One algorithm to provide a 90%-Confidence Interval for a given H, is to look at particular column, and find a subset of values in this column which has roughly 90% of the sum of values in this column. As we saw earlier, this might be impossible to do exactly in our oversimplified world, so let's try to find a subset which has at least 90%. It would be trivial to get "at least 90%", by simply reporting the whole column, which certainly contains the value, so additionally let's try to find a subset containing as little values of N as possible, striving for precision. And we want not just any subset, but a range, as ranges are easier to grasp and work with. Still, there could be several such "shortest ranges" if there are multiple "humps" in the sequence of numbers in the given column. However, in our toy example we are lucky: each column is a bitonic sequence of numbers, which means there is at most one such "hump" as the numbers (optionally) raise then (optionally) fall at most once. So, one way to find the smallest subset with at least 90% of coverage is to start with the whole column and repeatedly remove the pixel with the smallest value (which for a bitonic sequence will remove either the top-most or bottom-most pixel from our vertical stripe) until we can no longer remove anymore because we've reached 90%. Here's a result of this procedure applied to each column:

N=121126622049579292479249522066121
N=11222110330660924924660330110222 
N=104401804808401008840480180404  
N=987228867210081008672288728   
N=816128448896112089644812816    
N=7322246721120112067222432     
H:0123456789101112

For example, this procedure applied to H=8, started with the range spanning all possible values of N, 7-12, and then decided to remove pixel <N=8,H=12> because it contains the smallest value in this column, 16. Then it removed the pixel with value 72. And then it stopped because 495+330+180 is 91% of the sum of numbers in this column (495+330+180+72+16), but the next smallest value we could remove, 180, would cause a drop to 75%, and we wanted (at least) 90%-Confidence Interval, so we stop.

The gray area of each column contains at least 90% of the cases, and thus the gray area of the whole grid also has this property - if we were to run this experiment once, you could bet that we'd end up with a pair of N and H which fall in the grey area at least 90% of the time. Therefore if you commit yourself to always produce a sentence "For the value of H you revealed to me, I bet the N is in the grey area of the column for said H" you'd expect to be right at least 90% of the time.

"Horizontal" algorithm

Perhaps this all sounds so obvious and natural, that you're asking yourself: sure, what other way could you approach a problem like this one? So, here's another algorithm, which focuses on rows, instead of columns. For each row, we will mark a range of cells such that they cover at least 90% of the mass of this row. Again, in our simple toy problem this is easy, because each row is also bitonic, so we can reuse the same approach: start from a whole row, and remove smallest pixels one by one, until you have to stop. Unsurprisingly this leads to a very different picture:

N=121126622049579292479249522066121
N=11222110330660924924660330110222 
N=104401804808401008840480180404  
N=987228867210081008672288728   
N=816128448896112089644812816    
N=7322246721120112067222432     
H:0123456789101112

For example, for N=10, I decided to remove 4, 4, 40, 40, 180 (there were two pixels with value 180 to choose from, so I've arbitrary removed the right one) and reached . And I had to stop here, because removing another 180, would cause a drop to . Another nice property of our toy problem (and not the algorithm itself) is that gray areas chosen this "horizontal" way produce exactly one grey stripe in each column. In other words, for an arbitrary grid of pixel values, the very same procedure might create a jagged saw such that there is a column H which contains an alternating pattern of gray and white pixels which can't be interpreted as a single Confidence Interval. But for problems "regular enough" (I believe the required property is that the grid is bitonic in both axes), we can use the gray area produced by this "horizontal" approach to answer questions like "What is the 90%-Confidence Interval for H=8". It turns out to be 11-12. Which is a different answer than the one we saw previously (10-12). This perhaps is not puzzling in itself, after all this algorithm could simply be wrong, but what puzzled me at least, was that this way of answering questions, if you stick to it, also produces correct interval 90% of the time, for essentially the same reason as before: if gray pixels constitute at least 90% of the mass of each row, then they also constitute at least 90% of the whole grid. Even though this "horizontal" algorithm gives different answers. Even though it produces clearly wrong answer "N belongs to an empty set" for H=9! How to reconcile this? How can there be two completely different answers to the same question which are "equally good". Sure, we could imagine that "$11" and "$9" are "equally good" answers to the question "What is the value of $10 dolar bill", but if one person says "$11" and the other "there's no bill", or "$2", how can they both be "equally good"?

Well, they are "equally good" in one sense, but not in others. The sense in which they are "equally good" is precisely this: if you were to commit to following just one of this two ways throughout whole your life, and you expect to see a million instances of this experiment, then you'll produce an interval containing N roughly the same number of times - none of these methods seems better if your goal is to minimize number of wrong intervals.

But, if minimizing the number of mistakes was your only goal, then you could always answer 7-12, just to be certain, or 0-100, to be extra sure, even though you were asked for 90%-Confidence Interval. Clearly striving for a short interval matters for practical applications, and perhaps one of the methods produces shorter intervals at occasions when the other produces longer, and vice-versa? This comforting thought perhaps brings some sanity, so let's linger on it to see if it must hold. We know that the first algorithm we saw, produced vertical stripes which are shortest "possible", but we also know that the the other algorithm sometimes reports even shorter, so such a shorter range must have a sum of numbers smaller than 90% of the mass in its column - if it would be like this in all columns, then the "horizontal" algorithm would cover less than 90% of mass of whole grid, and we've already proved it has covered at least 90%, so there must be columns where the "horizontal" approach reports ranges longer than the "vertical" algorithm to restore the balance.

Just for fun, consider a "crazy algorithm" which simply picks any boundary pixels at random, so that they cover at most 10% of the mass, without any particular rhyme or reason, and colors all the remaining pixels gray. If we commit ourselves to always consult this post-modern picture to answer queries about 90%-Confidence Intervals for instances of this experiment, we would still be doing "equaly good". To me this was a bit unnerving realization, as trusting approach developed by renoved Bayesians or Frequentists is a one thing, but if a "crazy algorithm" gives "equally good" results, then something must be missing in the definition of "good".

So, in what sense the "horizontal" (Frequentist) approach is "better" than the "vertical" (Bayesian) and in what sense Bayesian is better than Frequentist?  Or to dissolve the paradox: what is the real problem each is trying to answer, and what is the real meaning of "90% of the time" each one uses?

In Bayesian framework, probabilities measure your information about the world - there are many theoretically possible states of the world (pairs of N and H) which match the observed reality (H) and a Bayesian will try to split the fixed budget of 100% between these states, according to the information it has already gained up till now about the situation. So it may assign 0% to states current info already precludes (H=8 means N can't be 6), or even assign 100% if they're certain (H=12 means N must be 12), or split evenly if genuinely unsure (as is the case before they observe H). However, if they have information about the world (which includes the design of the experiment, number theory, and observed value of H) they usually can do better than split it evenly, because once you know (from the design of the experiment and math) that <H=8,N=12> happens more often than <H=8,N=10>, and that observed value of H is 8, then the probability a Bayesian can assign to N=12 is larger than for N=8. As described so far, this still leaves a lot of freedom in assigning actual numerical values, as there are many ways to assign increasing sequence of probabilities to N, so it's important to realize that there is a goal in assigning these values, which restricts the choice very much: the goal is winning. Let's think about a bet between Bayesian and Reality: first the Bayesian commits to one probability distribution over N's for each possible value of H, which forms a grid of numbers depicting their way of reasoning. Then Reality picks one of the pixels it finds too low or to high, and says accordingly something like 

"You say that given H=8, you estimate the probability of N=10 to be 15%, and I think it is too low, therefore here's my offer: from now on, each time the experiment result is H=8, I'll give you $16 dollars unconditionally, yet if N happens to be 10, you'll pay me $100. Clearly from your point of view it's a sweet deal, as on average, after 100 such iterations you'd expect $16*100-15*$100=$100 gain, what you say?"

(Similar bet can be constructed if the value in the grid was too large). If the Bayesian didn't nail down the nature of reality correctly (I'm trying to avoid word "probability"), they will loose money. Say, it's actually 16.5%, not 15%, then $16*100-16.5*$100=-$50. 

So, when a Bayesian produces the 90%-Confidence Interval, the way I've described as the "vertical" algorithm, they might be thinking in terms of bets - they announce an offer to accept a bet: to pay $90 unconditionally to interested parties, if in return they'll pay back $100.01 dollars back in case N is in the interval produced by the Bayesian. But not only that. That would be too easy: a person who simply always reports range 1-10000 could also make such "bold" claim :) They additionally accept bets in the opposite direction: to pay $10 unconditionally to any interested party, if in return they'll pay back $100.01 if the real N is not in the range. This two offers together are a much more reliable signal of merit (note: as we saw earlier, to bring these two bounds closer together we need more and more fine grained grid - this is all a parable about 2-D continuous functions, which to me are easier to grasp using imperfect discrete grids). 

But, Frequentists could make such two claims, too! As we saw: the "horizontal" algorithm also covers approximately 90% of the cases, and thus doesn't cover approximately 10% of the cases. So, we are not done yet - the difference must be somewhere else.

Bayesian's can make a bolder claim: the interested party may decide for each column independently which variant of the bet they prefer: the one in which Bayesian pays $90 up front, but wants $100 for "gray" N, or the one in which Bayesian pays $10, but wants $100 back in case of "white" N. This is something Frequentists are more reluctant to do, for as we've observed, they do disagree with Bayesians about what range should be "gray" in some columns, and sometimes the "grey" area seems too heavy (at least to Bayesians) and sometimes the "white" area is too heavy. If two people disagree than at most one of them nailed down reality. And if your range didn't nail down the reality, then the Reality can pick the "too heavy" part of the bet. And while we haven't saw yet a proof that Bayesians nailed down the range, they were at least trying by using "vertical" algorithm, which seems to be doing something which might work, yet the "horizontal" approach of Frequentists doesn't even deal with columns and the matter of this new bet directly. I don't assume you buy Baysian approach wholesale (yet), but please take a look once again at the grid produced by the "horizontal" algorithm:

N=121126622049579292479249522066121
N=11222110330660924924660330110222 
N=104401804808401008840480180404  
N=987228867210081008672288728   
N=816128448896112089644812816    
N=7322246721120112067222432     
H:0123456789101112

where the numbers in bold are the ones for which the "interested party" will be forced to pay back $100 for a particular choice of bets in each column. Can you, with a straight face, predict this will end well financially for the "horizontalists", given that only 6.5% of the cases are bold? I think, that no matter what are your views on the nature of probability etc. if only 6.5% of the time you get $100, and each time you have to pay at least $10 (and sometimes $90!) there's no way to break even :)

And the strategy used by "interested party" to pick the bold numbers is..? Just to look at the Bayesian's grid, and see in which direction they disagree with Frequentists.

Can an "interested party" extract money from a Bayesian following the "vertical" grid? In our discrete world where it was difficult to set the range exactly at 90% boundary - sure. But the amount of money should be much smaller, and corresponds to the fact that either the "gray" part is a little smaller than 90% or "white" part is a little smaller than 10%, but it's not thaaat much smaller as in case of Frequentists strategy we saw.

Clearly, the Frequentists are not great at this particular game. So, perhaps they are playing a different game? In what sense their approach is "better" than Bayesian?

Well, it requires much less assumptions about the process which generated N. This is important, because N is playing a role of "unknown real world property" - the less we assume about it, the less likely we accidentally assume the thesis or bias ourselves too much. To see what I mean, imagine that it turned out the dice wasn't fair - that an asymmetry in its shape, caused the result 12 to appear more often than 1/6 of the time, at the expense of the result 11 which was less likely than 1/6. If we knew about this property of the dice before preparing our grids, we could take it into account. For example if 12 has probability 3/12, and 11 has probability 1/12, both algorithms ("horizontal" and "vertical") can model this by scaling numbers in rows accordingly. Clearly that could lead to different outcome in the "vertical" algorithm because relative values of pixels in a column might change (hopefully the values would still be bitonic, so the algorithm would still work at all!). OTOH the "horizontal" algorithm would produce exactly the same set of grey pixels! Indeed all it cares about is relative magnitude of pixels within a single row, which doesn't change as we change distribution over N in the first step.

But this has interesting consequence: it means "the vertical" algorithm doesn't really need to know if the dice is fair or not - the produced Confidence Intervals do not depend on the first step at all - so even if the unfairness of the dice was revealed after committing to the particular pattern of the grey area, it changes nothing! More than that: you can completely change the way you produce N - roll a 12-sided dice, or sum two 6-sided dices, whatever, the result stays the same (assuming bitonicity in both directions...). Heck, (and here we open the can of worms at the hearth of the debate between Frequentists and Bayesians) N could be even generated by a "non-random process" (whatever this phrase means to your ear) because the algorithm doesn't really care what you even mean by that. It always produces the same pattern of grey pixels, even if you insist that, N, the number of days in the week, is a random variable, or that kinetics of dice are deterministic, whatever.

I don't feel competent enough to summarize the best arguments on both sides of the debate, and I feel I've already said something (despite trying not to) which might've upset one or the other side. But, here's how I see how this two sides would see the situation where N is "the real height of mr. Smith" and H "the measured height of mr. Smith" where we assume an additve error with natural distribution: 

A Bayesian would google a chart of height distribution in the population, preferably the subpopulation of males, or better yet, "Smiths" in particular, to take into account all of the pieces of information. This would give'em some prior distribution over N - the scaling factors for the rows of the grid. Then, when asked for a given H, what is the 90%-Confidence Interval for N, they'd zoom in at H-th column of the huge grid, with each pixel containing a product of the row's scaling factor (the prior distribution over N) and probability of observing H given that specific N (given by the normal distribution) and would try to pick a narrow range of pixels with largest values to cover 90% of the mass in this column. (Or more likely use analytical tools over continuous functions to jump right to the exact answer).

A Frequentist would try to avoid assuming too much about mr.Smith, and mentally put question marks as weights for all rows. These values aren't needed anyway for their algorithm, because it works row by row: picking shortest range of pixels which covers 90% of mass in this particular row, no matter what is the scaling factor of it. Finally, the (hopefully continuous) range of grey pixels in column H is the 90%-Confidence Interval they'll report.

Now, in the case of the very artificial toy example with dices and coins, I was happy to admit both approaches give correct interval "90% of the time". But now, in case of such a real-world problem somethings starts to crack in me.

I can still persuade myself that following "Frequentists" methodology will produce intervals, such that 90% of them will contain the real value of N. But is it still true for Bayesian approach? Sure, if you ask the Bayesian, in their self-consistent world-view the answer is resounding YES! They see that the mass of grey pixels is 90%. But the grid is just their map. Does it reflect the territory? What if they got the scaling factors (the prior) wrong? Say, they've misunderstood the chart, or the chart was wrong, or outdated? In such case they might pick a "wrong" subset of pixels. The subset is only "correct" if the prior is exactly right - something we've got "for free" in the oversimplified case of rolling a fair dice. Otherwise it might be arbitrarly off, and the Reality can extract a lot of monetary gain by challenging them. OTOH Frequentists do not have to fear about such a failure mode, as they don't even use the numbers from these charts anywhere in their algorithm. It's completely robust to whatever process actually generates height of mr.Smith.

So, I'd say both perspectives (row-wise and column-wise) are very useful. 

Thank you for your time.

New Comment