If Bayesian stats is so great, it should be able to do everything that ordinary stats can do. Indeed, as a fan of Bayesian stats, I have not made much of an attempt to learn normal stats with its p-values and t-tests outside what was required for my highschool statistics class many years ago.
But was this really a wise decision on my part?[1]
Randomness Testing
Consider the following problem: We have a random bit generator. It should generate independent random bits, each bit at 1:1 odds of being 0 vs 1. We would like to test if it is truly a good RNG, or is broken in some way.
This is actually a really deep problem, and people invent barrages of statistical tests to divine subtle correlations between bits. But we will start out by setting a very humble goal for ourselves: Determine if the generator is making too many 0's or too many 1's.
This is really easy for normal stats, so it had better be really easy for Bayesian methods too.
First of all, what is the thing that is going to replace the p-value here? In Bayesian statistics we will use a likelihood ratio for this. Just like we set a cutoff for the p-value in classical stats, we'll likewise set a cutoff for likelihood ratio. If it becomes too large, it tells us that the random number generator has failed the test.
In testing a random number generator, we'll observe some large number of bits it generates, all strung together into a bitstring . Given two theories about how our random number generator works, the likelihood ratio is . This is all we need to do Bayesian updates in odds form! Usually both probabilities will be very tiny. But we're taking a ratio, so much of the tininess will divide out.
For our base event , we can use the theory that the random number generator is truly the perfect idealized random number generator that assigns each bitstring of length a probability of . The alternative event will represent the theory that the random number generator is broken in some way, in this case by having a probability other than for a given bit to be a 1.
The tricky part here is making the theory . In particular, if the probability of a given bit being 1 is not , but rather some other number , then there are infinitely many choices for . So the theory can't just specify a fixed value of . Instead the theory has to say something like: "When we start generating bits, we pick a value of at random. Then we generate bits using that probability for the rest of the process."
This prevents us from having to guess specific value of to compare against.
When the theory says to pick at random, what distribution do we draw it from? Well, the uniform distribution is probably a fine choice. We don't really know , but we want to be able to detect a biased generator for any possible value of , meaning we need to assign some probability to any of those possible values.
Sorry for probably overly belabouring that last point, but I think it's worth some thought. We want to compare against the theory that the generator is biased, but we don't know what the bias could be. So we say that we're ignorant, that the bias was probably selected at random. Then the bias is effectively learned as we process our bitstring to compute .
Now you might be asking: "Well, is in the range , and so are some values that are pretty close to it. Doesn't this cause problems because means we actually have a valid generator that should pass this test?" This concern will go away when we actually do a calculation. So let's just skip to that part.
Suppose that the observed bitstring has length and has bits that are 1. Then:
The integral can be done numerically without many problems. But it does have a closed form solution too:
This looks a little bit like a binomial coefficient, but inverted, and divided by . Anyway, what likelihood ratio do we get from this?
Taking as the length of our test bitstring for example, we get:
Here the axis denotes the number of heads, , while the axis (note the log scale) is the likelihood ratio. Near 50% heads, the ratio favours , while towards the extremes, is favoured instead. The fact that is included in theory is not a problem: Theory still has the advantage in the middle because it predicted as the heads probability with great specificity, rather than having it as one of many possibilities. With a large enough number of trials, even observing a fraction of heads that is very near to , like will be enough for the generator to fail the test. So including values of arbitrarily near to in our distribution was the right choice.
If we set a likelihood ratio threshold value of 10 for the test to fail, we can see the areas where the random number generator passes or fails:
Really we ought to choose a slightly larger threshold number than 10. If we choose 20, then this is kind of similar to the standard used in ordinary stats.
Multiple Tests
We can even use the trick of learning unknown variables to combine multiple tests into one. If we have a variety of alternative theories , then we can combine them into one theory by mixing them together; with each theory having probability under theory . There is a cost to doing this, which is that some extra data will be needed to discern between the sub-theories before one of them becomes dominant enough to start influencing the overall likelihood for theory .
Some other kinds of tests
What other theories might we like to use for the besides our simple biased coin theory? This introductory guide contains some notes on various other randomness tests. I'll briefly say a few things about how to do each of them in a Bayesian way.
Frequency test: This is the one we just did.
Block frequency test: Let be our block length. Then if the input is divided into blocks , our theory is that we have a series of values sampled from , where the probability of heads in block is . If we are uncertain about the right block length to use, we can also make it random.
Runs: We again have a single value of , but instead of denoting the probability that we get heads, it denotes the probability that the next bit is the same as the current one.
Word frequency test: Instead of a single bit, which has two possible outcomes, we look at multi-bit words, which have 4, 8, 16, etc. possible outcomes. Instead of picking from it is picked uniformly at random from the space of probability distributions on outcomes. The necessary integral still has a closed form, you can read about the Dirichlet distribution to see how to do the math here. Anyway, it's not fundamentally different from the frequency test above, just with more than 2 possible outcomes.
Autocorrelation: Similar to the word frequency test, except that the word frequency test divides the bitsting up into chunks aligned to the word size, whereas here we care about all possible offsets, not just the aligned ones. One Bayesian way to handle this is to make a Markov model where for word length , we independently sample for every bitstring of length , where denotes the probability that the next bit is 1 if the previous are .
Bonus: Linear autocorrelation: If we specifically want to capture linear correlations, we can just make a theory that uses them. We randomly sample a bit vector from . We also sample an error probability . Then the theory predicts the next bit as , with error probability , where is a vector of the most recent bits. This test will be faster (use less data) than the full non-linear autocorrelation test when testing a linear congruential generator.
End
This will all seem quite simple for people who already know a lot about probability theory, but hopefully it provides a good simple example of an actual practical application of Bayesian stats. And, like, what math you actually need to do to use it.
If Bayesian stats is so great, it should be able to do everything that ordinary stats can do. Indeed, as a fan of Bayesian stats, I have not made much of an attempt to learn normal stats with its p-values and t-tests outside what was required for my highschool statistics class many years ago.
But was this really a wise decision on my part? [1]
Randomness Testing
Consider the following problem: We have a random bit generator. It should generate independent random bits, each bit at 1:1 odds of being 0 vs 1. We would like to test if it is truly a good RNG, or is broken in some way.
This is actually a really deep problem, and people invent barrages of statistical tests to divine subtle correlations between bits. But we will start out by setting a very humble goal for ourselves: Determine if the generator is making too many 0's or too many 1's.
This is really easy for normal stats, so it had better be really easy for Bayesian methods too.
First of all, what is the thing that is going to replace the p-value here? In Bayesian statistics we will use a likelihood ratio for this. Just like we set a cutoff for the p-value in classical stats, we'll likewise set a cutoff for likelihood ratio. If it becomes too large, it tells us that the random number generator has failed the test.
In testing a random number generator, we'll observe some large number of bits it generates, all strung together into a bitstring . Given two theories about how our random number generator works, the likelihood ratio is . This is all we need to do Bayesian updates in odds form! Usually both probabilities will be very tiny. But we're taking a ratio, so much of the tininess will divide out.
For our base event , we can use the theory that the random number generator is truly the perfect idealized random number generator that assigns each bitstring of length a probability of . The alternative event will represent the theory that the random number generator is broken in some way, in this case by having a probability other than for a given bit to be a 1.
The tricky part here is making the theory . In particular, if the probability of a given bit being 1 is not , but rather some other number , then there are infinitely many choices for . So the theory can't just specify a fixed value of . Instead the theory has to say something like: "When we start generating bits, we pick a value of at random. Then we generate bits using that probability for the rest of the process."
This prevents us from having to guess specific value of to compare against.
When the theory says to pick at random, what distribution do we draw it from? Well, the uniform distribution is probably a fine choice. We don't really know , but we want to be able to detect a biased generator for any possible value of , meaning we need to assign some probability to any of those possible values.
Sorry for probably overly belabouring that last point, but I think it's worth some thought. We want to compare against the theory that the generator is biased, but we don't know what the bias could be. So we say that we're ignorant, that the bias was probably selected at random. Then the bias is effectively learned as we process our bitstring to compute .
Now you might be asking: "Well, is in the range , and so are some values that are pretty close to it. Doesn't this cause problems because means we actually have a valid generator that should pass this test?" This concern will go away when we actually do a calculation. So let's just skip to that part.
Suppose that the observed bitstring has length and has bits that are 1. Then:
The integral can be done numerically without many problems. But it does have a closed form solution too:
This looks a little bit like a binomial coefficient, but inverted, and divided by . Anyway, what likelihood ratio do we get from this?
Taking as the length of our test bitstring for example, we get:
Here the axis denotes the number of heads, , while the axis (note the log scale) is the likelihood ratio. Near 50% heads, the ratio favours , while towards the extremes, is favoured instead. The fact that is included in theory is not a problem: Theory still has the advantage in the middle because it predicted as the heads probability with great specificity, rather than having it as one of many possibilities. With a large enough number of trials, even observing a fraction of heads that is very near to , like will be enough for the generator to fail the test. So including values of arbitrarily near to in our distribution was the right choice.
If we set a likelihood ratio threshold value of 10 for the test to fail, we can see the areas where the random number generator passes or fails:
Really we ought to choose a slightly larger threshold number than 10. If we choose 20, then this is kind of similar to the standard used in ordinary stats.
Multiple Tests
We can even use the trick of learning unknown variables to combine multiple tests into one. If we have a variety of alternative theories , then we can combine them into one theory by mixing them together; with each theory having probability under theory . There is a cost to doing this, which is that some extra data will be needed to discern between the sub-theories before one of them becomes dominant enough to start influencing the overall likelihood for theory .
Some other kinds of tests
What other theories might we like to use for the besides our simple biased coin theory? This introductory guide contains some notes on various other randomness tests. I'll briefly say a few things about how to do each of them in a Bayesian way.
Frequency test: This is the one we just did.
Block frequency test: Let be our block length. Then if the input is divided into blocks , our theory is that we have a series of values sampled from , where the probability of heads in block is . If we are uncertain about the right block length to use, we can also make it random.
Runs: We again have a single value of , but instead of denoting the probability that we get heads, it denotes the probability that the next bit is the same as the current one.
Word frequency test: Instead of a single bit, which has two possible outcomes, we look at multi-bit words, which have 4, 8, 16, etc. possible outcomes. Instead of picking from it is picked uniformly at random from the space of probability distributions on outcomes. The necessary integral still has a closed form, you can read about the Dirichlet distribution to see how to do the math here. Anyway, it's not fundamentally different from the frequency test above, just with more than 2 possible outcomes.
Autocorrelation: Similar to the word frequency test, except that the word frequency test divides the bitsting up into chunks aligned to the word size, whereas here we care about all possible offsets, not just the aligned ones. One Bayesian way to handle this is to make a Markov model where for word length , we independently sample for every bitstring of length , where denotes the probability that the next bit is 1 if the previous are .
Bonus: Linear autocorrelation: If we specifically want to capture linear correlations, we can just make a theory that uses them. We randomly sample a bit vector from . We also sample an error probability . Then the theory predicts the next bit as , with error probability , where is a vector of the most recent bits. This test will be faster (use less data) than the full non-linear autocorrelation test when testing a linear congruential generator.
End
This will all seem quite simple for people who already know a lot about probability theory, but hopefully it provides a good simple example of an actual practical application of Bayesian stats. And, like, what math you actually need to do to use it.
Spoiler: Yes. ↩︎