There are multiple versions of the central limit theorem. One common version says that, as , the mean of means looks like a Gaussian distribution. People in industry sometimes ignore the "goes to infinity" part, and assume that if is large, that's good enough. But is that true? When is it safe to assume that your means have converged? I heard of a case recently where the mean of means was assumed to have converged, hadn't, and the application it was a part of suffered accuracy problems for months without anyone knowing. This sequence will explore: under what conditions does the central limit theorem "fail" like this?
The central limit theorem is about convolutions
There are multiple versions of the central limit theorem. They're all a version of the statement:
If you have a bunch of distributions (say, of them), and you convolve them all together into a distribution , then the larger is, the more will resemble a Gaussian distribution.
The simplest version of the central limit theorem requires that the distributions must be 1) independent and 2) identically distributed. In this sequence, I'm gonna assume #1 is true. We'll find that while condition #2 is nice to have, even without it, distributions can converge to a Gaussian under convolution.
A Gaussian distribution is the same thing as a Normal distribution. Some examples of Gaussian distributions:
Wait - this doesn't sound like the central limit theorem I know
Most statements of the central limit theorem, including Wikipedia's, talk in terms of the sums of random variables (and their density functions and expected values). But this is the same thing as our convolution-of-distributions, because the density function of the sum of two random variables X, Y is the convolution of the density functions of X and Y. Looking at the central limit theorem in terms of convolutions will make it easier to see some things. It's also useful if your version of probability doesn't have the concept of a random variable, like probability theory as extended logic*.
Another statement of the central limit theorem, from Jaynes:
The central limit theorem tells us what is essentially a simple combinatorial fact, that out of all conceivable error vectors that could be generated, the overwhelming majority have about the same degree of cancellation.
Hopefully this is enough to convince you that the central limit theorem need not be stated in terms of random variables or samples from a population.
(Also: the central limit theorems are sometimes talked about in terms of means only. They do say things about means, but they also say very much about distributions. I'll talk about both.)
*: first three chapters here.
The central limit theorem is a statement about the result of a sequence of convolutions. So to understand the central limit theorem, it's really important to know what convolutions are and to develop a good intuition for them.
Take two functions, and . Reverse on the y-axis, then slide it and along each other, taking the product of each pair of numbers as you slide. The result is the convolution of and . Here's a picture - is blue, is red, and the convolution is black:
Another one, with a different function :
You can write this as .
The first two sections on this page give a nice explanation of convolutions in terms of dropping a ball twice. This page lets you choose two functions to convolve visually, and I highly recommend it for getting a feel for the operation.
Convolution has nice properties
Convolution is associative and commutative - if you want to convolve multiple functions together, it doesn't matter what order you do it in. (Also, nothing I've said in this section requires and to be probability distributions - they can be any two functions.)
You can get the mean of a convolution without actually doing the convolutions. If the first moments of and g are and (for probability distributions, the first moment is the mean), then the first moment of is . There is a hint here about why people often talk about means when they talk about the central limit theorem - being able to get the mean of the convolution by just adding the means of the underlying distributions together is really nice.
You can get the second moment without actually convolving, too. (the second moment is closely related to the variance: if the mean is 0, the second moment and the variance are the same). For and with second moments , , the second moment of is . And so on for higher moments, for as long as and actually have those higher moments themselves.
If and are the Fourier transforms of and , then . This is very useful for computing convolutions numerically: evaluating naively would require doing an integral for every single value you were interested in. That's expensive, time-wise. Much easier to compute and and then just multiply them (multiplication is cheap). After that, applying the inverse Fourier transform gives . This is how convolution is computed in popular numerical programming languages.
If sums to 1 and sums to 1, then sums to 1. So if and are probability distributions, so is .
In the next post, I'll look at what happens when you convolve a bunch of different distributions together, and explore how much of what happens depends on the form of those distributions.
Post script: not neural networks
You may have heard of convolutional neural networks from all their success in image processing. I think they involve the same convolution I'm talking about here, but in much higher dimensions. I'll only be talking about convolutions of one-dimensional functions in this sequence. Some (or a lot?) of the stuff I'll say about the central limit theorem probably applies in higher dimensions too, but I'm not sure what changes as the dimension increases. So, this sequence is not about the neural networks.