My knowledge of probability theory is based mostly on reading E.T. Jaynes’ Probability Theory book, Andrew Gelman’s blog, and various LessWrong posts. I now want to get a strong grasp of the central limit theorem(s), but YouTube videos and googled pages speak so much in the language of sampling from a population, and random variables, that it’s hard to be sure what they’re saying, given that my background doesn’t really include those ideas. I’m especially interested in the different kinds of CLTs, like the Lyapunov condition, the Berry-Esseen theorem, and so on. I often have a tough time with diving right into algebra - something like gives me terrible trouble. Given all these constraints, does anyone know of good resources from which I can gain a strong grasp of the CLTs?

Some things I am confused about after googling so far:

Do distributions converge to gaussians, or do means converge to the mean of a gaussian? Is the former a more difficult convergence to achieve, or are they actually the very same condition?

Is the CLT even about means? Does it say anything about the resulting variance or skewness of the resulting distribution?

Is it actually necessary to be sampling from a population, or does the CLT apply to taking the means of arbitrary distributions, regardless of where they were obtained?

Any form of media is OK, for recommendations - no preference. Please feel free to suggest things even if you’re not sure it’s what I’m looking for - you are probably better than google!

New Answer
New Comment

1 Answers sorted by



Don't have any good source except univeristy textbooks, but:

  1. The simplest proof I know of (in 3 lines or so) is to just compute characteristic functions.
  2. In general, the theorem talks about weak convergence, i.e. convergence in distributions.
  3. The sample mean converges to expected value of the distribution it was taken from almost surely (i.e. strong convergence). This is a different phenomenon than CLT, it's called the law of large numbers.
  4. CLT applies to a family of random variables, not to distributions. The random variables in question do not have to be identically distributed, but do have to be independent (in particular, independence of a family of random variables is NOT the same as their pairwise independence).
  5. The best intuition behind the CLT I know of: Gaussian is the only distribution with a finite variance where a linear combination of two independent variables has the same distribution (modulo parameter shift) as they have (i.e. it is a stable distribution). So, if try to "solve" the recursive equation for the limit in CLT, you'll see that, if it exists, it has to be Gaussian. The theorem is actually about showing that the limit exists. 

    In general, as someone nicely put this: The importance of stable probability distributions is that they are "attractors" for properly normed sums of independent and identically distributed (iid) random variables.

Thanks. I think I had the law of large numbers and CLT in the same bucket in my head, so pointing out they're different is helpful. Your point #5, and the attractor bit, are especially interesting - and I've seen similar arguments in Jaynes's book, around gaussians, so this is starting to get into places I can relate to. And knowing that convergence in distribution is called weak convergence should help when I'm searching for stuff. Helpful!

CLT applies to a family of random variables, not to distributions.

I guess I consider a family of random variables to be the same thing as a family of distributions? Is there a difference?

Answering the last question: If you deal with any random variable, formally you are specifying a probability space, and the variable is a measurable function on it. So, to say anything useful about a family of random variables, they all have to live on the same space (otherwise you can't - for example - add them. It does not make sense to add functions defined on different spaces). This shared probability space can be very complicated by itself, even though the marginal distributions are the same - it encodes the (non-)independence among them (in case of independent variables, it's just a product space with a product measure).
3Maxwell Peterson
Your comment made me realize that I didn't actually know what it meant to add random variables! I looked it up and found that, according to Wikipedia, this corresponds (if the RVs are independent) to what my main source (Jaynes) has been talking about in terms of convolutions of probability distributions. So I'm gonna go back and re-read the parts on convolution. But I still want to go out on a limb here and say that  sounds to me like too strong a statement. Since I can take the AND of just about any two propositions and get a probability, can't I talk about the chance of a person being 6 feet tall, and about the probability that it is raining in Los Angeles today, even though those event spaces are really different, and therefore their probability spaces are different? And if I can do that, what is special about the addition of random variables that makes it not applicable, in the way AND is applicable?
If you don't have a given joint pobability space, you implicitly construct it (for example, by saying RV are independent, you implicitly construct a product space). Generally, the fact that sometimes you talk about X living on one space (on its own) and other time on the other (joint with some Y) doesn't really matter, because in most situations, probability theory is specifically about the properties of random variables that are independent of the of the underlying spaces (although sometimes it does matter). Your example, by definition, P = Prob(X = 6ft AND Y = raining) = mu{t: X(t) = 6ft and Y(t) = raining}. You have to assume their joint probability space. For example, maybe they are independent, and then it P = Prob(X = 6ft) \* Prob(Y = raining), or maybe it's Y = if X = 6ft than raining else not raining, and then P = Prob(X = 6ft).