How long does it take to become Gaussian?

[-]Donald Hobson5y*130

All Gaussian distributions have kurtosis 3, and no other distributions have kurtosis 3. So to check how close a distribution is to Gaussian, we can just check how far from 3 its kurtosis is.

This is wrong. kurtosis is just the expectation of the 4th power. (Edit: renormalized by expectations of the first and second power) All sorts of distributions have kurtosis 3. Like for example the discrete distribution over [-1,0,0,0,0,1]

Otherwise an interesting post.

[-]Gunnar_Zarncke5y50

I'm not sure Kurtosis is the right measure for how Gaussian is in practice. I would have been more interested in the absolute difference between the distributions. That might be more difficult to compute though.

[-]climbert85y160

To that point, skew and excess Kurtosis are just two of an infinite number of moments, so obviously they do not characterize the distribution. As someone else here suggested, one can look at the Fourier (or other) Transform, but then you are again left with evaluating the difference between two functions or distributions: knowing that the FT of a Gaussian is a Gaussian in its dual space doesn't help with "how close" a t-domain distribution F(t) is to a t-domain Gaussian G(t), you've just moved the problem into dual space.

We have a tendency to want to reduce an infinite dimensional question to a one dimensional answer. How about the L1 norm or the L2 norm of the difference? Well, the L2 norm is preserved under FT, so nothing is gained. Using the L1 norm would require some justification other than "it makes calculation easy".

So it really boils down to what question you are asking, what difference does the difference (between some function and the Gaussian) make? If being wrong (F(t) != G(t) for some t) leads to a loss of money, then use this as the "loss" function. If it is lives saved or lost use that loss function on the space of distributions. All such loss functions will look like an integral over the domain of L(F(t), G(t)). In this framework, there is no universal answer, but once you've decided what your loss function is and what your tolerance is you can now compute how many approximations it takes to get your loss below your tolerance.

Another way of looking at it is to understand what we are trying to compare the closeness of the test distribution to. It is not enough to say F(t) is this close to the Gaussian unless you can also tell me what it is not. (This is the "define a cat" problem for elementary school kids.) Is it not close to a Laplace distribution? How far away from Laplace is your test distribution compared to how far away it is from Gaussian? For these kinds of questions - where you want to distinguish between two (or more) possible candidate distributions - the Likelihood ratio is a useful metric.

Most data sceancetists and machine learning smiths I've worked with assume that in "big data" everything is going to be a normal distribution "because Central Limit Theorem". But they don't stop to check that their final distribution is actually Gaussian (they just calculate the mean and the variance and make all sorts of parametric assumptions and p-value type interpretations based on some z-score), much less whether the process that is supposed to give rise to the final distribution is one of sampling repeatedly from different distributions or can be genuinely modeled as convolutions.

One example: the distribution of coefficients in a Logistic model is assumed (by all I've spoken to) to be Gaussian ("It is peaked in the middle and tails off to the ends."). Analysis shows it to be closer to Laplace, and one can model the regression process itself as a diffusion equation in one dimension, whose solution is ... Laplace!

I can provide an additional example, this time of a sampling process, where one is sampling from hundreds of distributions of different sizes (or weights), most of which are close to Gaussian. The distribution of the sum is once again, Laplace! With the right assumptions, one can mathematically show how you get Laplace from Gaussians.

[-]Gunnar_Zarncke5y40

Thank you, that provided a lot of additional details.

I was interested in visual closeness and I think sum of abs delta would be a good fit. That doesn't invalidate any of your points.

With the right assumptions, one can mathematically show how you get Laplace from Gaussians.

Actually, I'm very interested in these conditions. Can you elaborate?

[-]Bucky5y120

The Berry-Essen theorem uses Kolmogorov-Smirnov distance to measure similarity to Gaussian - what’s the maximum difference between the CDF of the two distributions across all values of x?

As this measure is on absolute difference rather than fractional difference it doesn’t really care about the tails and so skew is the main thing stopping this measure approaching Gaussian. In this case the theorem says error reduces with root n.

From other comments it seems skew isn’t the best measure for getting kurtosis similar to a Gaussian, rather kurtosis (and variance) of the initial function(s) is a better predictor and skew only effects it inasmuch as skew and kurtosis/variance are correlated.

[-]DanielFilan5y40

Great theorem! Altho note that it's "Esseen" not "Essen".

[-]Bucky5y20

Ha, I don't know how many times I have read that in the last couple of days and completely failed to notice!

[-]Gunnar_Zarncke5y40

I think this is a very useful measure for practical applications.

[-]Maxwell Peterson5y30

I thought about that but didn't try it - maybe the sum of the absolute difference would work well. I'd tried KS distance, and also taking sum(sum(P(x > y) over y) over x), and wasn't happy with either.

[-]TheMajor5y10

Why not the Total Variation norm? KS distance is also a good candidate.

[-]Maxwell Peterson5y10

I think I didn't like the supremum part of the KS distance (which it looks like Total Variation has too) - felt like using just the supremum was using too little information. But it might have worked out anyway.

[-]Maxwell Peterson5y10

Fixed - thanks! (Although your example doesn't sum to 1, so is not an example of a distribution, I think?)

[-]Donald Hobson5y30

If you want mean 0 and variance 1, scale the example to [ ,0,0,0,0, $\sqrt{3}$ ].

[-]waveBidder5y90

Surprised no one has brought up the Fourier domain representation/characteristic functions. Over there, convolution is just repeated multiplication, so what this gives is . Conveniently, gaussians stay gaussians, and the fact that we have probability distributions fixes $ˆ f (0) = 1$ . So what we're looking for is how quickly the product above squishes to a gaussian around $ω = 0$ , which looks to be in large part determined by the tail behavior of $ˆ f$ . I suspect what is driving your result of needing few convolutions is the fact that you're working with smooth, mostly low frequency functions. For example, exp, which is pretty bad, still has a $O (\frac{1}{n^{2}})$ decay. By throwing in some jagged edges, you could probably concoct a function which will eventually converge to a gaussian, but will take rather a long time to get there (for functions which are piecewise smooth, the decay is $O (\frac{1}{n})$ .

One of these days I'll take a serious look at characteristic functions, which is roughly the statisticians way of thinking about what I was saying. There's probably an adaptation of the characteristic function proof of the CLT that would be useful here.

[-]chasmani5y20

I was thinking something similar. I vaguely remember that the characteristic function proof includes an assumption of n being large, where n is the number of variables being summed. I think that allows you to ignore some higher order n terms. So by keeping those in you could probably get some way to quantify how "close" a resulting distribution is to Gaussian. And you could relate that back to moments quite naturally as well.

[-]Douglas_Knight5y60

The central limit theorem is a piece of pure math that mixes together two things that should be pretty separate. One is what the distribution looks like in the center and the other is what the distribution looks like at the tails. Both aspects of the distribution converge to the Gaussian, but if you want to measure how fast they converge, you should probably choose one or the other and define your metric (and your graphs) based on that target.

Skewness is a red herring. You should care about both tails separately and they don't cancel out.

[-]Maxwell Peterson5y30

This is an interesting view. I like the idea of thinking of center and tails separately. At the same time: If the center and tails were best thought of as separate, isn't it interesting that both go to Gaussian? Are there cases you know of (probably with operations that are not convolution?) where the tails go to Gaussian but the center does not, or vice-versa?

Skew doesn't capture everything, but it really seems like it captures something, doesn't it? I'd be interested to see a collection of distributions for which the relationship between skew and Gaussianness-after-n-convolutions relationship does not hold, if you know of one!

[-]Douglas_Knight5y40

If the center goes to a unique limit and the tails go to a unique limit, then the two unique limits must be the same. CLT applies to anything with finite variance. So if something has gaussian tails, it just is the gaussian. But there are infinite variance examples with fat tails that have limits. What would it mean to ask if the center is gaussian? Where does the center end and the tails begin? So I don't think it makes sense to ask if the center is gaussian. But there are coarser statements you can consider. Is the limit continuous? How fast does a discrete distribution converge to a continuous one? Then you can look at the first and second derivative of the pdf. This is like an infinitesimal version of variance.

The central limit theorem is a beautiful theorem that captures something relating multiple phenomena. It is valuable to study that relationship, even if you should ultimately disentangle them. In contrast, skew is an arbitrary formula that mixes things together as a kludge with rough edges. It is adequate for dealing with one tail, but a serious mistake for two. It is easy to give examples of a distribution with zero skew that isn't gaussian: any symmetric distribution.

[-]habryka5y50

Promoted to curated: This has been a question I've asked myself a lot, and while of course this post can't provide a full complete answer, I did really like it's visualizations and felt like it did improve my intuitions in this space a decent amount. Thank you!

[-]Maxwell Peterson5y10

Thanks!

[-]adamShimi5y50

Great post! I don't have much to say, except that this seems very relevant to me as I'm studying some more distribution, and I find the presentation and the approach really helpful.

[-]Maxwell Peterson5y10

Good to hear! Thanks

[-]TurnTrout5y*40

So the skew of the component distributions goes a long way in determining how quick their convolution gets Gaussian. The Berry-Esseen theorem is a central limit theorem that says something similar.

It might be helpful to quantify exactly how the Berry-Esseen theorem relates the skew, because, as you hint, it isn't a direct correspondence. If, like I did, you expect Berry-Esseen to use the skew directly, you'll be in for a good confusion.

In the simplest incarnation of the theorem, consider an IID sequence of observations with finite mean $μ$ , finite third absolute moment $E [| X_{1} |^{3}]$ , and positive standard deviation $σ$ . Let $F_{n}$ be the CDF of the rescaled and centered sample mean $\sqrt{n} \frac{{¯ X}_{n} - μ}{σ}$ and let $Φ$ be the CDF of the normal $N (0, 1)$ . Berry-Esseen upper-bounds the Kolmogorov-Smirnov statistic:

sup x \in R | F_{n} (x) - Φ (x) | \leq C \frac{E [| X_{1} - μ |^{3}]}{σ^{3} \sqrt{n}},

where $C$ is some constant (.5 works). This theorem is rad because it bounds how slowly the CLT takes to work its magic.

However, skew is $E [{(\frac{X_{1} - μ}{σ})}^{3}] \leq E [{∣ ∣ \frac{X_{1} - μ}{σ} ∣ ∣}^{3}] = \frac{E [| X_{1} - μ |^{3}]}{σ^{3}}$ , and so the quantity used by Berry-Esseen is lower-bounded by the absolute value of the skew. This is the precise link to be made with the Berry-Esseen theorem.

If you don't realize that absolute skew != the RHS of Berry-Esseen, you'll be confused as follows. All symmetric distributions have 0 skew, and so then you'd expect uniform distributions to instantly converge to Gaussian (since the upper bound on the Kolmogorov-Smirnov would be 0).

[-]Maxwell Peterson5y10

Thanks for investigating, this is helpful - I added a link to this comment to the post.

[-]Bucky5y30

The graph showing Kurtosis vs convolutions for the 5 distributions could be interpreted as showing that distributions with higher initial kurtosis take longer to tend towards normal. Can you elaborate why initial skew is a better indicator than initial kurtosis?

The skew vs kurtosis graph suggests that there’s possibly a sweet spot for skew of about 0.25 which enables faster approach to normality than 0. I guess this isn’t real but it adds to my confusion above.

[-]Maxwell Peterson5y30

Yes, exactly right: initial kurtosis is a fine indicator of how many convolutions it will take to reach kurtosis = 3. Actually, it’s probably a better indicator than skew, if you already have the kurtosis on hand. Two reasons I chose to look at in in terms of skew:

the main reason: it’s easier to eye skew. I can look at a graph and think “damn that’s skewed!”, but I’m less able to look and say “boy is that kurtose!”. I’m not as familiar with kurtosis, geometrically, though, so maybe others more familiar would not have this problem. It’s also easier for me to reason about skew; I know that income and spend distributions are often skewed, but there aren’t any common real world problems I find myself thinking are more or less kurtose.
I suspect - I’m not sure - but I suspect that distance-from-kurtosis-3 is a monotonically decreasing function of #-of-convolutions. In that case, to say “things that start closer to three stay closer to three after applying a monotonic decreasing function” felt, I guess, a little bit obvious?

Re: the beta(20, 10) making it look like there's a sweet spot around skew=0.25: correct that that isn't real. beta(20, 10) is super Gaussian (has very low kurtosis) even before any convolutions.

[-]Bucky5y20

So my understanding then would be that initial skew tells you how fast you will approach the skew of a Gaussian (i.e. 0) and initial kurtosis tells you how fast you approach the kurtosis of a Gaussian (I.e. 3)?

Using my calibrated eyeball it looks like each time you convolve a function with itself the kurtosis moves half of the distance to 3. If this is true (or close to true) and if there is a similar rule for skew then that would seem super useful.

I do have some experience in distributions where kurtosis is very important. For one example I initially was modelling to a normal distribution but found as more data became available that I was better to replace that with a logistic distribution with thicker tails. This can be very important for analysing safety critical components where the tail of the distribution is key.

[-]gjm5y90

If you have two independent things with kurtoses and corresponding variances $v_{1}, v_{2}$ then their sum (i.e., the convolution of the probability distributions) has kurtosis $(\frac{v_{1}}{v_{1} + v_{2}})^{2} k_{1} + (\frac{v_{2}}{v_{1} + v_{2}})^{2} k_{2} + 6 \frac{v_{1} v_{2}}{(v_{1} + v_{2})^{2}}$ (in general there are two more cross-terms involving "cokurtosis" values that equal 0 in this case, and the last term involves another cokurtosis that equals 1 in this case).

We can rewrite this as $(\frac{v_{1}}{v_{1} + v_{2}})^{2} (k_{1} - 3) + (\frac{v_{2}}{v_{1} + v_{2}})^{2} (k_{2} - 3) + 3 ((\frac{v_{1}}{v_{1} + v_{2}})^{2} + 2 \frac{v_{1} v_{2}}{(v_{1} + v_{2})^{2}} + (\frac{v_{2}}{v_{1} + v_{2}})^{2})$ which equals $(\frac{v_{1}}{v_{1} + v_{2}})^{2} (k_{1} - 3) + (\frac{v_{2}}{v_{1} + v_{2}})^{2} (k_{2} - 3) + 3$ . So if both kurtoses differ from 3 by at most $δ$ then the new kurtosis differs from 3 by at most $\frac{v_{1}^{2} + v_{2}^{2}}{(v_{1} + v_{2})^{2}} δ$ which is at most $δ$ , and strictly less provided both variances are nonzero. If $v_{1} = v_{2}$ then indeed the factor is exactly 1/2.

So Maxwell's suspicions and Bucky's calibrated eyeball are both correct.

[-]Maxwell Peterson5y10

Wow! Cool - thanks!

[-]Maxwell Peterson5y10

Those possible approximate rules are interesting. I’m not sure about the answers to any of those questions.

[-]bipedaljoe3y20

Interesting how all sorts of PMFs, such as the ugly bimodal-uniform distribution, converges to gauss curve. I was thinking, if the phenomena might go beyond probability theory, and that just about any shape (that that extends in a finite range on the x axis and is positive on the y axis) might converge to gauss curve if self-convolved enough times?

[-]Maxwell Peterson3y10

Good question!

[-]ErrethAkbe5y20

There is a rich field of research on statistical laws (such as the CLT) for deterministic systems, which I think might interest various commenters. Here one starts with a randomly chosen point x in some state space X, some dynamics T on X, and a real (or complex) valued observable function g :X -> R and considers the statistics of Y_n = g(T^n(x)) for n > 0 (i.e. we start with X, apply T some number of times and then take a 'measurement' of the system via g). In some interesting circumstances these (completely deterministic) systems satisfy a CLT. Specifically, if we set S_n to be the sum of Y_0,... ,Y_{n-1} then for e.g. expanding or hyperbolic T, one can show that S_n is asymptotically a standard normal RV after appropriate scaling/translation to normalise. The key technical hypothesis is that the an invariant measure mu for T exists such that the mu-correlations between Y_0 and Y_k decays summably fast with k.

This also provides an example of a pathological case for the CLT. Specifically, if g is of the form g(x) = h(T(x))-h(x) (a coboundary) with h uniformly bounded then by telescoping the terms in S_n we see that S_n is uniformly bound over n, so when one divides by sqrt(n) the only limit is 0. Thus the limiting distribution is a dirac delta at 0.

[-]Aaro Salosensaari5y20

I agree the non-IID result is quite surprising. Careful reading of the Berry-Esseen gives some insight on the limit behavior. In the IID case, the approximation error is bounded by constants / $\sqrt{n}$ (where constants are proportional to third moment / $\sigma^3$.

The not-IID generalization for n distinct distribution has the bound more or less sum of third moments divided by (sum of sigma^2)^(3/2) times (sum of third moments), which is surprisingly similar to IID special case. My reading of it suggests that if the sigmas / third moments of all n distributions are all bounded below / above some sigma / phi (which of course happens when you pick up any finite number of distributions by hand), the error is again diminishes at rate $1/\sqrt{n}$ if you squint your eyes.

So, I would guess for a series of not-IID distributions to sum into a Gaussian as poorly as possible (while Berry-Esseen still applies), one would have to pick a series of distributions with as wildly small variances and wildly large skews...? And getting rid of the assumptions of CLT/its generalizations gives that the theorem no longer applies.