A compilation of misuses of statistics

Statistics are hard. The more statistics I learn the harder the discipline seems. To truly extract insight form data using statistical methods one has to take many factors and nuances into account, and I discover new ones very week. This is the reason much of the studies published in some scientific journals are false or many companies fail to create any value when then implement machine learning and data science methods. In this post, I will list many common misuses of statistics and hopefully keep updating this post as I learn more, like a kind of wiki of statistical mistakes. So far all of the listed mistakes come from Statistics done wrong by Alex Reinhart, Intuitive biostatistics by Harvey Motulsky and several of Nassim Taleb’s books. Let’s start !

False Gaussian assumption

Most socio-economic variables usually follow fat-tailed distributions, not Gaussian ones. Yet most financial and econometric models assume a Gaussian distribution. This was discovered by Benoit Mandelbrot in 1962.

I put this statistical fallacy in first place, even though the rest of this list is in no particular order, because I believe it has the most devastating consequences. It undermines nearly ALL attempts are forecasting, particularly in complex dynamical systems.

Bankers, policy makers, economists and other forecasting consumers all still rely heavily to this day on Gaussian assumptions which have been shown to be false. Fat-tailed distributions resemble a Gaussian (Normal) distribution, except the probability of extreme events is much greater and does not decrease exponentially as we get further from the mean, but sub-exponentially.

The most insidious aspect of fat-tailed variables is that they will appear Gaussian until the advent of a catastrophic, extreme event. As some traders say, “fat-tails hide their tails”.

Misunderstandings of p-values

The p-value is the probability of observing data as extreme or more than ours if the null hypothesis is true :

p = P(Data at least this extreme| H0)

It is extremely common to find data scientists, academics and even some statisticians who misinterpret p-values. It is not the probability of H0 being true nor the probability of H1 being false. A p-value can only tell us how compatible the data is with the null hypothesis being true, it does not inform us on P(H0) nor P(H1).

Base rate fallacy

If the base rate of a diseases is very low, say 1/1000, and I have a test that can detect the diseases with 5% false positives, if I test 1000 people I will get approximately 50 false positives and 1 true positive. That means if you are positive you have ~2% chance of actually having the disease (And that’s assuming perfect statistical power, explained further down). From Statistics done wrong :

“In areas where most tested hypotheses are false, such as early drug trials (most early drugs don’t make it through trials), it’s likely that most statistically significant results with p < 0.05 are actually flukes.”

This mistake also comes from a misunderstanding of the p-value. A low p-value tells us that P(Data at least this extreme | H0) is low, but if the base rate of H0 is much higher than the base rate of H1, then H0 can still be much more likely than H1 even with a low p-value. Again, the p-value does not inform us on P(H0) and P(H1).

Neglecting statistical power

The power of a study is the probability that it will distinguish an effect of a given size from pure luck. For example, a study can have a 50% chance of detecting a small effect, say 0,2 of a standard deviation. The power is affected by three factor : sample size, effect size and measurement error.

Studies with low power have a higher variance than studies with high power, which means it is easier to manipulate them to get the results you want. Also since low power means small effects will not be statistically significant effect, when you do detect a statistically significant effect it will often be a large one. Thus under-powered studies tend to overestimate the parameters they attempt to measure this is called a “type-M” error (M for magnitude) or “truth inflation”.

Scientific journals tend to be full of studies with low power. For example, Ioannidis et al. (2017) found that 90% of economics studies are under-powered. This mean that in that literature, the significant effects that are found will be overestimated. Also, if the power in a field/journal is low yet most studies in a journal find statistically significant results, which is often the case, it is very likely that there is significant publication bias in that literature.

Pseudo-replication

If you get 100 patients and measure their temperature, it is very different from taking one patient and measuring his 100 times. Yet it happens that in studies the researchers make that mistake.

Let’s say you want to test whether people with names form ethnic group X suffer from hiring discrimination compared to people from ethnic group Y. You send two resumes that differ only in name, with one name coming from ethnic group X and the other from ethnic group Y, to 10000 companies and assess the callback rates. You might think you have a large sample of 10000 so your study is robust… but you’ve only tested two names. Instead of testing many names to see if there is truly ethnic bias, you’ve tested one name per ethnicity 10000 times which makes you think you have a large sample. If you do notice bias it might have the do with the particular name you’ve chosen and not ethnicity.

In sum, pseudo-replication is when you replicate a treatment (measuring temperature, sending a CV) but the replicates of that treatment are not statistically independent (temperature taken from the same person twice will tend to correlate, using the same name every time).

Failing to account for multiple tests

If you test 10 hypotheses using a confidence threshold of 0.05 for each one of them, your probability of getting at least one false positive is not 5% but but ~40% (1 - probability of none of them being false or ). If you want this probability to remain 5%, you can account for it with the Bonferroni correction. It essentially consists of dividing the acceptable false positive rate alpha by the number of tests, so in our case we would use a confidence interval of 0.05 divided by 10, so 0.5%. Of course, since the new threshold for significance is higher, this reduces statistical power.

That’s the reason why genome wide association studies (GWAS) need huge sample sizes (n can be in the 100k and even in the millions), because for every single nucleotide polymorphism (SNP) they run a statistical test to see if it is associated with the studied trait, which reduces power immensely. GWAS usually use the $5 * 10^{- 8}$ threshold of significance to account for the huge number of tests computed.

Ignoring that the difference between statistically significant and not statistically significant is not itself statistically significant

Let’s say you test two drugs against a placebo to see if they are statistically significantly better than it to treat a certain ailment. One of the tests returns a significant result but the other does not. That means one drug is statistically significantly better than the other, right ? No. Wrong.

If you want to see which drug is better test them against each other. And the result could turn out to not be statistically significant even if one was significantly better than the placebo and the other was not.

Also, because the confidence intervals of two drugs compared to a placebo do not overlap does not mean they are significantly different, and vice versa. Don’t use two comparisons against a placebo to make inferences about the difference between the two treatments.

Data dredging

If you keep searching and searching a dataset, you’re going to find a some statistically significant association. That means, If you find something interesting during exploratory data analysis, you’ve got to test it with another dataset, to see if the finding generalizes. If you don’t, just searching through the data creates too many opportunities for false positives.

A famous example is that of a Cornell tenured psychology professor that bragged on a podcast that after collecting a dataset, his students would relentlessly search it and would not give up until they had found something. This is, of course, terrible scientific practice and a recipe for false positives.

A way of countering this is pre-registration, which consists of declaring what analyses you will run on a dataset before you do. This way, you cannot try 36 different statistical tests until you find something significant since the public already now the hypothesis you wanted to test through the pre-registration.

Regression to the mean bias

Imagine you’re a restaurant chef and one of your cooks puts in a bad performance one day. You decide to scold him in order for him to improve his performance. The next day, he performs much better. You then conclude that scolding employees after a bad performance causes them to improve.

But it could also very well be that that performances naturally varies from day to day, and that bad performance was just an outlier. The next day the cook gets back to his average performance and you interpret that as an effect of your intervention.

This bias is called regression to the mean. It means that if you pick an outlier from a certain distribution, the next value you pick is probably going to be closer to the mean.

Stopping rules issues

If you run an experiment and stop it once you get the desired result, it is a form of p-hacking. This practice is sometimes used because experiments can be expensive and it can save money, but it is a bad practice. You’ll probably save more money by running the experiment once correctly than by having to do it again because the results were flawed because you stopped early.

Indeed, as you perform an experiment, your parameter of interest might cross the significance threshold before falling below it again. If you stop the experiment as soon as you get a statistically significant result, you increase your chances of a false positive because the result might not have been significant if you kept going.

Unecessary discretization

Don’t dichotomize variables when it is not absolutely necessary. It both reduces statistical power and increases the likelihood of false positives. By trying different dichotomizations you can spuriously create correlations and/or effects that were not there originally. When you read a study that has dichotmized its variables with no justifications and no obvious reason, beware.

Overfitting

This one applies to both statistics and machine learning, overfitting happens when your model has a good performance on the data you used to create it, but bad performance on new data it hasn’t “seen” before.

In machine learning, it is common knowledge that you need a training set, to create the model, and a test set to evaluate its performance on new data. If you have to choose between different models or tune hyper-parameters then you need a validation set.

However, in other disciplines it happens that researchers publish regression results with no test set. It is then impossible to know if your model is only good at predicting the training data or if it can be used on new data with no loss of performance.

If you visit kaggle.com and check out some notebooks written by users, you will occasionally find some that have a training set and a test set, but no validation set. If you only train a single model and do not have hyper-parameters to tune, this is fine. That is why regression studies in other disciplines, say psychology, usually do not have a validation set, they usually only try linear regression or logistic regression. However, if you test several models or several hyper-parameter configurations then you need a validation set.

The reason is, if you train multiple models on the training set and then choose the one that performs best on the test set, your test set ceases to be a good estimate of the generalization error. This is akin to the multiple test problem, if you try ten different models on the test set you increase the chance of one of them being adjusted to the test set by pure chance. Therefore, if you need to perform model selection or hyper-parameter tuning, you need both a test set and a validation set.

Data leakage

Data leakage is when your model is trained while having access to information it probably will not possess in deployment. An example would be using “Monthly salary” to predict “Yearly salary”. If you have access to one variable then you don’t even need a model to compute the other, so if you’re ever put in a situation where you need to predict “Yearly salary” then you most likely will not have access to “Monthly" salary”.

A model trained on a dataset containing “Monthly salary” can thus have a good performance simply because of that feature. Then, when in deployment, it’s performance will be much weaker because it will not have access to it.

This might seem a bit obvious, but in a high dimensionality dataset it can take a lot of time to check if any of your column is a proxy for the dependent variable we want to predict, and you might forget one. Data leakage has deemed one of the top ten data mining mistakes.

Regression and causation

A regression analysis cannot tell you about causation outside very strict conditions which are rarely met in practice. If weight is a predictor of heart diseases in a regression, that does not mean that losing weight will reduce your risk of heart diseases. Regression does not provide information on intervention.

There exist methods to estimate causal effects form observational data like structural equation modeling or Judea Pearl’s causal inference. But standard good old regression is unfit to infer causal effects in most cases because of various problems of colinearity, reverse causation, latent variables and non-linearity of effects.

In Causation, prediction and search, Glymour et al. discuss this point at length in chapter 8. They illustrate their arguments by generating data from causal graphs of their making and then trying to use regression to estimate the causal effects of various variables. It turns out that regression is pretty bad at that task.

Simpson’s paradox

This paradox has fooled a lot of people during the Covid-19 epidemic, so let’s use that example since it is especially relevant. Anti-vaccine activists would often cite statistics of the order of “most hospitalized people were fully vaccinated”, however if you look at any single age bracket, vaccinated people tend to have better health outcomes.

How is this possible ? In every age group vaccinated people have better outcomes but in aggregate most hospitalized people are vaccinated. Well it’s simply because vaccinated people tend to be much older and age is therefore a confounding variable ! Add to that that if most people are vaccinated then most hospitalized people could be vaccinated even if vaccine’s work simply because there are so few unvaccinated people.

So to summarize, Simpson’s paradox arises whenever a trend is observed in all or several groups of data, but is reversed when the groups are aggregated. This problem can usually be solved by adjusting for confounding variables such as age in our example. If you need more examples, the Wikipedia entry for Simpson’s paradox supplies several famous ones such as the kidney stones case or the baseball batting average case.

Exploiting degrees of freedom

From Statistics done wrong by Alex Reinhart :

“In simulations, it’s possible to get effect sizes different by a factor of two simply by adjusting for different variables, excluding different sets of cases, and handling outliers differently.”

Having the freedom to test all kinds of procedures on your data will almost inevitably lead you to find an effect that appears useful.

This mistake is different from what we’ve defined as data dredging earlier. Data dredging as we’ve defined it means exploring the data with no idea of what you want to find, and searching it until you find something interesting. In this case, even if you declare that you will run a Z-test on the correlation of X and Y, you still have a margin of maneuver to p-hack the result. That can be through outlier processing or controlling different confounds as we’ve seen above. That means that on top of knowing what question you want to answer, you have to know precisely what procedures can answer it and how you will pre-process the data in advance as well.

Bias towards exciting, positive results

This problem is particularly prevalent in academia and scientific journals, but it can also be relevant for data scientists in an organization. Scientific journals tend to reject papers with negative conclusions, such as “drug X does not work for headaches” or “No correlation between income and height in Arizona”.

This is in part due to the incentive structure, of the academia-scientific journals complex, in which science journals are run by private companies for profit and therefore need positive, exciting results to sell. That means studies showing that drug X does not work will not be published, but studies that show it does will be, even if they are false positives. So there could be 9 unpublished studies that show the drug dosen’t work, yet the only one we’ll see is the false positive published in the journal. We will then think that the drugs actually work when most studies performed on it showed the inverse.

This problem combined with the other problems we talked about has generated a replication crisis in almost all scientific journals, including the most prestigious ones. Many studies, when repeated, yield results that contradict the original. In some fields, more than half of published results seem to suffer from this problem. In fact, even meta-analyses seem to be contradicted roughly 1/3 of the time by a large randomized control trial.

How does this apply to practitioners of statistics outside of academia ? Well say you’re the head of a business and ask your data science team to provide you with actionable insights extracted from the data, as is the norm. The data science team is unlikely to report null results or unexciting ones, so you might just receive a compilation of exciting , seemingly actionable, positives in your report that with not hold through time on new data. It is important even outside academia to define precise business questions you want to answer before starting data analysis.

False ergodicity assumption

An ergodic system is a system where the average of a parameter for one individual through time is equal to the average of the ensemble of individuals. In finance and economics, it is often assumed that the studied systems are ergodic, which is often false.

The example Nassim Taleb often uses is the following. A single person going to the casino 100 times will not have the same return as the average of 100 people going to the casino once. The reason is, if a single individual plays repeatedly infinitely, his expect return is… 0. Because there is non-zero probability of you losing everything (aka ruin), if you play enough times you will lose everything, and then you can’t play anymore.

Yet many financial models (modern portfolio theory for example) assume ergodicity and expose their users to far greater risk than they imagine. It also means that it is not because you invest in an index fund that you will personally earn the same returns as the market.

Ole Peters was the first one to apply the principle of ergodicity to economics, and found that when accounting for ergodicity, some “cognitive biases” become perfectly rational, such as risk aversion.

Regressing thin tailed variable against fat tailed one

As we’ve discussed, most socio-economic variables follow fat-tailed distributions. This makes them produce rare extreme events that are extremely hard to predict. Yet it is common practice in economics, political science or sociology to regress a thin tailed variable on a fat-tailed one and then share the obtained coefficient of determination $R^{2}$ .

Yet $R^{2}$ is systematically underestimated when the dependent variable is fat-tailed and it makes it look like the dependent variable is much more predictable than it actually is.

Think about it, the coefficient of determination, $R^{2}$ , is simply the variance in outcome variable explained by the model divided by the total variance of the dependent variable. But these parameters are calculated from the data, and the data will probably not contain extreme events of fat-tailed variables, since they are very rare. So these rare extreme events are absent from your data yet would hugely change the value of the of variance if they were present. That means the variance of the fat-tailed variable will be severely underestimated when you compute it from the data.

Now, if the variance of the dependent tends towards infinity then $R^{2}$ tends towards 0. Fat-tailed variables can have HUGE variances yet you would not know it from the data. Therefore $R^{2}$ is not a good estimate of how predictable a fat-tailed dependent variable is. The same logic also applies to the correlation coefficient.

This problem does not only affect linear regression, but also all machine learning models. If you evaluate your Neural Network’s performance at predicting a fat-tailed variable with R2, it will also be overestimated. And using another performance metric cannot solve the problem.

Using Pearson’s correlation for non linear relationships

In statistics 101, we are taught that Pearson’s correlation is a measure of a linear association between two variables. We are told that correlation does not work for non linear relationships. An example would be $y = | x |$ . Even though y is perfectly predictable by knowing x, the Pearson correlation between the two variables would be 0. Everyone seems, to forget this part after finishing stats 101 because the correlation coefficient is everywhere, yet most real world relationships are not linear.

Therefore, assuming there is no effect because we found a correlation of r=0 is wrong. In order to make such interpretations you should use mutual information, an entropy based measure that relies on probabilistic independence and not standard deviation.

Many also forget that, while Pearson’s correlation coefficient is a measure for linear relationships, it should not itself be interpreted linearly. A 50% correlation is much closer to 0 than it is to 100%, which is incredibly counter-intuitive.

End remarks

As you might have realized reading this piece, statistics are very hard. This list is not exhaustive at all and even if you know and understand every single one of the pitfalls listed there are still other ways to get fooled by data. However I do think I covered the most common mistakes with the most serious consequences. If I find new bad statistical practices that fit this list I will add them to this post. I hope you enjoyed the post and learned something, or at least got a good statistics refresher !

^{^}

Or close enough. P(neutrino hitting A | B) != P(neutrino hitting A | ~B) for pretty much any^[4] A and B.

^{^}

Thermal noise being the most obvious example.

^{^}

See e.g. the Tent map at m=2, where any finite binary precision will eventually fold to zero.

^{^}

Obvious exception for events that are not within each others lightcones - though even in that case there's often some event that's in the causal chain of A or B that has the same correlation.

[-]TLW4y30

One issue that I personally see a fair amount is people not grokking what happens when you have multiple superimposed probability distributions.

If I have two normal distributions superimposed, one with mean=0 and high variance, and one with mean != 0 and lower variance, the high-variance distribution will account for the majority of the outliers in both directions. (And the vast majority of extreme outliers in both directions. The tails are exponential; the ratio of two exponentials with different bases is itself an exponential, and drops towards zero surprisingly quickly.)

This can cause issues in a bunch of ways:

If you're focusing on outliers, you can miss that the low-variance distribution exists at all.
If you're defocusing / ignoring outliers, you're affecting the high-variance distribution more than the low-variance distribution.
Things can become weird when you combine more complex distributions.

*****

Three pet peeves of mine - all of which you touched upon so kudos - are:

When people take "study X failed to replicate" as refuting X, ignoring that the replication had terrible statistical power.
When people take "study X didn't show significance" as implying that all subsets of X must be insignificant.
When people take "study X showed significance; study Y failed to replicate X" as implying one or more of the following:
1. At least one of X or Y were faked.
2. At least one of X or Y were incorrect.
3. X is insignificant.
4. There were no important differences between the methodology of studies X and Y.

I really wish that people wouldn't use the normal distribution as a default. Start with the assumption that it's a fat-tailed distribution, and if/when you have the data to show that it isn't then go back to a normal distribution.

Alas, statistics/assumptions based on Gaussians is what gets published & publicized everywhere, so that's what people use.

[-]Younes Kamel4y20

100% agree with defaulting to non-gaussian distribution. That is what rigorous statistics would look like imo.

[-]meedstrom4y20

I'm concerned that getting well-versed in statistics mistakes has the same issues as doing so for biases and fallacies (Knowing about Biases Can Hurt People). When you're analyzing others' studies, you'll find that basically every study has at least one flaw of some kind. So this opens the door for unconsciously applying harsher criticism against studies you disagree with.

[-]Younes Kamel4y10

Yes, for sure. You can still fall for selective skepticism where you scrutinize studies you "like" much more than studies you don't like. You can deal with that by systematically applying the same checklist to every study you read, but that might be time consuming. The real solution is probably a community that is versed in statistics and that have open debates on the quality of studies, perhaps cumulatively, biases will cancel each other if the community has enough diversity of thought. Hence the value of pluralism.

[-]meedstrom4y10

First off, I like the compilation you made and I'm tempted to memorize it despite all I'm saying.

This 'pluralism' solution does not feel meaty -- your last sentence "Hence the value of pluralism" sounds to me like an applause light. I mean yeah, ultimately you and I build a lot of what we know on trust in the whole collective of scientists. But it's not directly relevant/useful to say so; there should be a halfway good solution for yourself as a solo rationalist, and calibrating yourself against others' beliefs is an extra measure you may apply later. Because I still prefer all those others to have used good solo toolkits for themselves: it makes them more reliable for me too.

Tentatively, for a real solution, I propose that it's better to focus on what right statistics looks like so that wrong statistics will automatically generate a feeling of puzzlement, and this way you still anyways get the ability to compare the quality of two studies.

Or you could learn each type of misuse as part of thoroughly learning the concept where they apply, with focus on better understanding that concept, not on learning about the misuse.

[-]Leo P.4y20

Regarding the stopping rule issue, it really depends how you decide the stopping. I believe sequential inference lets you do that without any problem but it's not the same as saying that the p-value is within the wanted bounds. But basically all of this derives from working with p-values instead of workable values like log-odds. The other problem of p-values is that it only lets you work with binary hypotheses and makes you believe that writing things like P(H0) actually carry a meaning, when in reality you can't test an hypothesis in a vacuum, you have to test it against an other hypothesis (unless once again it's binary of course).

An other common mistake you did not talk about is one done in many meta-analyses: one aggregates the data of several studies without checking if the data are logically independent.

[-]Younes Kamel4y50

I'm not as versed in mistakes of meta-analysis yet, but I'm working on it ! Once I compile enough meta-analysis misuses I will add them to the post. Here is one that's pretty interesting :

https://crystalprisonzone.blogspot.com/2016/07/the-failure-of-fail-safe-n.html

Many studies still use fail-safe N to account for publication bias when it has been shown to be invalid. If you see a study that uses it you can act as if they did not account for publication bias at all.

As someone who wants to do systematic review (meta-analysis with a certain rigidly prescribed structure), I will love to hear about the mistakes to watch out for!

Regarding the stopping rule issue, it really depends how you decide the stopping.

One surprisingly good sanity check I've found is to do up a quick Monte Carlo sim in e.g. Python.

As someone who uses statistics, but who is not a statistician, it's caught an astounding number of subtle issues.

[-]Younes Kamel4y30

I'm starting to realize that as well. It can give you the intuition without having to memorize theorems. I think I'm going to start using simulations a lot more.

[-]TLW4y10

It can give you the intuition without having to memorize theorems.

I find it's more helpful as a tool to catch wrong intuitions than as a crutch for missing intuition, personally. If you made a mistake with your simulation and you had the wrong intuition (or right intuition), you know something is up (unless the mistake happened to line up with a wrong intuition, at least). If you made a mistake with your simulation and you had no intuition, you're off in the weeds.

I think I'm going to start using simulations a lot more.

Some general pieces of advice, from someone who does a surprising number of quick simulations for sanity-checking:

Try to introduce small amounts of correlation in everything. In actuality, everything^[1] is correlated to some degree. Most of the time this does not matter. Every once in a while, it makes a huge difference.
Try to introduce small amounts of noise into everything. In actuality, everything^[2] has noise to some degree. Most of the time this does not matter. Every once in a while, it makes a huge difference.
Beware biased RNGs. Both the obvious and the not so obvious. Most of the time this does not matter. Every once in a while, it makes a huge difference.
Beware floating-point numbers in general. You can write something quickly using floats. You can write something safely using floats. Have fun doing both at once.
1. Corollary: if you can avoid division (or rewrite to avoid division), use integers instead of floats. Especially if you're in a language with native bigints.
2. Rerunning with two different floating-point precisions (e.g. Decimal's getcontext().prec) can be a decent sanity check, although it's not a panacea^[3].
Beware encoding the same assumptions into your simulation as you did in your intuition.
R... I can't really say.
Python is decent. Python is also slow.
1. numpy is faster if you're doing block operations. If you can (and know how to) restructure your code to take advantage of this, numpy can be quick. If you don't, numpy can be even slower than standard Python.
2. PyPy can offer a significant performance boost for Monte-carlo style 'do this thing a billion times' code. That being said, PyPy has disadvantages too.

^{^}
Or close enough. P(neutrino hitting A | B) != P(neutrino hitting A | ~B) for pretty much any^[4] A and B.
^{^}
Thermal noise being the most obvious example.
^{^}
See e.g. the Tent map at m=2, where any finite binary precision will eventually fold to zero.
^{^}
Obvious exception for events that are not within each others lightcones - though even in that case there's often some event that's in the causal chain of A or B that has the same correlation.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

4

A compilation of misuses of statistics

4

4