P-Values Know When You're Cheating
This post is not "the solution to p-hacking" or even "the problem with p-hacking". But I had some recent insights into thinking behind the p-value approach and wanted to see how they apply to p-hacking as an issue. While p-values are a flawed approach to reporting final results, they do provide a useful heuristic for comparing methodologies. Basic idea: P-values implicitly punish sloppy methodology. Suppose there are two different ways of testing a hypothesis that happen to yield the same results. When updating our expectations of future experience we won't see a difference because they have the same results. However, if we use p-values to measure our confidence in the result they will differ relative the methodology used. This is unhelpful for updating expectations, but valuable if you are trying to compare methods. The key here is that p-values compare all methods against a uniform yardstick, the chance of being fooled under the null hypothesis. If a method is more likely to be tricked by random variation it gets a weaker p-value. Consider then a typical example of p-hacking. Hart is a researcher who suspects that in a given situation dependent variable B is correlated with independent variable A. They run an experiment measuring B against A. But at the same time, they measure and record C, D, E, etc. So, if A ~ B ends up at p = 10%, Hart can keep testing variables until they find C ~ F at p = 4.3% and publish this 'statistically significant' result. But here is the rub: The method that Hart used included testing variables until they found something significant. Even the flawed reporting method of the p-value approach says has something to say about this. This yet another motivated stopping method. Supposing we measure eight variables in total. There are then 28 pair-wise interactions. These do not have 28 separate null hypotheses. When considering Hart's method as a whole
Quite so. I haven't read 'Skin in the game' yet, but it was recommended by a friend who read an early version of this post. It looks like it conveys this point exactly.
In response to the caution you referred to, I would agree. In reality we should only be watching practitioners, not listening to them. And then we can only treat the observation as Bayesian evidence.
One problem with this is that most people don't objectively summarize their lives and post all their consequences online. If we want to get more evidence than we can gather personally, we are going to have to listen to someone.