Sorted by New

Wiki Contributions


This error manifests in the field of data analytics when people use huge amounts of data to look for hidden correlations. By using brute force to combine and transform factors, you can end up with massive degrees of freedom. Spurious correlations are discovered, which turn out to be noise instead of signal, then the model or the findings fail to transfer outside of the training environment.

It seems to be a blind spot in the current culture of Data Science. I don't see many colleagues who focus on this error, or who can be easily convinced that it is a problem.

The only more prevalent error I find is target leakage. This would take the form of 'we defined the speed of light using the dimensions of the pyramid. Now behold the great coincidence of the sides correlating to the energy stored in matter.' Although, the connections are usually more difficult to find in practice.

I think the main factor would be a sense of confidence that the person has a genuine chance of contributing meaningful ideas to the subject. If you think that others are already miles ahead of you, then a combination of the bystander effect and Dunning Kruger effect may stop you from ever trying. After all, there is social risk for presenting ideas that you expect to be seen as foolish.

This post in particular was very difficult for me to grasp. In the first example, the meaning of nonlocal in this context was not immediately obvious to me, and the fact that the compact program needed to represent the machine was also lost on me.
How this translated to leaky generalizations, and further to the blurring of utility and expected utility, was completely incomprehensible.
I'm an average reader who has comprehended most of the "sequences" posts up till this one. Perhaps drawing some connective tissue between the example and the results would make it clear to future average readers?