Let’s say you are trying to make predictions about a variable. For example, maybe you are an engineer trying to keep track of how well the servers are running. It turns out that if you use the obvious approach advocated by e.g. frequentist statistics [1], you will have huge biases in what you pay attention to, compared to what you should pay attention to, because you will disregard big things in favor of common things. Let’s make a mathematical model of this.
Measurement error is proportional to magnitude
Because of background factors that fluctuate at a greater frequency than you can observe/model, and because of noise that enters through your indirect means of observing said variable, the reliable portion of the variable will be mixed together with some unpredictable noise. For example, as the engineer observes the server logs, they will see some quite significant fluctuations that are basically unattributable without a lot of care and work.
There are two fundamentally different kinds of fluctuations:
Fluctuations in the ordinary mechanisms that usually dominate the variable you are tracking (e.g. CPU load or database load),
Outliers/exogenous factors which trigger an entirely different mode (e.g. bugs after updates, or deadlocks)
In either case, in order for the fluctuation to be relevant, it has to have an influence that is at least proportional to the ordinary level for the variable; if it was much smaller, it would be negligible. We’d expect some multiplicative fluctuations to exist simply because things tend to work through multiplicative mechanisms, so as a lower bound on the fluctuations, it’s typically the case that there will be linearly proportional/non-negligible noise.
Conversely, if the noise is much than the variable you are tracking, it would be hard to make out the original variable, and you would pick a different set of indicators with less noise, so linearly proportional noise also works as an upper bound.
Information-orientation introduces massive biases
If we try to apply simple statistics to the variable, it will mostly yield garbage. For instance, if we take the average over a short period of time, it can fluctuate wildy due to the outliers, and if we try to apply a regression analysis, it will be strongly confounded by outliers. If we had enough sample size, it’s conceivable we could just average ridiculous amounts of data together to ensure that we only consider reliable results. However, if instead we are non-omniscient, have limited data, and we want to extract as much information as efficiently as possible from the observations, we instead need to be more careful.
A natural way to think about these is that averages and least squares linear regression try to minimize absolute errors[2], but if we want to be robust to noise, we need to instead minimize relative errors. That is, if we predicted that the variable is 10, but it really is 20, that is a much worse prediction than if we predicted that the variable is 110 but it really is 120.
There’s multiple basically isomorphic ways we can formalize this, for instance by fitting a lognormal distribution to the data. However, let’s use a more arithmetically convenient simplified model: if X is the variable and P is the observed proxy, then we can say that log(P) ≈ log(X) + log(E). Here, E is fluctuation/measurement error, expressed multiplicatively as a number to be multiplied by X to get the observation. By taking the logarithms, we make multiplicative fluctuations drop out into additive fluctuations, and we bring outliers in the distribution back into the bulk of the distribution.
Now we can use some standard normal distribution approximations to get at how statistics twist things. Let’s say that there’s a subgroup s of units (e.g. patients with a particular disease) for whom X is different, e.g. on average Xsubgroup=Xnorm+β. For this group, we have log(Psubgroup) ≈ log(Xnorm+β) + log(E) = log(Xnorm) + log(1+β/Xnorm) + log(E).
Thus, log(P) is log(1+β/Xnorm) higher in the subgroup than in the overall group. But we can't measure this exactly since there will be some noise depending on log(E). This noise will shrink with sqrt(N), where N is the sample size of the subgroup. So overall our statistical signal for how much is going on with the group will be something like log(1+β/Xnorm) sqrt(N) / std(log(E)).
Obviously if you want to have as much information as possible, you are going to prefer less noisy data, as expressed by the 1/std(log(E)) term. But the actual importance of the subgroup ought to be given by something like βN, yet for statistical purposes it appears their detectability is more like log(β) sqrt(N), which not only massively underrates the importance of long-tailed effects, but also favors commonality over effect size, to the point where it becomes exponentially more important for a group to be common than for them to have a large difference from the norm, in order for it to be detectable.
Information overload
If you were just dealing with a single subgroup, this would probably be fine. At some point you get enough sample size to detect the subgroup, at which point you can use more direct means to estimate the correct β.
The issue is that the real world often contains tons of subgroups. If you are dealing with hundreds of thousands of subgroups, you will find many subgroups of infinitesimal importance before you find the ones that really matter. You can then either let yourself be distracted by all these subgroups, or you can conclude that the statistical approach to identifying them doesn’t really work.
But if raw statistics don’t work, then what does work? It seems to me that there’s a need to be biased towards cases where X is big. Maybe this bias can be derived purely from maximizing probabilistic fit, and if so I have some ideas how that could be achieved, which I will get into later in the series. But I suspect ultimately we simply have to bring this bias with us, as a deviation from raw probability theory. In the next post, I will discuss the main way we can use a magnitude-based bias to infer things.
Bayesian probability kind of has a similar problem, in that the statistics work out the same. However, if you are willing to let yourself get close to getting Pascal-mugged, I could imagine Bayesians could dodge it by chasing things with small probabilities of being really big.
Followup to: Latent variable models, network models, and linear diffusion of sparse lognormals. This post is also available on my Substack.
Let’s say you are trying to make predictions about a variable. For example, maybe you are an engineer trying to keep track of how well the servers are running. It turns out that if you use the obvious approach advocated by e.g. frequentist statistics [1], you will have huge biases in what you pay attention to, compared to what you should pay attention to, because you will disregard big things in favor of common things. Let’s make a mathematical model of this.
Measurement error is proportional to magnitude
Because of background factors that fluctuate at a greater frequency than you can observe/model, and because of noise that enters through your indirect means of observing said variable, the reliable portion of the variable will be mixed together with some unpredictable noise. For example, as the engineer observes the server logs, they will see some quite significant fluctuations that are basically unattributable without a lot of care and work.
There are two fundamentally different kinds of fluctuations:
In either case, in order for the fluctuation to be relevant, it has to have an influence that is at least proportional to the ordinary level for the variable; if it was much smaller, it would be negligible. We’d expect some multiplicative fluctuations to exist simply because things tend to work through multiplicative mechanisms, so as a lower bound on the fluctuations, it’s typically the case that there will be linearly proportional/non-negligible noise.
Conversely, if the noise is much than the variable you are tracking, it would be hard to make out the original variable, and you would pick a different set of indicators with less noise, so linearly proportional noise also works as an upper bound.
Information-orientation introduces massive biases
If we try to apply simple statistics to the variable, it will mostly yield garbage. For instance, if we take the average over a short period of time, it can fluctuate wildy due to the outliers, and if we try to apply a regression analysis, it will be strongly confounded by outliers. If we had enough sample size, it’s conceivable we could just average ridiculous amounts of data together to ensure that we only consider reliable results. However, if instead we are non-omniscient, have limited data, and we want to extract as much information as efficiently as possible from the observations, we instead need to be more careful.
A natural way to think about these is that averages and least squares linear regression try to minimize absolute errors[2], but if we want to be robust to noise, we need to instead minimize relative errors. That is, if we predicted that the variable is 10, but it really is 20, that is a much worse prediction than if we predicted that the variable is 110 but it really is 120.
There’s multiple basically isomorphic ways we can formalize this, for instance by fitting a lognormal distribution to the data. However, let’s use a more arithmetically convenient simplified model: if X is the variable and P is the observed proxy, then we can say that log(P) ≈ log(X) + log(E). Here, E is fluctuation/measurement error, expressed multiplicatively as a number to be multiplied by X to get the observation. By taking the logarithms, we make multiplicative fluctuations drop out into additive fluctuations, and we bring outliers in the distribution back into the bulk of the distribution.
Now we can use some standard normal distribution approximations to get at how statistics twist things. Let’s say that there’s a subgroup s of units (e.g. patients with a particular disease) for whom X is different, e.g. on average Xsubgroup=Xnorm+β. For this group, we have log(Psubgroup) ≈ log(Xnorm+β) + log(E) = log(Xnorm) + log(1+β/Xnorm) + log(E).
Thus, log(P) is log(1+β/Xnorm) higher in the subgroup than in the overall group. But we can't measure this exactly since there will be some noise depending on log(E). This noise will shrink with sqrt(N), where N is the sample size of the subgroup. So overall our statistical signal for how much is going on with the group will be something like log(1+β/Xnorm) sqrt(N) / std(log(E)).
Obviously if you want to have as much information as possible, you are going to prefer less noisy data, as expressed by the 1/std(log(E)) term. But the actual importance of the subgroup ought to be given by something like βN, yet for statistical purposes it appears their detectability is more like log(β) sqrt(N), which not only massively underrates the importance of long-tailed effects, but also favors commonality over effect size, to the point where it becomes exponentially more important for a group to be common than for them to have a large difference from the norm, in order for it to be detectable.
Information overload
If you were just dealing with a single subgroup, this would probably be fine. At some point you get enough sample size to detect the subgroup, at which point you can use more direct means to estimate the correct β.
The issue is that the real world often contains tons of subgroups. If you are dealing with hundreds of thousands of subgroups, you will find many subgroups of infinitesimal importance before you find the ones that really matter. You can then either let yourself be distracted by all these subgroups, or you can conclude that the statistical approach to identifying them doesn’t really work.
But if raw statistics don’t work, then what does work? It seems to me that there’s a need to be biased towards cases where X is big. Maybe this bias can be derived purely from maximizing probabilistic fit, and if so I have some ideas how that could be achieved, which I will get into later in the series. But I suspect ultimately we simply have to bring this bias with us, as a deviation from raw probability theory. In the next post, I will discuss the main way we can use a magnitude-based bias to infer things.
Bayesian probability kind of has a similar problem, in that the statistics work out the same. However, if you are willing to let yourself get close to getting Pascal-mugged, I could imagine Bayesians could dodge it by chasing things with small probabilities of being really big.
Well, absolute squared errors. Point is, not relative errors.