[draft] Generalizing from average: a common fallacy?

by Dmytry1 min read5th Mar 201216 comments


Personal Blog

It seems to me that there is a great deal of generalization from average (or correlation, as a form of average) when interpreting the scientific findings.

Consider Sapir-Whorf hypothesis as an example; the hypothesis is tested by measuring average behaviours of huge groups of people; at the same time, it may well be that for some people strong version of Sapir-Whorf hypothesis does hold, and for some it is grossly invalid, with some people in between. We had determined that there's considerable diversity in the modes of thought by simply asking the people to describe their thought. I would rather infer from diversity of comments that I can't generalize about the human thought, than generalize from even the most accurate, most scientifically solid, most statistically significant average of some kind, and assume that this average tells of how human thought processes work in general.

In this case the average behaviour is nothing more but some indicator of the ratio between those populations; useless demographical trivia of the form "did you know that among north americans, linguistically-determined people are numerous enough to sway this particular experiment?" (a result that I wouldn't care a lot about). There has been an example posted here.

This goes for much of science, outside the physics.

There was another thread about software engineering. Even if the graph was not inverted and the co-founding variables were accounted for, the result should still have been perceived as useless trivia of the form "did you know that in such and such selection of projects the kind of mistakes that are more costly to fix with time outnumber the mistakes that are less costly to fix with time" (Mistakes in the work that is taken as input for future work, do snowball over time, and the others, not so much; any one who had ever successfully developed non-trivial product that he sold, knows that; but you can't stick 'science' label on this, yet you can stick 'science' label onto some average). Instead, the result is taken as if it literally told whenever mistakes are costlier, or less costly, to fix later. That sort of misrepresentation is in the abstracts of many papers being published.

It seems to me that this fallacy is extremely widespread. A study comes out, which generalizes from average; the elephant in the room is that it is often invalid to generalize from average; yet instead we are arguing whenever the average was measured correctly and whenever there was many enough people that the average was averaged out from. Even if it was, in many cases the result is just demographical trivia, barely relevant to the subject which the study is purposed to be about.

A study of 1 person's thought may provide some information about how thought processes work in 1 real human; it indicates that thought process can work in some particular way; a study of some average behaviour of many people provides the results that are primarily determined by demographics and ratios. Yet people often see the latter as more significant than the former, perhaps mistaking statistical significance for the significance in the everyday sense; perhaps mistaking the generalization from average for actual detailed study of large number of people. Perhaps this obsession with averaging is a form of cargo cult taking after the physics where you average the measurements to e.g. cancel out thermal noise in the sensor.


I want to make a main post about it, with larger number of examples; it'd be very helpful if you can post here your examples of generalization from averages.


16 comments, sorted by Highlighting new comments since Today at 9:42 PM
New Comment

The average human has one ovary and one testicle.

If your feet are in a bucket of ice and your head's in a oven, on average you're at a comfortable temperature.

The average family has 2.4 children.

And as for correlations, some years ago I wrote this brief note on how little predictive use you get from the typical magnitude of published correlations.

You probably know the story of the three statisticians hunting a tiger? The first statistician's shot goes wild, one meter to the left. The second statistician applies a correction, but overcompensates and misses, one meter to the right.

And that's when the third yells "Got 'im!"

Ya... i need to expand on that in the post - we have that sort of understanding of how the averages fail ('average temperature in a hospital'), but we don't seem to apply it well to correlations (which are still just averages).

edit: your brief note is great. You can expand on something popular - e.g. IQ tests - consider different IQ scores and what they actually tell about probability of individual doing this well or this badly on another IQ test (using correlation between 2 IQ tests). Or assuming that there is some 'IQ' that IQ tests correlate with, what does IQ test actually tell about the IQ.

It's on this kind of thought process that I have issues with statistics being used by people who don't really understand them.

I'm not trying to get on a high horse and exclaim that the common people shouldn't cite studies and stats, but if you are going to cite them, cite them fully. More often than not, by adding a standard deviation and median to an average, you get a picture much closer to what is actually occurring. But even after that, there are other tests which can yield a whole bunch of information that could be more useful towards refining the picture.

I guess if you are going to cite a study, you should take the time to read through the math people tend to skip over, or at least, read all of the conclusions drawn from the math, and not simply mine reports for the facts that happen to work for your argument.

It's not just averaging, it's the problem of making valid inferences in general; reasoning from observations to generalized conclusions.

In fact, the data in my original post on the cost of fixing defects wasn't even much of an "average" to start with - that is, it wasn't really obtained by sampling a population, measuring some variable of interest, and generalizing from the expected value of that variable in the sample to the expected value of that variable in the population.

The "sample" wasn't really a sample but various samples examined at various times, of varying sizes. The "measure" wasn't really a single measure ("cost to fix") but a mix of several operationalizations, some looking at engineers' reports on timesheets, others looking at stopwatch measurements in experimental settings, others looking at dollar costs from accounting data, and so on. The "variable" isn't really a variable - there isn't widespread agreement on what counts as the cost of fixing a defect, as the thread illustrated in a few places. And so on. So it's no wonder that the conclusions are not credible - "averaging" as an operation has little to do with why.

I have a further post on software engineering mostly written - I've been sitting on it for a few weeks now because I haven't found the time to finalize the diagrams - which shows that a lot of writing on software engineering has suffered from egregious mistakes in reasoning about causality.

I'm trying to understand your apparent distaste for averaging.

In the physics context, you're treating it as some empirical lab technique for dealing with imperfect apparatus. Given its indifference to the particular sorts of noise or error model, averaging can appear to be unprincipled or just a tractable approximation to some better scheme for analyzing all the observations. What if there is temporal or spatial correlation in the errors? What if there is some Simpson's Paradox-style structure between groups of observations? What if the least-significant bits of the measurements spell out in ASCII what the true answer is?

However, it is nearly a meta-theorem of statistics that inference is possible only when averaging is (this follows from looking at the properties of exponential families of distributions, the only really tractable class). If some extra structure is present, the answer is not to give up averaging but instead to average ALL the things (corresponding to sufficient statistics in a richer family).

The problem is not with averaging. The problem is the misunderstanding of what the result means and where the result is actually coming from.

The average weight of a stable atomic nucleus - averaged over all stable nuclei [of all elements], for instance, is not an important fact from nuclear physics. It is almost entirely useless trivia so uninteresting that I wouldn't be surprised if not a single nuclear physicist has ever calculated it. Likewise, the average human behaviour, when there is huge variance in human behaviour, is more of a demographical fact than psychological.

Likewise in the computer science example; there is a great variety of the work that is performed, with different consequences to mistakes; the average mistake's average cost over time is much more of a fact about the average ratio between different types of work, than a fact about software development process and the fate of any particular mistake and correction. I develop software for living, and I am saying that this factoid is of about as much relevance to my work as the average atomic weight of a stable nucleus is important in the nuclear physics (or any physics).

I found this comment clearer and more engaging than the original post.

Original post is a draft... I intend to rewrite it some to make it a good main post. It is much easier for me to respond to comments than to just make arguments from the blue which would address possible comments.

I agree with the grandparent and think those examples should be integrated in the main point.

Offtopic, but:

there is huge variance in human behaviour

Is this true? My map says that most humans exhibit similar behaviour in most circumstances, but that as social animals we are tuned to pick out the differences more than the similarities, so we just feel that everyone is completely different. If I've got this wrong then I've got some serious updating to do.

On a related note, if I type human behavior or human ethology into Wikipedia I don't seem to get a page explaining how humans behave, but instead get a few observations on how human behaviour is studied. Have I gone completely crazy here?

Any two things look the same if you look from far enough away. Any two things look different if you look from close enough in. Similarity, like probability, is in the observer, not the observed.

Missing that point drove my ontology wildly off course in a metaphysics course in undergrad. Seeing the obvious similarity between red things, even if they were reflecting slightly different wavelengths, led me to believe that Universals such as Red and Courage exist. It may be that that point should be pushed harder on this site.

Well, as far as attitude towards savings - or other topic being studied is concerned - yes the behaviour is very diverse. As far as human cognition goes - some people using mental imagery, some people not having mental imagery at all - ditto.

But what does 'very varied' mean? Well, too varied for the common methods would do. As varied as my atomic weight example.

[-][anonymous]9y 0

Uh, if your priors about the characteristics of new people you meet don't come from some generalization about other people you have met, where does it come from?

[This comment is no longer endorsed by its author]Reply
[-][anonymous]9y 0

I figured I must not understand what the author is saying. He couldn't really mean that we shouldn't use statistical discrimination when dealing with humans, could he?

[This comment is no longer endorsed by its author]Reply