Statisticism: How Cluster-Thinking About Data Creates Blind Spots

Benquo

There is an epistemic stance, common among academics in quantitative fields, academics who wish they were in quantitative fields, and independent scholars who do not wish to decorrelate too much from the academic mainstream by communicating in an incompatible dialect, that treats statistical convergence as the gold standard of evidence. If many indicators point the same direction, the signal is real. Call this statisticism. It converges on truth when your instruments have independent errors. It diverges from truth when they share a systematic distortion, because then convergence is what the distortion looks like. The following example illustrates a case where it fails, and why.

Two stories about the same numbers

The US homicide rate doubled between 1960 and 1980, then fell by more than half between 1991 and 2014. I argued that the fall is mostly a medical artifact: trauma surgery vastly improved, so the same rate of shootings produced fewer deaths. I constructed an adjusted trend line using two independent data sources and found no clear decline in serious violence after 1980.

Scott Alexander argues the decline is real. Many different crime categories all fell together: homicide, robbery, car theft, survey-measured victimization. This convergence, in the statisticist mode, makes the decline robust. ^[1]

The convergence argument

In many cases Scott falls back on the argument that many indicators agree, therefore the signal is real. If independent instruments, chosen for their suitability to the purpose, are being compared, this can work. But when the question is whether your instruments are broken, this argument has little to offer.

Every indicator he cites has specific, identifiable problems for measuring the kind of serious interpersonal violence I'm trying to track:

Homicide rates are suppressed by improving medicine. The FBI's own Supplementary Homicide Reports make no adjustment for changing lethality.
Aggravated assault rates were inflated for decades by expanding police reporting (the 911 rollout, professionalization of record-keeping, recognition of domestic violence) and then deflated by CompStat-era gaming. The NYPD's CompStat system, introduced in 1994, held precinct commanders accountable for index crime numbers. Felony assaults fell 42% from 2000 to 2009 while misdemeanor assaults fell only 9%, a divergence that Eterno and Silverman documented as systematic downclassification. Under UCR rules, a shooting is hard to classify as anything other than aggravated assault, but a borderline bar fight can plausibly be coded as simple assault rather than aggravated assault, removing it from the index. The expansion of the category led not only to increased reporting but increased charging and conviction, and therefore substantially greater penalties applied to the marginal cases newly considered aggravated assault, which were therefore selectively disincentivized.
Victim surveys (the NCVS) interview about 240,000 people and get roughly 1,000 aggravated assault reports per year. The signal of interest (would-be homicides reclassified as assaults by medical improvement) is a tiny fraction of total assaults. The survey lacks the statistical power to detect it. The NCVS documentation itself flags assault as the worst-recalled crime in the survey.
- This of course contains no direct information about homicides, but was used as the denominator to estimate firearm lethality in the Eckberg paper Scott cited to refute Harris's lethality-adjusted homicide paper. This denominator is noisy (year-to-year variation is high), and when you divide a numerator from one data source by a denominator from another, any biases can compound rather than canceling.
- Scott later argued that if homicides were being converted to aggravated assaults through medical mitigation, that should be reflected in the NCVS numbers.
Property crime responds to locks, cameras, cashless payments, and prosecution thresholds. It tells you about theft, not about whether people are shooting each other. Car theft declined because of immobilizers and GPS tracking, not because of declining criminal intent.

The limitations of these instruments are neither secret nor heterodox. The FBI's UCR handbook warns about comparability problems across time and jurisdiction. The NCVS documentation discusses its own power limitations. The information about instrument quality exists. It just gets stripped away as data moves from producers to consumers, so that by the time the data reaches a blog post, a newspaper, or a summary characterization from an adjacent academic field, it looks like a clean fact about reality rather than a noisy output of a specific, flawed process.

All of these indicators have drifted in the direction of apparent decline during the period in question, for reasons unrelated to whether people became less violent. Counting up indicators that agree doesn't help when they share the defect you're trying to diagnose.

Suppose you suspect your bathroom scale reads low because the spring is worn out. Your friend says it must be accurate because your belt fits better, your face looks thinner, and your blood pressure is down. These are all evidence of something (maybe you're exercising more) but none of them address whether the scale reads low. Body recomposition might produce the same effects. If you want to know whether the spring is worn out, you need to test the spring, or at least the scale.

Testing the spring

I took the hardest data available, the actual count of dead bodies from death certificates filed by medical examiners, and asked: how has the relationship between this number and the underlying rate of serious violence changed over time? Dead bodies are not subject to reporting drift, survey methodology, or police statistics games. The Monty Python parrot scenario is an outrageous fictional exaggeration, and even then it was a parrot; brazenly insisting an obviously dead human being is alive to avoid a minor financial inconvenience strains plausibility even for an absurd comedy sketch. ^[2]

Homicide rates are subject to one known distortion: whatever the perpetrator does, if the victim doesn't actually die of it, it wasn't a homicide. Medicine is a field specifically devoted to causing people not to die of things they otherwise would have died of, and (I think even Robin Hanson would agree) it has sometimes gotten better over time. So I measured the improvement using two independent clinical sources (FBI firearm lethality ratios and hospital abdominal gunshot wound survival rates) and divided it out.

The point isn't that I thought one instrument might not be enough so I got a second one. The point is that I modeled the strengths and weaknesses of my first instrument well enough to have some idea what secondary measure might ameliorate its deficiencies.

Where the blind spots appear

In a subsequent exchange on Substack between me and Scott, statisticism produced a characteristic set of moves. Scott clearly writes from a place of genuine uncertainty and curiosity. But the statisticist default shapes what counts as engaging with an argument, and the result is that certain kinds of evidence become structurally difficult to hear.

The hardest evidence gets outvoted

The strongest piece of evidence in the entire debate is a doubling in death counts between 1960 and 1980, during a period of well-documented medical improvement. Death certificates filed by medical examiners are the least distorted measurement available. If you accept this evidence and the medical adjustment, violence roughly tripled on the adjusted measure, and for crime to be at "record lows" today, the adjusted rate would need to have fallen back by a comparable amount. My data shows it didn't.

I flagged this as the crux: "my argument that violent crime increased a lot 1964–1980 is strong, and I'd need to be wrong about that for [the] headline claim to be true." Scott responded: "I agree there's less data about 1960–1980."

I hadn't said anything about having less data. I'd said I had strong evidence. Body counts are the hardest data in this debate. There is less survey data before 1973, because the National Crime Victim Survey didn't start until then. But death certificates are older and more reliable than the best survey available. By responding as though I was arguing from data scarcity, Scott reframed "I have body counts" as "there's less data," inverting the hierarchy of evidence and attributing that inversion to me. I don't think this was deliberate, but I confess that it rankles a bit to have words put in my mouth, which may make me less fair-minded than I otherwise might be; but I don't think it's good discursive practice for people with grievances to self-silence for want of an advocate, so on we go! Within the statisticist framework this move is natural and almost invisible, because the framework ranks evidence by quantity and diversity of sources rather than by the quality of any single source's connection to physical reality.

Trends get reified

Statisticism encourages treating "the crime trend" as a thing that exists in the world, rather than as a summary computed from instruments. Once you think of it as a thing, you can ask whether it went up or down, and you evaluate this by polling your instruments.

The grand old Duke of York,
He had ten thousand instruments;
He marched them up to the top of the hill,
And he marched them down again.

When they were up, crime was up,
And when they were down, crime was down,
And when they were only halfway up,
Crime was neither up nor down.

But the crime trend is neither a generating process for, nor an explanation of, crimes. There are specific events (shootings, robberies, car thefts) counted by specific instruments with specific mechanics by which the events are detected, categorized, and counted. "Crime" is a word we use to group these events. A car theft and a shooting are both crimes, but they have different causes, different mechanisms, and different measurement problems. Treating these different instruments as interchangeable readings of a single underlying variable discards everything you know about how each measurement works.

The hypothesis that one underlying single generating factor, whether it's propensity for criminality, the trust level of society, or the cybernetic capacity of the state, drives changes in all these categories, is a strong claim that calls for strong evidence. I just described the union of three distinct theories connected with "or," not one coherent theory. Much like evidence for the existence of the monotheists' Yahweh doesn't work if it proves too much and also supports the incompatible Zeus, an argument for a single factor has to either rule out the other contenders for the single factor, or specify under what conditions the convergence should fail.

Parsimony gets misapplied across periods

Scott argues that since the post-1980 decline appears real (convergence), the 1960–1980 increase was probably also smaller than it looks. This treats "the trend" as a single object to be accepted or rejected wholesale. But the evidence is asymmetric. The 1960–1980 increase rests principally on body counts. The post-1980 decline rests on rates contaminated by the artifacts under dispute. Projecting the weaker period's story onto the stronger period gets the direction of inference backwards.

Experience gets filed as "vibes"

"Who are you going to believe, me or your lying eyes?" is not, on its face, a very credible rhetorical move. But reframe it as "what are you going to believe, objective statistics or the vibes?" and it becomes surprisingly effective.

In a followup post on disorder, Scott examines whether the things people complain about (litter, graffiti, tent encampments) are really increasing. He looks at the indicators, finds most flat or down, and concludes that perceived disorder probably outruns actual disorder. He frames this as keeping "one foot in the statistical story, one foot in the vibes." Statistics on one side, vibes on the other. The lived experience of people who observe deteriorating conditions gets categorized as a psychological phenomenon to be explained, not as evidence about reality. Along the way he notices several times that his indicators don't match what people report (NYC's litter ratings contradict residents' experience, shoplifting data contradicts what stores say) but instead of asking "what is this instrument failing to capture?", he files these as caveats and returns to the cluster.

I think Scott is trying to be appealingly self-deprecating here: he too has vibes, he too feels the despair when he goes to San Francisco, he's not claiming to be above it. But self-deprecation about perception itself unmediated by statistics is also deprecation of everyone else's capacity to make sense of their environment. Hey, I'm someone! If my eyes and your eyes and the store owners' eyes all see the same thing, and the statistics disagree, "vibes" is a word that makes it easy to dismiss all of us at once, including yourself. The ideology operates as a default, the place you end up when you're not actively thinking about what your instruments are doing.

Statisticism: the Good, the Bad, and the Ugly

So when should you trust convergence? When does it go wrong? And what turns a useful tool into an ideology?

Convergence as Evidence

In the ideal case, convergence is straightforward. Multiple labs estimate a physical constant using different experimental setups. Each lab has its own systematic errors, but these are uncorrelated, so convergence across labs really does reduce uncertainty. In finance, this is diversification: a portfolio of uncorrelated assets really does have lower variance than any component holding.

Charles Darwin's first major work was Voyage of the Beagle, an autobiographical work about the five years he spent on a surveying ship. He noticed that the beaks on the finches of each island seemed adapted for the particular foods that were available. He studied comparative anatomy. He followed new developments in the fossil record. He corresponded with breeders of animals, and took special note of how various sorts of fancy pigeon, if interbred, would produce normal street pigeons. Before Darwin, few of these things were effective measurement instruments for much beyond themselves. Darwin's observations converged only after he found a deep explanation: species change over time as small variations cause differential survival and reproductive success. The power of the explanation was demonstrated by the variety of explained data.

Before Darwin, Galápagos finch beaks were not obviously measuring the same process as the fossil record. Pigeon fanciers in England were not acting on the sort of thing that naturalists in South America were measuring. We might better say that Darwin's theory was made more credible by making these previously diverse things cohere.

Nor was Darwin's mind overfitting the data the explanatory factor. Gregor Mendel's experiments with pea plants had demonstrated the high-level mechanics of particulate inheritance before Darwin published, but his work wasn't known to Darwin; it was only widely recognized as a converging line of thought decades later.

Convergence as Epistemic Fragility

Governing complex systems requires feedback loops, and feedback loops on complex outcomes require proxies. The US developed GDP to manage wartime output, public health policy is quite a bit harder if you don't have registers of deaths or births, and crime policy is hard to evaluate without crime statistics. These statistical proxies are attempts to compress high-dimensional reality into signals that a control system can act on. The people who built these proxies were trying to solve real problems, like winning the World Wars, and often succeeded. The tragedy is that the solution becomes the next problem.

You often want your national statistics to be methodologically standardized so they're comparable across jurisdictions and time. But standardization introduces shared methodology and therefore shared exposure to the same biases. In finance, this shared exposure would be called basis risk: the risk that your instrument doesn't track the thing it's supposed to track.

Sometimes crime statistics converge because common exogenous factors affect all of them. Other times because they are the targets of the same sorts of cybernetic optimization; they're statistics we collect in order to decrease the underlying quantity, so the statistics converging is an indicator of systemic health. And other times they converge because the same set of political incentives decide which studies count, which problems with the data are forgivable and which are fatal, who has refuted who when two published studies disagree.

CompStat is a textbook case. Precinct commanders were accountable for index crime numbers. The numbers improved. Whether public safety improved is a different question, one that CompStat couldn't answer because CompStat was the measurement system.

Beta Bucks

In finance, beta is the correlated drift left over after you diversify away idiosyncratic risk. If you own shares in two car companies instead of one, you shouldn't expect less exposure to the auto market overall, but the good or bad luck of either company (politically charged CEO, breakout product, scandal where the car explodes) affects you less. Beta is the part you can't diversify away: the movement of the whole market that carries all its participants together. An asset with high beta rises when the market rises and falls when the market falls. In a system where correlated failures get bailed out, beta is free money: you capture the upside of the shared drift and the government absorbs the downside.

More generally, once the proxy is the target, people can profit by correlating their behavior with it, betting explicitly or implicitly against divergence from trend. When enough enterprises are exposed to the same risk, the government prevents them from failing, so excessive optimism is not selected against when correlated with others' optimism. When enough researchers share the same methodology, the consensus can't be challenged without challenging everyone at once, so the methodology becomes a means of organizing politically.

Hidden correlations can arise by accident: nutrition studies using food frequency questionnaires because they're a convenient instrument for collecting large datasets, but embed things like social desirability bias. But once a correlated movement exists, it attracts and retains participants. The environment selects for people who bet with the consensus and conditions them to feel that doing so is virtuous. The ones who didn't are no longer in the room. In practice the accidental and motivated components blur together, since most participants are not conscious of the full incentive structure. They're doing what feels advantageous, appropriate, or safe.

A lone dissenter who says "these instruments share a bias" is in the position of a short-seller betting against a systemically important asset class: possibly right, but structurally disadvantaged, because the system is set up to bail out the consensus. ^[3]

Not with a whimper, but a bang

When a subsidized consensus breaks, it should break catastrophically and all at once, because the same correlation that made it feel robust makes it fragile. The replication crisis in psychology looks like this: not a slow erosion of confidence, but a sudden phase transition once a few key papers fell and the shared methodological exposure was revealed.

Statisticism as Ideology

Statisticism is not a mere failure of epistemics; it is a means of constructing and defending a consensus. Many instruments agreeing gives you a way to dismiss any individual challenge: "that's just one study," "that contradicts the weight of evidence." This works regardless of whether the instruments are actually independent, because most audiences cannot evaluate independence of error sources. You get to feel and vibe to others like a truth-seeker, while doing what is functionally consensus enforcement, because the rules of your epistemology produce the same behavior: privilege the cluster, dismiss the outlier. Nobody needs to be lying. The epistemology does the work for them.

Statisticism is an ideology within which the idea of evidence has been not augmented but replaced by the idea of statistics. Within this framework, only statistically legible information counts as meaningful. Your sensorium is not meaningful, first-principles reasoning about mechanisms is not meaningful, and the only real evidence is the output of a large data collection process using statistical methods. This makes convergence arguments feel decisive, because modeling a specific instrument's relationship to physical reality looks like speculation, while piling up indicator after indicator looks like rigor.

The same pattern shows up in effective altruist philanthropy, where it impairs learning by letting you carry incompatible hypotheses indefinitely without testing them. ^[4]

Predictable blind spots

The style will tend to:

Dismiss strong individual measurements that disagree with the cluster
Miss systematic biases that affect many indicators in the same direction
Treat "many data sources agree" as a conversation-stopper rather than asking whether the agreement is informative
Reframe strong but solitary evidence as "less data" rather than "different and better data"
Categorize non-statistical evidence (direct observation, mechanistic reasoning, lived experience) as "vibes" rather than as information about reality that the statistics may be failing to capture
Apply parsimony across contexts where the generating process has changed, because parsimony feels rigorous and context-sensitivity feels like special pleading

The corrective is not to abandon quantitative evidence or distrust convergence categorically. It is to treat each measurement as the output of a specific causal process, and to ask whether the process supports the use you want to make of the output, in the context where you're trying to apply it. When the question is whether a specific distortion explains an observed trend, the answer must come from modeling the distortion directly, not from counting correlated indicators.

The main counterexample someone might bring up is that Scott engages at length with the medical adjustment argument. He cites Eckberg to refute Harris's lethality-adjusted homicide estimates, and cites evidence of very recent shooting survival rates not improving. But the Eckberg study does not properly refute Harris if you look at the details of the instruments (specifically the underpowered NCVS denominator), and there's no argument offered for why we should extrapolate from post-2000 shooting survival trends to the 1964–2000 period where the trauma surgery revolution occurred. ↩︎
Weekend at Bernie's comes closer, but it takes a lot of work which plainly does not scale to a meaningful distortion of the homicide statistics, and in any case it is not a documentary. ↩︎
Michele Reilly's Anatomy of a Bubble describes a related mechanism in which "arbitrageurs" extract value by creating uniformity of belief around a speculative commodity, with pragmatism functioning as submission to threats rather than as independent assessment. ↩︎
See (Oppression and production are competing explanations for wealth inequality)and (A drowning child is hard to find) for worked examples. Holden Karnofsky's Sequence thinking vs. cluster thinking explicitly defends sandboxing uncertain perspectives as epistemically superior to following chains of reasoning to their conclusions. But the cost of sandboxing is that you never follow a chain of reasoning far enough to falsify it in a timely manner. For the radical problems created by this deferral of accountability, see Civil Law and Political Drama. ↩︎

25