There is an epistemic stance, common among academics in quantitative fields, academics who wish they were in quantitative fields, and independent scholars who do not wish to decorrelate too much from the academic mainstream by communicating in an incompatible dialect, that treats statistical convergence as the gold standard of evidence. If many indicators point the same direction, the signal is real. Call this statisticism. It converges on truth when your instruments have independent errors. It diverges from truth when they share a systematic distortion, because then convergence is what the distortion looks like. The following example illustrates a case where it fails, and why.
Two stories about the same numbers
The US homicide rate doubled between 1960 and 1980, then fell by more than half between 1991 and 2014. I argued that the fall is mostly a medical artifact: trauma surgery vastly improved, so the same rate of shootings produced fewer deaths. I constructed an adjusted trend line using two independent data sources and found no clear decline in serious violence after 1980.
Scott Alexander argues the decline is real. Many different crime categories all fell together: homicide, robbery, car theft, survey-measured victimization. This convergence, in the statisticist mode, makes the decline robust.
The convergence argument
Scott's reasoning: many indicators agree, therefore the signal is real. Good logic when your instruments are trustworthy. Bad logic when the question is whether your instruments are broken.
Every indicator he cites has specific, identifiable problems for measuring serious interpersonal violence:
Homicide rates are suppressed by improving medicine. This is the whole question. The FBI's own Supplementary Homicide Reports make no adjustment for changing lethality.
Aggravated assault rates were inflated for decades by expanding police reporting (the 911 rollout, professionalization of record-keeping, recognition of domestic violence) and then deflated by CompStat-era gaming. The NYPD's CompStat system, introduced in 1994, held precinct commanders accountable for index crime numbers. Felony assaults fell 42% from 2000 to 2009 while misdemeanor assaults fell only 9%, a divergence that Eterno and Silverman documented as systematic downclassification. Under UCR rules, a shooting is hard to classify as anything other than aggravated assault, but a borderline bar fight can plausibly be coded as simple assault rather than aggravated assault, removing it from the index. The expansion of the category led not only to increased reporting but increased charging and conviction, and therefore substantially greater penalties applied to the marginal cases newly considered aggravated assault, which were therefore selectively disincentivized.
Victim surveys (the NCVS) interview about 240,000 people and get roughly 1,000 aggravated assault reports per year. This of course contains no direct information about aggravated assaults, but Scott later argued that if homicides were being converted to aggravated assaults through medical mitigation, that should be reflected in the NCVS numbers. The signal of interest (would-be homicides reclassified as assaults by medical improvement) is a tiny fraction of total assaults. The survey lacks the statistical power to detect it. The NCVS documentation itself flags assault as the worst-recalled crime in the survey.
Property crime responds to locks, cameras, cashless payments, and prosecution thresholds. It tells you about theft, not about whether people are shooting each other. Car theft declined because of immobilizers and GPS tracking, not because of declining criminal intent.
The limitations of these instruments are neither secret nor heterodox. The FBI's UCR handbook warns about comparability problems across time and jurisdiction. The NCVS documentation discusses its own power limitations. The information about instrument quality exists. It just gets stripped away as data moves from producers to consumers, so that by the time the data reaches a blog post, a newspaper, or a summary characterization from an adjacent academic field, it looks like a clean fact about reality rather than a noisy output of a specific, flawed process.
All of these indicators have drifted in the direction of apparent decline during the period in question, for reasons unrelated to whether people became less violent. Counting up indicators that agree doesn't help when they share the defect you're trying to diagnose.
Suppose you suspect your bathroom scale reads low because the spring is worn out. Your friend says it must be accurate because your belt fits better, your face looks thinner, and your blood pressure is down. These are all evidence of something (maybe you're exercising more) but none of them address whether the scale reads low. Body recomposition might produce the same effects. If you want to know whether the spring is worn out, you need to test the spring, or at least the scale.
Testing the spring
I took the hardest data available, the actual count of dead bodies from death certificates filed by medical examiners, and asked: how has the relationship between this number and the underlying rate of serious violence changed over time? Dead bodies are not subject to reporting drift, survey methodology, or police statistics games. The Monty Python parrot scenario is an outrageous fictional exaggeration, and even then it was a parrot; brazenly insisting an obviously dead human being is alive to avoid a minor financial inconvenience strains plausibility even for an absurd comedy sketch.[1]
Homicide rates are subject to one known distortion: whatever the perpetrator does, if the victim doesn't actually die of it, it wasn't a homicide. Medicine is a field specifically devoted to causing people not to die of things they otherwise would have died of, and (I think even Robin Hanson would agree) it has sometimes gotten better over time. So I measured the improvement using two independent clinical sources (FBI firearm lethality ratios and hospital abdominal gunshot wound survival rates) and divided it out.
This is instrument-modeling: instead of asking "do many measurements agree?", asking "what is this specific measurement actually tracking, and how has the tracking changed?"
Where the blind spots appear
In a subsequent exchange on Substack between me and Scott, statisticism produced a characteristic set of moves. Scott clearly writes from a place of genuine uncertainty and curiosity. But the statisticist default shapes what counts as engaging with an argument, and the result is that certain kinds of evidence become structurally difficult to hear.
The hardest evidence gets outvoted
The strongest piece of evidence in the entire debate is a doubling in death counts between 1960 and 1980, during a period of well-documented medical improvement. Death certificates filed by medical examiners are the least distorted measurement available. If you accept this evidence and the medical adjustment, violence roughly tripled on the adjusted measure, and for crime to be at "record lows" today, the adjusted rate would need to have fallen back by a comparable amount. My data shows it didn't.
I flagged this as the crux: "my argument that violent crime increased a lot 1964–1980 is strong, and I'd need to be wrong about that for [the] headline claim to be true." Scott responded: "I agree there's less data about 1960–1980."
I hadn't said anything about having less data. I'd said I had strong evidence. Body counts are the hardest data in this debate. There is less survey data before 1973, because the National Crime Victim Survey didn't start until then. But death certificates are older and more reliable than the best survey available. By responding as though I was arguing from data scarcity, Scott reframed "I have body counts" as "there's less data," inverting the hierarchy of evidence and attributing that inversion to me. I don't think this was deliberate, but I confess that it rankles a bit to have words put in my mouth, which may make me less fair-minded than I otherwise might be; but I don't think it's good discursive practice for people with grievances to self-silence for want of an advocate, so on we go! Within the statisticist framework this move is natural and almost invisible, because the framework ranks evidence by quantity and diversity of sources rather than by the quality of any single source's connection to physical reality.
Trends get reified
Statisticism encourages treating "the crime trend" as a thing that exists in the world, rather than as a summary computed from instruments. Once you think of it as a thing, you can ask whether it went up or down, and you evaluate this by polling your instruments.
The grand old Duke of York,
He had ten thousand instruments;
He marched them up to the top of the hill,
And he marched them down again.
When they were up, crime was up,
And when they were down, crime was down,
And when they were only halfway up,
Crime was neither up nor down.
But the crime trend is neither a generating process for, nor an explanation of, crimes. There are specific events (shootings, robberies, car thefts) counted by specific instruments with specific mechanics by which the events are detected, categorized, and counted. "Crime" is a word we use to group these events. A car theft and a shooting are both crimes, but they have different causes, different mechanisms, and different measurement problems. Treating these different instruments as interchangeable readings of a single underlying variable discards everything you know about how each measurement works.
The hypothesis that one underlying single generating factor, whether it's propensity for criminality, the trust level of society, or the cybernetic capacity of the state, drives changes in all these categories, is a strong claim that calls for strong evidence. I just described the union of three distinct theories connected with "or," not one coherent theory. Much like evidence for the existence of the monotheists' Yahweh doesn't work if it proves too much and also supports the incompatible Zeus, an argument for a single factor has to either rule out the other contenders for the single factor, or specify under what conditions the convergence should fail.
Parsimony gets misapplied across periods
Scott argues that since the post-1980 decline appears real (convergence), the 1960–1980 increase was probably also smaller than it looks. This treats "the trend" as a single object to be accepted or rejected wholesale. But the evidence is asymmetric. The 1960–1980 increase rests principally on body counts. The post-1980 decline rests on rates contaminated by the artifacts under dispute. Projecting the weaker period's story onto the stronger period gets the direction of inference backwards.
Experience gets filed as "vibes"
"Who are you going to believe, me or your lying eyes?" is not, on its face, a very credible rhetorical move. But reframe it as "what are you going to believe, objective statistics or the vibes?" and it becomes surprisingly effective.
In a followup post on disorder, Scott examines whether the things people complain about (litter, graffiti, tent encampments) are really increasing. He looks at the indicators, finds most flat or down, and concludes that perceived disorder probably outruns actual disorder. He frames this as keeping "one foot in the statistical story, one foot in the vibes." Statistics on one side, vibes on the other. The lived experience of people who observe deteriorating conditions gets categorized as a psychological phenomenon to be explained, not as evidence about reality. Along the way he notices several times that his indicators don't match what people report (NYC's litter ratings contradict residents' experience, shoplifting data contradicts what stores say) but instead of asking "what is this instrument failing to capture?", he files these as caveats and returns to the cluster.
I think Scott is trying to be appealingly self-deprecating here: he too has vibes, he too feels the despair when he goes to San Francisco, he's not claiming to be above it. But self-deprecation about perception itself unmediated by statistics is also deprecation of everyone else's capacity to make sense of their environment. Hey, I'm someone! If my eyes and your eyes and the store owners' eyes all see the same thing, and the statistics disagree, "vibes" is a word that makes it easy to dismiss all of us at once, including yourself. The ideology operates as a default, the place you end up when you're not actively thinking about what your instruments are doing.
Statisticism: the Good, the Bad, and the Ugly
So when should you trust convergence? When does it go wrong? And what turns a useful tool into an ideology?
When convergence works
In the ideal case, convergence is straightforward. Multiple labs estimate a physical constant using different experimental setups. Each lab has its own systematic errors, but these are uncorrelated, so convergence across labs really does reduce uncertainty. In finance, this is genuine diversification: a portfolio of uncorrelated assets really does have lower variance than any individual holding. The key word is uncorrelated.
The more interesting case is Charles Darwin, who spent years collecting observations from island biogeography, comparative anatomy, the fossil record, and selective breeding. These observations converged on a single conclusion: species change over time through descent with modification. The convergence meant something because each line of evidence was genuinely independent. Galápagos finch beaks are not subject to the same reporting drift as the fossil record. Pigeon breeders in England are not coordinating their results with naturalists in South America. When many instruments agree and there is no shared machinery generating the agreement, convergence really does reduce uncertainty.
Gregor Mendel's experiments with pea plants had actually established the mechanism of particulate inheritance before Darwin published, though the work wasn't recognized until decades later. Mendel's genetics explained why Darwin's observations converged. This is the instrument-modeling step applied after the fact. Not just "many things point the same direction" but "here is the causal process that makes them point the same direction." The convergence was real, and the mechanism confirmed it.
When convergence misleads
Governing complex systems requires feedback loops, and feedback loops on complex outcomes require proxies. You can't steer a national economy without GDP, manage public health without mortality rates, or run a criminal justice system without crime statistics. These statistical proxies are genuine attempts to compress high-dimensional reality into signals that a control system can act on. Statisticism in its legitimate form is the epistemology that makes cybernetic governance possible. The people who built these proxies were trying to solve genuine problems, like winning the World Wars, and often succeeded. The tragedy is that the solution becomes the next problem.
You often want your national statistics to be methodologically standardized so they're comparable across jurisdictions and time. But standardization introduces shared methodology and therefore shared exposure to the same biases. In finance, this shared exposure would be called basis risk: the risk that your instrument doesn't track the thing it's supposed to track. The question is always whether anyone is modeling the basis risk. Usually nobody is, because within the statisticist framework, the proxy is reality.
Compare Darwin's case. His observations converged because nature ran genuinely separate experiments on different islands. Crime statistics converge because they all pass through the same institutional machinery: the same reporting systems, the same definitional boundaries, the same political incentives. That's not the convergence of independent evidence. It's the convergence of shared plumbing.
Goodharting
Once the proxy becomes a target, the system optimizes for the proxy rather than the underlying reality. The proxy diverges from what it was meant to track, but the divergence is invisible from within the control system, because the control system only sees the proxy.
CompStat is a textbook case. Precinct commanders were accountable for index crime numbers. The numbers improved. Whether public safety improved is a different question, one that CompStat couldn't answer because CompStat was the measurement system. From inside the control loop, declining felony assaults looked like declining violence. From outside, if you compared felony and misdemeanor assault trends and noticed they were diverging, it looked like reclassification. The people inside the loop had no reason to look outside it, and strong career incentives not to.
Beta Bucks
In finance, beta is the correlated drift left over after you diversify away idiosyncratic risk. If you own shares in two car companies instead of one, you shouldn't expect less exposure to the auto market overall, but the good or bad luck of either company (politically charged CEO, breakout product, scandal where the car explodes) affects you less. Beta is the part you can't diversify away: the movement of the whole market that carries all its participants together. An asset with high beta rises when the market rises and falls when the market falls. In a system where correlated failures get bailed out, beta is free money: you capture the upside of the shared drift and the government absorbs the downside.
More generally, once the proxy is the target, people can profit by correlating their behavior with it, betting explicitly or implicitly against divergence from trend. When enough enterprises are exposed to the same risk, the government prevents them from failing, so excessive optimism is not selected against when correlated with others' optimism. When enough researchers share the same methodology, the consensus can't be challenged without challenging everyone at once, so the methodology becomes a means of organizing politically.
Hidden correlations can arise by accident: nutrition studies all using food frequency questionnaires, crime statistics all subject to the same reporting drift. But once a correlated movement exists, it attracts and retains participants. The environment selects for people who bet with the consensus and conditions them to feel that doing so is epistemically virtuous. The ones who didn't are no longer in the room. In practice the accidental and motivated components blur together, since most participants are not conscious of the full incentive structure. They're doing what feels advantageous, appropriate, or safe.
A lone dissenter who says "these instruments share a bias" is in the position of a short-seller betting against a systemically important asset class: possibly right, but structurally disadvantaged, because the system is set up to bail out the consensus.[2]
Not with a whimper, but a bang
This implies a testable prediction: when a subsidized consensus breaks, it should break catastrophically and all at once, because the same correlation that made it feel robust makes it fragile. The replication crisis in psychology looks like this: not a slow erosion of confidence, but a sudden phase transition once a few key papers fell and the shared methodological exposure was revealed.
Why it works politically
Statisticism functions well as consensus enforcement even when it fails as epistemics. Many instruments agreeing gives you a way to dismiss any individual challenge: "that's just one study," "that contradicts the weight of evidence." This works regardless of whether the instruments are actually independent, because most audiences cannot evaluate independence of error sources. You get to feel and vibe to others like a truth-seeker, while doing what is functionally consensus enforcement, because the rules of your epistemology produce the same behavior: privilege the cluster, dismiss the outlier. Nobody needs to be lying. The epistemology does the work for them.
Thermometer, Thermostat, Theology: the Lifecycle of a Proxy
Legitimate cybernetic need → proxy construction → caveats get stripped as data moves downstream → proxy becomes target (Goodhart) → correlated exploitation of the target (too-big-to-fail) → statisticism as the ideology that treats the proxy layer as reality and structurally cannot hear challenges to it.
Statisticism is an ideology within which the idea of evidence has been not augmented but replaced by the idea of statistics. Within this framework, only statistically legible information counts as meaningful. Your sensorium is not meaningful, first-principles reasoning about mechanisms is not meaningful, and the only real evidence is the output of a large data collection process using statistical methods. This makes convergence arguments feel decisive, because modeling a specific instrument's relationship to physical reality looks like speculation, while piling up indicator after indicator looks like rigor.
The same pattern shows up in effective altruist philanthropy, where it impairs learning by letting you carry incompatible hypotheses indefinitely without testing them.[3]
Predictable blind spots
The style will tend to:
Dismiss strong individual measurements that disagree with the cluster
Miss systematic biases that affect many indicators in the same direction
Treat "many data sources agree" as a conversation-stopper rather than asking whether the agreement is informative
Reframe strong but solitary evidence as "less data" rather than "different and better data"
Categorize non-statistical evidence (direct observation, mechanistic reasoning, lived experience) as "vibes" rather than as information about reality that the statistics may be failing to capture
Apply parsimony across contexts where the generating process has changed, because parsimony feels rigorous and context-sensitivity feels like special pleading
The corrective is not to abandon quantitative evidence or distrust convergence categorically. It is to treat each measurement as the output of a specific causal process, and to ask whether the process supports the use you want to make of the output, in the context where you're trying to apply it. When the question is whether a specific distortion explains an observed trend, the answer must come from modeling the distortion directly, not from counting correlated indicators.
Weekend at Bernie's comes closer, but it takes a lot of work which plainly does not scale to a meaningful distortion of the homicide statistics, and in any case it is not a documentary. ↩︎
Michele Reilly's Anatomy of a Bubble describes a related mechanism in which "arbitrageurs" extract value by creating uniformity of belief around a speculative commodity, with pragmatism functioning as submission to threats rather than as independent assessment. ↩︎
There is an epistemic stance, common among academics in quantitative fields, academics who wish they were in quantitative fields, and independent scholars who do not wish to decorrelate too much from the academic mainstream by communicating in an incompatible dialect, that treats statistical convergence as the gold standard of evidence. If many indicators point the same direction, the signal is real. Call this statisticism. It converges on truth when your instruments have independent errors. It diverges from truth when they share a systematic distortion, because then convergence is what the distortion looks like. The following example illustrates a case where it fails, and why.
Two stories about the same numbers
The US homicide rate doubled between 1960 and 1980, then fell by more than half between 1991 and 2014. I argued that the fall is mostly a medical artifact: trauma surgery vastly improved, so the same rate of shootings produced fewer deaths. I constructed an adjusted trend line using two independent data sources and found no clear decline in serious violence after 1980.
Scott Alexander argues the decline is real. Many different crime categories all fell together: homicide, robbery, car theft, survey-measured victimization. This convergence, in the statisticist mode, makes the decline robust.
The convergence argument
Scott's reasoning: many indicators agree, therefore the signal is real. Good logic when your instruments are trustworthy. Bad logic when the question is whether your instruments are broken.
Every indicator he cites has specific, identifiable problems for measuring serious interpersonal violence:
The limitations of these instruments are neither secret nor heterodox. The FBI's UCR handbook warns about comparability problems across time and jurisdiction. The NCVS documentation discusses its own power limitations. The information about instrument quality exists. It just gets stripped away as data moves from producers to consumers, so that by the time the data reaches a blog post, a newspaper, or a summary characterization from an adjacent academic field, it looks like a clean fact about reality rather than a noisy output of a specific, flawed process.
All of these indicators have drifted in the direction of apparent decline during the period in question, for reasons unrelated to whether people became less violent. Counting up indicators that agree doesn't help when they share the defect you're trying to diagnose.
Suppose you suspect your bathroom scale reads low because the spring is worn out. Your friend says it must be accurate because your belt fits better, your face looks thinner, and your blood pressure is down. These are all evidence of something (maybe you're exercising more) but none of them address whether the scale reads low. Body recomposition might produce the same effects. If you want to know whether the spring is worn out, you need to test the spring, or at least the scale.
Testing the spring
I took the hardest data available, the actual count of dead bodies from death certificates filed by medical examiners, and asked: how has the relationship between this number and the underlying rate of serious violence changed over time? Dead bodies are not subject to reporting drift, survey methodology, or police statistics games. The Monty Python parrot scenario is an outrageous fictional exaggeration, and even then it was a parrot; brazenly insisting an obviously dead human being is alive to avoid a minor financial inconvenience strains plausibility even for an absurd comedy sketch. [1]
Homicide rates are subject to one known distortion: whatever the perpetrator does, if the victim doesn't actually die of it, it wasn't a homicide. Medicine is a field specifically devoted to causing people not to die of things they otherwise would have died of, and (I think even Robin Hanson would agree) it has sometimes gotten better over time. So I measured the improvement using two independent clinical sources (FBI firearm lethality ratios and hospital abdominal gunshot wound survival rates) and divided it out.
This is instrument-modeling: instead of asking "do many measurements agree?", asking "what is this specific measurement actually tracking, and how has the tracking changed?"
Where the blind spots appear
In a subsequent exchange on Substack between me and Scott, statisticism produced a characteristic set of moves. Scott clearly writes from a place of genuine uncertainty and curiosity. But the statisticist default shapes what counts as engaging with an argument, and the result is that certain kinds of evidence become structurally difficult to hear.
The hardest evidence gets outvoted
The strongest piece of evidence in the entire debate is a doubling in death counts between 1960 and 1980, during a period of well-documented medical improvement. Death certificates filed by medical examiners are the least distorted measurement available. If you accept this evidence and the medical adjustment, violence roughly tripled on the adjusted measure, and for crime to be at "record lows" today, the adjusted rate would need to have fallen back by a comparable amount. My data shows it didn't.
I flagged this as the crux: "my argument that violent crime increased a lot 1964–1980 is strong, and I'd need to be wrong about that for [the] headline claim to be true." Scott responded: "I agree there's less data about 1960–1980."
I hadn't said anything about having less data. I'd said I had strong evidence. Body counts are the hardest data in this debate. There is less survey data before 1973, because the National Crime Victim Survey didn't start until then. But death certificates are older and more reliable than the best survey available. By responding as though I was arguing from data scarcity, Scott reframed "I have body counts" as "there's less data," inverting the hierarchy of evidence and attributing that inversion to me. I don't think this was deliberate, but I confess that it rankles a bit to have words put in my mouth, which may make me less fair-minded than I otherwise might be; but I don't think it's good discursive practice for people with grievances to self-silence for want of an advocate, so on we go! Within the statisticist framework this move is natural and almost invisible, because the framework ranks evidence by quantity and diversity of sources rather than by the quality of any single source's connection to physical reality.
Trends get reified
Statisticism encourages treating "the crime trend" as a thing that exists in the world, rather than as a summary computed from instruments. Once you think of it as a thing, you can ask whether it went up or down, and you evaluate this by polling your instruments.
But the crime trend is neither a generating process for, nor an explanation of, crimes. There are specific events (shootings, robberies, car thefts) counted by specific instruments with specific mechanics by which the events are detected, categorized, and counted. "Crime" is a word we use to group these events. A car theft and a shooting are both crimes, but they have different causes, different mechanisms, and different measurement problems. Treating these different instruments as interchangeable readings of a single underlying variable discards everything you know about how each measurement works.
The hypothesis that one underlying single generating factor, whether it's propensity for criminality, the trust level of society, or the cybernetic capacity of the state, drives changes in all these categories, is a strong claim that calls for strong evidence. I just described the union of three distinct theories connected with "or," not one coherent theory. Much like evidence for the existence of the monotheists' Yahweh doesn't work if it proves too much and also supports the incompatible Zeus, an argument for a single factor has to either rule out the other contenders for the single factor, or specify under what conditions the convergence should fail.
Parsimony gets misapplied across periods
Scott argues that since the post-1980 decline appears real (convergence), the 1960–1980 increase was probably also smaller than it looks. This treats "the trend" as a single object to be accepted or rejected wholesale. But the evidence is asymmetric. The 1960–1980 increase rests principally on body counts. The post-1980 decline rests on rates contaminated by the artifacts under dispute. Projecting the weaker period's story onto the stronger period gets the direction of inference backwards.
Experience gets filed as "vibes"
"Who are you going to believe, me or your lying eyes?" is not, on its face, a very credible rhetorical move. But reframe it as "what are you going to believe, objective statistics or the vibes?" and it becomes surprisingly effective.
In a followup post on disorder, Scott examines whether the things people complain about (litter, graffiti, tent encampments) are really increasing. He looks at the indicators, finds most flat or down, and concludes that perceived disorder probably outruns actual disorder. He frames this as keeping "one foot in the statistical story, one foot in the vibes." Statistics on one side, vibes on the other. The lived experience of people who observe deteriorating conditions gets categorized as a psychological phenomenon to be explained, not as evidence about reality. Along the way he notices several times that his indicators don't match what people report (NYC's litter ratings contradict residents' experience, shoplifting data contradicts what stores say) but instead of asking "what is this instrument failing to capture?", he files these as caveats and returns to the cluster.
I think Scott is trying to be appealingly self-deprecating here: he too has vibes, he too feels the despair when he goes to San Francisco, he's not claiming to be above it. But self-deprecation about perception itself unmediated by statistics is also deprecation of everyone else's capacity to make sense of their environment. Hey, I'm someone! If my eyes and your eyes and the store owners' eyes all see the same thing, and the statistics disagree, "vibes" is a word that makes it easy to dismiss all of us at once, including yourself. The ideology operates as a default, the place you end up when you're not actively thinking about what your instruments are doing.
Statisticism: the Good, the Bad, and the Ugly
So when should you trust convergence? When does it go wrong? And what turns a useful tool into an ideology?
When convergence works
In the ideal case, convergence is straightforward. Multiple labs estimate a physical constant using different experimental setups. Each lab has its own systematic errors, but these are uncorrelated, so convergence across labs really does reduce uncertainty. In finance, this is genuine diversification: a portfolio of uncorrelated assets really does have lower variance than any individual holding. The key word is uncorrelated.
The more interesting case is Charles Darwin, who spent years collecting observations from island biogeography, comparative anatomy, the fossil record, and selective breeding. These observations converged on a single conclusion: species change over time through descent with modification. The convergence meant something because each line of evidence was genuinely independent. Galápagos finch beaks are not subject to the same reporting drift as the fossil record. Pigeon breeders in England are not coordinating their results with naturalists in South America. When many instruments agree and there is no shared machinery generating the agreement, convergence really does reduce uncertainty.
Gregor Mendel's experiments with pea plants had actually established the mechanism of particulate inheritance before Darwin published, though the work wasn't recognized until decades later. Mendel's genetics explained why Darwin's observations converged. This is the instrument-modeling step applied after the fact. Not just "many things point the same direction" but "here is the causal process that makes them point the same direction." The convergence was real, and the mechanism confirmed it.
When convergence misleads
Governing complex systems requires feedback loops, and feedback loops on complex outcomes require proxies. You can't steer a national economy without GDP, manage public health without mortality rates, or run a criminal justice system without crime statistics. These statistical proxies are genuine attempts to compress high-dimensional reality into signals that a control system can act on. Statisticism in its legitimate form is the epistemology that makes cybernetic governance possible. The people who built these proxies were trying to solve genuine problems, like winning the World Wars, and often succeeded. The tragedy is that the solution becomes the next problem.
You often want your national statistics to be methodologically standardized so they're comparable across jurisdictions and time. But standardization introduces shared methodology and therefore shared exposure to the same biases. In finance, this shared exposure would be called basis risk: the risk that your instrument doesn't track the thing it's supposed to track. The question is always whether anyone is modeling the basis risk. Usually nobody is, because within the statisticist framework, the proxy is reality.
Compare Darwin's case. His observations converged because nature ran genuinely separate experiments on different islands. Crime statistics converge because they all pass through the same institutional machinery: the same reporting systems, the same definitional boundaries, the same political incentives. That's not the convergence of independent evidence. It's the convergence of shared plumbing.
Goodharting
Once the proxy becomes a target, the system optimizes for the proxy rather than the underlying reality. The proxy diverges from what it was meant to track, but the divergence is invisible from within the control system, because the control system only sees the proxy.
CompStat is a textbook case. Precinct commanders were accountable for index crime numbers. The numbers improved. Whether public safety improved is a different question, one that CompStat couldn't answer because CompStat was the measurement system. From inside the control loop, declining felony assaults looked like declining violence. From outside, if you compared felony and misdemeanor assault trends and noticed they were diverging, it looked like reclassification. The people inside the loop had no reason to look outside it, and strong career incentives not to.
Beta Bucks
In finance, beta is the correlated drift left over after you diversify away idiosyncratic risk. If you own shares in two car companies instead of one, you shouldn't expect less exposure to the auto market overall, but the good or bad luck of either company (politically charged CEO, breakout product, scandal where the car explodes) affects you less. Beta is the part you can't diversify away: the movement of the whole market that carries all its participants together. An asset with high beta rises when the market rises and falls when the market falls. In a system where correlated failures get bailed out, beta is free money: you capture the upside of the shared drift and the government absorbs the downside.
More generally, once the proxy is the target, people can profit by correlating their behavior with it, betting explicitly or implicitly against divergence from trend. When enough enterprises are exposed to the same risk, the government prevents them from failing, so excessive optimism is not selected against when correlated with others' optimism. When enough researchers share the same methodology, the consensus can't be challenged without challenging everyone at once, so the methodology becomes a means of organizing politically.
Hidden correlations can arise by accident: nutrition studies all using food frequency questionnaires, crime statistics all subject to the same reporting drift. But once a correlated movement exists, it attracts and retains participants. The environment selects for people who bet with the consensus and conditions them to feel that doing so is epistemically virtuous. The ones who didn't are no longer in the room. In practice the accidental and motivated components blur together, since most participants are not conscious of the full incentive structure. They're doing what feels advantageous, appropriate, or safe.
A lone dissenter who says "these instruments share a bias" is in the position of a short-seller betting against a systemically important asset class: possibly right, but structurally disadvantaged, because the system is set up to bail out the consensus. [2]
Not with a whimper, but a bang
This implies a testable prediction: when a subsidized consensus breaks, it should break catastrophically and all at once, because the same correlation that made it feel robust makes it fragile. The replication crisis in psychology looks like this: not a slow erosion of confidence, but a sudden phase transition once a few key papers fell and the shared methodological exposure was revealed.
Why it works politically
Statisticism functions well as consensus enforcement even when it fails as epistemics. Many instruments agreeing gives you a way to dismiss any individual challenge: "that's just one study," "that contradicts the weight of evidence." This works regardless of whether the instruments are actually independent, because most audiences cannot evaluate independence of error sources. You get to feel and vibe to others like a truth-seeker, while doing what is functionally consensus enforcement, because the rules of your epistemology produce the same behavior: privilege the cluster, dismiss the outlier. Nobody needs to be lying. The epistemology does the work for them.
Thermometer, Thermostat, Theology: the Lifecycle of a Proxy
Legitimate cybernetic need → proxy construction → caveats get stripped as data moves downstream → proxy becomes target (Goodhart) → correlated exploitation of the target (too-big-to-fail) → statisticism as the ideology that treats the proxy layer as reality and structurally cannot hear challenges to it.
Statisticism is an ideology within which the idea of evidence has been not augmented but replaced by the idea of statistics. Within this framework, only statistically legible information counts as meaningful. Your sensorium is not meaningful, first-principles reasoning about mechanisms is not meaningful, and the only real evidence is the output of a large data collection process using statistical methods. This makes convergence arguments feel decisive, because modeling a specific instrument's relationship to physical reality looks like speculation, while piling up indicator after indicator looks like rigor.
The same pattern shows up in effective altruist philanthropy, where it impairs learning by letting you carry incompatible hypotheses indefinitely without testing them. [3]
Predictable blind spots
The style will tend to:
The corrective is not to abandon quantitative evidence or distrust convergence categorically. It is to treat each measurement as the output of a specific causal process, and to ask whether the process supports the use you want to make of the output, in the context where you're trying to apply it. When the question is whether a specific distortion explains an observed trend, the answer must come from modeling the distortion directly, not from counting correlated indicators.
Weekend at Bernie's comes closer, but it takes a lot of work which plainly does not scale to a meaningful distortion of the homicide statistics, and in any case it is not a documentary. ↩︎
Michele Reilly's Anatomy of a Bubble describes a related mechanism in which "arbitrageurs" extract value by creating uniformity of belief around a speculative commodity, with pragmatism functioning as submission to threats rather than as independent assessment. ↩︎
See (Oppression and production are competing explanations for wealth inequality)and (A drowning child is hard to find) for worked examples. Holden Karnofsky's Sequence thinking vs. cluster thinking explicitly defends sandboxing uncertain perspectives as epistemically superior to following chains of reasoning to their conclusions. But the cost of sandboxing is that you never follow a chain of reasoning far enough to falsify it in a timely manner. For the radical problems created by this deferral of accountability, see Civil Law and Political Drama. ↩︎