You Are Not Measuring What You Think You Are Measuring

Eight years ago, I worked as a data scientist at a startup, and we wanted to optimize our sign-up flow. We A/B tested lots of different changes, and occasionally found something which would boost (or reduce) click-through rates by 10% or so.

Then one week I was puzzling over a discrepancy in the variance of our daily signups. Eventually I scraped some data from the log files, and found that during traffic spikes, our server latency shot up to multiple seconds. The effect on signups during these spikes was massive: even just 300 ms was enough that click-through dropped by 30%, and when latency went up to seconds the click-through rates dropped by over 80%. And this happened multiple times per day. Latency was far and away the most important factor which determined our click-through rates. [1]

Going back through some of our earlier experiments, it was clear in hindsight that some of our biggest effect-sizes actually came from changing latency - for instance, if we changed the order of two screens, then there’d be an extra screen before the user hit the one with high latency, so the latency would be better hidden. Our original interpretations of those experiments - e.g. that the user cared more about the content of one screen than another - were totally wrong. It was also clear in hindsight that our statistics on all the earlier experiments were bunk - we’d assumed that every user’s click-through was statistically independent, when in fact they were highly correlated, so many of the results which we thought were significant were in fact basically noise.

Main point of this example: we were not measuring what we thought we were measuring. We thought we were testing hypotheses about what information the user cared about, or what order things needed to be presented in, or whether users would be more likely to click on a bigger and shinier button. But in fact, we were mostly measuring latency.

When I look back on experiments I’ve run over the years, in hindsight the very large majority of cases are like the server latency example. The large majority of the time, experiments did not measure what I thought they were measuring. I’ll call this the First Law of Experiment Design: you are not measuring what you think you are measuring.

Against One-Bit Experiments

A one-bit experiment is an experiment designed to answer a yes/no question. It’s the prototypical case from high school statistics: which of two mouse diets results in lower bodyweight? Which of two button designs on a website results in higher click-through rates? Does a new vaccine design protect against COVID better than an old design (or better than no vaccine at all)? Can Muriel Bristol tell whether milk or tea was added first to her teacup? Will a neural net trained to navigate to a coin at the end of a level still go to the coin if it’s no longer at the end of a level? Can a rat navigate a maze just by smell?

There’s an obvious criticism of such experiments: at best, they yield one bit of information. (Of course the experimenter probably observes a lot more than one bit of information over the course of the experiment, but usually people are trained to ignore most of that useful information and just report a p-value on the original yes/no question.) The First Law of Experiment Design implies that the situation is much worse: in the large majority of cases, a one-bit experiment yields approximately zero information about the thing the experimenter intended to measure. It inevitably turns out that mouse bodyweight, or Muriel Bristol’s tea-tasting, or a neural net’s coinrun performance, in fact routes through something entirely different from what we expected.

Corollary To The First Law: If You Are Definitely Not Measuring Anything Besides What You Think You Are Measuring, You Are Probably Not Measuring Anything

Ok, but aren’t there experiments where we in fact understand what’s going on well enough that we can actually measure what we thought we were measuring? Like the vaccine test, or maybe those experiments from physics lab back in college?

Yes. And in those cases, we usually have a pretty damn good idea of what the experiment’s outcome will be. When we understand what’s going on well enough to actually measure the thing we intended to measure, we usually also understand what’s going on well enough to predict the result. And if we already know the result, then we gain zero information - in a Bayesian sense, we measure nothing.

Take the physics lab example: in physics lab classes, we know what the result “should” be, and if we get some other result then we messed up the experiment. In other words: either we know what the result is (and therefore gain zero information), or we accidentally measure something other than what we intended. (Well… I say “accidentally”, but my college did have a physics professor who would loosen the screws on the pendulum in the freshman physics lab.) Either way, we’re definitely not measuring the thing we intended to measure - either we measure something else, or we measure nothing at all.

… though I suppose one could argue that the physics lab experiment result tells us whether or not we’ve messed up the experiment. In other words, we can test whether we’re measuring the thing we thought we were measuring. So if we know the First Law of Experiment Design, then at least we can measure whether or not the corollary applies to the case at hand.

Anyway, for the rest of this post I’ll assume we’re in a domain where we don’t already know what the answer is (or “should” be).

Solution: Measure Lots of Things

In statistics jargon, the problem is confounders. We never measure what we think we are measuring because there are always confounders, all the time. We can’t control for the confounders because in practice we never know what they are, or which potential confounders actually matter, or which confounders are upstream vs downstream. Classical statistics has lots to say about significance and experiment size and so forth, but when we don’t even know what the confounders are there’s not much to be done.

… or at least that used to be the case. Modern work on causality (e.g. Pearl) largely solves that problem - if we measure enough stuff. One of the key insights of causality is that, while we can’t determine causation from correlation of two variables, we can sometimes determine causation from correlation of three or more variables - and the more variables, the better we can nail down causality. Similarly, if we measure enough stuff, we can often back out any latent variables and figure out how they causally link up to everything else. In other words, we can often deal with confounders if we measure enough stuff.

That’s the theoretical basis for what I’ll call The Second Law of Experiment Design: if you measure enough different stuff, you might figure out what you’re actually measuring.

Feynman’s story about rat-mazes is a good example here:

He had a long corridor with doors all along one side where the rats came in, and doors along the other side where the food was.  He wanted to see if he could train the rats to go in at the third door down from wherever he started them off.  No.  The rats went immediately to the door where the food had been the time before.

The question was, how did the rats know, because the corridor was so beautifully built and so uniform, that this was the same door as before?  Obviously there was something about the door that was different from the other doors.  So he painted the doors very carefully, arranging the textures on the faces of the doors exactly the same.  Still the rats could tell.  Then he thought maybe the rats were smelling the food, so he used chemicals to change the smell after each run.  Still the rats could tell.  Then he realized the rats might be able to tell by seeing the lights and the arrangement in the laboratory like any commonsense person.  So he covered the corridor, and, still the rats could tell.

He finally found that they could tell by the way the floor sounded when they ran over it.  And he could only fix that by putting his corridor in sand.

Measure enough different stuff, and sometimes we can figure out what’s actually going on.

The biggest problem with one-bit experiments (or low-bit experiments more generally) is that we’re not measuring what we think we’re measuring, and we’re not measuring enough stuff to figure out what’s actually going on. When designing experiments, we want a firehose of bits, not just yes/no. Watching something through a microscope yields an enormous number of bits. Looking through server logs yields an enormous number of bits. That’s the sort of thing we want - a firehose of information.

Measurement Devices

What predictions might we make, from the two Laws of Experiment Design?

Here’s one: new measurement devices or applications of measurement devices, especially high-bit measurement devices, are much more likely than individual experiments to be bottlenecks to the progress of science. For instance, the microscope is more of a bottleneck than Jenner’s controlled trial of the first vaccine. Jenner’s experiment enabled only that one vaccine, and it was almost a century before anybody developed another. When the next vaccine came along, it came from Pasteur’s work watching bacteria under a microscope - and that method resulted in multiple vaccines in rapid succession, as well as “Pasteurization” as a method of disinfection.

We could make similar predictions for particle accelerators, high-throughput sequencing, electron microscopes, mass spectrometers, etc. In the context of AI/ML, we might predict that interpretability tools are a major bottleneck.

Betting Markets

For the same reasons that an experiment is usually not measuring what we think it’s measuring, a fully operationalized prediction is usually not predicting the thing we think it is predicting.

For instance, maybe what I really want to predict is something about qualitative shifts in political influence in Russia. I can operationalize that into a bunch of questions about Putin, the war in Ukraine, specific laws/policies, etc. Probably it will turn out that none of those questions actually measure the qualitative shift in political influence which I’m trying to get at. On the other hand, with a whole bunch of questions, I could maybe do some kind of principal component analysis and back out whatever main factors the questions do measure. For the same reasons that we can sometimes figure out what an experiment actually measures if we measure enough stuff, we can sometimes figure out what questions on a prediction market are actually asking about if we set up markets on enough different questions.

Reading Papers

Of course the Laws of Experiment Design also apply when reading the experiment designs and results of others.

As an example, here’s a recent abstract off biorxiv:

In this study, we examined whether there is repeatability in the activity levels of juvenile dyeing poison frogs (Dendrobates tinctorius). [...] We did not find individual behaviour to be repeatable, however, we detected repeatability in activity at the family level, suggesting that behavioural variation may be explained, at least partially, by genetic factors in addition to a common environment.

Just based on the abstract, I’m going to go out on a limb here and guess that this study did not, in fact, measure “genetic factors”. Probably they measured some other confounder, like e.g. family members grew up nearby each other. (Or maybe the whole result was noise + p-hacking, there’s always that possibility.)

Ok, time to look at the paper… well, the experiment size sure is suspiciously small, they used a grand total of 17 frogs and tested 4 separate behaviors. That sure does sound like a statistical nothingburger! On the other hand, the effect size was huge and their best p-value was p < 0.001, so maaaaaybe there’s something here? I’m skeptical, but let’s give the paper the benefit of the doubt on the statistics for now.

Did they actually measure genetic effects? Well, they sure didn’t rule out non-genetic effects. The “husbandry” section of the Methods actually has a whole spiel about how the father-frogs “exhibit an elaborate parental care behaviour” toward their tadpoles: “Recently-hatched tadpoles are transported on their father’s back from terrestrial clutches to water-filled plant structures at variable heights”. Boy, that sure does sound like a family of tadpoles growing up in a single environment which is potentially different from the environment of another family of tadpoles. The experimenters do talk about their efforts to control the exact environment in which they ran the tests themselves… but they don’t seem to have made much effort to control for variables impacting the young frogs before the test began. So, yeah, there’s ample room for non-genetic correlations between each family of tadpoles.

This is a pretty typical paper: the authors didn’t systematically control for confounders, and the experiment is sufficiently low-bit that we can’t tell what factors actually mediated the correlations between sibling frogs (assuming those correlations weren’t just noise in the first place). Probably the authors weren’t measuring what they thought they were measuring; certainly they didn’t rule out other things they might have been measuring.

Takeaways

Let’s recap the two laws of experiment design:

  • First Law of Experiment Design: you are not measuring what you think you are measuring.
  • Second Law of Experiment Design: if you measure enough different stuff, you might figure out what you’re actually measuring.

The two laws have a lot of consequences for designing and interpreting experiments. When designing experiments, assume that the experiment will not measure the thing you intend. Include lots of other measurements, to check as many other things as you can. If possible, use instruments which give a massive firehose of information, instruments which would let you notice a huge variety of things you might not have considered, like e.g. a microscope.

Similarly, when interpreting others’ experiments, assume that they were not measuring what they thought they were measuring. Ignore the claims and p-values in the abstract, go look at the graphs and images and data, cross-reference with other papers measuring other things, and try to put together enough different things to figure out what the experimenters actually measured.

  1. ^

    The numbers in the latency story are pulled out my ass, I don’t remember what they actually were other than that the latency effects were far larger than anything else we’d seen. Consider the story qualitatively true, but fictional in the quantitative details.

New Comment
44 comments, sorted by Click to highlight new comments since:

The two laws have a lot of consequences for designing and interpreting experiments. When designing experiments, assume that the experiment will not measure the thing you intend. Include lots of other measurements, to check as many other things as you can. If possible, use instruments which give a massive firehose of information, instruments which would let you notice a huge variety of things you might not have considered, like e.g. a microscope.

 

Great advice here. Right now, I'm validating a PCR protocol via agarose gel electrophoresis. Neither technique is new to me, but the equipment is, and I'm using borrowed DNA from one person and borrowed dye from another, using an agarose gel electrophoresis protocol that's not written for the same dye.

At first, I was running gels to answer with just one DNA sample, basically getting one bit of information for the one question I had in mind.

I quickly realized that there were a ton of variables at play, with some unknown unknowns. Even though it took a little longer to add extra sample lanes to look at different questions, I could learn much more, much more quickly, by doing so. Even if it wasn't obvious why I'd want that data, making sure I'd have access to it when the gel was finished an hour later proved useful again and again. It is good to over-observe your experiment as much as you can get away with.

This also sounds a lot like Elmer Gates' approach to learning new things and coming up with inventions.

Gates’s “mind-using” strategies look very different from modern and traditional learning practices. Gates’s theory of “brain-building” fused radically empiricist and inductivist philosophy with a belief in the mind’s adaptability, and his “art” reflects that. Early on in his studies, Gates observed that acquiring the right higher level concepts depended heavily on having the right foundational experiences. Hence, to learn a new scientific field, Gates would start by gathering the “sensory experiences” that constituted the raw data of that science. In the case of mechanics, this might involve replicating key experiments related to things like gravity or fiction. In the case of chemistry, Gates supposedly replicated the experiments described in multiple textbooks himself. In the case of weaving, which Gates was once challenged to apply his method to (which he did successfully, inventing multiple new devices), we have the following quote regarding how Gates’s went about the initial steps:

I secured letters of introduction to practical weavers and loom makers, and with the aid of several assistants made a systematic search of the technical literature. By actual observation of looms and methods of weaving I built over my brain with reference to that subject; acquiring an the images, concepts, ideas, and thoughts that six weeks’ continuous effort made possible.

Once Gates had had the necessary sensory experiences, he’d move on to the next step of “refunctioning” them. Refunctioning (coined by Gates) refers to mentally reviewing the experiences and observations acquired from the science until they “became more vivid, much more clearly minted and complete, while the processes of states acquired much greater celerity and efficiency.” As far as I can tell, Gates would then move on to repeating the process but for images, concepts, and thoughts (higher levels of abstraction) that emerged from repeatedly reviewing the raw experiences and studying their relationships. The following passage contains the best description I could find of this part of the process:

Then he related each concept to the others, which gave many new and true ideas (to be temporarily recorded until experimentally verified). Every new concept always implied a number of new ideas; and for the first time, to relate new concepts was always a rich opportunity for new ideas.

Linked review by our own NaiveTortoise.

Another example from the lab. My labmate is running a microfluidic system to make nanoparticles. She's trying to make them monodispersed (all the same size). What she'll publish is very precise measurements of the size distribution of the nanoparticles, which she'll obtain from the MasterSizer.

However, it's a 30 minute round trip walk to use it. And every MasterSizer reading is just a point in time. What she'll use to make the decision about whether or not to measure the nanoparticles is purely qualitative. She's got a microscope hooked up to a computer screen that's displaying the nanoparticles flowing through as they're created. She can see clearly that they're monodispersed, and can tell by sight what the effect is of changing the speed of the oil and aqueous phases, the angle of the syringes, or the ratio of the flow rates. Not to mention the effect of bumps, dust particles, and so on. It's a continuous, dynamic, high-data way to observe - a firehose of information.

The figure that will ultimately be published will capture only a small piece of that, and it will be a far less informative piece of information than simply standing there and watching the nanoparticles stream by - or an informal conversation with her. But that MasterSizer measurement is what may convince other researchers that she's had success, or that her techniques are worth trying. They'll have to set up their own firehose to really learn about those monodispersed nanoparticles. The paper is, in a way, just advertising, and maybe a third of the way to an instruction manual.

[-]Ben50

If she can, it might be nice for her to put a few minutes of microscope footage attached to the paper as a supplementary. Maybe just "bad conditions" (stuff flowing by all different sizes), followed by "good conditions": stuff all the same. Lots of journals offer the possibility of videos as supplementary information. Its the sort of thing that (I think) journal editors like (maybe it boosts engagement by the metrics they use for their websites?), and it sounds like it will benefit the paper.

Well stated, good post!

You might enjoy my paper on this topic, where I present other approaches to measurement, especially in the presence of adversarial pressure, which you didn't really discuss. I also think that the idea of metric design, which I explored, gets into more detail about how I think you can practically do measurement better.

I think your 'Towards a coherent process for metric design' section alone is worth its weight in gold. Since most LW readers aren't going to click on your linked paper (click-through rates being as low in general as they are, from my experience in marketing analytics), let me quote that section wholesale:

Given the various strategies and considerations discussed in the paper, as well as failure modes and limitations, it is useful to lay out a simple and coherent outline of a process for metric design. While this will by necessity be far from complete, and will include items that may not be relevant for a particular application, it should provide at least an outline that can be adapted to various metric design processes. Outside of the specific issues discussed earlier, there is a wide breadth of expertise and understanding that may be needed for metric design. Citations in this section will also provide a variety of resources for at least introductory further reading on those topics.

  1. Understand the system being measured, including both technical (Blanchard & Fabrycky, 1990) and organizational (Berry & Houston, 1993) considerations.
    1. Determine scope
      1. What is included in the system?
      2. What will the metrics be used for?
    2. Understand the causal structure of the system
      1. What is the logic model or theory? (Rogers, Petrosino, Huebner, & Hacsi, 2000)
      2. Is there formal analysis (Gelman, 2010) or expert opinion (van Gelder, Vodicka, & Armstrong, 2016) that can inform this?
    3. Identify stakeholders (Kenny, 2014)
      1. Who will be affected?
      2. Who will use the metrics?
      3. Whose goals are relevant?
  2. Identify the Goals
    1. What immediate goals are being served by the metric(s)? How are individual impacts related to performance more broadly? (Ruch, 1994)
    2. What longer term or broader goals are implicated?
  3. Identify Relevant Desiderata
    1. Availability
    2. Cost
    3. Immediacy
    4. Simplicity
    5. Transparency
    6. Fairness
    7. Corruptibility
  4. Brainstorm potential metrics
    1. What outcomes important to capture?
    2. What data sources exist?
    3. What methods can be used to capture additional data?
    4. What measurements are easy to capture?
    5. What is the relationship between the measurements and the outcomes?
    6. What isn’t captured by the metrics?
  5. Consider and Plan
    1. Understand why and how the metric is useful. (Manheim, 2018)
    2. Consider how the metrics will be used to diagnose issues or incentify people.(Dai, Dietvorst, Tuckfield, Milkman, & Schweitzer, 2017)
    3. Plan how to use the metrics to develop the system, avoiding the “reward / punish” dichotomy. (Wigert & Harter, 2017)
    4. Perform a pre-mortem (Klein, 2007)
  6. Plan to routinely revisit the metrics (Atkins, Wanick, & Wills, 2017)

Reminds me of https://astralcodexten.substack.com/p/ivermectin-much-more-than-you-wanted, where ivermectin studies likely measured worm infections, not Covid prevention.

I wish I had a stronger strong upvote I could give this post. I was already nodding my head by the time I was done with the introduction, and then almost every subsequent section gave me something to be excited about. I will try to say some more substantive things later, but I wanted to say this first because I often don't get around to commenting.

Here's a structural explanation for statistical improprieties in scientific research:

  • For publications, scientists are incentivized to overinvest in precision. They aren't able to get a publication out of just their informal observations and qualitative data.
  • Yet scientists are also underincentivized to invest in definitive answers. That doesn't mean there's no incentive, just that the incentives aren't adequate.

"Do precise statistics, but it's not that important to conclusively resolve the question at hand" sounds like a recipe for wasted effort. Once you realize that your research isn't definitive, don't know how to make it definitive, and that nobody's expecting you to achieve this anyway, it's not that far to go to start neglecting the statistics as well.

A better recipe would be "gather the data that gives a definitive answer to your real question, and demonstrate that your conclusions are sound using whatever statistics or demonstrations make the most sense."

We can also think about a funnel model. You should start with broad and informal (cheap) qualitative observation, and then progressively narrow down to formal and precise quantitative data as you figure out what the few key most important variables are.

Great post!

This also reminds me of Tal Yarkoni's paper on what he calls the generalizability crisis in psychology. That's the fact that psychological experiments measure a very specific thing that's treated as corresponding to a more general thing. Psychologists think that the specific thing measures the more general thing, and Yarkoni argues that they're not measuring what they think they're measuring.

One of his examples is about the study of verbal overshadowing. This is a claimed phenomenon where if you have to verbally describe what a face looks like, you will be worse off at actually recognizing that face later on. The hypothesis is that producing the verbal description causes you to remember the verbal description, while remembering the actual face less well - but the verbal description inevitably contains less detail. This has been generalized to the broader claim of "producing verbal descriptions of experiences impair our later recollection of them".

Yarkoni discusses an effort to replicate one of the original experiments:

Alogna and colleagues (2014) conducted a large-scale “registered replication report” (RRR; Simons, Holcombe, & Spellman, 2014) involving 31 sites and over 2,000 participants. The study sought to replicate an influential experiment by Schooler and Engstler-Schooler (1990) in which the original authors showed that participants who were asked to verbally describe the appearance of a perpetrator caught committing a crime on video showed poorer recognition of the perpetrator following a delay than did participants assigned to a control task (naming as many countries and capitals as they could). Schooler & Engstler-Schooler (1990) dubbed this the verbal overshadowing effect. In both the original and replication experiments, only a single video, containing a single perpetrator, was presented at encoding, and only a single set of foil items was used at test. Alogna et al. successfully replicated the original result in one of two tested conditions, and concluded that their findings revealed “a robust verbal overshadowing effect” in that condition.

Let us assume for the sake of argument that there is a genuine and robust causal relationship between the manipulation and outcome employed in the Alogna et al study. I submit that there would still be essentially no support for the authors’ assertion that they found a “robust” verbal overshadowing effect, because the experimental design and statistical model used in the study simply cannot support such a generalization. The strict conclusion we are entitled to draw, given the limitations of the experimental design inherited from Schooler and Engstler-Schooler (1990), is that there is at least one particular video containing one particular face that, when followed by one particular lineup of faces, is more difficult for participants to identify if they previously verbally described the appearance of the target face than if they were asked to name countries and capitals. [...]

On any reasonable interpretation of the construct of verbal overshadowing, the corresponding universe of intended generalization should clearly also include most of the operationalizations that would result from randomly sampling various combinations of these factors (e.g., one would expect it to still count as verbal overshadowing if Alogna et al. had used live actors to enact the crime scene, instead of showing a video). Once we accept this assumption, however, the critical question researchers should immediately ask themselves is: are there other psychological processes besides verbal overshadowing that could plausibly be influenced by random variation in any of these uninteresting factors, independently of the hypothesized psychological processes of interest? A moment or two of consideration should suffice to convince one that the answer is a resounding yes. It is not hard to think of dozens of explanations unrelated to verbal overshadowing that could explain the causal effect of a given manipulation on a given outcome in any single operationalization.

This verbal overshadowing example is by no means unusual. The same concerns apply equally to the broader psychology literature containing tens or hundreds of thousands of studies that routinely adopt similar practices. In most of psychology, it is standard operating procedure for researchers employing just one experimental task, between-subject manipulation, experimenters, testing room, research site, etc., to behave as though an extremely narrow operationalization is an acceptable proxy for a much broader universe of admissible observations. It is instructive—and somewhat fascinating from a sociological perspective—to observe that while no psychometrician worth their salt would ever recommend a default strategy of measuring complex psychological constructs using a single unvalidated item, the majority of psychology studies do precisely that with respect to multiple key design factors. The modal approach is to stop at a perfunctory demonstration of face validity—that is, to conclude that if a particular operationalization seems like it has something to do with the construct of interest, then it is an acceptable stand-in for that construct. Any measurement-level findings are then uncritically generalized to the construct level, leading researchers to conclude that they’ve learned something useful about broader phenomena like verbal overshadowing, working memory, ego depletion, etc., when in fact such sweeping generalizations typically obtain little support from the reported empirical studies.


 

Interesting. This resonates, and yet maybe stands in tension, with complaints that social psychology fails to do enough exact replications. I remember that a criticism of social psychology was that researchers would test a generalization like priming in too many different ways, and people were suspicious about whether or not any of the effects would stand up to replication.

I’d love to see a description of what this field should be doing. There’s a sweet spot between too much weight on one experimental approach, and too little exact replication. How does a field identify that sweet spot, and how can it coordinate to carry out experiments in the sweet spot?

Yeah, I was thinking this same thing. I feel like I’m social sciences I’m more concerned about researchers testing for too many things and increasing the probability of false positives than testing too few things and maybe not fully understanding a result.

I feel like it really comes down to how powerful a study is. When you have tons of data like a big tech company might, or the results are really straightforward, like in some of the hard sciences, I think this is a great approach. When the effects of a treatment are subtler and sample size is more limited, as is often the case in the social sciences, I would be wary to recommend testing everything you can think of.

I feel like I’m social sciences I’m more concerned about researchers testing for too many things and increasing the probability of false positives than testing too few things and maybe not fully understanding a result.

I'd say that's more a problem of selective reporting.

I hope that Metaculus will introduce the option of having questions that have multiple outcomes that one can predict in correlated ways. Perhaps by setting up a latent variable model.

They have a Patreon where they receive money and suggestions. I signed up to give them money there and suggested they introduce support for latent variable models. They seemed interested. I don't know what's going on behind the scenes, but I assume if anyone else also thinks it's worthwhile, it would help them if you gave them a bunch of money and asked them to use it for latent variable models.

This post didn't feel particularly important when I first read it.

Yet I notice that I've been acting on the post's advice since reading it. E.g. being more optimistic about drug companies that measure a wide variety of biomarkers.

I wasn't consciously doing that because I updated due to the post. I'm unsure to what extent the post changed me via subconscious influence, versus deriving the ideas independently.

Will a neural net trained to navigate to a coin at the end of a level still go to the coin if it’s no longer at the end of a level?

This seems in line with your position, but I want to reply so people won't conclude "And coinrun experiments don't tell you important things." I think the more interesting question for that experiment is "how will the agent generalize? Can we predict it in advance? In what ways do we systematically mispredict, and why?" 

(And, roughly speaking, there are a range of possible algorithms by which the agent can generalize, so it's way more than one bit. I got way more out of that paper by asking the authors for hundreds of videos for different training settings. Probably will mention the results in a post, soon.)

First Law of Experiment Design: you are not measuring what you think you are measuring.

In a way, this is just a corollary to the good, old-fashioned "correlation does not imply causation" principle. I guess the important difference is that this is a warning directed at experimenters who tend to assume that they know better.

Second Law of Experiment Design: if you measure enough different stuff, you might figure out what you’re actually measuring.

Before COVID, my manager's manager would semi-regularly put on a multi-day training course where he would teach everyone in his organization about the process of "Design of Experiments" (DOE) (https://en.wikipedia.org/wiki/Design_of_experiments). I work with other data scientists, statisticians, biochemist, and engineers, often performing experiments with medical devices, so this is relevabt to all of us.

The idea behind a DOE is to create a mutifactorial experimental design that tests the effects of many factors at once. The first step is to brainstorm all possible factors that may influence the experimental results (often using a fishbone diagram [https://en.m.wikipedia.org/wiki/Ishikawa_diagram] to focus on factors arising from Materials, Method, Measurement, Machine, Man, or "Mother Nature"). Each factor is then labelled as either control (C, factors fixed throughout all parts of the experiment), noise (N, random factors that cannot/won't be controlled for but may cause some variation in the results), or experimental (X, factors to be caried systematically to test for their effects).

Next, high and low settings are chosen for each X factor, and all possible combinations of settings are arranged in a hypercube. Instead of experimenting on one factor at a time with enough repetitions to build up statistical significance, you can perform just a few repetitions at each corner of the hypercube. You can still use the results to look at single-factor contributions to the experimental effect by aggregating data from all corners of the low side and high side of the hypercube along one dimension. But you can also look at the impact of factor interactions and the nonlinearities that result, which would have acted as hidden confounders in more traditional single-factor experiments.

I liked how this method gave a systematic way of thinking about multifactorial experimental effects. Such experiments tend to uncover a lot more information about a system than you would see otherwise. To actually tease out the underlying causal mechanisms at work, though, would require deeper statistics and modeling than we ever got into in that course.

Next, high and low settings are chosen for each X factor, and all possible combinations of settings are arranged in a hypercube. Instead of experimenting on one factor at a time with enough repetitions to build up statistical significance, you can perform just a few repetitions at each corner of the hypercube.

This concept reminds me of the problem of planning software tests: I want to exercise all behaviors of the code under test, but actually testing the cartesian product of input conditions often means writing a test that is so generic it duplicates the code under test (unless there is a more naïve algorithm that the test can use), and is hard to evaluate for its own correctness. Instead, I end up writing a selected set of cases intended to cover interesting combinations of inputs — but then the problem is thinking of which inputs are worth testing. When bugs are discovered, they may be combinations of inputs that were not thought of (or they may be parameters we didn't think of testing, i.e. implicitly put in the “control” category, or specific edge-case values of parameters we did test).

An alternative to hand-written testing of specific cases is to write a property test, like “is input A + input B always ≤ output C, under a wide-ranging selection of inputs”. This feels analogous to measuring correlations in that hypercube — and the part of the actual output that you're not checking precisely (in my example, the value A + B − C) is the part of the test that is “noise” rather than “control” because we've decided it is more practical to ignore that information than to control it (write a test that contains or computes the exact answer to expect).

When doing quantified self-experiments, people sometimes argue for the virtues of blinding. In quantified self-experiments, blinding means measuring less. It means not exposing yourself to data about reality. "Measure Lots of Things" frequently means engaging in less blinding. 

Modern work on causality (e.g. Pearl) largely solves that problem - if we measure enough stuff.

It seems like Pearl's work is so hard to put into practice that I didn't really get good examples when I asked for them on LessWrong. 

Do you think Pearl's work is more directly useful, why do you think I didn't get more practical examples of LessWrongers who used it?

As is often the case with these things, I think a lot of people already intuitively use the sort of moves suggested by Pearl-style causality. In particular, looking for mediators is usually the central move in reconstructing causal structure, and that's something which lots of scientists already have strong instincts for. Also, the general mental model of causality as a DAG helps defuse a lot of stupid arguments about polycausality and everything-interacting-with-everything - again, people with good research taste tend to already know those arguments are stupid, but don't necessarily have the language to explain how/why.

I think the lesson a lot of people took away from Pearl is (roughly speaking) that you should look for colliders between independent variables, since colliders are the one primitive causal structure that looks distinct (whereas forward mediators, backward mediators, and confounders look the same from a graphical independence perspective).

My impression is that looking for colliders doesn't really work for most practical causal inference. So this then translates to Pearl's approaches not working in practice.

Except I think Pearl has become (always was?) pretty skeptical about inferring causal structure from this sort of data (even though it's "theoretically" possible), so it might just be a miscommunication along the way. In the rationalist community, this miscommunication probably originates mainly from Eliezer's endorsement of searching for colliders as one of the main forms of causal inference.

Yeah, I think the mistake there is mostly one of emphasis. In terms of narrowing-down-model-space, most of the work (i.e. most of the model-elimination) is in reconstructing the undirected graphical structure; figuring out which direction each arrow goes is relatively easy after that. (You could view this as a consequence of Science In A High Dimensional World: the hard step is figuring out which handful of variables are directly relevant, and after that it's relatively easy to experiment with those variables.)

Also, Pearl himself was operating in a statistical framework closer to 20th century frequentism than modern Bayesian - e.g. his algorithms in Causality were designed to use an independence oracle rather than Bayesian model comparison. It turns out that model testing in high-dimensional systems is one of the places where the advantage of modern Bayesianism is largest.

In terms of narrowing-down-model-space, most of the work (i.e. most of the model-elimination) is in reconstructing the undirected graphical structure; figuring out which direction each arrow goes is relatively easy after that.

This depends a lot on the field, no? If it's too expensive or unethical to intervene, or one plain and simply doesn't understand the variables well enough to intervene, then figuring out the direction of the causal arrows can be difficult.

It turns out that model testing in high-dimensional systems is one of the places where the advantage of modern Bayesianism is largest.

Interesting, do you have any additional resources on this?

The Rationality Cheater's Move is to have more information. Really beats analysis and reason in many cases.

Hmmmm.

So when I read this post I initially thought it was good. But on second thought I don't think I actually get that much from it. If I had to summarise it, I'd say

  • a few interesting anecdotes about experiments where measurement was misleading or difficult
  • some general talk about "low bit experiments" and how hard it is to control for cofounders

The most interesting claim I found was the second law of experiment design. To quote: "The Second Law of Experiment Design: if you measure enough different stuff, you might figure out what you’re actually measuring.". But even here I didn't get much clarity or new info. The argument seemed to boil down to "If you measure more things, you may find the actual underlying important variable", which is true I guess but doesn't seem particularly novel and also introduces other risks. e.g: That the more variables you measure the higher the chance that at least some of them will correlate just due to chance. There's a pointer to a book which the author claims sheds more light on the topic and on modern statistical methods around experiment design more generally, but that's it.

I think I also have a broader problem here, namely that the article feels a bit fuzzy in a way that makes it hard to pin down what the central claims are.

So yeah, I enjoyed it but on reflection I'm a bit less of a fan than I thought.

[-]joe50

A single experiment, especially if it has a small sample size---but even large sample sizes can be ruined by experimental error etc.---or even a small number of such experiments which confirm each other---just aren't going to give very reliable results for a complex system.

When things are tightly controlled in a lab for a simple system (with few variables), then basic statistical methods can yield believable results. This is why the replication problem is mostly in psychology, medicine, and social sciences (I presume biology & ecology should be on the list, but they aren't so consequential maybe, except for the biology that is already in the medical category). Those disciplines work with complex systems that we have little macro-level understanding of and do not yield themselves to be controlled well in a lab. Whatever lab controls can be instituted really control very little because we are often talking about human persons who are complex biological systems themselves. 

Of course the other problem is that the researchers themselves generally do not understand statistical theory. And that is a very big factor here generally. You basically have to be an actual statistician or very quantitatively/analytically skilled. There is statistical literature, for example, about the problems with p-values and statisticians trying to figure out what to do about that since nobody understands these things. 

All that being said, with many iterations of experiments addressing the same question, say, done by many different research teams (say in different places and at different times, to whatever appropriate level), then we can start to see real physical connections emerge. it might still be the case that some hidden confounding variable is the ultimate arbiter of whatever relationship is observed, but the observed relationship will still be real but just mediated but unknown observables. It would be better to know the precise causal chain, but it's still pretty cool to know a real relationship still, even if it is only correlation and not causation.

It costs a lot of money to do so many repetitions. So for something that doesn't ultimately have much economic impact, there is no incentive to spend a lot of money to get some small bits of information to be cataloged away and rarely if ever accessed. Even if large sums of money are spent to answer some question about a complex system, the outcome might still be uncertainty about whether the relationship observed is real.

An interesting article. I would have really liked the author to complete the circle. A discussion of solutions to the problems would be highly complementary. 

we’d assumed that every user’s click-through was statistically independent, when in fact they were highly correlated, so many of the results which we thought were significant were in fact basically noise.

a) Does there exist an approach to model the non-independent click behavior of users?

b) A lot of progress in CTR prediction assumes independence of clicks. What is the likely benefit of the non-independence assumption of clicks, besides being closer to reality?

c) How can one use Pearl's techniques to model the non-independent behavior of user clicks?

Curated. I really like this post. The title doesn't do it justice – based on the title it sounds like the focus is going to be describing a problem, in fact, you get the problem and then a bunch of proposed solution. I've long thought myself to be pretty good at measuring things (having read "How To Measure Anything And All") but I'm now excited to add these Laws to my experiment design going forward, and put more effort into seeing whether I can in, fact, maybe, measure what I hope to. Cheers

This is fantastic. Immensely practical, immediately allowing for a sharpening of the mind. Very Art of Rationality.

And excellently written.

Thank you.

I was going to complain that the language quoted from the abstract in the frog paper is sufficiently couched that it's not clear the researchers thought they were measuring anything at all. Saying that X "suggests" Y "may be explained, at least partially" by Z seems reasonable to me (as you said, they had at least not ruled out that Z causes Y). Then I clicked through the link and saw the title of the paper making the unambiguous assertion that Z influences Y.

Should this be named... the Wentworth's Law of Measurement?

I'd prefer the S'wentworth Law of Measurement

Hah, I was just guessing at the real name.

One of the key statements made in this post is that measuring more stuff is better than measuring less stuff.  Have your beliefs on that updated at all since the original post ? What evidence would cause you to become more certain or less certain of this claim ? 

Basically, we have an almost religious reverence for high-powered decent-effect-size low-p-value statistical evidence, and we fail notice when these experiments acquire their bayes factors because they measure something incredibly narrow and is therefore unlikely to generalise to whatever gears-level models you're entertaining.

It has the same problem as deferring to experts. The communication suffers a bandwidth problem[1], and we frequently delude ourselves into believing we have adopted their models as long as we copy their probabilities on a very slim sample of queries they've answered.

  1. ^

    I know of no good write-up of the bandwidth problem in social epistemology but Owen Cotton-Barrat talks about it here (my comment) and Dennett refers to it as the "Daddy Is a Doctor" phenomenon.

To the contrary, johnswentworth's point is not that the experiments have low external validity but that they have low internal validity. It's that there are confounds.

Ironically, one of my quibbles with the post is that the verbiage implies measurement error is the problem. Not measuring what you think you're measuring is about content validity, but the post is actually about how omitted variables (i.e., confounders) are a problem for inferences. "You are not Complaining About What You Think You Are Complaining About."

The measurement of lots of other things leads to the pathology of data mining rather than trying to find the correct causative variable.  The better experimental technique is to sequentially investigate each confounding variable and try to ensure they are eliminated.  Sometimes this can be hard, but that is no excuse not to do properly CONTROLLED experiments rather than reporting noise.

Data mining is so problematic that medical journals have insisted that the experiment hypothesis is defined in advance so that unexpected variables with significant p-values are not reported instead (p-hacking).  

I would far rather experimenters do the 1-bit experiment and then if the result doesn't falsify the hypothesis, think about other explanations for the result and check those variables in the same way.  Good experimentation is not for the lazy.

Sometimes this can be hard, but that is no excuse not to do properly CONTROLLED experiments rather than reporting noise.

Sometimes, people accept the hard work of doing properly CONTROLLED experiments. Then they not only FAIL COMPLETELY, they think they succeeded - and everybody else is tricked into believing it too. That's what happened in John's A/B test.

For unsolved problems, you can only find the correct causative variable with a "firehose of information." Then you can go on to prove you're right via a properly controlled experiment.

A classic unreplicable p-hacked study is the one where they found relationships between a discount on an Italian buffet and what the diners ate.

If I was those researchers, I wouldn't mind gathering the data. That's the firehose. But I wouldn't want to publish it. Instead, I'd scour the data, combine it with my qualitative observations of the event, and see if we could come up with a specific pre-registerable causal hypothesis we could actually believe in for a follow-up study. Only the follow-up would be published.

"For unsolved problems, you can only find the correct causative variable with a "firehose of information." Then you can go on to prove you're right via a properly controlled experiment."

 

That second part often doesn't happen. For [bio]medical experiments it is just too expensive.  Datamning ensues and any significant p value variables are then published.  The medical journals are rife with this which is one reason 30-50% of medical research proves unrepeatable.

Never underestimate human nature to do the easiest thing rather than the correct one.  Science can be painstakingly hard to get right, but the pressures to publish are high.  I've seen it first hand in biotech, where the obvious questions to ask of the "result" were ignored.  

I also am in biotech, and I agree these problems exist.

One way of making use of the "firehose of information" in biotech would be to insist that researchers publish their raw datasets, and provide additional supplementary information along with their paper. Imagine, for example, if researchers doing animal work were required to film themselves doing it and post the videos online for others to review. I think it's easy to see how that would be a helpful "firehose of information" and would do a lot to flesh out the picture given by the normally reported figures in a publication.

I think you're worried about people switching from hard analysis to squishier qualitative data, perhaps because resources are already so constrained that it feels like "one or the other." I think John's saying "why not both?"