Eight years ago, I worked as a data scientist at a startup, and we wanted to optimize our sign-up flow. We A/B tested lots of different changes, and occasionally found something which would boost (or reduce) click-through rates by 10% or so.
Then one week I was puzzling over a discrepancy in the variance of our daily signups. Eventually I scraped some data from the log files, and found that during traffic spikes, our server latency shot up to multiple seconds. The effect on signups during these spikes was massive: even just 300 ms was enough that click-through dropped by 30%, and when latency went up to seconds the click-through rates dropped by over 80%. And this happened multiple times per day. Latency was far and away the most important factor which determined our click-through rates. [1]
Going back through some of our earlier experiments, it was clear in hindsight that some of our biggest effect-sizes actually came from changing latency - for instance, if we changed the order of two screens, then there’d be an extra screen before the user hit the one with high latency, so the latency would be better hidden. Our original interpretations of those experiments - e.g. that the user cared more about the content of one screen than another - were totally wrong. It was also clear in hindsight that our statistics on all the earlier experiments were bunk - we’d assumed that every user’s click-through was statistically independent, when in fact they were highly correlated, so many of the results which we thought were significant were in fact basically noise.
Main point of this example: we were not measuring what we thought we were measuring. We thought we were testing hypotheses about what information the user cared about, or what order things needed to be presented in, or whether users would be more likely to click on a bigger and shinier button. But in fact, we were mostly measuring latency.
When I look back on experiments I’ve run over the years, in hindsight the very large majority of cases are like the server latency example. The large majority of the time, experiments did not measure what I thought they were measuring. I’ll call this the First Law of Experiment Design: you are not measuring what you think you are measuring.
Against One-Bit Experiments
A one-bit experiment is an experiment designed to answer a yes/no question. It’s the prototypical case from high school statistics: which of two mouse diets results in lower bodyweight? Which of two button designs on a website results in higher click-through rates? Does a new vaccine design protect against COVID better than an old design (or better than no vaccine at all)? Can Muriel Bristol tell whether milk or tea was added first to her teacup? Will a neural net trained to navigate to a coin at the end of a level still go to the coin if it’s no longer at the end of a level? Can a rat navigate a maze just by smell?
There’s an obvious criticism of such experiments: at best, they yield one bit of information. (Of course the experimenter probably observes a lot more than one bit of information over the course of the experiment, but usually people are trained to ignore most of that useful information and just report a p-value on the original yes/no question.) The First Law of Experiment Design implies that the situation is much worse: in the large majority of cases, a one-bit experiment yields approximately zero information about the thing the experimenter intended to measure. It inevitably turns out that mouse bodyweight, or Muriel Bristol’s tea-tasting, or a neural net’s coinrun performance, in fact routes through something entirely different from what we expected.
Corollary To The First Law: If You Are Definitely Not Measuring Anything Besides What You Think You Are Measuring, You Are Probably Not Measuring Anything
Ok, but aren’t there experiments where we in fact understand what’s going on well enough that we can actually measure what we thought we were measuring? Like the vaccine test, or maybe those experiments from physics lab back in college?
Yes. And in those cases, we usually have a pretty damn good idea of what the experiment’s outcome will be. When we understand what’s going on well enough to actually measure the thing we intended to measure, we usually also understand what’s going on well enough to predict the result. And if we already know the result, then we gain zero information - in a Bayesian sense, we measure nothing.
Take the physics lab example: in physics lab classes, we know what the result “should” be, and if we get some other result then we messed up the experiment. In other words: either we know what the result is (and therefore gain zero information), or we accidentally measure something other than what we intended. (Well… I say “accidentally”, but my college did have a physics professor who would loosen the screws on the pendulum in the freshman physics lab.) Either way, we’re definitely not measuring the thing we intended to measure - either we measure something else, or we measure nothing at all.
… though I suppose one could argue that the physics lab experiment result tells us whether or not we’ve messed up the experiment. In other words, we can test whether we’re measuring the thing we thought we were measuring. So if we know the First Law of Experiment Design, then at least we can measure whether or not the corollary applies to the case at hand.
Anyway, for the rest of this post I’ll assume we’re in a domain where we don’t already know what the answer is (or “should” be).
Solution: Measure Lots of Things
In statistics jargon, the problem is confounders. We never measure what we think we are measuring because there are always confounders, all the time. We can’t control for the confounders because in practice we never know what they are, or which potential confounders actually matter, or which confounders are upstream vs downstream. Classical statistics has lots to say about significance and experiment size and so forth, but when we don’t even know what the confounders are there’s not much to be done.
… or at least that used to be the case. Modern work on causality (e.g. Pearl) largely solves that problem - if we measure enough stuff. One of the key insights of causality is that, while we can’t determine causation from correlation of two variables, we can sometimes determine causation from correlation of three or more variables - and the more variables, the better we can nail down causality. Similarly, if we measure enough stuff, we can often back out any latent variables and figure out how they causally link up to everything else. In other words, we can often deal with confounders if we measure enough stuff.
That’s the theoretical basis for what I’ll call The Second Law of Experiment Design: if you measure enough different stuff, you might figure out what you’re actually measuring.
Feynman’s story about rat-mazes is a good example here:
He had a long corridor with doors all along one side where the rats came in, and doors along the other side where the food was. He wanted to see if he could train the rats to go in at the third door down from wherever he started them off. No. The rats went immediately to the door where the food had been the time before.
The question was, how did the rats know, because the corridor was so beautifully built and so uniform, that this was the same door as before? Obviously there was something about the door that was different from the other doors. So he painted the doors very carefully, arranging the textures on the faces of the doors exactly the same. Still the rats could tell. Then he thought maybe the rats were smelling the food, so he used chemicals to change the smell after each run. Still the rats could tell. Then he realized the rats might be able to tell by seeing the lights and the arrangement in the laboratory like any commonsense person. So he covered the corridor, and, still the rats could tell.
He finally found that they could tell by the way the floor sounded when they ran over it. And he could only fix that by putting his corridor in sand.
Measure enough different stuff, and sometimes we can figure out what’s actually going on.
The biggest problem with one-bit experiments (or low-bit experiments more generally) is that we’re not measuring what we think we’re measuring, and we’re not measuring enough stuff to figure out what’s actually going on. When designing experiments, we want a firehose of bits, not just yes/no. Watching something through a microscope yields an enormous number of bits. Looking through server logs yields an enormous number of bits. That’s the sort of thing we want - a firehose of information.
Measurement Devices
What predictions might we make, from the two Laws of Experiment Design?
Here’s one: new measurement devices or applications of measurement devices, especially high-bit measurement devices, are much more likely than individual experiments to be bottlenecks to the progress of science. For instance, the microscope is more of a bottleneck than Jenner’s controlled trial of the first vaccine. Jenner’s experiment enabled only that one vaccine, and it was almost a century before anybody developed another. When the next vaccine came along, it came from Pasteur’s work watching bacteria under a microscope - and that method resulted in multiple vaccines in rapid succession, as well as “Pasteurization” as a method of disinfection.
We could make similar predictions for particle accelerators, high-throughput sequencing, electron microscopes, mass spectrometers, etc. In the context of AI/ML, we might predict that interpretability tools are a major bottleneck.
Betting Markets
For the same reasons that an experiment is usually not measuring what we think it’s measuring, a fully operationalized prediction is usually not predicting the thing we think it is predicting.
For instance, maybe what I really want to predict is something about qualitative shifts in political influence in Russia. I can operationalize that into a bunch of questions about Putin, the war in Ukraine, specific laws/policies, etc. Probably it will turn out that none of those questions actually measure the qualitative shift in political influence which I’m trying to get at. On the other hand, with a whole bunch of questions, I could maybe do some kind of principal component analysis and back out whatever main factors the questions do measure. For the same reasons that we can sometimes figure out what an experiment actually measures if we measure enough stuff, we can sometimes figure out what questions on a prediction market are actually asking about if we set up markets on enough different questions.
Reading Papers
Of course the Laws of Experiment Design also apply when reading the experiment designs and results of others.
As an example, here’s a recent abstract off biorxiv:
In this study, we examined whether there is repeatability in the activity levels of juvenile dyeing poison frogs (Dendrobates tinctorius). [...] We did not find individual behaviour to be repeatable, however, we detected repeatability in activity at the family level, suggesting that behavioural variation may be explained, at least partially, by genetic factors in addition to a common environment.
Just based on the abstract, I’m going to go out on a limb here and guess that this study did not, in fact, measure “genetic factors”. Probably they measured some other confounder, like e.g. family members grew up nearby each other. (Or maybe the whole result was noise + p-hacking, there’s always that possibility.)
Ok, time to look at the paper… well, the experiment size sure is suspiciously small, they used a grand total of 17 frogs and tested 4 separate behaviors. That sure does sound like a statistical nothingburger! On the other hand, the effect size was huge and their best p-value was p < 0.001, so maaaaaybe there’s something here? I’m skeptical, but let’s give the paper the benefit of the doubt on the statistics for now.
Did they actually measure genetic effects? Well, they sure didn’t rule out non-genetic effects. The “husbandry” section of the Methods actually has a whole spiel about how the father-frogs “exhibit an elaborate parental care behaviour” toward their tadpoles: “Recently-hatched tadpoles are transported on their father’s back from terrestrial clutches to water-filled plant structures at variable heights”. Boy, that sure does sound like a family of tadpoles growing up in a single environment which is potentially different from the environment of another family of tadpoles. The experimenters do talk about their efforts to control the exact environment in which they ran the tests themselves… but they don’t seem to have made much effort to control for variables impacting the young frogs before the test began. So, yeah, there’s ample room for non-genetic correlations between each family of tadpoles.
This is a pretty typical paper: the authors didn’t systematically control for confounders, and the experiment is sufficiently low-bit that we can’t tell what factors actually mediated the correlations between sibling frogs (assuming those correlations weren’t just noise in the first place). Probably the authors weren’t measuring what they thought they were measuring; certainly they didn’t rule out other things they might have been measuring.
Takeaways
Let’s recap the two laws of experiment design:
- First Law of Experiment Design: you are not measuring what you think you are measuring.
- Second Law of Experiment Design: if you measure enough different stuff, you might figure out what you’re actually measuring.
The two laws have a lot of consequences for designing and interpreting experiments. When designing experiments, assume that the experiment will not measure the thing you intend. Include lots of other measurements, to check as many other things as you can. If possible, use instruments which give a massive firehose of information, instruments which would let you notice a huge variety of things you might not have considered, like e.g. a microscope.
Similarly, when interpreting others’ experiments, assume that they were not measuring what they thought they were measuring. Ignore the claims and p-values in the abstract, go look at the graphs and images and data, cross-reference with other papers measuring other things, and try to put together enough different things to figure out what the experimenters actually measured.
- ^
The numbers in the latency story are pulled out my ass, I don’t remember what they actually were other than that the latency effects were far larger than anything else we’d seen. Consider the story qualitatively true, but fictional in the quantitative details.
Great advice here. Right now, I'm validating a PCR protocol via agarose gel electrophoresis. Neither technique is new to me, but the equipment is, and I'm using borrowed DNA from one person and borrowed dye from another, using an agarose gel electrophoresis protocol that's not written for the same dye.
At first, I was running gels to answer with just one DNA sample, basically getting one bit of information for the one question I had in mind.
I quickly realized that there were a ton of variables at play, with some unknown unknowns. Even though it took a little longer to add extra sample lanes to look at different questions, I could learn much more, much more quickly, by doing so. Even if it wasn't obvious why I'd want that data, making sure I'd have access to it when the gel was finished an hour later proved useful again and again. It is good to over-observe your experiment as much as you can get away with.
This also sounds a lot like Elmer Gates' approach to learning new things and coming up with inventions.
Linked review by our own NaiveTortoise.
Another example from the lab. My labmate is running a microfluidic system to make nanoparticles. She's trying to make them monodispersed (all the same size). What she'll publish is very precise measurements of the size distribution of the nanoparticles, which she'll obtain from the MasterSizer.
However, it's a 30 minute round trip walk to use it. And every MasterSizer reading is just a point in time. What she'll use to make the decision about whether or not to measure the nanoparticles is purely qualitative. She's got a microscope hooked up to a computer screen that's displaying the nanoparticles flowing through as they're created. She can see clearly that they're monodispersed, and can tell by sight what the effect is of changing the speed of the oil and aqueous phases, the angle of the syringes, or the ratio of the flow rates. Not to mention the effect of bumps, dust particles, and so on. It's a continuous, dynamic, high-data way to observe - a firehose of information.
The figure that will ultimately be published will capture only a small piece of that, and it will be a far less informative piece of information than simply standing there and watching the nanoparticles stream by - or an informal conversation with her. But that MasterSizer measurement is what may convince other researchers that she's had success, or that her techniques are worth trying. They'll have to set up their own firehose to really learn about those monodispersed nanoparticles. The paper is, in a way, just advertising, and maybe a third of the way to an instruction manual.
Well stated, good post!
You might enjoy my paper on this topic, where I present other approaches to measurement, especially in the presence of adversarial pressure, which you didn't really discuss. I also think that the idea of metric design, which I explored, gets into more detail about how I think you can practically do measurement better.
I think your 'Towards a coherent process for metric design' section alone is worth its weight in gold. Since most LW readers aren't going to click on your linked paper (click-through rates being as low in general as they are, from my experience in marketing analytics), let me quote that section wholesale:
Thanks!
(The Feynman story remains unconfirmed.)
Since found (by gwern).
Reminds me of https://astralcodexten.substack.com/p/ivermectin-much-more-than-you-wanted, where ivermectin studies likely measured worm infections, not Covid prevention.
Here's a structural explanation for statistical improprieties in scientific research:
"Do precise statistics, but it's not that important to conclusively resolve the question at hand" sounds like a recipe for wasted effort. Once you realize that your research isn't definitive, don't know how to make it definitive, and that nobody's expecting you to achieve this anyway, it's not that far to go to start neglecting the statistics as well.
A better recipe would be "gather the data that gives a definitive answer to your real question, and demonstrate that your conclusions are sound using whatever statistics or demonstrations make the most sense."
We can also think about a funnel model. You should start with broad and informal (cheap) qualitative observation, and then progressively narrow down to formal and precise quantitative data as you figure out what the few key most important variables are.
I wish I had a stronger strong upvote I could give this post. I was already nodding my head by the time I was done with the introduction, and then almost every subsequent section gave me something to be excited about. I will try to say some more substantive things later, but I wanted to say this first because I often don't get around to commenting.
Great post!
This also reminds me of Tal Yarkoni's paper on what he calls the generalizability crisis in psychology. That's the fact that psychological experiments measure a very specific thing that's treated as corresponding to a more general thing. Psychologists think that the specific thing measures the more general thing, and Yarkoni argues that they're not measuring what they think they're measuring.
One of his examples is about the study of verbal overshadowing. This is a claimed phenomenon where if you have to verbally describe what a face looks like, you will be worse off at actually recognizing that face later on. The hypothesis is that producing the verbal description causes you to remember the verbal description, while remembering the actual face less well - but the verbal description inevitably contains less detail. This has been generalized to the broader claim of "producing verbal descriptions of experiences impair our later recollection of them".
Yarkoni discusses an effort to replicate one of the original experiments:
Interesting. This resonates, and yet maybe stands in tension, with complaints that social psychology fails to do enough exact replications. I remember that a criticism of social psychology was that researchers would test a generalization like priming in too many different ways, and people were suspicious about whether or not any of the effects would stand up to replication.
I’d love to see a description of what this field should be doing. There’s a sweet spot between too much weight on one experimental approach, and too little exact replication. How does a field identify that sweet spot, and how can it coordinate to carry out experiments in the sweet spot?
Yeah, I was thinking this same thing. I feel like I’m social sciences I’m more concerned about researchers testing for too many things and increasing the probability of false positives than testing too few things and maybe not fully understanding a result.
I feel like it really comes down to how powerful a study is. When you have tons of data like a big tech company might, or the results are really straightforward, like in some of the hard sciences, I think this is a great approach. When the effects of a treatment are subtler and sample size is more limited, as is often the case in the social sciences, I would be wary to recommend testing everything you can think of.
I'd say that's more a problem of selective reporting.
I hope that Metaculus will introduce the option of having questions that have multiple outcomes that one can predict in correlated ways. Perhaps by setting up a latent variable model.
They have a Patreon where they receive money and suggestions. I signed up to give them money there and suggested they introduce support for latent variable models. They seemed interested. I don't know what's going on behind the scenes, but I assume if anyone else also thinks it's worthwhile, it would help them if you gave them a bunch of money and asked them to use it for latent variable models.
This seems in line with your position, but I want to reply so people won't conclude "And coinrun experiments don't tell you important things." I think the more interesting question for that experiment is "how will the agent generalize? Can we predict it in advance? In what ways do we systematically mispredict, and why?"
(And, roughly speaking, there are a range of possible algorithms by which the agent can generalize, so it's way more than one bit. I got way more out of that paper by asking the authors for hundreds of videos for different training settings. Probably will mention the results in a post, soon.)
In a way, this is just a corollary to the good, old-fashioned "correlation does not imply causation" principle. I guess the important difference is that this is a warning directed at experimenters who tend to assume that they know better.
Before COVID, my manager's manager would semi-regularly put on a multi-day training course where he would teach everyone in his organization about the process of "Design of Experiments" (DOE) (https://en.wikipedia.org/wiki/Design_of_experiments). I work with other data scientists, statisticians, biochemist, and engineers, often performing experiments with medical devices, so this is relevabt to all of us.
The idea behind a DOE is to create a mutifactorial experimental design that tests the effects of many factors at once. The first step is to brainstorm all possible factors that may influence the experimental results (often using a fishbone diagram [https://en.m.wikipedia.org/wiki/Ishikawa_diagram] to focus on factors arising from Materials, Method, Measurement, Machine, Man, or "Mother Nature"). Each factor is then labelled as either control (C, factors fixed throughout all parts of the experiment), noise (N, random factors that cannot/won't be controlled for but may cause some variation in the results), or experimental (X, factors to be caried systematically to test for their effects).
Next, high and low settings are chosen for each X factor, and all possible combinations of settings are arranged in a hypercube. Instead of experimenting on one factor at a time with enough repetitions to build up statistical significance, you can perform just a few repetitions at each corner of the hypercube. You can still use the results to look at single-factor contributions to the experimental effect by aggregating data from all corners of the low side and high side of the hypercube along one dimension. But you can also look at the impact of factor interactions and the nonlinearities that result, which would have acted as hidden confounders in more traditional single-factor experiments.
I liked how this method gave a systematic way of thinking about multifactorial experimental effects. Such experiments tend to uncover a lot more information about a system than you would see otherwise. To actually tease out the underlying causal mechanisms at work, though, would require deeper statistics and modeling than we ever got into in that course.
This concept reminds me of the problem of planning software tests: I want to exercise all behaviors of the code under test, but actually testing the cartesian product of input conditions often means writing a test that is so generic it duplicates the code under test (unless there is a more naïve algorithm that the test can use), and is hard to evaluate for its own correctness. Instead, I end up writing a selected set of cases intended to cover interesting combinations of inputs — but then the problem is thinking of which inputs are worth testing. When bugs are discovered, they may be combinations of inputs that were not thought of (or they may be parameters we didn't think of testing, i.e. implicitly put in the “control” category, or specific edge-case values of parameters we did test).
An alternative to hand-written testing of specific cases is to write a property test, like “is input A + input B always ≤ output C, under a wide-ranging selection of inputs”. This feels analogous to measuring correlations in that hypercube — and the part of the actual output that you're not checking precisely (in my example, the value A + B − C) is the part of the test that is “noise” rather than “control” because we've decided it is more practical to ignore that information than to control it (write a test that contains or computes the exact answer to expect).
When doing quantified self-experiments, people sometimes argue for the virtues of blinding. In quantified self-experiments, blinding means measuring less. It means not exposing yourself to data about reality. "Measure Lots of Things" frequently means engaging in less blinding.
It seems like Pearl's work is so hard to put into practice that I didn't really get good examples when I asked for them on LessWrong.
Do you think Pearl's work is more directly useful, why do you think I didn't get more practical examples of LessWrongers who used it?
As is often the case with these things, I think a lot of people already intuitively use the sort of moves suggested by Pearl-style causality. In particular, looking for mediators is usually the central move in reconstructing causal structure, and that's something which lots of scientists already have strong instincts for. Also, the general mental model of causality as a DAG helps defuse a lot of stupid arguments about polycausality and everything-interacting-with-everything - again, people with good research taste tend to already know those arguments are stupid, but don't necessarily have the language to explain how/why.
I think the lesson a lot of people took away from Pearl is (roughly speaking) that you should look for colliders between independent variables, since colliders are the one primitive causal structure that looks distinct (whereas forward mediators, backward mediators, and confounders look the same from a graphical independence perspective).
My impression is that looking for colliders doesn't really work for most practical causal inference. So this then translates to Pearl's approaches not working in practice.
Except I think Pearl has become (always was?) pretty skeptical about inferring causal structure from this sort of data (even though it's "theoretically" possible), so it might just be a miscommunication along the way. In the rationalist community, this miscommunication probably originates mainly from Eliezer's endorsement of searching for colliders as one of the main forms of causal inference.
Yeah, I think the mistake there is mostly one of emphasis. In terms of narrowing-down-model-space, most of the work (i.e. most of the model-elimination) is in reconstructing the undirected graphical structure; figuring out which direction each arrow goes is relatively easy after that. (You could view this as a consequence of Science In A High Dimensional World: the hard step is figuring out which handful of variables are directly relevant, and after that it's relatively easy to experiment with those variables.)
Also, Pearl himself was operating in a statistical framework closer to 20th century frequentism than modern Bayesian - e.g. his algorithms in Causality were designed to use an independence oracle rather than Bayesian model comparison. It turns out that model testing in high-dimensional systems is one of the places where the advantage of modern Bayesianism is largest.
This depends a lot on the field, no? If it's too expensive or unethical to intervene, or one plain and simply doesn't understand the variables well enough to intervene, then figuring out the direction of the causal arrows can be difficult.
Interesting, do you have any additional resources on this?
The Rationality Cheater's Move is to have more information. Really beats analysis and reason in many cases.
An interesting article. I would have really liked the author to complete the circle. A discussion of solutions to the problems would be highly complementary.
a) Does there exist an approach to model the non-independent click behavior of users?
b) A lot of progress in CTR prediction assumes independence of clicks. What is the likely benefit of the non-independence assumption of clicks, besides being closer to reality?
c) How can one use Pearl's techniques to model the non-independent behavior of user clicks?
Curated. I really like this post. The title doesn't do it justice – based on the title it sounds like the focus is going to be describing a problem, in fact, you get the problem and then a bunch of proposed solution. I've long thought myself to be pretty good at measuring things (having read "How To Measure Anything And All") but I'm now excited to add these Laws to my experiment design going forward, and put more effort into seeing whether I can in, fact, maybe, measure what I hope to. Cheers
This is fantastic. Immensely practical, immediately allowing for a sharpening of the mind. Very Art of Rationality.
And excellently written.
Thank you.
I was going to complain that the language quoted from the abstract in the frog paper is sufficiently couched that it's not clear the researchers thought they were measuring anything at all. Saying that X "suggests" Y "may be explained, at least partially" by Z seems reasonable to me (as you said, they had at least not ruled out that Z causes Y). Then I clicked through the link and saw the title of the paper making the unambiguous assertion that Z influences Y.
Should this be named... the Wentworth's Law of Measurement?
I'd prefer the S'wentworth Law of Measurement
Hah, I was just guessing at the real name.
One of the key statements made in this post is that measuring more stuff is better than measuring less stuff. Have your beliefs on that updated at all since the original post ? What evidence would cause you to become more certain or less certain of this claim ?
Basically, we have an almost religious reverence for high-powered decent-effect-size low-p-value statistical evidence, and we fail notice when these experiments acquire their bayes factors because they measure something incredibly narrow and is therefore unlikely to generalise to whatever gears-level models you're entertaining.
It has the same problem as deferring to experts. The communication suffers a bandwidth problem[1], and we frequently delude ourselves into believing we have adopted their models as long as we copy their probabilities on a very slim sample of queries they've answered.
I know of no good write-up of the bandwidth problem in social epistemology but Owen Cotton-Barrat talks about it here (my comment) and Dennett refers to it as the "Daddy Is a Doctor" phenomenon.
To the contrary, johnswentworth's point is not that the experiments have low external validity but that they have low internal validity. It's that there are confounds.
Ironically, one of my quibbles with the post is that the verbiage implies measurement error is the problem. Not measuring what you think you're measuring is about content validity, but the post is actually about how omitted variables (i.e., confounders) are a problem for inferences. "You are not Complaining About What You Think You Are Complaining About."
The measurement of lots of other things leads to the pathology of data mining rather than trying to find the correct causative variable. The better experimental technique is to sequentially investigate each confounding variable and try to ensure they are eliminated. Sometimes this can be hard, but that is no excuse not to do properly CONTROLLED experiments rather than reporting noise.
Data mining is so problematic that medical journals have insisted that the experiment hypothesis is defined in advance so that unexpected variables with significant p-values are not reported instead (p-hacking).
I would far rather experimenters do the 1-bit experiment and then if the result doesn't falsify the hypothesis, think about other explanations for the result and check those variables in the same way. Good experimentation is not for the lazy.
Sometimes, people accept the hard work of doing properly CONTROLLED experiments. Then they not only FAIL COMPLETELY, they think they succeeded - and everybody else is tricked into believing it too. That's what happened in John's A/B test.
For unsolved problems, you can only find the correct causative variable with a "firehose of information." Then you can go on to prove you're right via a properly controlled experiment.
A classic unreplicable p-hacked study is the one where they found relationships between a discount on an Italian buffet and what the diners ate.
If I was those researchers, I wouldn't mind gathering the data. That's the firehose. But I wouldn't want to publish it. Instead, I'd scour the data, combine it with my qualitative observations of the event, and see if we could come up with a specific pre-registerable causal hypothesis we could actually believe in for a follow-up study. Only the follow-up would be published.
"For unsolved problems, you can only find the correct causative variable with a "firehose of information." Then you can go on to prove you're right via a properly controlled experiment."
That second part often doesn't happen. For [bio]medical experiments it is just too expensive. Datamning ensues and any significant p value variables are then published. The medical journals are rife with this which is one reason 30-50% of medical research proves unrepeatable.
Never underestimate human nature to do the easiest thing rather than the correct one. Science can be painstakingly hard to get right, but the pressures to publish are high. I've seen it first hand in biotech, where the obvious questions to ask of the "result" were ignored.
I also am in biotech, and I agree these problems exist.
One way of making use of the "firehose of information" in biotech would be to insist that researchers publish their raw datasets, and provide additional supplementary information along with their paper. Imagine, for example, if researchers doing animal work were required to film themselves doing it and post the videos online for others to review. I think it's easy to see how that would be a helpful "firehose of information" and would do a lot to flesh out the picture given by the normally reported figures in a publication.
I think you're worried about people switching from hard analysis to squishier qualitative data, perhaps because resources are already so constrained that it feels like "one or the other." I think John's saying "why not both?"