You Are Not Measuring What You Think You Are Measuring

[-]DirectedEvolution3y3420

The two laws have a lot of consequences for designing and interpreting experiments. When designing experiments, assume that the experiment will not measure the thing you intend. Include lots of other measurements, to check as many other things as you can. If possible, use instruments which give a massive firehose of information, instruments which would let you notice a huge variety of things you might not have considered, like e.g. a microscope.

Great advice here. Right now, I'm validating a PCR protocol via agarose gel electrophoresis. Neither technique is new to me, but the equipment is, and I'm using borrowed DNA from one person and borrowed dye from another, using an agarose gel electrophoresis protocol that's not written for the same dye.

At first, I was running gels to answer with just one DNA sample, basically getting one bit of information for the one question I had in mind.

I quickly realized that there were a ton of variables at play, with some unknown unknowns. Even though it took a little longer to add extra sample lanes to look at different questions, I could learn much more, much more quickly, by doing so. Even if it wasn't obvious why I'd want that data, making sure I'd have access to it when the gel was finished an hour later proved useful again and again. It is good to over-observe your experiment as much as you can get away with.

[-]DirectedEvolution3y337

This also sounds a lot like Elmer Gates' approach to learning new things and coming up with inventions.

Gates’s “mind-using” strategies look very different from modern and traditional learning practices. Gates’s theory of “brain-building” fused radically empiricist and inductivist philosophy with a belief in the mind’s adaptability, and his “art” reflects that. Early on in his studies, Gates observed that acquiring the right higher level concepts depended heavily on having the right foundational experiences. Hence, to learn a new scientific field, Gates would start by gathering the “sensory experiences” that constituted the raw data of that science. In the case of mechanics, this might involve replicating key experiments related to things like gravity or fiction. In the case of chemistry, Gates supposedly replicated the experiments described in multiple textbooks himself. In the case of weaving, which Gates was once challenged to apply his method to (which he did successfully, inventing multiple new devices), we have the following quote regarding how Gates’s went about the initial steps:
I secured letters of introduction to practical weavers and loom makers, and with the aid of several assistants made a systematic search of the technical literature. By actual observation of looms and methods of weaving I built over my brain with reference to that subject; acquiring an the images, concepts, ideas, and thoughts that six weeks’ continuous effort made possible.
Once Gates had had the necessary sensory experiences, he’d move on to the next step of “refunctioning” them. Refunctioning (coined by Gates) refers to mentally reviewing the experiences and observations acquired from the science until they “became more vivid, much more clearly minted and complete, while the processes of states acquired much greater celerity and efficiency.” As far as I can tell, Gates would then move on to repeating the process but for images, concepts, and thoughts (higher levels of abstraction) that emerged from repeatedly reviewing the raw experiences and studying their relationships. The following passage contains the best description I could find of this part of the process:
Then he related each concept to the others, which gave many new and true ideas (to be temporarily recorded until experimentally verified). Every new concept always implied a number of new ideas; and for the first time, to relate new concepts was always a rich opportunity for new ideas.

[-]DirectedEvolution3y31

Linked review by our own NaiveTortoise.

[-]DirectedEvolution3y256

Another example from the lab. My labmate is running a microfluidic system to make nanoparticles. She's trying to make them monodispersed (all the same size). What she'll publish is very precise measurements of the size distribution of the nanoparticles, which she'll obtain from the MasterSizer.

However, it's a 30 minute round trip walk to use it. And every MasterSizer reading is just a point in time. What she'll use to make the decision about whether or not to measure the nanoparticles is purely qualitative. She's got a microscope hooked up to a computer screen that's displaying the nanoparticles flowing through as they're created. She can see clearly that they're monodispersed, and can tell by sight what the effect is of changing the speed of the oil and aqueous phases, the angle of the syringes, or the ratio of the flow rates. Not to mention the effect of bumps, dust particles, and so on. It's a continuous, dynamic, high-data way to observe - a firehose of information.

The figure that will ultimately be published will capture only a small piece of that, and it will be a far less informative piece of information than simply standing there and watching the nanoparticles stream by - or an informal conversation with her. But that MasterSizer measurement is what may convince other researchers that she's had success, or that her techniques are worth trying. They'll have to set up their own firehose to really learn about those monodispersed nanoparticles. The paper is, in a way, just advertising, and maybe a third of the way to an instruction manual.

[-]Ben2y60

If she can, it might be nice for her to put a few minutes of microscope footage attached to the paper as a supplementary. Maybe just "bad conditions" (stuff flowing by all different sizes), followed by "good conditions": stuff all the same. Lots of journals offer the possibility of videos as supplementary information. Its the sort of thing that (I think) journal editors like (maybe it boosts engagement by the metrics they use for their websites?), and it sounds like it will benefit the paper.

[-]Davidmanheim3y200

Well stated, good post!

You might enjoy my paper on this topic, where I present other approaches to measurement, especially in the presence of adversarial pressure, which you didn't really discuss. I also think that the idea of metric design, which I explored, gets into more detail about how I think you can practically do measurement better.

[-]Mo Putera3y2011

I think your 'Towards a coherent process for metric design' section alone is worth its weight in gold. Since most LW readers aren't going to click on your linked paper (click-through rates being as low in general as they are, from my experience in marketing analytics), let me quote that section wholesale:

Given the various strategies and considerations discussed in the paper, as well as failure modes and limitations, it is useful to lay out a simple and coherent outline of a process for metric design. While this will by necessity be far from complete, and will include items that may not be relevant for a particular application, it should provide at least an outline that can be adapted to various metric design processes. Outside of the specific issues discussed earlier, there is a wide breadth of expertise and understanding that may be needed for metric design. Citations in this section will also provide a variety of resources for at least introductory further reading on those topics.
Understand the system being measured, including both technical (Blanchard & Fabrycky, 1990) and organizational (Berry & Houston, 1993) considerations.
Determine scope
What is included in the system?
What will the metrics be used for?
Understand the causal structure of the system
What is the logic model or theory? (Rogers, Petrosino, Huebner, & Hacsi, 2000)
Is there formal analysis (Gelman, 2010) or expert opinion (van Gelder, Vodicka, & Armstrong, 2016) that can inform this?
Identify stakeholders (Kenny, 2014)
Who will be affected?
Who will use the metrics?
Whose goals are relevant?
Identify the Goals
What immediate goals are being served by the metric(s)? How are individual impacts related to performance more broadly? (Ruch, 1994)
What longer term or broader goals are implicated?
Identify Relevant Desiderata
Availability
Cost
Immediacy
Simplicity
Transparency
Fairness
Corruptibility
Brainstorm potential metrics
What outcomes important to capture?
What data sources exist?
What methods can be used to capture additional data?
What measurements are easy to capture?
What is the relationship between the measurements and the outcomes?
What isn’t captured by the metrics?
Consider and Plan
Understand why and how the metric is useful. (Manheim, 2018)
Consider how the metrics will be used to diagnose issues or incentify people.(Dai, Dietvorst, Tuckfield, Milkman, & Schweitzer, 2017)
Plan how to use the metrics to develop the system, avoiding the “reward / punish” dichotomy. (Wigert & Harter, 2017)
Perform a pre-mortem (Klein, 2007)
Plan to routinely revisit the metrics (Atkins, Wanick, & Wills, 2017)

[-]Davidmanheim3y20

Thanks!

[-]gwern3y208

(The Feynman story remains unconfirmed.)

[-]Richard_Kennaway2y80

Since found (by gwern).

[-]Shmi3y124

Reminds me of https://astralcodexten.substack.com/p/ivermectin-much-more-than-you-wanted, where ivermectin studies likely measured worm infections, not Covid prevention.

[-]gwillen3y122

I wish I had a stronger strong upvote I could give this post. I was already nodding my head by the time I was done with the introduction, and then almost every subsequent section gave me something to be excited about. I will try to say some more substantive things later, but I wanted to say this first because I often don't get around to commenting.

[-]DirectedEvolution3y111

Here's a structural explanation for statistical improprieties in scientific research:

For publications, scientists are incentivized to overinvest in precision. They aren't able to get a publication out of just their informal observations and qualitative data.
Yet scientists are also underincentivized to invest in definitive answers. That doesn't mean there's no incentive, just that the incentives aren't adequate.

"Do precise statistics, but it's not that important to conclusively resolve the question at hand" sounds like a recipe for wasted effort. Once you realize that your research isn't definitive, don't know how to make it definitive, and that nobody's expecting you to achieve this anyway, it's not that far to go to start neglecting the statistics as well.

A better recipe would be "gather the data that gives a definitive answer to your real question, and demonstrate that your conclusions are sound using whatever statistics or demonstrations make the most sense."

We can also think about a funnel model. You should start with broad and informal (cheap) qualitative observation, and then progressively narrow down to formal and precise quantitative data as you figure out what the few key most important variables are.

[-]Kaj_Sotala3y95

Great post!

This also reminds me of Tal Yarkoni's paper on what he calls the generalizability crisis in psychology. That's the fact that psychological experiments measure a very specific thing that's treated as corresponding to a more general thing. Psychologists think that the specific thing measures the more general thing, and Yarkoni argues that they're not measuring what they think they're measuring.

One of his examples is about the study of verbal overshadowing. This is a claimed phenomenon where if you have to verbally describe what a face looks like, you will be worse off at actually recognizing that face later on. The hypothesis is that producing the verbal description causes you to remember the verbal description, while remembering the actual face less well - but the verbal description inevitably contains less detail. This has been generalized to the broader claim of "producing verbal descriptions of experiences impair our later recollection of them".

Yarkoni discusses an effort to replicate one of the original experiments:

Alogna and colleagues (2014) conducted a large-scale “registered replication report” (RRR; Simons, Holcombe, & Spellman, 2014) involving 31 sites and over 2,000 participants. The study sought to replicate an influential experiment by Schooler and Engstler-Schooler (1990) in which the original authors showed that participants who were asked to verbally describe the appearance of a perpetrator caught committing a crime on video showed poorer recognition of the perpetrator following a delay than did participants assigned to a control task (naming as many countries and capitals as they could). Schooler & Engstler-Schooler (1990) dubbed this the verbal overshadowing effect. In both the original and replication experiments, only a single video, containing a single perpetrator, was presented at encoding, and only a single set of foil items was used at test. Alogna et al. successfully replicated the original result in one of two tested conditions, and concluded that their findings revealed “a robust verbal overshadowing effect” in that condition.
Let us assume for the sake of argument that there is a genuine and robust causal relationship between the manipulation and outcome employed in the Alogna et al study. I submit that there would still be essentially no support for the authors’ assertion that they found a “robust” verbal overshadowing effect, because the experimental design and statistical model used in the study simply cannot support such a generalization. The strict conclusion we are entitled to draw, given the limitations of the experimental design inherited from Schooler and Engstler-Schooler (1990), is that there is at least one particular video containing one particular face that, when followed by one particular lineup of faces, is more difficult for participants to identify if they previously verbally described the appearance of the target face than if they were asked to name countries and capitals. [...]
On any reasonable interpretation of the construct of verbal overshadowing, the corresponding universe of intended generalization should clearly also include most of the operationalizations that would result from randomly sampling various combinations of these factors (e.g., one would expect it to still count as verbal overshadowing if Alogna et al. had used live actors to enact the crime scene, instead of showing a video). Once we accept this assumption, however, the critical question researchers should immediately ask themselves is: are there other psychological processes besides verbal overshadowing that could plausibly be influenced by random variation in any of these uninteresting factors, independently of the hypothesized psychological processes of interest? A moment or two of consideration should suffice to convince one that the answer is a resounding yes. It is not hard to think of dozens of explanations unrelated to verbal overshadowing that could explain the causal effect of a given manipulation on a given outcome in any single operationalization.
This verbal overshadowing example is by no means unusual. The same concerns apply equally to the broader psychology literature containing tens or hundreds of thousands of studies that routinely adopt similar practices. In most of psychology, it is standard operating procedure for researchers employing just one experimental task, between-subject manipulation, experimenters, testing room, research site, etc., to behave as though an extremely narrow operationalization is an acceptable proxy for a much broader universe of admissible observations. It is instructive—and somewhat fascinating from a sociological perspective—to observe that while no psychometrician worth their salt would ever recommend a default strategy of measuring complex psychological constructs using a single unvalidated item, the majority of psychology studies do precisely that with respect to multiple key design factors. The modal approach is to stop at a perfunctory demonstration of face validity—that is, to conclude that if a particular operationalization seems like it has something to do with the construct of interest, then it is an acceptable stand-in for that construct. Any measurement-level findings are then uncritically generalized to the construct level, leading researchers to conclude that they’ve learned something useful about broader phenomena like verbal overshadowing, working memory, ego depletion, etc., when in fact such sweeping generalizations typically obtain little support from the reported empirical studies.

[-]DirectedEvolution3y81

Interesting. This resonates, and yet maybe stands in tension, with complaints that social psychology fails to do enough exact replications. I remember that a criticism of social psychology was that researchers would test a generalization like priming in too many different ways, and people were suspicious about whether or not any of the effects would stand up to replication.

I’d love to see a description of what this field should be doing. There’s a sweet spot between too much weight on one experimental approach, and too little exact replication. How does a field identify that sweet spot, and how can it coordinate to carry out experiments in the sweet spot?

[-]bortrand3y30

Yeah, I was thinking this same thing. I feel like I’m social sciences I’m more concerned about researchers testing for too many things and increasing the probability of false positives than testing too few things and maybe not fully understanding a result.

I feel like it really comes down to how powerful a study is. When you have tons of data like a big tech company might, or the results are really straightforward, like in some of the hard sciences, I think this is a great approach. When the effects of a treatment are subtler and sample size is more limited, as is often the case in the social sciences, I would be wary to recommend testing everything you can think of.

[-]johnswentworth3y31

I feel like I’m social sciences I’m more concerned about researchers testing for too many things and increasing the probability of false positives than testing too few things and maybe not fully understanding a result.

I'd say that's more a problem of selective reporting.

[-]tailcalled3y99

I hope that Metaculus will introduce the option of having questions that have multiple outcomes that one can predict in correlated ways. Perhaps by setting up a latent variable model.

They have a Patreon where they receive money and suggestions. I signed up to give them money there and suggested they introduce support for latent variable models. They seemed interested. I don't know what's going on behind the scenes, but I assume if anyone else also thinks it's worthwhile, it would help them if you gave them a bunch of money and asked them to use it for latent variable models.

[-]PeterMcCluskey2y82Review for 2022 Review

This post didn't feel particularly important when I first read it.

Yet I notice that I've been acting on the post's advice since reading it. E.g. being more optimistic about drug companies that measure a wide variety of biomarkers.

I wasn't consciously doing that because I updated due to the post. I'm unsure to what extent the post changed me via subconscious influence, versus deriving the ideas independently.

[-]Srdjan Miletic2y80Review for 2022 Review

Hmmmm.

So when I read this post I initially thought it was good. But on second thought I don't think I actually get that much from it. If I had to summarise it, I'd say

a few interesting anecdotes about experiments where measurement was misleading or difficult
some general talk about "low bit experiments" and how hard it is to control for cofounders

The most interesting claim I found was the second law of experiment design. To quote: "The Second Law of Experiment Design: if you measure enough different stuff, you might figure out what you’re actually measuring.". But even here I didn't get much clarity or new info. The argument seemed to boil down to "If you measure more things, you may find the actual underlying important variable", which is true I guess but doesn't seem particularly novel and also introduces other risks. e.g: That the more variables you measure the higher the chance that at least some of them will correlate just due to chance. There's a pointer to a book which the author claims sheds more light on the topic and on modern statistical methods around experiment design more generally, but that's it.

I think I also have a broader problem here, namely that the article feels a bit fuzzy in a way that makes it hard to pin down what the central claims are.

So yeah, I enjoyed it but on reflection I'm a bit less of a fan than I thought.

[-]TurnTrout3y70

Will a neural net trained to navigate to a coin at the end of a level still go to the coin if it’s no longer at the end of a level?

This seems in line with your position, but I want to reply so people won't conclude "And coinrun experiments don't tell you important things." I think the more interesting question for that experiment is "how will the agent generalize? Can we predict it in advance? In what ways do we systematically mispredict, and why?"

(And, roughly speaking, there are a range of possible algorithms by which the agent can generalize, so it's way more than one bit. I got way more out of that paper by asking the authors for hundreds of videos for different training settings. Probably will mention the results in a post, soon.)

[-]Jon Garcia3y61

First Law of Experiment Design: you are not measuring what you think you are measuring.

In a way, this is just a corollary to the good, old-fashioned "correlation does not imply causation" principle. I guess the important difference is that this is a warning directed at experimenters who tend to assume that they know better.

Second Law of Experiment Design: if you measure enough different stuff, you might figure out what you’re actually measuring.

Before COVID, my manager's manager would semi-regularly put on a multi-day training course where he would teach everyone in his organization about the process of "Design of Experiments" (DOE) (https://en.wikipedia.org/wiki/Design_of_experiments). I work with other data scientists, statisticians, biochemist, and engineers, often performing experiments with medical devices, so this is relevabt to all of us.

The idea behind a DOE is to create a mutifactorial experimental design that tests the effects of many factors at once. The first step is to brainstorm all possible factors that may influence the experimental results (often using a fishbone diagram [https://en.m.wikipedia.org/wiki/Ishikawa_diagram] to focus on factors arising from Materials, Method, Measurement, Machine, Man, or "Mother Nature"). Each factor is then labelled as either control (C, factors fixed throughout all parts of the experiment), noise (N, random factors that cannot/won't be controlled for but may cause some variation in the results), or experimental (X, factors to be caried systematically to test for their effects).

Next, high and low settings are chosen for each X factor, and all possible combinations of settings are arranged in a hypercube. Instead of experimenting on one factor at a time with enough repetitions to build up statistical significance, you can perform just a few repetitions at each corner of the hypercube. You can still use the results to look at single-factor contributions to the experimental effect by aggregating data from all corners of the low side and high side of the hypercube along one dimension. But you can also look at the impact of factor interactions and the nonlinearities that result, which would have acted as hidden confounders in more traditional single-factor experiments.

I liked how this method gave a systematic way of thinking about multifactorial experimental effects. Such experiments tend to uncover a lot more information about a system than you would see otherwise. To actually tease out the underlying causal mechanisms at work, though, would require deeper statistics and modeling than we ever got into in that course.

[-]kpreid3y51

Next, high and low settings are chosen for each X factor, and all possible combinations of settings are arranged in a hypercube. Instead of experimenting on one factor at a time with enough repetitions to build up statistical significance, you can perform just a few repetitions at each corner of the hypercube.

This concept reminds me of the problem of planning software tests: I want to exercise all behaviors of the code under test, but actually testing the cartesian product of input conditions often means writing a test that is so generic it duplicates the code under test (unless there is a more naïve algorithm that the test can use), and is hard to evaluate for its own correctness. Instead, I end up writing a selected set of cases intended to cover interesting combinations of inputs — but then the problem is thinking of which inputs are worth testing. When bugs are discovered, they may be combinations of inputs that were not thought of (or they may be parameters we didn't think of testing, i.e. implicitly put in the “control” category, or specific edge-case values of parameters we did test).

An alternative to hand-written testing of specific cases is to write a property test, like “is input A + input B always ≤ output C, under a wide-ranging selection of inputs”. This feels analogous to measuring correlations in that hypercube — and the part of the actual output that you're not checking precisely (in my example, the value A + B − C) is the part of the test that is “noise” rather than “control” because we've decided it is more practical to ignore that information than to control it (write a test that contains or computes the exact answer to expect).

[-]ChristianKl3y60

When doing quantified self-experiments, people sometimes argue for the virtues of blinding. In quantified self-experiments, blinding means measuring less. It means not exposing yourself to data about reality. "Measure Lots of Things" frequently means engaging in less blinding.

Modern work on causality (e.g. Pearl) largely solves that problem - if we measure enough stuff.

It seems like Pearl's work is so hard to put into practice that I didn't really get good examples when I asked for them on LessWrong.

Do you think Pearl's work is more directly useful, why do you think I didn't get more practical examples of LessWrongers who used it?

[-]johnswentworth3y90

As is often the case with these things, I think a lot of people already intuitively use the sort of moves suggested by Pearl-style causality. In particular, looking for mediators is usually the central move in reconstructing causal structure, and that's something which lots of scientists already have strong instincts for. Also, the general mental model of causality as a DAG helps defuse a lot of stupid arguments about polycausality and everything-interacting-with-everything - again, people with good research taste tend to already know those arguments are stupid, but don't necessarily have the language to explain how/why.

[-]tailcalled3y30

I think the lesson a lot of people took away from Pearl is (roughly speaking) that you should look for colliders between independent variables, since colliders are the one primitive causal structure that looks distinct (whereas forward mediators, backward mediators, and confounders look the same from a graphical independence perspective).

My impression is that looking for colliders doesn't really work for most practical causal inference. So this then translates to Pearl's approaches not working in practice.

Except I think Pearl has become (always was?) pretty skeptical about inferring causal structure from this sort of data (even though it's "theoretically" possible), so it might just be a miscommunication along the way. In the rationalist community, this miscommunication probably originates mainly from Eliezer's endorsement of searching for colliders as one of the main forms of causal inference.

[-]johnswentworth3y30

Yeah, I think the mistake there is mostly one of emphasis. In terms of narrowing-down-model-space, most of the work (i.e. most of the model-elimination) is in reconstructing the undirected graphical structure; figuring out which direction each arrow goes is relatively easy after that. (You could view this as a consequence of Science In A High Dimensional World: the hard step is figuring out which handful of variables are directly relevant, and after that it's relatively easy to experiment with those variables.)

Also, Pearl himself was operating in a statistical framework closer to 20th century frequentism than modern Bayesian - e.g. his algorithms in Causality were designed to use an independence oracle rather than Bayesian model comparison. It turns out that model testing in high-dimensional systems is one of the places where the advantage of modern Bayesianism is largest.

[-]tailcalled3y61

In terms of narrowing-down-model-space, most of the work (i.e. most of the model-elimination) is in reconstructing the undirected graphical structure; figuring out which direction each arrow goes is relatively easy after that.

This depends a lot on the field, no? If it's too expensive or unethical to intervene, or one plain and simply doesn't understand the variables well enough to intervene, then figuring out the direction of the causal arrows can be difficult.

It turns out that model testing in high-dimensional systems is one of the places where the advantage of modern Bayesianism is largest.

Interesting, do you have any additional resources on this?

[-]lemonhope3y61

The Rationality Cheater's Move is to have more information. Really beats analysis and reason in many cases.

[-]joe2y50

A single experiment, especially if it has a small sample size---but even large sample sizes can be ruined by experimental error etc.---or even a small number of such experiments which confirm each other---just aren't going to give very reliable results for a complex system.

When things are tightly controlled in a lab for a simple system (with few variables), then basic statistical methods can yield believable results. This is why the replication problem is mostly in psychology, medicine, and social sciences (I presume biology & ecology should be on the list, but they aren't so consequential maybe, except for the biology that is already in the medical category). Those disciplines work with complex systems that we have little macro-level understanding of and do not yield themselves to be controlled well in a lab. Whatever lab controls can be instituted really control very little because we are often talking about human persons who are complex biological systems themselves.

Of course the other problem is that the researchers themselves generally do not understand statistical theory. And that is a very big factor here generally. You basically have to be an actual statistician or very quantitatively/analytically skilled. There is statistical literature, for example, about the problems with p-values and statisticians trying to figure out what to do about that since nobody understands these things.

All that being said, with many iterations of experiments addressing the same question, say, done by many different research teams (say in different places and at different times, to whatever appropriate level), then we can start to see real physical connections emerge. it might still be the case that some hidden confounding variable is the ultimate arbiter of whatever relationship is observed, but the observed relationship will still be real but just mediated but unknown observables. It would be better to know the precise causal chain, but it's still pretty cool to know a real relationship still, even if it is only correlation and not causation.

It costs a lot of money to do so many repetitions. So for something that doesn't ultimately have much economic impact, there is no incentive to spend a lot of money to get some small bits of information to be cataloged away and rarely if ever accessed. Even if large sums of money are spent to answer some question about a complex system, the outcome might still be uncertainty about whether the relationship observed is real.

[-]Avishek Shaw3y50

An interesting article. I would have really liked the author to complete the circle. A discussion of solutions to the problems would be highly complementary.

we’d assumed that every user’s click-through was statistically independent, when in fact they were highly correlated, so many of the results which we thought were significant were in fact basically noise.

a) Does there exist an approach to model the non-independent click behavior of users?

b) A lot of progress in CTR prediction assumes independence of clicks. What is the likely benefit of the non-independence assumption of clicks, besides being closer to reality?

c) How can one use Pearl's techniques to model the non-independent behavior of user clicks?

[-]Ruby3y40

Curated. I really like this post. The title doesn't do it justice – based on the title it sounds like the focus is going to be describing a problem, in fact, you get the problem and then a bunch of proposed solution. I've long thought myself to be pretty good at measuring things (having read "How To Measure Anything And All") but I'm now excited to add these Laws to my experiment design going forward, and put more effort into seeing whether I can in, fact, maybe, measure what I hope to. Cheers

[-]Valentine3y40

This is fantastic. Immensely practical, immediately allowing for a sharpening of the mind. Very Art of Rationality.

And excellently written.

Thank you.

[-]Emrik3y4-1

Basically, we have an almost religious reverence for high-powered decent-effect-size low-p-value statistical evidence, and we fail notice when these experiments acquire their bayes factors because they measure something incredibly narrow and is therefore unlikely to generalise to whatever gears-level models you're entertaining.

It has the same problem as deferring to experts. The communication suffers a bandwidth problem^[1], and we frequently delude ourselves into believing we have adopted their models as long as we copy their probabilities on a very slim sample of queries they've answered.

^{^}
I know of no good write-up of the bandwidth problem in social epistemology but Owen Cotton-Barrat talks about it here (my comment) and Dennett refers to it as the "Daddy Is a Doctor" phenomenon.

[-]Daniel V3y107

To the contrary, johnswentworth's point is not that the experiments have low external validity but that they have low internal validity. It's that there are confounds.

Ironically, one of my quibbles with the post is that the verbiage implies measurement error is the problem. Not measuring what you think you're measuring is about content validity, but the post is actually about how omitted variables (i.e., confounders) are a problem for inferences. "You are not Complaining About What You Think You Are Complaining About."

[-]Richard Korzekwa3y40

I was going to complain that the language quoted from the abstract in the frog paper is sufficiently couched that it's not clear the researchers thought they were measuring anything at all. Saying that X "suggests" Y "may be explained, at least partially" by Z seems reasonable to me (as you said, they had at least not ruled out that Z causes Y). Then I clicked through the link and saw the title of the paper making the unambiguous assertion that Z influences Y.

[-]Shmi3y43

Should this be named... the Wentworth's Law of Measurement?

[-]Alex_Altair3y31

I'd prefer the S'wentworth Law of Measurement

[-]Shmi3y20

Hah, I was just guessing at the real name.

[-]corey morris2y30

One of the key statements made in this post is that measuring more stuff is better than measuring less stuff. Have your beliefs on that updated at all since the original post ? What evidence would cause you to become more certain or less certain of this claim ?

[-]Alex Tolley3y2-4

The measurement of lots of other things leads to the pathology of data mining rather than trying to find the correct causative variable. The better experimental technique is to sequentially investigate each confounding variable and try to ensure they are eliminated. Sometimes this can be hard, but that is no excuse not to do properly CONTROLLED experiments rather than reporting noise.

Data mining is so problematic that medical journals have insisted that the experiment hypothesis is defined in advance so that unexpected variables with significant p-values are not reported instead (p-hacking).

I would far rather experimenters do the 1-bit experiment and then if the result doesn't falsify the hypothesis, think about other explanations for the result and check those variables in the same way. Good experimentation is not for the lazy.

[-]DirectedEvolution3y53

Sometimes this can be hard, but that is no excuse not to do properly CONTROLLED experiments rather than reporting noise.

Sometimes, people accept the hard work of doing properly CONTROLLED experiments. Then they not only FAIL COMPLETELY, they think they succeeded - and everybody else is tricked into believing it too. That's what happened in John's A/B test.

For unsolved problems, you can only find the correct causative variable with a "firehose of information." Then you can go on to prove you're right via a properly controlled experiment.

A classic unreplicable p-hacked study is the one where they found relationships between a discount on an Italian buffet and what the diners ate.

If I was those researchers, I wouldn't mind gathering the data. That's the firehose. But I wouldn't want to publish it. Instead, I'd scour the data, combine it with my qualitative observations of the event, and see if we could come up with a specific pre-registerable causal hypothesis we could actually believe in for a follow-up study. Only the follow-up would be published.

[-]Alex Tolley3y21

"For unsolved problems, you can only find the correct causative variable with a "firehose of information." Then you can go on to prove you're right via a properly controlled experiment."

That second part often doesn't happen. For [bio]medical experiments it is just too expensive. Datamning ensues and any significant p value variables are then published. The medical journals are rife with this which is one reason 30-50% of medical research proves unrepeatable.

Never underestimate human nature to do the easiest thing rather than the correct one. Science can be painstakingly hard to get right, but the pressures to publish are high. I've seen it first hand in biotech, where the obvious questions to ask of the "result" were ignored.

[-]DirectedEvolution3y102

I also am in biotech, and I agree these problems exist.

One way of making use of the "firehose of information" in biotech would be to insist that researchers publish their raw datasets, and provide additional supplementary information along with their paper. Imagine, for example, if researchers doing animal work were required to film themselves doing it and post the videos online for others to review. I think it's easy to see how that would be a helpful "firehose of information" and would do a lot to flesh out the picture given by the normally reported figures in a publication.

I think you're worried about people switching from hard analysis to squishier qualitative data, perhaps because resources are already so constrained that it feels like "one or the other." I think John's saying "why not both?"

[-]Czynski8mo10

Doesn't this imply that having a theory of the domain you're experimenting in is of low to no value? I find that hard to believe, and therefore doubt your assumptions are correct and applicable.

^{^}

The numbers in the latency story are pulled out my ass, I don’t remember what they actually were other than that the latency effects were far larger than anything else we’d seen. Consider the story qualitatively true, but fictional in the quantitative details.

LESSWRONG
LW

LESSWRONG
LW

429

You Are Not Measuring What You Think You Are Measuring

429

429

Against One-Bit Experiments

Corollary To The First Law: If You Are Definitely Not Measuring Anything Besides What You Think You Are Measuring, You Are Probably Not Measuring Anything

Solution: Measure Lots of Things

Measurement Devices

Betting Markets

Reading Papers

Takeaways