## LESSWRONGLW

Often I want to form a quick impression as to whether it is worth me analysing a given paper in more detail. A couple of quick calculations can go a long way. Some of this will be obvious but I've tried to give the approximate thresholds for the results which up until now I've been using subconsciously. I'd be very interested to hear other people's thresholds.

## Calculations

• Calculate how many p-values (could) have been calculated.
• If the study and analysis techniques were pre-registered then count how many p-values were calculated.
• If the study was not pre-registered, calculate how many different p-values could have been calculated (had the data looked different) which would have been equally justified as the ones that they did calculate (see Gelman’s garden of forking paths). This depends on how aggressive any hacking has been but roughly speaking I'd calculate:
• Number of input variables (including interactions) x Number of measurement variables
• Calculate expected number of type I errors
• Multiply answer from previous step by the threshold p-value of the paper
• Different results may have different thresholds which makes life a little more complicated

• Estimate Cohen’s d for the experiment (without looking at the actual result!)
• One option in estimating effect size is to not consider the specific intervention, but just to estimate how easy the target variable is to move for any intervention – see putanumonit for a more detailed explanation. I wouldn't completely throw away my prior on how effective the particular intervention in question is, but I do consider it helpful advice to not let my prior act too powerfully.
• Calculate experimental power
• You can calculate this properly but alternatively can use Lehr’s formula. Sample size equations for different underlying distributions can be found here.
• To get Power > 0.8 we require sample size per group of:
• This is based on , single p-value calculated, 2 samples of equal size, 2 tailed t-test.
• A modification to this rule to account for multiple p-values would be to add 3.25 to the numerator for each doubling of the number of p-values calculated previously.
• If sample sizes are very unequal (ratio of >10) then the number required in the smaller sample is the above calculation divided by 2. This also works for single sample tests against a fixed value.

## Thresholds

Roughly speaking, if expected type I errors is above 0.25 I’ll write the study off, between 0.05 and 0.25 I’ll be suspicious. If multiple significant p-values are found this gets a bit tricky due to non-independence of the p-values so more investigation may be required.

If sample size is sufficient for power > 0.8 then I’m happy. If it comes out below then I’m suspicious and have to check whether my estimation for Cohen’s d is reasonable. If I'm still convinced N is a long way from being large enough I'll write the study off. Obviously as the paper has been published the calculated Cohen’s d is large enough to get a significant result but the question is do I believe that the effect size calculated is reasonable.

## Test

I tried Lehr’s formula on the 80,000 hours replication quiz. Of the 21 replications, my calculation gave a decisive answer in 17 papers, getting them all correct - 9 studies with comfortably oversized samples replicated successfully, 8 studies with massively undersized samples (less than half the required sample size I calculated) failed to replicate. Of the remaining 4 where the sample sizes were 0.5 – 1.2 x my estimate from Lehr’s equation, all successfully replicated.

(I remembered the answer to most of the replications but tried my hardest to ignore this when estimating Cohen's d.)

Just having a fixed minimum N wouldn’t have worked nearly as well – of the 5 smallest studies only 1 failed to replicate.

1. For reasonable assumptions if you're studying an interaction then you might need 16x larger samples - see Gelman. Essentially standard error is double for interactions and Andrew thinks that interaction effects being half the size of main effects is a good starting point for estimates, giving times larger samples.

2. When estimating cohen's d, it is important that you know whether the study is between or within subjects - within subject studies will give much lower standard error and thus require much smaller ... (read more)

6Bucky6moI just came across an example of this which might be helpful. Essentially getting good grades and having a desk in your room are apparently good predictors of whether you want to go to university or not. The former seemed sensible, the latter seemed like it shouldn't have a big effect size but I wanted to give it a chance. The paper itself is here [https://www.tandfonline.com/doi/full/10.1080/03055698.2019.1681941]. Just from the abstract you can tell there are at least 8 input variables so the numerator on Lehr's equation becomes ~26. This means a cohen's d of 0.1 (which I feel is pretty generous for having a desk in your room) would require 2600 results in each sample. As the samples are unlikely to be of equal size, I would estimate they would need a total of ~10,000 samples for this to have any chance of finding a meaningful result for smaller effect sizes. The actual number of samples was ~1,000. At this point I would normally write off the study without bothering to go deeper, the process taking less than 5 minutes. I was curious to see how they managed to get multiple significant results despite the sample size limitations. It turns out that they decided against reporting p-values because "we could no longer assume randomness of the sample". Instead they report the odds ratio of each result and said that anything with a large ratio had an effect, ignoring any uncertainty of the results. It turns out there were only 108 students in the no-desk sample. Definitely what Andrew Gelman calls a Kangaroo [https://statmodeling.stat.columbia.edu/2015/04/21/feather-bathroom-scale-kangaroo/] measurement. There are a lot of other problems with the paper but just looking at the sample size (even though the sample size was ~1,000) was a helpful check to confidently reject the paper with minimal effort.

# 79

When you look at a paper, what signs cause you to take it seriously? What signs cause you to discard the study as too poorly designed to be much evidence one way or the other?

I'm hoping to compile a repository of heuristics on study evaluation, and would love to hear people's tips and tricks, or their full evaluation-process.

I'm looking for things like...

• "If the n (sample size) is below [some threshold value], I usually don't pay much attention."
• "I'm mostly on the lookout for big effect sizes."
• "I read the abstract, then I spend a few minutes thinking about how I would design the experiment, including which confounds I would have to control for, and how I could do that. Then I read the methods section, and see how their study design compares to my 1-3 minute sketch. Does their design seem sensible? Are they accounting for the first-order-obvious confounds?"
• etc.

Pingbacks
New Comment

waveman

### Oct 30, 2019

44

1. Look at the sponsorship of the research and of the researchers (previous sponsorship, "consultancies" etc are also important for up to 10-15 years). This creates massive bias. E.g: A lot of medical bodies and researchers are owned by pharmaceutical companies

2. Look at ideological biases of the authors. E.g. a lot of social science research assumes as a given that genes have no effect on personality or intelligence. (Yes, really).

3. Understand statistics very deeply. There is no pain-free way to get this knowledge, but without it you cannot win here. E.g. a) The assumptions behind all the statistical models b) the limitations of alleged "corrections". You need to understand both Bayesian and Frequentist statistics in depth, to the point that they are obvious and intuitive to you.

4. Understand how researchers rig results. e.g. undisclosed multiple comparisons, peeking at the data before deciding what analysis to do, failing to pre-publish the design and end points and to follow that pre-publication, "run-in periods" for drug trials, sponsor-controlled committees to review and change diagnoses... There are papers about this e.g. "why most published research findings are false".

5. After sponsorship, read the methods section carefully. Look for problems. Have valid and appropriate statistics been used? Were the logical end points assessed? Maybe then look at the conclusions. Do the conclusions match the body of the paper? Has the data from the study been made available to all qualified researchers to check the analysis? Things can change a lot when that happens e.g. Tamiflu. Is the data is only available to commercial interests and their stooges this is a bad sign.

6. Has the study been replicated by independent researchers?

7. Is the study observational? If so, does is meet generally accepted criteria for valid observational studies? (large effect, dose-response gradient, well understood causal model, well understood confounders, confounders smaller than the published effect etc).

8. Do not think you can read abstracts only and learn much that is useful.

9. Read some of the vitriolic books about the problems in research e.g. "Deadly Medicines and Organised Crime How big pharma has corrupted healthcare" by PETER C GØTZSCHE. Not everything in this book is true but it will open your eyes about what can happen.

10. Face up to the fact that 80-90% of studies are useless or wrong. You will spend a lot of time reading things only to conclude that there is not much there.

Unnamed

### Nov 12, 2019

31

Context: My experience is primarily with psychology papers (heuristics & biases, social psych, and similar areas), and it seems to generalize pretty well to other social science research and fields with similar sorts of methods.

1. Is this "result" just noise? Or would it replicate?

2. (If there's something besides noise) Is there anything interesting going on here? Or are all the "effects" just confounds, statistical artifacts, demonstrating the obvious, etc.

3. (If there is something interesting going on here) What is going on here? What's the main takeaway? What can we learn from this? Does it support the claim that some people are tempted to use it to support?

There is some benefit just to explicitly considering all three questions, and keeping them separate.

For #1 ("Is this just noise?") people apparently do a pretty good job of predicting which studies will replicate. Relevant factors include:

1a. How strong is the empirical result (tiny p value, large sample size, precise estimate of effect size, etc.).

1b. How plausible is this effect on priors? Including: How big an effect size would you expect on priors? And: How definitively does the researchers' theory predict this particular empirical result?

1c. Experimenter degrees of freedom / garden of forking paths / possibility of p-hacking. Preregistration is best, visible signs of p-hacking are worst.

1d. How filtered is this evidence? How much publication bias?

1e. How much do I trust the researchers about things like (c) and (d)?

I've found that this post on how to think about whether a replication study "failed" also seems to have helped clarify my thinking about whether a study is likely to replicate.

If there are many studies of essentially the same phenomenon, then try to find the methodologically strongest few and focus mainly on those. (Rather than picking one study at random and dismissing the whole area of research if that study is bad, or assuming that just because there are lots of studies they must add up to solid evidence.)

If you care about effect size, it's also worth keeping in mind that the things which turn noise into "statistically significant results" also tend to inflate effect sizes.

For #2 ("Is there anything interesting going on here?"), understanding methodology & statistics is pretty central. Partly that's background knowledge & expertise that you keep building up over the years, partly that's taking the time & effort to sort out what's going on in this study (if you care about this study and can't sort it out quickly), sometimes you can find other writings which comment on the methodology of this study which can help a lot. You can try googling for criticisms of this particular study or line of research (or check google scholar for papers that have cited it), or google for criticisms of specific methods they used. It is often easier to recognize when someone makes a good argument than to come up with that argument yourself.

One framing that helps me think about a study's methodology (and whether or not there's anything interesting going on here) is to try to flesh out "null hypothesis world": in the world where nothing interesting is going on, what would I expect to see come out of this experimental process? Sometimes I'll come up with more than one world that feels like a null hypothesis world. Exercise: try that with this study (Egan, Santos, Bloom 2007). Another exercise: Try that with the hot hand effect.

#3 ("What is going on here?") is the biggest/broadest question of the three. It's the one that I spend the most time on (at least if the study is any good), and it's the one that I could most easily write a whole bunch about (making lots of points and elaborating on them). But it's also the one that is the most distant from Eli's original question, and I don't want to turn those post into a big huge essay, so I'll just highlight a few things here.

A big part of the challenge is thinking for yourself about what's going on and not being too anchored on how things are described by the authors (or the press release or the person who told you about the study). Some moves here:

3a. Imagine (using your inner sim) being a participant in the study, such that you can picture what each part of the study was like. In particular, be sure that you understand every experimental manipulation and measurement in concrete terms (okay, so then they filled out this questionnaire which asked if you agree with statements like such-and-such and blah-blah-blah).

3b. Be sure you can clearly state the pattern of results of the main finding, in a concrete way which is not laden with the authors' theory (e.g. not "this group was depleted" but "this group gave up on the puzzles sooner"). You need this plus 3a to understand what happened in the study, then from there you're trying to draw inferences about what the study implies.

3c. Come up with (one or several) possible models/theories about what could be happening in this study. Especially look for ones that seem commonsensical / that are based in how you'd inner sim yourself or other people in the experimental scenario. It's fine if you have a model that doesn't make a crisp prediction, or if you have a theory that seems a lot like the authors' theory (but without their jargon). Exercise: try that with a typical willpower depletion study.

3d. Have in mind the key takeaway of the study (e.g., the one sentence summary that you would tell a friend; this is the thing that's the main reason why you're interested in reading the study). Poke at that sentence to see if you understand what each piece of it means. As you're looking at the study, see if that key takeaway actually holds up. e.g., Does the main pattern of results match this takeaway or do they not quite match up? Does the study distinguish the various models that you've come up with well enough to strongly support this main takeaway? Can you edit the takeaway claim to make it more precise / to more clearly reflect what happened in the study / to make the specifics of the study unsurprising to someone who heard the takeaway? What sort of research would it take to provide really strong support for that takeaway, and how does the study at hand compare to that?

3e. Look for concrete points of reference outside of this study which resemble the sort of thing the researchers are talking about. Search in particular for ones that seem out-of-sync with this study. e.g., This study says not to tell other people your goals, but the other day I told Alex about something I wanted to do and that seemed useful; do the specifics of this experiment change my sense of whether that conversation with Alex was a good idea?

Some narrower points which don't neatly fit into my 3-category breakdown:

A. If you care about effect sizes then consider doing a Fermi estimate, or otherwise translating the effect size into numbers that are intuitively meaningful to you. Also think about the range of possible effect sizes rather than just the point estimate, and remember that the issues with noise in #1 also inflate effect size.

B. If the paper finds a null effect and claims that it's meaningful (e.g., that the intervention didn't help) then you do care about effect sizes. (e.g., If it claims the intervention failed because it had no effect on mortality rates, then you might assume a value of $10M per life and try to calculate a 95% confidence interval on the value of the intervention based solely on its effect on mortality.) C. New papers that claim to debunk an old finding are often right when they claim that the old finding has issues with #1 (it didn't replicate) or #2 (it had methodological flaws) but are rarely actually debunkings if they claim that the old finding has issues with #3 (it misdescribes what's really going on). The new study on #3 might be important and cause you to change your thinking in some ways, but it's generally an incremental update rather than a debunking. Examples that look to me like successful debunkings: behavioral social priming research (#1), the Dennis-dentist effect (#2), the hot hand fallacy (#2 and some of B), the Stanford Prison Experiment (closest to #2), various other things that didn't replicate (#1). Examples of alleged "debunkings" which seem like interesting but overhyped incremental research: the bystander effect (#3), loss aversion (this study) (#3), the endowment effect (#3). Bucky ### Nov 12, 2019 28 Often I want to form a quick impression as to whether it is worth me analysing a given paper in more detail. A couple of quick calculations can go a long way. Some of this will be obvious but I've tried to give the approximate thresholds for the results which up until now I've been using subconsciously. I'd be very interested to hear other people's thresholds. ## Calculations • Calculate how many p-values (could) have been calculated. • If the study and analysis techniques were pre-registered then count how many p-values were calculated. • If the study was not pre-registered, calculate how many different p-values could have been calculated (had the data looked different) which would have been equally justified as the ones that they did calculate (see Gelman’s garden of forking paths). This depends on how aggressive any hacking has been but roughly speaking I'd calculate: • Number of input variables (including interactions) x Number of measurement variables • Calculate expected number of type I errors • Multiply answer from previous step by the threshold p-value of the paper • Different results may have different thresholds which makes life a little more complicated • Estimate Cohen’s d for the experiment (without looking at the actual result!) • One option in estimating effect size is to not consider the specific intervention, but just to estimate how easy the target variable is to move for any intervention – see putanumonit for a more detailed explanation. I wouldn't completely throw away my prior on how effective the particular intervention in question is, but I do consider it helpful advice to not let my prior act too powerfully. • Calculate experimental power • You can calculate this properly but alternatively can use Lehr’s formula. Sample size equations for different underlying distributions can be found here. • To get Power > 0.8 we require sample size per group of: • This is based on , single p-value calculated, 2 samples of equal size, 2 tailed t-test. • A modification to this rule to account for multiple p-values would be to add 3.25 to the numerator for each doubling of the number of p-values calculated previously. • If sample sizes are very unequal (ratio of >10) then the number required in the smaller sample is the above calculation divided by 2. This also works for single sample tests against a fixed value. ## Thresholds Roughly speaking, if expected type I errors is above 0.25 I’ll write the study off, between 0.05 and 0.25 I’ll be suspicious. If multiple significant p-values are found this gets a bit tricky due to non-independence of the p-values so more investigation may be required. If sample size is sufficient for power > 0.8 then I’m happy. If it comes out below then I’m suspicious and have to check whether my estimation for Cohen’s d is reasonable. If I'm still convinced N is a long way from being large enough I'll write the study off. Obviously as the paper has been published the calculated Cohen’s d is large enough to get a significant result but the question is do I believe that the effect size calculated is reasonable. ## Test I tried Lehr’s formula on the 80,000 hours replication quiz. Of the 21 replications, my calculation gave a decisive answer in 17 papers, getting them all correct - 9 studies with comfortably oversized samples replicated successfully, 8 studies with massively undersized samples (less than half the required sample size I calculated) failed to replicate. Of the remaining 4 where the sample sizes were 0.5 – 1.2 x my estimate from Lehr’s equation, all successfully replicated. (I remembered the answer to most of the replications but tried my hardest to ignore this when estimating Cohen's d.) Just having a fixed minimum N wouldn’t have worked nearly as well – of the 5 smallest studies only 1 failed to replicate. jimrandomh ### Nov 02, 2019 28 1. For health-related research, one of the main failure modes I've observed when people I know try to do this, is tunnel vision and a lack of priors about what's common and relevant. Reading raw research papers before you've read broad-overview stuff will make this worse, so read UpToDate first and Wikipedia second. If you must read raw research papers, find them with PubMed, but do this only rarely and only with a specific question in mind. 2. Before looking at the study itself, check how you got there. If you arrived via a search engine query that asked a question or posed a topic without presupposing an answer, that's good; if there are multiple studies that say different things, you've sampled one of them at random. If you arrived via a query that asked for confirmation of a hypothesis, that's bad; if there are multiple studies that said different things, you've sampled in a way that was biased towards that hypothesis. If you arrived via a news article, that's the worst; if there are multiple studies that said different things, you sampled in a way that was biased opposite reality. 3. Don't bother with studies in rodents, animals smaller than rodents, cell cultures, or undergraduate psychology students. These studies are done in great numbers because they are cheap, but they have low average quality. The fact that they are so numerous makes the search-sampling problems in (2) more severe. 4. Think about what a sensible endpoint or metric would be before you look at what endpoint/metric was reported. If the reported metric is not the metric you expected, this will often be because the relevant metric was terrible. Classic examples are papers about battery technologies reporting power rather than capacity, biomedical papers reporting effects on biomarkers rather than symptoms or mortality. 5. Correctly controlling for confounders is much, much harder than people typically give it credit for. Adding extra things to the list of things controlled for can create spurious correlations, and study authors are not incentivized to handle this correctly. The practical upshot is that observational studies only count if the effect size is very large. Elizabeth ### Nov 10, 2019 24 One tactic I like to use is "how do they know this?", and asking myself or investigating if it's possible for their answer to demonstrate the thing they're claiming. A lot of work doesn't tell you. Those aren't necessarily wrong, because they might have a good answer they're not incentivized to share, but at a minimum it's going to make it hard to learn from the work. A lot of work claims to tell you, but when you look they are lying. For example, when I investigated the claim humans could do 4 hours of thought-work per day, I looked up the paper's citations, and found they referred to experiments of busy work. Even if those studies were valid, they couldn't possibly prove anything about thought-work. I consider "pretending to have sources and reasons" a worse sin than "not giving a source or reason" More ambiguously, I spent a lot of time trying to figure out how much we could tell and at what resolution from ice core data. I still don't have a great answer on this for the time period I was interested in. But I learned enough to know that the amount of certainty the book I was reading (The Fate of Rome) was presenting data as more clear cut than it was. On the other end, The Fall of Rome spends a lot of time explaining why pottery is useful in establishing economic and especially trade status of an area/era. This was pretty hard to verify from external sources because it's original research from the author, but it absolutely makes sense and produces a lot of claims and predictions that could be disproved. Moreover, none of the criticism I fond of Fall of Rome addressed his points on pottery- no one was saying "well I looked at Roman pottery and think the quality stayed constant through the 600s". Kristin Lindquist ### Nov 02, 2019 18 Already many good answers, but I want to reinforce some and add others. 1. Beware of multiplicity - does the experiment include a large number of hypotheses, explicitly or implicitly? Implicit hypotheses include "Does the intervention have an effect on subjects with attributes A, B or C?" (subgroups) and "Does the intervention have an effect that is shown by measuring X, Y or Z?" (multiple endpoints). If multiple hypotheses were tested, were the results for each diligently reported? Note that multiplicity can be sneaky and you're often looking for what was left unsaid, such as a lack of plausible mechanism for the reported effect. For example, take the experimental result "Male subjects who regularly consume Vitamin B in a non-multi-vitamin form have a greater risk of developing lung cancer (irrespective of dose)." Did they *intentionally* hypothesize that vitamin B would increase the likelihood cancer, but only if 1) it was not consumed as part of a multi vitamin and 2) in a manner that was not dose-dependent? Unlikely! The real conclusion of this study should have been "Vitamin B consumption does not appear correlated to lung cancer risk. Some specific subgroups did appear to have a heightened risk, but this may be statistical anomaly." 2. Beware of small effect sizes and look for clinical significance - does the reported effect sound like something that matters? Consider the endpoint (e.g. change in symptoms of depression, as measured by the Hamilton Depression Rating Scale) and the effect size (e.g. d = 0.3, which is generally interpreted as a small effect). As a depressive person, I don't really care about a drug that has a small effect size.* I don't care if the effect is real but small or not real at all, because I'm not going to bother with that intervention. The "should I care" question cuts through a lot of the bullshit, binary thinking and the difficulty in interpreting small effect sizes (given their noisiness). 3. Beware of large effect sizes - lots of underpowered studies + publication bias = lots of inflated effect sizes reported. Andrew Gelman's "Type M" (magnitude) errors are a good way to look at this - an estimate of the how inflated the effect size is likely to be. However, this isn't too helpful unless you're ready to bust out R when reading research. Alternately, a good rule of thumb is to be skeptical of 1) large effect sizes reported from small N studies and 2) confidence intervals wide enough to drive a trunk through. 4. Beware of low prior odds - is this finding in a highly exploratory field of research, and itself rather extraordinary? IMO this is an under-considered conclusion of Ioannidis' famous "Why Most Published Research Findings are False" paper. This Shinyapp nicely illustrates "positive predictive value" (PPV), which takes into account bias & prior odds. 5. Consider study design - obviously look for placebo control, randomization, blinding etc. But also look for repeated measures designs, e.g. "crossover" designs. Crossover designs achieve far higher power with fewer participants. If you're eyeballing study power, keep this in mind. 6. Avoid inconsistent skepticism - for one, don't be too skeptical of research just because of its funding source. All researchers are biased. It's small potatoes$$-wise compared to a Pfizer, but postdoc Bob's career/identity is on the line if he doesn't publish. Pfizer may have$3 billion on the line for their Phase III clinical trial, but if Bob can't make a name for himself, he's lost a decade of his life and his career prospects. Then take Professor Susan who built her career on Effect X being real - what were those last 30 years for, if Effect X was just anomaly?

Instead, look at 1) the quality of the study design, 2) the quality and transparency of the reporting (including COI disclosures, preregistrations, the detail and organization in said preregistrations, etc).

7. Learn to love meta-analysis - Where possible, look at meta-analyses rather than individual studies. But beware: meta-analyses can suffer their own design flaws, leading to some people saying "lies, damn lies and meta-analysis." Cochrane is the gold standard. If they have a meta-analysis for the question at hand, you're in luck. Also, check out the GRADE criteria - a pragmatic framework for evaluating the quality of research used by Cochrane and others.

*unless there is high heterogeneity in the effect amongst a subgroup with whom I share attributes, which is why subgrouping is both hazardous and yet still important.

Charlie Steiner

### Oct 30, 2019

12

Here's an answer for condensed matter physics:

Step 1: Read the title, journal name, author list, and affiliations.

By reading papers in a field, talking to people in the field, and generally keeping track of the field as a social enterprise, you should be able to place papers in a context even before reading them. People absolutely have reputations, and that should inform your priors. You should also have an understanding of what the typical research methods are to answer a certain question - check either the title or the abstract to make sure that the methods used match the problem.

Actually, you know what?

Step 0: Spend years reading papers and keeping track of people to develop an understanding of trust and reputation as various results either pan our or don't. Read a few textbooks to understand the physical basis of the commonly-used experimental and theoretical techniques, then check that understanding by reading more papers and keeping track of what kind of data quality is the standard in the field, how techniques are best applied, and which techniques and methods of analysis provide the most reliable results.

For example, by combining steps 0 and 1, you can understand that certain experimental techniques might be more difficult and easier to fool yourself with, but might be the best method available for answering some specific question. If you see a paper applying this technique to this sort of question, this actually should increase your confidence in the paper relative to the base rate for this technique, because it shows that the authors are exercising good judgment. Next...

Step 2: Read the abstract and look at the figures.

This is good for understanding the paper too, not just evaluating trustworthiness. Look for data quality (remember that you learned how to judge the data quality of the most common techniques in step 0) and whether they've presented it in a way that clearly backs up the core claims of the abstract, or presents the information you're trying to learn from the paper. Data that is merely suggestive of the authors' claims is actually a red flag, because remember, everyone just presents the nicest figure they can. Responsible scientists reduce their claims when the evidence is weak.

If you have specific parts you know you care about, you can usually just read those in detail and skim the rest. But if you really care about assessing this particular paper, check the procedures and compare it to your knowledge of how this sort of work should go. If there are specific parts that you want to check yourself, and you can do so, do so. This is also useful so you can...

Step 4: Compare it to similar papers.

You should have background knowledge, but it's also useful to keep similar papers (both in terms of what methods they used, and what problem they studied) directly on hand if you want to check something. If you know a paper that did a similar thing, use that to check their methods. Find some papers on the same problem and cross-check how they present the details of the problem and the plausibility of various answers, to get a feel for the consensus. Speaking of consensus, if there are two similar papers from way in the past that you found via Google Scholar and one of them has 10x the citations of the other, take that into account. When you notice confusing statements, you can check those similar papers to see how they handled it. But once you're really getting into the details, you'll have to...

Step 5: Follow up citations for things you don't understand or want to check.

If someone is using a confusing method or explanation, there should be a nearby citation. If not, that's a red flag. Find the citation and check whether it supports the claim in the original paper (recursing if necessary). Accept that this will require lots of work and thinking, but hey, at least this feeds back into step 0 so you don't have to do it as much next time.

There are smart people out there. Hopefully you know some, so that if something seems surprising and difficult to understand, you can ask them what they think about it.

romeostevensit

### Oct 30, 2019

12

Sample size is related to how big an effect size you should be surprised by ie power. Big effect sizes in smaller populations = less surprising. Why is there no overall rule of thumb? Because it gets modified a bunch by the base rate of what you're looking at and some other stuff I'm not remembering off the top of my head.

In general I'd say there's enough methodological diversity that there's a lot of stuff I'm looking for as flags that a study wasn't designed well. For examples of such you can look at the inclusion criteria for meta-analyses.

There's also more qualitative things about how much I'm extrapolating based on the discussion section by the study authors. In the longevity posts for example, I laud a study for having a discussion section where the authors explicitly spend a great deal of time talking about what sorts of things are *not* reasonable to conclude from the study even though they might be suggestive for further research directions.

Confounds are kinda like building a key word map. I'm looking at the most well regarded studies in a domain, noting down what they're controlling for, then discounting studies that aren't controlling for them to varying degrees. This is another place where qualitative judgements creep in even in cochrane reviews where they are forced to just develop ad hoc 'tiers' of evidence (like A, B, C etc) and give some guidelines for doing so.

I have higher skepticism in general than I did years ago as I have learned about the number of ways that effects can sneak into the data despite honest intention by moderately competent scientists. I'm also much more aware of a fundamental problem with selection effects in that anyone running a study has some vested interest in framing hypotheses in various ways because nobody devotes themselves to something about which they're completely disinterested. This shows up as a problem in your own evaluation in that it's almost impossible to not sneak in isolated demands for rigor based on priors.

I'm also generally reading over the shoulder of whichever other study reviewers seem to be doing a good job in a domain. Epistemics is a team sport. An example of this is when Scott did a roundup of evidence for low carb diets and mentioning lots of other people doing meta reviews and some speculating about why different conclusions were reached eg Luke Muelhauser and I came down on the side that the VLC evidence seemed weak and Will Eden came down on the side that it seemed more robust, seemingly differing on how much weight we placed on inside view metabolic models vs outside view long term studies.

That's a hot take. It can be hard to just dump top level heuristics vs seeing what comes up from more specific questions/discussion.

whales

### Oct 30, 2019

10

Recapitulating something I've written about before:

You should first make a serious effort to formulate both the specific question you want answered, and why you want an answer. It may turn out surprisingly often that you don't need to do all this work to evaluate the study.

Short of becoming an expert yourself, your best bet is then to learn how to talk to people in the field until you can understand what they think about the paper and why—and also how they think and talk about these things. This is roughly what Harry Collins calls "interactional" expertise. (He takes gravitational-wave scientist Joe Weber's late work as an especially vivid example: "I can promise such lay readers that if they teach themselves a bit of elementary statistics and persevere with reading the paper, they will find it utterly convincing. Scientific papers are written to be utterly convincing; over the centuries their special language and style has been developed to make them read convincingly.... The only way to know that Weber’s paper is not to be read in the way it is written is to be a member of the ‘oral culture’ of the relevant specialist community." The full passage is very good.)

If you only learn from papers (or even textbooks and papers), you won't have any idea what you're missing. A lot of expertise is bound up in individual tacit knowledge and group dynamics that never get written down. This isn't to say that the 'oral culture' is always right, but if you don't have a good grasp of it, you will make at best slow progress as an outsider.

This is the main thing holding me back from running the course I've half-written on layperson evaluation of science. Most of the time, the best thing is just to talk to people. (Cold emails are OK; be polite, concise, and ask a specific question. Grad students tend to be generous with their time if you have an interesting question or pizza and beer. And I'm glad to answer physics questions by LW message.)

Short of talking to people, you can often find blogs in the field of interest. More rarely, you can also find good journalism doing the above kind of work for you. (Quanta is typically good in physics, enough so that I more or less trust them on other subjects.)

There's plenty to be said about primary source evaluation, which varies with field and which the other answers so far get at, but I think this lesson needs to come first.

Elizabeth

### Nov 12, 2019

7

If a psychology study doesn't prominently say who its subjects were, the answer is "undergrads at the university, predominantly those is psychology classes" and it is worthless.

TheMajor

### Oct 30, 2019

4

Already partially mentioned by others, including OP.

I usually start with comparing the conclusion with my expectations (I'm painfully aware that this creates a confirmation bias, but what else am I supposed to compare it with). If they are sufficiently different I try to imagine how, using the method described by the authors, I would be able to get a positive result to their experiment conditional on my priors being true, i.e. their conclusion being false. This is basically the same as trying to figure out how I would run the experiment and which data would disprove my assumptions, and then seeing if the published results fall in that category.

Usually the buck stops there, most published research use methods that are sufficiently flimsy that (again, conditional on my priors), it is very likely the result was a fluke. This approach is pretty much the same as your third bullet point, and also waveman's point number 5. I would like to stress though that it's almost never enough to have a checklist of "common flaws in method sections" (although again, you have to start somewhere). Unfortunately different strengths and types of results in different fields require different methods.

A small Bayesian twist on the interpretation of this approach: when you're handed a paper (that doesn't match your expectations), that is evidence of something. I'm specifically looking at the chance that, conditional on my priors being accurate, the paper I'm given is still being published.