I recall hearing about classes at Carnegie Mellon (in the Social and Decision Sciences department) which gave exams in this sort of format.
Related: Integrity for consequentialists by Paul Christiano
Context: My experience is primarily with psychology papers (heuristics & biases, social psych, and similar areas), and it seems to generalize pretty well to other social science research and fields with similar sorts of methods.
One way to think about this is to break it into three main questions:
1. Is this "result" just noise? Or would it replicate?
2. (If there's something besides noise) Is there anything interesting going on here? Or are all the "effects" just confounds, statistical artifacts, demonstrating the obvious, etc.
3. (If there is something interesting going on here) What is going on here? What's the main takeaway? What can we learn from this? Does it support the claim that some people are tempted to use it to support?
There is some benefit just to explicitly considering all three questions, and keeping them separate.
For #1 ("Is this just noise?") people apparently do a pretty good job of predicting which studies will replicate. Relevant factors include:
1a. How strong is the empirical result (tiny p value, large sample size, precise estimate of effect size, etc.).
1b. How plausible is this effect on priors? Including: How big an effect size would you expect on priors? And: How definitively does the researchers' theory predict this particular empirical result?
1c. Experimenter degrees of freedom / garden of forking paths / possibility of p-hacking. Preregistration is best, visible signs of p-hacking are worst.
1d. How filtered is this evidence? How much publication bias?
1e. How much do I trust the researchers about things like (c) and (d)?
I've found that this post on how to think about whether a replication study "failed" also seems to have helped clarify my thinking about whether a study is likely to replicate.
If there are many studies of essentially the same phenomenon, then try to find the methodologically strongest few and focus mainly on those. (Rather than picking one study at random and dismissing the whole area of research if that study is bad, or assuming that just because there are lots of studies they must add up to solid evidence.)
If you care about effect size, it's also worth keeping in mind that the things which turn noise into "statistically significant results" also tend to inflate effect sizes.
For #2 ("Is there anything interesting going on here?"), understanding methodology & statistics is pretty central. Partly that's background knowledge & expertise that you keep building up over the years, partly that's taking the time & effort to sort out what's going on in this study (if you care about this study and can't sort it out quickly), sometimes you can find other writings which comment on the methodology of this study which can help a lot. You can try googling for criticisms of this particular study or line of research (or check google scholar for papers that have cited it), or google for criticisms of specific methods they used. It is often easier to recognize when someone makes a good argument than to come up with that argument yourself.
One framing that helps me think about a study's methodology (and whether or not there's anything interesting going on here) is to try to flesh out "null hypothesis world": in the world where nothing interesting is going on, what would I expect to see come out of this experimental process? Sometimes I'll come up with more than one world that feels like a null hypothesis world. Exercise: try that with this study (Egan, Santos, Bloom 2007). Another exercise: Try that with the hot hand effect.
#3 ("What is going on here?") is the biggest/broadest question of the three. It's the one that I spend the most time on (at least if the study is any good), and it's the one that I could most easily write a whole bunch about (making lots of points and elaborating on them). But it's also the one that is the most distant from Eli's original question, and I don't want to turn those post into a big huge essay, so I'll just highlight a few things here.
A big part of the challenge is thinking for yourself about what's going on and not being too anchored on how things are described by the authors (or the press release or the person who told you about the study). Some moves here:
3a. Imagine (using your inner sim) being a participant in the study, such that you can picture what each part of the study was like. In particular, be sure that you understand every experimental manipulation and measurement in concrete terms (okay, so then they filled out this questionnaire which asked if you agree with statements like such-and-such and blah-blah-blah).
3b. Be sure you can clearly state the pattern of results of the main finding, in a concrete way which is not laden with the authors' theory (e.g. not "this group was depleted" but "this group gave up on the puzzles sooner"). You need this plus 3a to understand what happened in the study, then from there you're trying to draw inferences about what the study implies.
3c. Come up with (one or several) possible models/theories about what could be happening in this study. Especially look for ones that seem commonsensical / that are based in how you'd inner sim yourself or other people in the experimental scenario. It's fine if you have a model that doesn't make a crisp prediction, or if you have a theory that seems a lot like the authors' theory (but without their jargon). Exercise: try that with a typical willpower depletion study.
3d. Have in mind the key takeaway of the study (e.g., the one sentence summary that you would tell a friend; this is the thing that's the main reason why you're interested in reading the study). Poke at that sentence to see if you understand what each piece of it means. As you're looking at the study, see if that key takeaway actually holds up. e.g., Does the main pattern of results match this takeaway or do they not quite match up? Does the study distinguish the various models that you've come up with well enough to strongly support this main takeaway? Can you edit the takeaway claim to make it more precise / to more clearly reflect what happened in the study / to make the specifics of the study unsurprising to someone who heard the takeaway? What sort of research would it take to provide really strong support for that takeaway, and how does the study at hand compare to that?
3e. Look for concrete points of reference outside of this study which resemble the sort of thing the researchers are talking about. Search in particular for ones that seem out-of-sync with this study. e.g., This study says not to tell other people your goals, but the other day I told Alex about something I wanted to do and that seemed useful; do the specifics of this experiment change my sense of whether that conversation with Alex was a good idea?
Some narrower points which don't neatly fit into my 3-category breakdown:
A. If you care about effect sizes then consider doing a Fermi estimate, or otherwise translating the effect size into numbers that are intuitively meaningful to you. Also think about the range of possible effect sizes rather than just the point estimate, and remember that the issues with noise in #1 also inflate effect size.
B. If the paper finds a null effect and claims that it's meaningful (e.g., that the intervention didn't help) then you do care about effect sizes. (e.g., If it claims the intervention failed because it had no effect on mortality rates, then you might assume a value of $10M per life and try to calculate a 95% confidence interval on the value of the intervention based solely on its effect on mortality.)
C. New papers that claim to debunk an old finding are often right when they claim that the old finding has issues with #1 (it didn't replicate) or #2 (it had methodological flaws) but are rarely actually debunkings if they claim that the old finding has issues with #3 (it misdescribes what's really going on). The new study on #3 might be important and cause you to change your thinking in some ways, but it's generally an incremental update rather than a debunking. Examples that look to me like successful debunkings: behavioral social priming research (#1), the Dennis-dentist effect (#2), the hot hand fallacy (#2 and some of B), the Stanford Prison Experiment (closest to #2), various other things that didn't replicate (#1). Examples of alleged "debunkings" which seem like interesting but overhyped incremental research: the bystander effect (#3), loss aversion (this study) (#3), the endowment effect (#3).
My experience was similar to Habryka's. I followed the "too small and subdivided" link to find more details on what exactly the book claimed about the research and how the research looked to you. I didn't see more details on the page where I landed, and couldn't tell where to navigate from there, so I gave up on that and didn't bother clicking any other links from the article. I think I had a similar experience the last time you relied on Roam links. So I've been getting more out of your epistemic spot checks when they've included the content in the post.
The shape of the graph will depend a lot on what questions you ask. So it's hard to interpret many aspects of the graph without seeing the questions that it's based on (or at least a representative subset of questions).
In particular, my recollection is that some GJP questions took the form "Will [event] happen by [date]?", where the market closed around the same time as the date that was asked about. These sorts of questions essentially become different questions as time passes - a year before the date they are asking if the event will happen in a one-year-wide future time window, but a month before the date they are instead asking if the event either will happen in a one-month-wide future time window or if it has already happened in an eleven-months-wide past time window. People can give more and more confident answers as the event draws closer because it's easier to know if the event happened in the past than it is to know if the event will happen in the future, regardless of whether predicting the near future is easier than predicting the far future.
For example, consider the question "an earthquake of at least such-and-such magnitude will happen in such-and-such region between October 16 2019 and October 15 2020". If you know that the propensity for such earthquakes is that they have a probability p of happening each day on average, and you have no information that allows you to make different guesses about different times, then the math on this question is pretty straightforward. Your initial estimate will be that there's a (1-p)^365 chance of No Qualifying Earthquake. Each day that passes with no qualifying earthquake happening, you'll increase the probability you put on No Qualifying Earthquake by reducing the exponent by 1 ("I know that an earthquake didn't happen yesterday, so now how likely is to happen over the next 364 days?", etc.). And if a qualifying earthquake ever does happen then you'll change your prediction to a 100% chance of earthquake in that window (0% chance of No Qualifying Earthquake). You're able to predict the near future (e.g. probability of an earthquake on October 17 2019) and the distant future (e.g. probability of an earthquake on October 14 2020) equally well, but with this [event] by [date] formulation of the question it'll look like you're able to correctly get more and more confident as the date grows closer.
Perhaps the Scots tend to one-box, whereas the English tend to two-box.
My intuition is that two-boxing is the correct move in this scenario where the Predictor always fills the box with $1M for the Scots and never for the English. An Englishman has no hope of walking away with the $1M, so why should he one-box? He could wind up being one of the typical Englishmen who walk away with $1000, or one of the atypical Englishmen who walk away with $0, but he is not going to wind up being an Englishman who walks away with $1M because those don't exist and he is not going to wind up being a Scottish millionaire because he is English.
EDT might also recommend two-boxing in this scenario, because empirically p($1M | English & one-box) = 0.
These studies have not held up well to further rigor. See Scott's 2016 post Devoodooifying Psychology, or even better Simonsohn's (2011) paper Spurious? Name similarity effects (implicit egotism) in marriage, job, and moving decisions.
Number of weakly connected digraphs with n nodes.
It also seems worth noting that this study looked at whether people intervened in aggressive public conflicts, which is a type of situation where the bystander's safety could be at risk and there can be safety in numbers. A lone bystander intervening in a fight is at higher risk of getting hurt, compared to a group of 10 bystanders acting together. This factor doesn't exist (or is much weaker) in situations like "does anyone stop to see if the person lying on the ground needs medical help" or "does anyone notify the authorities about the smoke which might indicate a fire emergency." So I'd be cautious about generalizing to those sorts of situations.
The standard claim in bystander effect research is that an individual bystander's probability of intervening goes down as the number of bystanders increases (see, e.g., Wikipedia). Whereas this study looked at the probability of any intervention from the group of bystanders, which is a different thing.
The abstract of the paper actually begins with this distinction:
Half a century of research on bystander behavior concludes that individuals are less likely to intervene during an emergency when in the presence of others than when alone. By contrast, little is known regarding the aggregated likelihood that at least someone present at an emergency will do something to help.
So: not a debunking. And another example of why it's good practice to check the paper in question (or at least its abstract) and the Wikipedia article(s) on the topic rather than believing news headlines.