Context: My experience is primarily with psychology papers (heuristics & biases, social psych, and similar areas), and it seems to generalize pretty well to other social science research and fields with similar sorts of methods.
One way to think about this is to break it into three main questions:
1. Is this "result" just noise? Or would it replicate?
2. (If there's something besides noise) Is there anything interesting going on here? Or are all the "effects" just confounds, statistical artifacts, demonstrating the obvious, etc.
3. (If there is something interesting going on here) What is going on here? What's the main takeaway? What can we learn from this? Does it support the claim that some people are tempted to use it to support?
There is some benefit just to explicitly considering all three questions, and keeping them separate.
For #1 ("Is this just noise?") people apparently do a pretty good job of predicting which studies will replicate. Relevant factors include:
1a. How strong is the empirical result (tiny p value, large sample size, precise estimate of effect size, etc.).
1b. How plausible is this effect on priors? Including: How big an effect size would you expect on priors? And: How definitively does the researchers' theory predict this particular empirical result?
1c. Experimenter degrees of freedom / garden of forking paths / possibility of p-hacking. Preregistration is best, visible signs of p-hacking are worst.
1d. How filtered is this evidence? How much publication bias?
1e. How much do I trust the researchers about things like (c) and (d)?
I've found that this post on how to think about whether a replication study "failed" also seems to have helped clarify my thinking about whether a study is likely to replicate.
If there are many studies of essentially the same phenomenon, then try to find the methodologically strongest few and focus mainly on those. (Rather than picking one study at random and dismissing the whole area of research if that study is bad, or assuming that just because there are lots of studies they must add up to solid evidence.)
If you care about effect size, it's also worth keeping in mind that the things which turn noise into "statistically significant results" also tend to inflate effect sizes.
For #2 ("Is there anything interesting going on here?"), understanding methodology & statistics is pretty central. Partly that's background knowledge & expertise that you keep building up over the years, partly that's taking the time & effort to sort out what's going on in this study (if you care about this study and can't sort it out quickly), sometimes you can find other writings which comment on the methodology of this study which can help a lot. You can try googling for criticisms of this particular study or line of research (or check google scholar for papers that have cited it), or google for criticisms of specific methods they used. It is often easier to recognize when someone makes a good argument than to come up with that argument yourself.
One framing that helps me think about a study's methodology (and whether or not there's anything interesting going on here) is to try to flesh out "null hypothesis world": in the world where nothing interesting is going on, what would I expect to see come out of this experimental process? Sometimes I'll come up with more than one world that feels like a null hypothesis world. Exercise: try that with this study (Egan, Santos, Bloom 2007). Another exercise: Try that with the hot hand effect.
#3 ("What is going on here?") is the biggest/broadest question of the three. It's the one that I spend the most time on (at least if the study is any good), and it's the one that I could most easily write a whole bunch about (making lots of points and elaborating on them). But it's also the one that is the most distant from Eli's original question, and I don't want to turn those post into a big huge essay, so I'll just highlight a few things here.
A big part of the challenge is thinking for yourself about what's going on and not being too anchored on how things are described by the authors (or the press release or the person who told you about the study). Some moves here:
3a. Imagine (using your inner sim) being a participant in the study, such that you can picture what each part of the study was like. In particular, be sure that you understand every experimental manipulation and measurement in concrete terms (okay, so then they filled out this questionnaire which asked if you agree with statements like such-and-such and blah-blah-blah).
3b. Be sure you can clearly state the pattern of results of the main finding, in a concrete way which is not laden with the authors' theory (e.g. not "this group was depleted" but "this group gave up on the puzzles sooner"). You need this plus 3a to understand what happened in the study, then from there you're trying to draw inferences about what the study implies.
3c. Come up with (one or several) possible models/theories about what could be happening in this study. Especially look for ones that seem commonsensical / that are based in how you'd inner sim yourself or other people in the experimental scenario. It's fine if you have a model that doesn't make a crisp prediction, or if you have a theory that seems a lot like the authors' theory (but without their jargon). Exercise: try that with a typical willpower depletion study.
3d. Have in mind the key takeaway of the study (e.g., the one sentence summary that you would tell a friend; this is the thing that's the main reason why you're interested in reading the study). Poke at that sentence to see if you understand what each piece of it means. As you're looking at the study, see if that key takeaway actually holds up. e.g., Does the main pattern of results match this takeaway or do they not quite match up? Does the study distinguish the various models that you've come up with well enough to strongly support this main takeaway? Can you edit the takeaway claim to make it more precise / to more clearly reflect what happened in the study / to make the specifics of the study unsurprising to someone who heard the takeaway? What sort of research would it take to provide really strong support for that takeaway, and how does the study at hand compare to that?
3e. Look for concrete points of reference outside of this study which resemble the sort of thing the researchers are talking about. Search in particular for ones that seem out-of-sync with this study. e.g., This study says not to tell other people your goals, but the other day I told Alex about something I wanted to do and that seemed useful; do the specifics of this experiment change my sense of whether that conversation with Alex was a good idea?
Some narrower points which don't neatly fit into my 3-category breakdown:
A. If you care about effect sizes then consider doing a Fermi estimate, or otherwise translating the effect size into numbers that are intuitively meaningful to you. Also think about the range of possible effect sizes rather than just the point estimate, and remember that the issues with noise in #1 also inflate effect size.
B. If the paper finds a null effect and claims that it's meaningful (e.g., that the intervention didn't help) then you do care about effect sizes. (e.g., If it claims the intervention failed because it had no effect on mortality rates, then you might assume a value of $10M per life and try to calculate a 95% confidence interval on the value of the intervention based solely on its effect on mortality.)
C. New papers that claim to debunk an old finding are often right when they claim that the old finding has issues with #1 (it didn't replicate) or #2 (it had methodological flaws) but are rarely actually debunkings if they claim that the old finding has issues with #3 (it misdescribes what's really going on). The new study on #3 might be important and cause you to change your thinking in some ways, but it's generally an incremental update rather than a debunking. Examples that look to me like successful debunkings: behavioral social priming research (#1), the Dennis-dentist effect (#2), the hot hand fallacy (#2 and some of B), the Stanford Prison Experiment (closest to #2), various other things that didn't replicate (#1). Examples of alleged "debunkings" which seem like interesting but overhyped incremental research: the bystander effect (#3), loss aversion (this study) (#3), the endowment effect (#3).