John Ioannidis has written a very insightful and entertaining article about the current state of the movement which calls itself "Evidence-Based Medicine".  The paper is available ahead of print at

As far as I can tell there is currently no paywall, that may change later, send me an e-mail if you are unable to access it.

Retractionwatch interviews John about the paper here:

(Full disclosure: John Ioannidis is a co-director of the Meta-Research Innovation Center at Stanford (METRICS), where I am an employee. I am posting this not in an effort to promote METRICS, but because I believe the links will be of interest to the community)

39 comments, sorted by Click to highlight new comments since: Today at 11:59 AM
New Comment

Short version of the important parts, please?

I got the impression from the article that at the beginning, EBM was successful (and made many people in the pharma industry angry) because by using meta-analyses and similar tools it exposed the flaws in existing treatments.

But later the industry adapted, and started producing the kind of fake evidence that can better fool EBM. For example, instead of sponsoring studies, they started sponsoring the meta-analyses; instead of twisting numbers in one experiment, which can be revealed by comparison with other experiments, they started twisting the very comparisons of the experiments. Or instead of doing dubious studies with low numbers of participants, now they are doing dubious studies with huge numbers of participants (to get more weight in the meta-analysis) but still without sharing data and protocols.

In other words, EBM may suffer from the effects of Goodhart's law. Which doesn't necessarily mean that it's completely doomed; but it means that after the initial easy success it will become more difficult because more people will be trying to "hack" it.

In other words, EBM may suffer from the effects of Goodhart's law.

Basically, that. Also, "medicine is a very large high-inertia thing and trying to move it kinda failed".

This paper has a very critical tone, but it's still not making it clear to me whether the EBM movement has "failed". The counterfactual is not a world where EBM was somehow immune from being 'hijacked' (whatever 'hijacking' is, we're really talking about unintended consequences of the prevailing incentives in this sector) but one where that movement didn't even get off the ground in the first place. Relative to that, I think we're in fact far better off.

Insightful, accurate and depressing.

On a related note:

Ben Goldacre along with colleagues has recently set up COMPare.

For the last few years it’s been the norm that research in humans should be preregistered before the trial starts to avoid the file-drawer effect where negative trials don’t get published.

Without preregistration it’s hard to tell when someone has thrown a dart at a wall then built the dartboard around it.

It’s gradually been improving with a lot of research being preregistered but still published trials often don’t report what they said they were going to report or report things they didn’t preregister.

The dartboard is now there beforehand but people are still quietly building new dartboards around wherever the dart hits without mentioning it or mentioning that they were aiming at the original dartboard.

The COMPare project is doing something incredibly simple: Reading the paper. Reading the preregistered plan. Posting a public note on their website and sending a letter to the publishing journal pointing it out.

It’s embarrassing for journals because in theory they should have made sure that the papers matched what was preregistered during peer review.

They’ve had a range of responses, some journal editors like the ones at the BMJ have posted corrections while others have doubled-down like the editors at Annals of Internal Medecine, it’s really quite entertaining.

It's intersting to note that both METRICS and Ben Goldacre are founded by the Laura and John Arnold Foundation. Yay, Earning-to-Give money.

the Laura and John Arnold Foundation. Yay, Earning-to-Give money.

WIkipedia: "...he is credited with making three quarters of a billion dollars for Enron in 2001..."


[-][anonymous]6y 0

According to Wikipedia, he was as a trader of natural gas derivatives at Enron. This seems essentially unrelated to the collapse of Enron, which was caused by their accounting practices. Do you have a reason to suspect he was involved? If not, this comment seems like a completely unwarranted personal attack on a person who is obviously doing a lot of good for open science (including, for full disclosure, providing the grant which funds my salary)

[This comment is no longer endorsed by its author]Reply
[-][anonymous]6y 0

I would regard projects like COMPare, which rate studies after publication, as much more valuable than preregistration. Yes, preregistration reduces researcher degrees of freedom, but it also increases red tape. Ioannidis mentions how researchers are spending too much time chasing funds. Preregistration increases costs (in terms of extra work) to the researcher; encouraging them to chase more funding. Increasing quality will likely require reducing the cost of doing higher quality research; not increasing it. Yes, I'm aware COMPare is using the preregistration to rate the studies, but that's just one method. The question mark for me with preregistration is: what is the opportunity cost? If researchers are now spending this extra time figuring out exactly what they plan to do all from the beginning of the study, and then filling out preregistration forms, what are they not doing instead?

If you don't publicly pre-commit to what you're going to measure then p-values become a bit meaningless since nobody can know if that was the only thing you measured.

If researchers are well organized then pre-reg should be almost free. On the other hand if they're disorganized, winging it and making things up as they go along then pre-reg will look like a terrible burden since it forces them to decide what they're actually going to do.

[-][anonymous]6y 0

The short general version of my argument is: feedback > filtering

I would agree that preregistration is one way to make p-values more useful. They may be the best way to determine what the researcher originally intended to measure, but they're not the only way to know if that was the only thing a researcher measured. I've found asking questions often works.

If we're talking strictly about properly run RCTs, then I would agree, preregistration is close to free relatively speaking. But that's because a properly conducted RCT is such a big undertaking that most filtering initiatives are going to be relatively small. But RCTs aren't the only kind of study design out there. They are the gold-standard, yes, in that they have the greatest robustness, but their major drawback is that they're expensive to conduct properly relative to alternatives.

Science already has a pretty strong filter. Researchers need to spend 8 years (and usually much more than that) after high school working towards a PhD. They then have to decide that what they're doing is the best way to analyze a problem, or if they're still in grad school, their professor has to approve it, and they have to believe in it. Then two or more other people with PhDs who weren't involved in the research (editor and peer reviewer(s)) have to review what the researcher did, and come to the conclusion that the research was properly conducted. I don't view this as principally a filtering problem. Filtering can improve quality, but it also reduces the number of possible ways to conduct research. The end result of excessive filtering to me is that everybody ends up just doing RCTs for everything, which is extremely cost-inefficient, and leads to the problem of everybody chasing funding. If nobody with less than a million on their credit can conduct a study, I think that's a problem.

Having been through some of that process... it's less than stellar.

That recent "creator" paper managed, somehow, to get through peer review and in the past I've been acutely aware that it's been clear that sometimes reviewers have no clue about what they've been asked to review and just sort of wave it through with a few requests for spelling and grammar corrections.

To an extent it's a very similar problem to ones faced in programming and engineering. Asking for more feedback is just the waterfall model applied to research.

To an extent, even if researchers weren't being asked to publicly post their pre-reg getting them to actually work out what they're planning to measure is a little like getting programmers to adopt Test Driven Development (write the tests, then write the code) which tends to produce higher quality output.

Despite that 8 years a lot of people still don't really know what they're doing in research and just sort of ape their supervisor. (who may have been in the same situation)

Since the system is still half-modeled on the old medieval master-journeyman-apprentice system you can also get massive massive massive variation in ability/competence so simply trusting in people being highly qualified isn't very reliable.

The simplest way to illustrate the problem is to point to really really basic stats errors which make it into huge portions of the literature. Basic errors which have made it past supervisors, made it past reviewers, made it past editors. Made it past many people with PHD's and not one picked up on them.

(This is just an example, there are many many other basic errors made constantly in research)

They’ve identified one direct, stark statistical error that is so widespread it appears in about half of all the published papers surveyed from the academic neuroscience research literature.

To understand the scale of this problem, first we have to understand the statistical error they’ve identified. This is slightly difficult, and it will take 400 words of pain. At the end, you will understand an important aspect of statistics better than half the professional university academics currently publishing in the field of neuroscience.

Let’s say you’re working on some nerve cells, measuring the frequency with which they fire. When you drop a chemical on them, they seem to fire more slowly. You’ve got some normal mice, and some mutant mice. You want to see if their cells are differently affected by the chemical. So you measure the firing rate before and after applying the chemical, first in the mutant mice, then in the normal mice.

When you drop the chemical on the mutant mice nerve cells, their firing rate drops, by 30%, say. With the number of mice you have (in your imaginary experiment) this difference is statistically significant, which means it is unlikely to be due to chance. That’s a useful finding which you can maybe publish. When you drop the chemical on the normal mice nerve cells, there is a bit of a drop in firing rate, but not as much – let’s say the drop is 15% – and this smaller drop doesn’t reach statistical significance.

But here is the catch. You can say that there is a statistically significant effect for your chemical reducing the firing rate in the mutant cells. And you can say there is no such statistically significant effect in the normal cells. But you cannot say that mutant cells and mormal cells respond to the chemical differently. To say that, you would have to do a third statistical test, specifically comparing the “difference in differences”, the difference between the chemical-induced change in firing rate for the normal cells against the chemical-induced change in the mutant cells.

Now, looking at the figures I’ve given you here (entirely made up, for our made up experiment) it’s very likely that this “difference in differences” would not be statistically significant, because the responses to the chemical only differ from each other by 15%, and we saw earlier that a drop of 15% on its own wasn’t enough to achieve statistical significance.

But in exactly this situation, academics in neuroscience papers are routinely claiming that they have found a difference in response, in every field imaginable, with all kinds of stimuli and interventions: comparing responses in younger versus older participants; in patients against normal volunteers; in one task against another; between different brain areas; and so on.

How often? Nieuwenhuis looked at 513 papers published in five prestigious neuroscience journals over two years. In half the 157 studies where this error could have been made, it was made.

It makes sense when you realize that many people simply ape their supervisors and the existing literature. When bad methods make it into a paper people copy those methods without ever considering whether they're obviously incorrect.

[-][anonymous]6y 0

I'm arguing we need more feedback rather than more filtering.

You're arguing the new filtering will be more effective than the old filtering, and as proof, here is all the ways the old filtering method has failed.

But pointing out that filtering didn't work in the past is not a criticism of my argument that we need more feedback such as through objective post-publication reviews of articles. I never argued that the old filtering method works.

If you believe the old filtering method isn't a stringent filtering system, do you believe it wouldn't make much difference if we removed it, and let anybody publish anywhere without peer review as long as they preregistered their study? Would this produce an improvement?

I think you also need to contend with the empirical evidence from COMPare that preregistration (the new filtering method you support) hasn't been effective so far.

I think more stringent filtering can increase reliability, but doing so will also increase wastefulness. Feedback can increase reliability without increasing wastefulness.

Feedback from supervisors and feedback from reviewers is what the current system is mostly based on. We're currently in a mostly-feedback system but it's disorganised, poorly standardised feedback and the feedback tends to end a very short time after publication.

Some of the better journals operate blinded reviews so that in theory "anybody" should be able to publish a paper if the quality is good and that's a good thing.

COMPare implies that preregistration didn't solve all the problems but other studies have shown that it has massively improved matters.

Science already has a pretty strong filter.

If that's true, why are replication rates so poor?

They may be the best way to determine what the researcher originally intended to measure, but they're not the only way to know if that was the only thing a researcher measured. I've found asking questions often works.

You can ask questions but how do you know whether the answers you are getting are right? It's quite easy for people who fit a linear model to play a bit around with the parameters and not even remember all parameters they tested.

Then two or more other people with PhDs who weren't involved in the research (editor and peer reviewer(s)) have to review what the researcher did

More often they don't review what the researcher did but what the researchers claimed they did.

[-][anonymous]6y 0

If that's true, why are replication rates so poor?

There is no feedback post publication. Researchers are expected to individually decide on the quality of a published study, or occasionally ask the colleagues in their department.

I don't get the impression that low replication rates are due to malice generally. I think it's a training and incentive problem most of the time. In that case just asking should often work.

Science has very little feedback and lots of filtering at present. Preregistration is just more filtering. Science needs more feedback.

What kind of feedback would you want to exist?

If researchers are now spending this extra time figuring out exactly what they plan to do all from the beginning of the study, and then filling out preregistration forms, what are they not doing instead?

Spending time on using a lot of different statistical techniques till one of them provides statistical significant restuls?

It seems that EBM defaults to a do-ocracy. This is alluded to in the discussion of purchasing authorship in prestigious papers. The system seems to work by offering to do all the "hard parts" of publishing. This meshes with Goodheart's law: figure out which proxy measures people care about, and then game the proxies. We have 3 sets of interests: the organization, the customers of the organization, and the professionals the organization partners with. The customers care about whichever proxy measures they have learned to care about, headline statistics, good words like "randomized controlled trials", "meta-analysis" etc (see the Bingo Card Fallacy). The professionals care about prestige. The organization buys associations with the professionals and then sells them to their customers. They smooth away all the frictional costs associated with this transaction and thus gain access to monkeying around with the methods of that transaction in a way that optimizes for the proxies their customers respond to.

There winds up being an ongoing battle. Educators try to teach the public about new, harder-to-game proxies (forest plots! preregistration! and on the very cutting edge, specification curves!*) Organizations have a very large incentive to find ways to game the new proxies. It is actually surprising that much progress is made. The incentives are asymmetric, especially money wise. Good conceptual handles/frames (like the spread of the pyramid of evidence) seem to help a lot. recently did an article explaining forest plots, though I am having difficulty finding it now.

The general pattern is that the new proxies need to better along some understandable numerical (meta-anlysis = bigger n!) or graphical dimension (forest plots give an intuitive overview of an entire line of study at a glance). I am hopeful for specification curves for exactly this reason since the output is intuitively graspable. It is then plausible for a meme to spread that evidence without X backing it is lower quality, where X is simply the checklist of hardest to fake proxies for rigor/validity.

One objection might be that you can only build so much on the top of the pyramid of evidence when the base is rotting (poor individual study design), but I think it helps. When Cochrane does a meta-analysis and only includes 10% of the papers in a given research area because the other 90% are crap, this sends a signal to the system. It would be nice if the signal was more robustly responded to of course, but at least it exists.

*Skip to this point to see what the output of a specification curve would be: The second curve shown illustrates how discontinuities can highlight problems/areas of interest in the specification space.


"He entered one day the board room"

should be "he entered the board room one day"

"This is not what I thought medicine would be about, let along EBM."

should be "alone"

"EBM should still be possible to practice anywhere, somewhere-- this remains a worthwhile goal"

reads less awkwardly with somewhere and anywhere reversed (English colloquialism).

a very insightful and entertaining

You did mean "depressing", right? :-/

In health care, patients are primarily concerned about their health, while everyone else is primarily concerned about their careers.

It's hardly surprising that patient health doesn't drive the system, when the only people primarily motivated by patient health have been systematically disempowered by regulatory law.

See also Pournelle's Iron Law of Bureaucracy

This seems to me like you missed the point of the whole story. Patients want to become healthy, but they don't know what exactly should they do. They rely on various signals... and there is an arms race between finding better signals and gaming the existing ones.

If you could magically remove all the bureaucracy, there would still remain the situation with companies saying "our products will cure you", some experts saying "no they won't", other experts saying "yes they will", and then the two groups of experts accusing each other of incompetence and dishonesty. How is a non-expert patient supposed to transform their desire to become healthy into making better decisions in situations where experts compete to provide convincing arguments for both sides?

EBM seemed like a solution, but now it seems like another system that can be gamed if you spend enough money on it. For example, it seemed like if you take hundreds of studies and make a meta-analysis, the results will be more reliable. Unfortunately, many studies are so crappy, that the people doing the meta-analysis have to throw them away. But then a company can just pay another group to throw away a different subset, in order to achieve the desired conclusion, and say "what? we are doing exactly the same thing as Cochrane, and it proves that homeopathy does cure cancer. it's evidence-based".

You seem to have a mental model where listening to an expert is the only possible way to distinguish whether to do A or B. That's not true.

An alternative would be what Kant called Enlightment. It's interesting that the idea of Enlightment seems so absurd as to be unthinkable.

Okay, so which method would you recommend to an average patient instead?

Listening to experts is the (imperfect) solution for a society with division of labor, so that most people don't have to study medicine and statistics and make their own experiments. General education can make you dismiss the most obvious nonsense, but beyond that you need an expertise, and either you spend years to develop it yourself, or you outsource it to other people.

I don't think that you can effectively delegate all decisions about your health to someone else.

There are no effective noninvasive expert-based weight loss regiments. At the same time there are people who lose weight through being motivated to lose weight and working on the goal.

A product such as 23andMe can give a patient useful information about their health without there being a expert doctor who tries to translate. Current laws prevent a customer to buy data interpretation directly from 23andMe.

Last week I was with a bunch of experts at a workshop payed by government to envision the future of health 2.0. I think the room agreed that it would be good for medicine to be a bit less expert-based. We had a president from the medical association explain us that doctors likely aren't qualified to give good answers when you come to them with good data.

Seems to me like two different questions:

(1) Should an average person try to study medicine and statistics and find out the answers for themselves, or use an expert opinion?

(2) Are people recognized by current laws / credential systems as "experts" really the best available experts?

I would support people getting information about their health from organizations like 23andMe. But I don't expect 23andMe to do all research on their own -- at some moment they are going to rely on some peer-reviewed study. There should be a system where the studies are not easily gamed by financial interests of pharmaceutical companies, or by pressure on scientists to publish even when there is nothing worth publishing.

Should an average person try to study medicine and statistics

An average person should not do many things. That is not a good reason for people who are not average to abstain from these things.

(1) Should an average person try to study medicine and statistics and find out the answers for themselves, or use an expert opinion?

I don't think any person will be successful at losing weight based on following the 10 minute advice of an expert and doctors frequently take less than 10 minutes for talking to a patient. It takes learning about food and healthy eating. The same is true for every significant medical issue. It never makes sense to try to outsource the knowledge completely.

I would support people getting information about their health from organizations like 23andMe.

That's no decision that get's to be made on it's own. It's a decision that's the result of a system either being restrictive and only allowing experts to diagnose or the system being more free.

There should be a system where the studies are not easily gamed by financial interests of pharmaceutical companies

The question is whether the political realities of current insiders having that much power allow for such a system. Is change possible as long as power get's concentrated on well-funded people at the top of the system?

Or are we better off if we simply allow new entities into the system and allow them to do whatever they want? Entities like 23andMe that we currently block?

Some professional coauthors will probably die but will continue to have their names placed on new publications several years posthumously. The submitting author may forget that they are long dead, buried among dozens of automatically listed co-authors.

That seems to be a falsibiable prediction. It might also be a good tool to shame wrong-doers.

In theory yes but in practice it's hard to untangle from papers where someone was a proper, real coauthor or contributor but died before publication.

Excellent article. Do you happen to know of any evidence based research on cholesterol? Mine just came back at:

Whole Cholesterol 242
Triglycerides 73
HDL 55
LDL 172

and my doctor wants to have a talk with me about lowering my cholesterol. I'm already doing all of the easy things so I suspect he will want to put me on drugs.

I think the evidence for the effectiveness of statins is very convincing. The absolute risk reduction from statins will depend primarily on your individual baseline risk of coronary disease. From the information you have provided, I don't think your baseline risk is extraordinarily high, but it is also not negligible.

You will have to make a trade-off where the important considerations are (1) how bothered you are by the side-effects, (2) what absolute risk reduction you expect based on your individual baseline risk, (3) the marginal price (in terms of side effects) that you are willing to pay for slightly better chance at avoiding a heart attack. I am not going to tell you how to make that trade-off, but I would consider giving the medications a try simply because it is the only way to get information on whether you get any side effects, and if so, whether you find them tolerable.

(I am not licensed to practice medicine in the United States or on the internet, and this comment does not constitute medical advise)

[-][anonymous]6y 0

I don't want to complain about downvotes, but if someone believes that the above comment is misleading in any way I would like to hear the argument.

I am mostly posting this because I have a strong hunch that if an admin looks up who downvoted the above comment , they will discover sockpuppet belonging to a man known by many names.

[This comment is no longer endorsed by its author]Reply

Do you have other risk factors for heart disease such as high blood pressure?

Not blood pressure which is around 122/84. No family history, but 23&me says I have a 1.13x risk of Coronary Heart disease. I'm not overweight.

Cholesterol -- specifically, the importance of "cholesterol" (actually, lipoproteins) numbers -- is a hotly contested topic. There are so-called cholesterol wars about it. The mainstream position has been slowly evolving from "cholesterol is the devil" to "LDL is the devil" to "You need to look at HDL and trigs as well, but LDL is bad anyway" to "It's complicated" :-/

Doctors, unfortunately, tend to have a hard boundary in mind and if your cholesterol is above it, they feel the need to drive it below that boundary. I am not a fan of this approach.

Statins are a whole big separate issue. My impression is that statins have been shown to be quite effective for cardiac patients, that is, people who already had a cardiovascular event and are at high risk for another one. The use of statins for primary prevention, that is for people without any history of CVD is another thing. Pharma companies, obviously, really really want statins to be used for primary prevention.

The problem is that the effectiveness of statins for primary prevention is iffy. Essentially claims for it rest on a single trial called Jupiter. Before it, the Cochrane Collaboration stated that the use of statins for primary prevention was not shown to be useful. After that trial Cochrane changed the review and said that "Reductions in all-cause mortality, major vascular events and revascularisations were found with no excess of adverse events among people without evidence of CVD treated with statins." You can read the entire review (pay attention to absolute risk reduction numbers).

ETA: What, Cochrane reviews are behind a paywall now? I thought they were free-access. In any case, here is a link to a freely-available version.

If you feel you need to do something about your LDL without statins, try changing your diet. In particular, saturated fats push up LDL (but they also push up the "good" HDL).

You did the 23andMe thing, right? What's your APOC3 status?

I'm not sure which SNP or position for the APOC3 gene you are referring to.

Actually, check APOE first as it has pronounced effects. It's usually considered to be an Alzheimer's risk factor, but besides that it affects your cholesterol levels and specifically how they react to the saturated fat in your diet. See e.g. this paper or a high-level overview.

APOC3 is this.

I'm E3/E3 for APOE which I guess means I have less to worry about.