Thanks to JustisMills and Ruby for reviewing this post. Any errors are my own.

TLDR: The French long COVID study which suggests that belief in having had COVID is correlated with long COVID symptoms but that actually having had COVID is not correlated with long COVID symptoms used the wrong statistical tool to obtain this result.

In reality, the study data show that long COVID symptoms are correlated with having had COVID and agree with Scott’s conclusions in Long COVID: Much more than you wanted to know.


The authors suggest that nearly all long COVID symptoms might not be caused by SARS-CoV-2 (except for those associated with anosmia). I believe that this is true in some cases but not remotely close to the extent suggested by the paper.

Study Design 

Roughly speaking the experimental setup was:

  • Send out a whole load of serology tests to ~36,000 people in May-Nov 2020 (Serology tests are antibody tests which are intended to show if you have ever had COVID).
  • Perform the tests (~27,000 received back), then give participants their results.
  • Send out a questionnaire about whether participants think they’ve had COVID and what persistent symptoms they’ve had (Dec 2020-Jan 2021).
  • Exclude some people for reasons.
    • e.g. Participants who thought that they had had COVID after they did the serology test were excluded.
  • Run some logistic regressions on different symptoms vs belief in having had COVID and/or serology results.

You may have spotted the first problem. We’re trying to test whether people’s belief in whether they’ve had COVID or their actually having had COVID is a better predictor of long COVID symptoms but we’ve given participants their serology results before we ask them if they think they’ve had COVID.

You’d think that this would ruin the results – belief in having had COVID should be extremely well correlated with having a positive serology result.

Fortunately (?!) this doesn’t seem to be the case. Of everyone who had a positive serology results, only 41.5% replied that they thought they’d had COVID. Of everyone who thought they’d had COVID, 50.4% had had a negative serology result.

I’m super confused by this but I’ll take this at face value for the moment and move on to the analysis.

Combined effects logistic regression with correlated predictors

The main reported result comes from model 3 of the study's analysis. This is the combined effects logistic regression model which uses 2 predictors:

  • Belief in having had COVID.
  • Serology results.

To predict:

  • Presence of persistent symptoms (18 different symptoms).

The result of this model was that a lot of symptoms (16/18) were predicted well by belief in having had COVID but that only anosmia was predicted by serology results.

This seems pretty damning of long COVID symptoms being caused by SARS-CoV-2, at least until we consider the correlation between the 2 predictive properties.

Consider the following example with 100 participants:

  • 89 are negative for belief and serology.
    • None have symptom A.
  • 10 are positive for belief and serology.
    • 9 have symptom A.
    • 1 does not have symptom A.
  • 1 is positive for belief and but negative for serology.
    • They have symptom A.

Running the equivalent of model 3 from the study on these data will show that belief in having had COVID is a positive predictor of symptom A but that a positive serology result is a negative predictor of symptom A.

At the same time, 90% of people who had COVID have symptom A compared to 1.1% of people who didn't have COVID!

This is kinda tricky to explain but bear with me.

  • Taking each predictor separately, belief is a stronger predictor of symptom A than serology.
    • This is due to the last participant mentioned (the only participant whose belief and serology don't match). For them, positive belief predicts having symptom A but negative serology predicts symptom A.
  • The model notices this difference in predictive power and makes belief a strong positive predictor.
  • It then looks at any variation which isn't explained by belief but that can be explained by serology.
    • Consider the last 11 people on the list (all the positive for belief participants).
      • 100% of people who were negative for serology had symptom A.
      • 90% of people who were positive for serology had symptom A.
    • So, given that someone is positive for belief, being positive for serology actually decreases the probability of having symptom A.
  • In reality the 2 predictors are optimised concurrently using gradient descent but the result is the same.

Probably people who are familiar with statistics are cringing slightly at that explanation but I hope it gives an intuitive idea of what is happening. Essentially:

  • All of the examples where positive serology makes you more likely to have symptom A are better explained (according to the model) by being positive for belief.
  • After adjusting for belief, being serology positive makes you slightly less likely to experience symptom A.

Of course this example is me just making up numbers to show how counter-intuitive results can be from this kind of model.

However, hopefully it illustrates the problems you can have when running a combined effects logistic regression with correlated predictors. This might not be a problem (or even be a feature) in some cases but when one of your predictors (having COVID) often causes the other (believing that you had COVID) then you have to think more carefully about your model.


Is there a simple way to assess whether COVID causes the symptoms in the study? Yes, just run the logistic regression with serology results as the only predictor. Fortunately for us the study includes this model – model 2.

Model 2 results show that the likelihoods of experiencing the following persistent symptoms are increased by having had COVID (odds ratio / percentage point increase vs serology negative):

  • Fatigue (2.59 / 5.0%)
  • Anosmia (15.69 / 4.3%)
  • Poor attention/concentration (2.10 / 2.8%)
  • Breathing difficulties (3.60 / 2.3%)
  • Chest pain (3.70 / 1.4%)
  • Palpitations (2.61 / 1.2%)
  • Headache (1.69 / 0.9%)
  • Dizziness (2.37 / 0.6%)
  • Cough (2.22 / 0.6%)
  • Other symptoms (1.91 / 1.3%)

If we add all the percentage point increases (i.e. how many more percentage points serology positive participants experienced persistent symptoms vs serology negative participants - data from table 2) then we get 20.3%. So having COVID on average gives you ~0.2 persistent symptoms vs not having COVID, with presumably some people having more than one symptom.

This is roughly in line with Scott’s conclusions in Long COVID: Much more than you wanted to know. The specific symptoms experienced are also in line with that post, so if that post reflects your current understanding of long COVID then I wouldn’t update much based on this study except to add some more confidence to a couple of the points Scott makes:

2. The prevalence of Long COVID after a mild non-hospital-level case is probably somewhere around 20%, but some of this is pretty mild.

3. The most common symptoms are breathing problems, issues with taste/smell, and fatigue + other cognitive problems.

Serology vs Belief

Can we say anything about how much effect belief in having had COVID has on Long COVID compared to actually having had COVID?

I think it’s difficult based on this study, because participants knew their serology results before stating their belief and I really have no idea how this affected the results. I’ll keep pretending that this isn’t an issue for the moment.

We can compare model 2 (serology) results to model 1 (belief in having had COVID) along with values from table 2. The percentage points increases from belief are on average 2.17x (range 1.55-2.92) higher than the equivalents for serology (for the symptoms which are significant for serology). So if the belief value represents the full population who report symptoms then actually having had COVID accounts for 46% of those. If we include the other symptoms which aren't significant for serology then this number will get lower.

At face value this suggests that just over half of the people with long COVID symptoms who think that they had COVID are wrong. This is important but not the same as "A serology test result positive for SARS-COV-2 was positively associated only with persistent anosmia" as is reported in the study.

If we factor in the obvious problems with the experimental setup, then it's hard to know how much credence to give the study's data on this topic.


New Comment
11 comments, sorted by Click to highlight new comments since: Today at 7:44 AM

It seems (on the basis of what you say here; I haven't looked at the actual study) as if everything is consistent with the following situation:

  • "Long COVID" symptoms other than anosmia/parosmia are caused by believing you have had COVID-19.
  • Actually having COVID-19 makes you more likely to believe you have had COVID-19.
  • This is how it comes about that "having COVID on average gives you ~0.2 persistent symptoms vs not having COVID".

Does the study give detailed enough numbers to distinguish this scenario from one where the disease causes the symptoms by "non-psychological" mechanisms?

Thats a fair point. I don’t think the data does distinguish between the two so maybe I’ve overstated the case here.

I think it’s important to distinguish between “is consistent with” and “implies that”. I think the belief hypothesis should be given a much lower prior than just Covid causing long Covid symptoms plus some additional cases for belief on top of that.

I would expect that the low probability of reporting COVID given that you have a positive serology test is due to the fact that many COVID cases are asymptomatic. If I had no symptoms of COVID, but someone told me I tested positive for COVID one time, would I consider myself to have had COVID? I probably would, but I expect most people wouldn't since "had COVID" is an experience centered on the experience of disease for most people (i.e. coughing and feeling unwell), not centered on the presence or absence of a virus in your body. The fact that half of the people who have a positive test result don't think they have had COVID approximately matches my expectation about the rate of asymptomatic infection.

It actually lines up with the official terminology: The "D" in "COVID-19" stands for disease. Not all infections cause disease.

Is there a simple way to assess whether COVID causes the symptoms in the study? Yes, just run the logistic regression with serology results as the only predictor. Fortunately for us the study includes this model – model 2.

This makes the assumption that people are equally likely to get infected with COVID regardless of health. What evidence is there for this assumption?

Yes, this is a good point, I suspect most long COVID studies probably have the same flaw

It's tricky to run studies with beliefs as a variable, because beliefs have causes, so you're setting yourself up to have confounds. I haven't looked that closely at this study, but here are some possibilities:

Severity: people who had covid but believed that they didn't had mild symptoms. So 'severe cases have more long-term symptoms than mild/asymptomatic cases' would look like 'covid+belief leads to more reported long-term symptoms than covid without belief'.

Other illnesses: people who didn't have covid but thought the did had some other illness like flu or pneumonia. If there's long flu, then the long-term symptoms could be from that.

Long-term symptoms: a person who thinks that they probably just have a cold and not covid, but then is still fatigued a month later, might conclude that actually it probably was covid. So medium-to-long-term symptoms can cause belief, rather than belief causing long-term symptoms.

Testing inaccuracy: if the test that they're using to establish the ground truth of whether a person had covid isn't that accurate, then people who they're counting as 'covid but no belief' might actually be false positives, and people who they're counting as 'no covid but yes belief' might be false negatives.

Hypochondria: people who are to imagine that their health is worse than it actually is might mistakenly believe that they had covid (when they didn't) and also imagine that they have long-term symptoms like fatigue or difficulty breathing. If people who did get covid have similar reported long-term symptoms, that means that the actual long-term symptoms of people who had covid are as bad as the imagined level of symptoms among people who imagined they had covid.

Denial: the reverse of hypochondria - people who say they're fine even when they have some health symptoms might say that they didn't have covid even though they did, and then downplay their long-term symptoms. 

Trolling: if data slipped into the study from any people who find it funny to give extreme answers, they would look like hypochondriacs, claiming to have covid & long-term symptoms even if they didn't have covid.

The first few of these possibilities are cases where facts about the world influence beliefs, and those facts also influence long-term symptoms. The last few of these possibilities are where the person's traits influence their beliefs (or stated beliefs), and those traits also influence their reports of what long-term symptoms they've had.

If you wanted to independently assess the effects of getting covid and believing that you had covid, ideally (for scientific rigor unconstrained by ethics or practicality) you'd randomly assign some people to get covid or not and also randomly assign some people to believe they had covid or not (e.g. by lying to them). If you couldn't have perfect random assignment + blinding, then you'd want to measure a whole bunch of other variables to account for them statistically. In reality, without anything like random assignment, who gets covid is maybe close enough to random for an observational study to work, especially if you control for some high-level variables like age. Beliefs about whether you had covid are heavily entangled with relevant stuff, in a way that makes it really hard to study them as an independent variable.

Is there good reason to think that this study overcomes these problems?

For comparison: imagine some medical researchers are interested in whether a particular medicine helps with a particular medical condition, so they set up a placebo controlled trial. A bunch of people with the medical condition all get their symptoms tested, then they flip a coin and half get pills with the medicine while the other half get sugar pills, and they don't know whether they have the real pills. Then, some time later, they all get their symptoms tested again.

Now, imagine that I'm interested in "placebo effects" - I want to see if the ritual of taking sugar pills which you think might be medicine improves people's health, or causes side effects, and I want to piggyback on this medical trial. I could just look at the pre vs post results for the set of people who got the sugar pills, but unfortunately this medical condition varies over time so I can't disentangle effects of the pill-taking ritual from changes over time. I wish the study had a third "no-pill" group who (knowingly) didn't get any treatment, in addition to the medical pill group and the inert pill group. Then I could just compare the results of the sugar pill group to the no pill group. But it doesn't.

So I have the clever idea of getting the researchers to add a question to the tests at the end of the study, where they ask the patients whether they think they got the medicine pill or the sugar pill. That gives me a nice 2x2 design, where patients differ both in whether they got the medicine pill or the sugar pill, and separately in whether they believe they got the medicine pill or the sugar pill. So I can look separately at each of the 4 groups to see how much their condition improved, and what side effects they got. Changes that are associated with beliefs, I can claim, are based on the psychological effects of this pill taking ritual rather than the physiological effects of the substance they ingested.

This is a terrible study design. Who's going to believe they got the real medicine? Well, people whose condition improved will tend to think they must've gotten the real medicine. And people who noticed physiological states like nausea or dry mouth will tend to think they've gotten the real medicine. This study design will say that improved condition & nausea are caused by people's beliefs about whether they got the medicine, when in reality it's the reverse: the beliefs are caused by these physical changes.

If I'm especially meddlesome, I might even tell the original researchers that they should use this 2x2 design to evaluate the original study. Instead of just comparing the outcomes for the medicine pill group and the sugar pill group, they should compare the outcomes while controlling for people's beliefs about whether they got the medicine. That would mess up their study. It would be asking how effective the medicine is, after removing any effects that allowed patients to realize that they'd gotten the medicine (as if belief-entangled effects couldn't be physiological effects of the substance).

Seropositivity is also not a good indicator for having had covid:

Some researchers believe absence of antibodies after infection is positively correlated with long covid (I don't have a source).

This study is bunk and it's harmful for adequate treatment of seronegative treatment.

"If we add all the percentage point increases (i.e. how many more percentage points serology positive participants experienced persistent symptoms vs serology negative participants - data from table 2) then we get 20.3%." 

I am not sure whether this reasoning is correct. It seems to be dependent on how the symptoms are categorized. For instance, suppose we divided fatigue into moderate fatigue and severe fatigue. The increased probability for each might be 5%, and then you would get 25.3% rather than 20.3%. Or suppose we combined fatigue and poor attention, which are likely correlated. The combined increased probability of "fatigue or poor attention" is likely less than 7.8%, and this would bring you down from 20%.

Yes, I agree that some symptoms are likely highly correlated. I didn't intend to rule out that possibility with that sentence - I was just trying to say how I did my math (although I'm not sure how clear I was!). The correct conclusion is in the following sentence:

So having COVID on average gives you ~0.2 persistent symptoms vs not having COVID, with presumably some people having more than one symptom.

Possibly it would be better to add the caveat "0.2 persistent symptoms of those symptoms investigated".