In the early days of the pandemic, there wasn't great data available, and it wasn't easy to do better than trusting the standard epidemiological estimate that around 2% of people who got COVID-19 would die. My back of the envelope estimate at the time was way higher, but no one else I knew seemed to think that number made sense, so I let the matter drop. But now we have enough data to check.

Recently, my sister reached out to me to check her own thinking on the matter. She used the same method I initially tried and abandoned - simply dividing the number of deaths by the number of resolved cases (deaths + recoveries) - to estimate that in the US, COVID-19 kills around 1 in 6 people who get it.

The problem with using only resolved cases, in a country with an ongoing pandemic, is that if people die faster than they're marked recovered, death rates can be inflated - and if they recover faster, deflated. Ideally, you'd want to wait until all cases have been resolved one way or the other. Fortunately, there are now countries where that situation nearly holds.

ETA: The other problem is that cases aren't infections. But if, as I do, you want to use publicly reported case data to make informed personal decisions, you might also be interested in the easier to calculate expected deaths per reported case.

I looked at countries with the 25 lowest active case counts, using the 91-DIVOC visualization tool, to see which ones seemed to be mostly done with the pandemic, at least for now:

I copied the current numbers (as of 7 Jun 2020) into a spreadsheet, to see what the effective death rate is in countries where the vast majority of cases are resolved. For the People's Republic of China, I only looked at numbers from Hubei.

In the countries and regions I looked at, active cases are around 1% of all confirmed cases, and around 5.8% of resolved COVID-19 cases end in death. But two thirds of cases came from Hubei. Excluding China, around 4% of confirmed cases are active, and around 3.5% of resolved cases ended in death. If I only look at other countries where fewer than 1% of confirmed cases are active, the average COVID-19 death rate is 2.6%, though in individual countries it ranges from 0.55% for Iceland to 4.67% for Croatia.

I'm adjusting my freak-out-and-hole-up threshold accordingly.

UPDATE: In the comments on my blog, Anna Salamon shared this estimate of the actual infection fatality rate for the state of New York.

Infection fatality rate (IFR) is clearly the superior metric if you're trying to do something like forecast the spread of the virus, and total death counts, because it corresponds more directly to a statement about underlying reality than the case fatality rate (CFR) does; the denominator of CFR is determined in part by who gets tested.

But if you're trying to figure out what rough-and-ready multiplier to apply to the daily numbers reported in your area, then to use IFR estimates, you need to remember that reported cases are not the same as actual infections, and adjust accordingly.


New Comment
27 comments, sorted by Click to highlight new comments since: Today at 2:29 AM

Ben, I think you're failing to account for under-testing. You're computing the case fatality rate when you want the infection fatality rate. Most experts, as well as the well-done meta analyses, place the IFR in the 0.5%-1% range. I'm a little bit confused why you're relying on this back of the envelope rather than the pretty extensive body of work on this question.

IFR isn't that helpful when trying to use public case data to estimate a hazard rate. I'll add a note clarifying that in the post. Since what's reported are cases, case fatalities are the natural thing to multiply the rate of new cases by.

Some apparently expert-promoted models have been total nonsense, and I prefer a back-of-the-envelope calculation whose flaws are obvious and easy for me to understand, to comparatively opaque sophisticated estimates which I can't interpret.

Can you point me to a clear concise account that shows how to estimate IFR with available data and use it in a decision-relevant way?

The most detailed treatment I’ve seen on this is this from a couple of months ago.

EDIT: To clarify per discussion below, I do think there's a fair chance that given a lack of sufficient ventillators the IFR may be >1%.

You say that like detail is a pure good. "Greg Cochran says 1.2%" is better than any number of words from CBG. Anyhow, you repudiated this. When I pushed you on it, you came up with the number 1.4%.

I'm not confident in a 1% as an upper limit (especially in an overrun healthcare system) but I do think that comment gives good back-of-the-envelope estimates (as requested). Later on in that thread CBG also acknowledges it may be higher in than 1% in some places and conditions.

Detail in this case is useful as it shows multiple sources and back-of-the-envelope calculations. I'm not really assessing CBG (except trusting that he isn't picking and choosing his arguments), rather I'm assessing his back-of-the-envelope calculation and where likely errors can creep in - exactly what the great-grandparent mentioned was preferred. 

If "Greg Cochran says 1.2%" is the counter-argument then I don't really know what to say except how likely is it that he's wrong this time and by what factor might he be off? What's his confidence interval? If someone can provide his working then at least that's something I can assess. It seems he is looking specifically at places with high infection rates and more stretched healthcare systems.

Anyhow, you repudiated this. When I pushed you on it, you came up with the number 1.4%.

The naive central estimate of a single back-of-the-envelope estimate where virus prevalence in Lombardy was estimated from one small town from a month previous isn't something I'd put much weight on. If pushed for an interquartile range based only on this calculation I would say 0.5<IFR<3.5. The point of that calculation wasn't to get an accurate answer but to show that 0.2% population fatality rate doesn't imply that the IFR is massive and 3,000,000 US coronavirus deaths this year is still highly unlikely.

except trusting that he isn't picking and choosing his arguments

Well, don't do that. I told you this before.

What's his confidence interval?

What's CBG's confidence interval? When he says 0.5-1%, does he mean something? Does he mean a confidence interval, or a distribution of "normal" situations or a distribution of more general situations? Or does he not mean anything?

Later on in that thread CBG also acknowledges it may be higher in than 1% in some places and conditions.

It's nice that he says that, but that's exactly the situation that you cited him in the other thread, claiming <=1%. I'm guessing that the pseudo-detail is exactly what caused you to not understand his claims. If you don't know what he claims, how can you assess his work? At least with GC you're not fooling yourself about what you've done.

And I still don't know what he claims. He seems to claim that NYC had IFR <=1%. Was NYC normal or not? In any event he's wrong. If NYC defines the upper range, then this affects his conclusion. If NYC doesn't count, I dunno, but I'm pretty sure that people are equivocating on whether it counts.

I have edited the original comment to more fully reflect my position.

The CFR will shift substantially over time and location as testing changes. I'm not sure how you would reliably use this information. IFR should not change much and tells you how bad it is for you personally to get sick.

I wouldn't call the model Zvi links expert-promoted. Every expert I talked to thought it had problems, and the people behind it are economists not epidemiologists or statisticians.

For IFR you can start with seroprevalence data here and then work back from death rates:

Regarding back-of-the-envelope calculations, I think we have different approaches to evidence/data. I started with back-of-the-envelope calculations 3 months ago. But I would have based things on a variety of BOTECs and not a single one. Now I've found other sources that are taking the BOTEC and doing smarter stuff on top of it, so I mostly defer to those sources, or to experts with a good track record. This is easier for me because I've worked full-time on COVID for the past 3 months; if I weren't in that position I'd probably combine some of my own BOTECs with opinions of people I trusted. In your case, I predict Zvi if you asked him would also say the IFR was in the range I gave.

I clicked through to the tweet you mentioned, which contains a screencap of a chart purporting to show "An Approximate Percentage of the Population That Has COVID-19 Antibodies." No dates or other info about how these numbers might have been generated.

Fortunately, Gottlieb's next tweet in the thread contains another screencap of the URLs of the studies mentioned in the chart. I hand-transcribed the Wuhan study URL, and found that while it was performed at a date that's probably helpful (April 20th) it's a study in a single hospital in Wuhan, and the abstract explicitly says it's not a good population estimate:

Here, we reported the positive rate of COVID‐19 tests based on NAT, chest CT scan and a serological SARS‐CoV‐2 test, from April 3 to 15 in one hospital in Qingshan Destrict, Wuhan. We observed a ~10% SARS‐CoV‐2‐specific IgG positive rate from 1,402 tests. Combination of SARS‐CoV‐2 NAT and a specific serological test might facilitate the detection of COVID‐19 infection, or the asymptomatic SARS‐CoV‐2‐infected subjects. Large‐scale investigation is required to evaluate the herd immunity of the city, for the resuming people and for the re‐opened city.

I'd need to know more about e.g. hospitalization rates in Wuhan to interpret this.

The New York numbers seem to come from a press release, with no clear info about how testing was conducted.

All of these are point estimates, and to get ongoing infection rates, I'd need to fit a time series model with too many degrees of freedom. Not saying no one can do this, but definitely saying it's not clear to me how I can make use of these numbers without working on the problem full time for a few weeks.

You've nonspecifically referred to experts and models a few times; that's not helpful and only serves to intimidate. What would be helpful would be if you could point to specific models by specific experts that make specific claims which you found helpful.

I'm not trying to intimidate; I'm trying to point out that I think you're making errors that could be corrected by more research, which I hoped would be helpful. I've provided one link (which took me some time to dig up). If you don't find this useful that's fine, you're not obligated to believe me and I'm not obligated to turn a LW comment into a lit review.

Given that it apparently took you some time to dig up even as much as a tweet with a screen cap of some numbers that with quite a lot of additional investigation might be helpful, I hope you're now at least less "confused" about why I am "relying on this back of the envelope rather than the pretty extensive body of work on this question."

If you want to see something better, show something better.

The director of NIAID publicly endorsed that model's bottom line.

start with seroprevalence data

Because of false positives, seroprevalence is massively overestimated everywhere that there hasn't been a massive outbreak. In those places the IFR is 1-2%. But can we extrapolate to normal outbreaks? If, as widely believed, an overrun medical system has worse mortality, then maybe the normal IFR really is only 0.5-1%. But if your meta-analysis directly measures that, it is not well-done.

The intro paragraph seems to be talking about IFR ("around 2% of people who got COVID-19 would die") and suggesting that "we have enough data to check", i.e. that you're estimating IFR and have good data on it.

Good point, I should add a clarifying note.

Here is a study that a colleague recommends: Tweet version:

Their point estimate is 0.64% but with likely heterogeneity across settings.

According to Greg Cochran, NYC and Italy give us the best data and the mortality rate for people who get COVID seems around 1.2% for an age structure similar to the US.

A link, or other citation if this somehow isn't available online, would help here. As would an explanation of why I should prefer this number to some other.

Sorry no link, but we might do another podcast soon. As to why you should prefer this number, well, Scott Alexander said Greg has "creepy oracular powers".

Interesting, Singapore has extremely low CFR: 37.900 cases and only 25 deaths. Mostly because overtesting and young patients (migrant workers)

This points to an important weakness in the data source I'm using here.

Does your recovery number include people who got covid but were never tested (ie positive antibody test, but not tested for infection)?

Not unless countries are reporting untested cases somehow.

Not to demand more work, but, um, any chance you could break any of these down by age group?

From your comment "But if you're trying to figure out what rough-and-ready multiplier to apply to the daily numbers reported in your area, then to use IFR estimates, you need to remember that reported cases are not the same as actual infections, and adjust accordingly." I still do not understand how you can translate CFR into the IFR that you really need. How do "adjust accordingly"?