[ Question ]

How can we extrapolate the true prevalence of a disease, given available information?

by Elizabeth1 min read9th Mar 20204 comments



Note: a similar question got more attention here so maybe check that out.

The motivation here is COVID-19, but I think there are useful general models in the area.

I've made a lot of risk assessment models over the last week, most of which depend on knowing the true infection rate of a population. That's difficult to pin down at the best of times, but especially in the case of COVID-19. In the country I'm most familiar with, the US, there simply aren't enough tests performed to provide good prevalence information. This post is for models extrapolating the true prevalence of a disease from information you have on hand.

This is an exploration thread, so don't worry about it not being rigorous or defensible enough. I'll be posting my own as an example in the answers section.

New Answer
Ask Related Question
New Comment

2 Answers

My attempt (also available on guesstimate), using death rate statistics:


% Infectious = 100 * [# Infected Now] / [Total Population] # Infected Now = (# Infected 2 Weeks Ago]) * 2 ^ ([2 weeks]/[Doubling Time])[# Infected 2 Weeks Ago] = [# of C19 Deaths]/[Fatality Rate]

Variable Values

When values for a specific area were needed I used King County, home of Seattle, since it is the known epicenter right now and plague.com reports deaths by county.

[Total Population] = 2.1m


[Fatality Rate] = .01 to .034

These are estimates I have heard. I intend to replace this with a pointer to a better researched range when possible.

[# of C19 Deaths]: 5 to 100 (what I wanted), 5 to 90 (to make guesstimate come out right)

Seattle has 18 reported deaths. However 13 of those were at the same nursing home and so don't really count as independent. For a danger-loving estimate I used (18-12=5), as the number of deaths with independent infections. I used 100 as the upper bound because Seattle has ~5000 pneumonia deaths per year = ~750 per in March (assuming it's 3x as bad as the best month) = ~200 per week, and I assumed a 50% increase in pneumonia deaths would be noticed by a doctor or reporter. That gives us 100 as the upper bound of hidden SC2 deaths.Guesstimate is doing something weird and not following its own documentation, so in order to get the 5-100 lognormal distribution, I had to use the parameters 5 to 90.

Note that I'm including all deaths, not deaths over a specific period. This is fine in the very early stage of the epidemic because it's growing so fast that deaths will be dominated by the fastest dying of the most recently infected.

Hat tip to @orthonormal for pointing out that I could use overall pneumonia deaths as a cap, and for general help with this model.

[Doubling Time]: 3.5 to 7 (uniform distribution)

These are the estimates I have heard through various channels for doubling time. The estimates have been getting smaller recently, which is worrisome. I feel a little gross using doubling time instead of transmission rate, but it does make the math easier. This is one of several reasons this model is only valid during the early, fast-growing period of the virus.

Using this model, as of 2020-03-08, I estimate 0.11% to 5.2% of King County is carrying Sars-CoV 2, with the median prediction being 1.1%. The upper end of this is very high; this survey estimates the peak % of people who have the flu on a given day is 4.3%. I expect SC2 to peak higher than the flu, but not so soon. Even 1.1% is quite high: at that level I'd expect (based on an unpublished model) hospitals to be reaching capacity; news articles indicate they're bracing but the wave hasn't hit yet. This suggests that King County has either caught most of the deaths, or that coronavirus is quite deadly (and therefor each death represents fewer infections).

I made up a rough model here, and Unnamed and Scott Alexander each posted their own models in the comments.