Model estimating the number of infected persons in the bay area

by elityre3 min read9th Mar 202021 comments



[Edit: I already found one error in my spreadsheet, and adjusted the incubation rate, which decreased my results by an order of magnitude. My up to date spreadsheet is here, but heed it at your own risk.]

[Epistemic status: Quite uncertain. It seems plausible that I made a major math error and this model is flat-out wrong, or that some of the inputs I used were very off. Best to think of this as a draft.]

[Thank you to Elizabeth Garrett, Luke Raskopf , jimrandomh, and PeterH.]

In my coronavirus planning, the crux between different actions is often "how many people are infected (as opposed to symptomatic) on a given day?" (For instance, when 0.5% of the Bay area population is infected, I'm going to stop going to the gym.)

This post walks through the model that I'm using to estimate current infection rates. I'd be grateful for anyone suggesting improvements, nitpicking the inputs, and especially correcting errors.

I'm computing my estimates in this messy spreadsheet, which is automatically importing data from John Hopkins CSSE's github repo. (Thanks PeterH!)

Basic argument

(This model is a variation of one that Elizabeth Garrett shared with me. Please give credit where credit is due.)

My goal is to estimate the number of people who are infected (who are carriers of the disease) rather than the number of people who are currently suffering symptoms. Here I'm going to walk through a series of steps, starting from the number of confirmed cases in a location, and derive and estimate of the number of infected persons in that population.

Use diagnosis rate and number of confirmed cases, to get the total number of symptomatic cases

To estimate the number of people currently infected, I start with the number of new cases that that were diagnosed in the past doubling period.

But not all the people who developed symptoms are confirmed as having the disease. Presumably some fraction (less than one) of all people who develop symptoms are successfully diagnosed. But if you know what that fraction (the diagnosis rate or confirmation rate) is, you can get the total number of cases by multiplying the confirmed number of cases by one over the diagnosis rate.

total cases that became symptomatic in the past doubling period = cases confirmed in the most recent doubling period * 1 / confirmation rate

Use doubling time and recent daily cases, to get the number of cases one doubling time ago

If you know the doubling time of the disease, and you know how many new cases there were in the past one doubling time, you know how many cases there were at the beginning of that doubling time.

For instance, if you know that a disease has a doubling time of one week, and you know that there were 50 new cases over the past week, that means there must have been 50 cases a week ago. (Because that's what a doubling time means. After one doubling time, there are twice as many cases as you started with).

total cases that became symptomatic in the past doubling period = total cases that had already shown symptoms at the beginning of that doubling period

Use total number of cases and incubation period, to get the number of people who became infected one incubation period ago

However, the number of symptomatic cases, lags behind the number of infected people, because there's an incubation period.

If we treat the incubation period as uniform, that means that the total number of people that have shown symptoms, on any given day, is equal to the number of people who were infected one incubation period ago.

So we're now estimating the number of people that were infected two steps in the past: a doubling time and and an incubation period ago.

Use the number of infected people (one doubling time + one incubation period) ago and the doubling time, to get the current number of infected people

Once you have a number of people who were infected (though not necessarily symptomatic) a doubling time and an incubation period ago, you can multiply that number by 2 raised to "however many doubling times there have been since that day".

This gives us an estimate of the number of people who are currently infected.

(If you see any errors, please leave a comment!)

Conclusion with current numbers

Given the above model, we can plug in some available numbers to get an estimate of how many people in the Bay area are currently (as of the evening of March 8, 2020) infected with COVID-19.

For number of confirmed cases, I'm using the data from John Hopkins CSSE. [See the "intermediate calculations" tab of the spreadsheet].

(Note that these numbers are including the Grand Princess Cruise Ship, which is currently in the pacific off the coast of California.)

I've heard that the doubling time for COVID-19 is between 3.5 and 7 days, so I calculated both of those, for a rough lower and upper bound. As more data comes in, I'll be able to observe the doubling time in the Bay area directly, and use that for future calculations.

(For calculating the number of new cases in the past 3.5 days, I took the difference between today and the average of 3 and 4 days ago.)

I'm very uncertain about what a reasonable confirmation rate is. Are 50% of symptomatic cases successfully being diagnosed as COVID-19? Are 30%? 10%? 1%?!

I elected to take all of them, and compute the number of people who are infected as a function of the confirmation rate. [see "Number of infected people" tab in the spreadsheet.]

Plotted on a log scale:

Again, I'm unsure what kind of diagnosis rate is reasonable. But, from a rough guess, I would be surprised if it was less than 5%, and surprised if it was much more than 70%.

So that gives an upper bound of 89,740 infected people (about 1.15% of the population of the Bay area) and a lower bound of 1,393 infected people (less than .01% of the population of the Bay area.

Note that that upper bound, in particular, is very sensitive to changes in the confirmation rate: if we assume that 10% of cases are successfully diagnosed, our number of infected persons drops to 44,870 (~0.5% of BA population).

Noting some simplifying assumptions that I'm making:

  • I'm assuming that the spread of coronavirus is well-modeled by an exponential function.
  • I'm assuming that everyone who is infected begins displaying symptoms exactly 2 weeks later.
    • (To the extant that infectees show symptoms earlier than two weeks, these models are overestimating the true values because there are fewer doubling times between infection and confirmation).
  • I'm assuming that everyone who gets coronavirus is diagnosed and confirmed as having coronavirus on the day they develop symptoms.
    • In reality, there's probably a lag (does anyone know how much of a lag?), which means these numbers will underestimate the true value, because we're actually getting data about who started showing symptoms a few days ago.


Again, please help me correct any mistakes. Additionally, if anyone has better data for any of these inputs than I've used here, especially for the confirmation rate, please share.

And if you have a different model, please post it! I would rather be taking my estimates from an ensemble of model