Kinsa Smart Thermometer Dataset
Kinsa is a company that makes smart thermometers. A few years ago, they found that they could use the data that they got from their smart thermometers (most importantly the temperature reading and location of the user) to track flu trends across the United States. (FitBit has done something similar.)
Kinsa's data science team has now turned their attention to Covid-19 trends and started a tracking website using their thermometer data, using methods which they explain in more detail on their technical approach page. It looks like the most impressive thing that they've been able to do with this dataset so far is to identify new hotspots before other people do, like the increase in cases in southern Florida. But potentially there are a lot of other things that can be done with these sorts of data.
Estimating the Number of Coronavirus Infections in the US
One of those other things which might be doable with these sorts of data: coming up with more accurate estimates of the number of people with coronavirus. Testing in the US (and many other places) is spotty and delayed a great deal, estimating the number of infections based on the number of deaths involves a very long delay and a bunch of assumptions, etc. But if you can count the number of people in America with a fever (or extrapolate from a sample), and subtract off the baseline estimate of how many fevers you'd expect from influenza or other causes, then you can get an estimate of the number of people in the US with a fever due to coronavirus. And that gets you close to an estimate of the total number of coronavirus cases.
The coronavirus tracking website that Kinsa set up is already doing much of this - their graph (also shown below) shows something like the number of people with a fever and the baseline expected number of fevers.
So I decided to give it a try and use their graph to estimate the total number of coronavirus cases in the US.
It's a fairly rough first-pass analysis, which may contain errors, and could definitely be improved with some more work. The number I got at the end is that about 1% of Americans have gotten coronavirus, through March 20.
My Estimation Method, in Brief
The graph above shows something like "number of new fevers" (on an unclear scale labeled "% ill") and Kinsa's estimate of the expected number of fevers if there was no coronavirus. So the gap between the two lines represents something like the number of new fevers each day due to coronavirus. That trend has an odd shape for a pandemic: it increases and then levels off. I suspect that this is because, once people start taking precautions to avoid coronavirus, the number of flu cases drops dramatically, so their estimated baseline gets farther and farther from reality (of # of flu cases) and coronavirus accounts for a larger and larger number of the new fevers. You can view the regional trends by clicking on particular counties; regions like the SF Bay Area and Seattle have a similar shape on earlier days. The SF Bay Area is actually now anomalously below baseline in number of new fevers on March 21.
I decided to deal with this by focusing on the trend up until March 14, and extrapolating from there. (It would be even better to do this separately for each county and then aggregate them.)
Next step: making sense of the y-axis. A little bit of digging showed that it's from their flu work, where they used their data to fit a particular measure of flu prevalence that the CDC uses, which is ILINet data (explained partly down the page here). A little bit more digging on the relationship between this number and the number of flu cases reported by the CDC (as seen headlines like this) suggests that 1 point on the scale corresponds to roughly 75,000 new flu cases that day (which probably means about 75,000 new fevers). More detailed explanation of where that number comes from in my longer writeup.
So the gap of 0.79 scale points between observed and expected on March 14 corresponds to about 60,000 excess new fevers that day. Which we're guessing are entirely due to coronavirus. Using either their data for previous days, or assumptions about the growth rate in cases, we can turn that into an estimate in the cumulative total number of feverish cases as of that day. I tried both and got numbers of 470,000 and 370,000, so let's call it about 420,000 total cumulative cases through March 14.
But this is only counting the coronavirus cases that do get a fever, and (more importantly) it is only counting them when they get the fever. My guess is that a bit more than a doubling time passes between infection and fever, and also adjusting for the cases that never get a fever, the total number of coronavirus infections on March 14 was about 3x the number of feverish cases, or about 1.3 million.
Extrapolating forward assuming a 4-day doubling time gives an estimate of 3.6 million cases in the US through March 20, or 1.1% of the population.
So that's the basic method and estimate. The longer writeup goes into more detail about each step, and includes various things things I'm still confused or uncertain about and ways in which this analysis might be wrong. For instance, maybe concerns about coronavirus are causing people to take their temperature more often, which is sufficient to cause an increased number of measured fevers, and a large part of the upwards trend line is due to that rather than to actual coronavirus cases.
I'm interested in improving this estimate, or having other people go off and do their own estimate. And I'm especially interested in people finding more good things to do with this sort of dataset.