On 5th Feb a commenter on Reddit posted that the coronavirus cases in China were following a suspiciously accurate quadratic curve, implying that China was making up their data and weren’t even bothering to hide it particularly well.
This set off my bullshit sensors fairly strongly so I wanted to check it out.
Having looked at the data myself, there may be other reasons why the Chinese data are not accurate but I think the quadratic pattern provides no evidence in favour of falsification.
I doubt people are making many decisions based on that post but possibly looking at where the statistics went wrong may be beneficial.
It turns out that the claim of a “near perfect” model was based on a very high value (0.9995).
is often chosen to summarise how precisely a regression fits the data – it tells you how much of the variance in the data is explained by the equation. Surely if only 1 part in 2,000 isn’t explained by the model that indicates that the data is fabricated?
The first thing to note is that "variance" is a technical term which isn't the same thing as the natural understanding of "variation". If you don't appreciate this then will seem more impressive than it is. In particular, variance isn't the deviation from the mean, but the squared deviation. In the more natural understanding, the quadratic model explains 44 parts in 45 of the deviation from the mean, not 1,999 in 2,000. This is still pretty good but seems like less evidence for fabrication.
An alternative explanation of the value is that it compares two models:
1. The model that you are trying to fit to the data
2. The model where the value of y is expected to be the same for all values of x (the mean of the y values)
A high value is telling us that model 1 is a much better fit that model 2.
However in our case we already know model 2 is going to be a terrible, terrible, terrible fit to our data. The y value used in the regression is total cases so far. So model 2 represents some number of cases having already been identified at the beginning of the time period in question and no more cases occurring during the time period.
So saying that model 1 accounts for 44 parts in 45 of the deviation between the data and model 2 doesn’t really tell me much – model 2 is a lost cause.
New cases per day
The problem here is the chosen y-axis. Instead of choosing total number of cases by a certain day, it would be better to choose new cases per day. This removes excess correlation between data points.
If we do this then instead of fitting a quadratic curve we need to be fitting a straight line (we’re taking the derivative with respect to x) but I'll still call it quadratic for the sake of consistency. Model 2 changes from representing a constant total number of cases to a constant number of new cases every day. This still isn’t a particularly likely model but is certainly an improvement.
Plotting and regressing we get:
Our value has gone down to 0.96. This is still high and suggests that quadratic growth is a fairly good model for the data but isn’t suspicious. For instance, I can also fit a power law (again 2 free parameters) to the data and get = 0.966.
So within the training set our quadratic model (linear new cases per day) explains a comparable level of variance as a power law model of new cases does.
Looking outside the dataset
The obvious thing to do is check whether the pattern was there outside the dataset.
It is clear that the pattern breaks down shortly after it was noticed (day 15 on this chart).
In addition we can look at the pattern before the training set. Again as soon as we go outside of the dataset used to create the graph the pattern completely breaks down. This is not surprising as at this point the quadratic model predicts a negative number of new cases per day.
(Looking a deaths instead of cases shows a similar story. For deaths the pattern keeps going for a little longer but even then the power law fit matches the data better.)
So to make the case for China falsifying the data quadratically you have to also say that the start date for them doing it was ~20th Jan and the end date more-or-less straight after the pattern was noticed. Presumably this would be justified by China having been caught out and changing from then on.
Or possibly, this is just how the virus develops. Now that we have some developments in other countries it is possible to compare spread rates. Where the disease has got out of containment there is a remarkably consistent pattern of growth which matches the China rate very closely. (I'm planning to write a more detailed post on this.)
Yup! I took this hypothesis seriously when it was first linked by Scott's tumblr, then I found the numbers didn't match up as well as the Reddit post (which only showed a chart) suggested. This was a strong sign that the poster was lying, but I still checked the numbers for a few days until they had clearly diverged.
Picking a descriptive statistic for these sorts of problems is pretty tricky. But I think we can do better than R^2, even without going all Bayesian-parameter-estimation.
What I mostly care about is just the standard deviation (in excel, STDEV.S() ) of the difference between the data and the model. Then I want to know how this compares to other scales in the data (like the average number of new cases per day).
I'd be curious why you think this is a good metric.
BTW, when you say "the difference between the data and the model", I assume you're referring to the residuals?
It's just a measure of how close the data is to the line - like the "inside view" uncertainty that the model has about the data. In fact, that's more precisely what it is if this is the chi squared statistic (or square root thereof) that you minimized to fit the model. And it's in nice convenient units that you can compare to other things.
It's not quite right, because it uses an implicit prior about noise and models that doesn't match your actual state of information. But it's something that someone who's currently reporting R^2 to us can do in 30 seconds in Excel.
Meta: I think using latex in headers causes a funny issue in the index on the left
Oh, huh. Yeah, don't use LaTeX in headers apparently. Sorry about that. Will add it to the issue tracker, though might take a while until I can fix it.
If we assume that china is an infinite plain, with Chinese people distributed uniformly across it, and that individual Chinese people don't travel much, then R2 could be accurate. A circle of disease spreading at a constant distance per day.
Nit: I guess you mean "a quadratic model could be accurate". The term R² here has nothing to do with the fact that the model is quadratic.