This is great, but I'd caution against saying it's definitive. There's a risk of multiple hypothesis testing, as well as the usual risks of publication bias, possible errors in the way individual studies were conducted, etc.
Er, yes. That was obvious in my head and I should probably make it obvious in the article text too. Thanks!
Dynomight has looked at the health effects of vitamin D supplementation. The large-scale meta-analyses that have been performed conclude there is no significant effect, even though individual studies relatively consistently point in the direction of an effect. This means our failure to detect a significant effect may be due to low experiment power.
Dynomight suggests a mechanism for that: what if the low power of the existing randomised trials is because they tend to be run in countries that fortify common foods with vitamin D? That would mean, practically speaking, all arms of the trial get the treatment. To investigate, Dynomight pulls out a table with the results only from studies where participants entered with relatively low values of vitamin D in their blood.
Trial
All-cause mortality
Trivedi
0.90 (0.77 to 1.07)
WHI
0.92 (0.83 to 1.01)
Lyons
0.99 (0.93 to 1.05)
RECORD
0.93 (0.85 to 1.02)
The all-cause mortality is reported in odds ratios, which is a multiplicative scale. If the number is less than one, it means vitamin D supplementation was found to reduce all-cause mortality. But the 95 % confidence intervals in parentheses all straddle 1, meaning none of the studies were able to show a significant effect at that confidence level.
Unfortunately, no formal meta-analysis has been done on this specific subset of studies. But we can make a quick and dirty one!
Sign test (counting coinflips)
We can tell immediately from the table that four out of four studies have a number less than one, i.e. they show a beneficial effect of vitamin D supplementation.
This is a primitive form of meta-analysis! We count the total number of studies, and how many of them support the hypothesis. If we assume no effect, then half the trials should show a benefit, and half should show harm. What are the chances of flipping a coin four times and getting the same result on all four of them? 12.5 %.
Thus, the pooled p-value of the combined trials, when we look only at the direction of their result, is 12.5 %. Not significant. But keep this technique in your back pocket! It's so easy you can pretty much do it in your head.
Fisher’s method (the chi-square hack)
If we use more information from the trials, we get a higher-powered meta-analysis. Bringing back the confidence interval endpoints, we can compute the standard error that must have been used to arrive at the confidence intervals. We get this by taking half the interval width and dividing it by 1.96.
Trial
lower bound
upper bound
SE
Trivedi
0.77
1.07
0.077
WHI
0.83
1.01
0.046
Lyons
0.93
1.05
0.031
RECORD
0.85
1.02
0.043
Armed with this confidence interval[1], we can compute the z-score of the result as the size of the measured effect relative to the noise of the measurement, i.e. divided by the standard error.
Trial
Mort.
SE
z-score
Trivedi
0.90
0.077
−1.26
whi
0.92
0.046
−1.67
Lyons
0.99
0.031
−0.32
RECORD
0.93
0.043
−1.56
We can then convert this z-score to a p-value by assuming it’s normally distributed. This p-value was probably reported in the original papers, but it didn’t carry over to Dynomight’s table, so we’re recomputing it. The p-value is the value of the standard normal distribution at the z-score. Google sheets has it as the
norm.s.distfunction.[2]Trial
Mort.
SE
z-score
p-value
Trivedi
0.90
0.077
−1.26
0.21
WHI
0.92
0.046
−1.67
0.10
Lyons
0.99
0.031
−0.32
0.75
RECORD
0.93
0.043
−1.56
0.12
Here comes the trick for computing the aggregate significance of these four studies. We can convert this p-value to a chi-squared value, by taking its logarithm and multiplying by −2.
Why? I don’t know! Fisher said we could do that.
Trial
Mort.
p-value
χ²
Trivedi
0.90
0.21
3.13
WHI
0.92
0.10
4.69
Lyons
0.99
0.75
0.59
RECORD
0.93
0.12
4.26
These chi-squared values are all with two degrees of freedom, and they can be added together. The combined effect of all four trials has a chi-squared value of 12.7 with 8 degrees of freedom. We either look that up in a chi-square table, or run it through the Google Sheets function
chisq.dist.rt, and we will find it corresponds to a p-value of 12 %. In this case, that happened to be very close to the result of the sign test, but ever so slightly more powerful.I like this method because it’s relatively easy to remember the procedure, so I can whip it up in a spreadsheet live. Unfortunately, it’s still not powerful enough to reveal a significant combined impression from these four trials.
Precision-weighted pooled intervals
We can perform an even more powerful type of meta-analysis. First we need the standard errors of each study. If we only have the p-values (and effect sizes), we can convert them to standard errors, but the process relies on quite significant assumptions. In our case, we extracted the standard errors from confidence intervals, which is slightly better but still not ideal. Either way, we have standard errors.
The idea is that we’ll compute a weighted average of the effect sizes, with the weights coming from the precision of each study. The precision is the inverse variance, i.e. one divided by the square of the standard error.
Trial
Mort.
SE
weight
Trivedi
0.90
0.077
170
WHI
0.92
0.046
470
Lyons
0.99
0.031
1100
RECORD
0.93
0.043
530
This assigns the highest weight to the Lyons study, because that one seems to have pinned down the effect most precisely.[3]
When we perform the weighted average of the observed effects using these weights, we arrive at a combined odds ratio of 0.95. Since this is a linear combination of uncertain values, and we know their variation, we can compute the variation of the weighted average. We find out the aggregate standard error is 0.021.
That means the z-score of the aggregate effect is −2.17, which, under a two-tailed test, corresponds to a p-value of 3 %. Look at that! When we use a powerful enough test, the combined impression of these four studies is significant. If we believe Dynomight didn’t cherry-pick these four studies, that would be a meaningful discovery.
Now that we have the standard error of the aggregate, we can also compute a confidence interval for the aggregate.[4]
Trial
All-cause mortality
95 % CI
p-value
Trivedi
0.90
0.77 to 1.07
0.21
WHI
0.92
0.83 to 1.01
0.10
Lyons
0.99
0.93 to 1.05
0.75
RECORD
0.93
0.85 to 1.02
0.12
Aggregate
0.96
0.83 to 0.99
0.04
If this table does not excite you, then surely this fancy ASCII diagram will.[5] (This chart does not render correctly on LW on my phone. See the article on my website if the straight reference line for 1.0 is not straight.)
Odds ratio: 0.8 0.9 1.0 1.1- - - - - - - - - - - - - - - - - - - - - │ - - - - - - - - -
Trivedi ─────────────●─────────┼─────── (p=0.21)
WHI ─────────●───────┼─ (p=0.10)
Lyons ──────●┼───── (p=0.75)
RECORD ────────●──────┼── (p=0.12)
- - - - - - - - - - - - - - - - - - - - - │ - - - - - - - - -
Aggregate ────●────┤ (p=0.04)
I find this whole thing incredible. We can tell the aggregate is a sort of average weighted toward the more precise results, but the information contained in all of these four studies is still enough to draw a definitive conclusion that (a) yes, there is an effect, and (b) it is in the direction of benefit.
Hang on – aren’t odds ratios multiplicative, so we should log-transform the data before we do maths on it? Yeah, we should. Will it make a huge difference for a quick-and-dirty significance check? There are more important things to care about.
And here we are taking twice that, because we are doing a two-tailed test. This means we’re trying to see if there is any significant effect at all – benefit or harm. We perform the test as if we hadn’t seen any potential effect is likely a benefit.
There’s a mathematical argument for why the weight should be the inverse variance, but I seem to have lost my notes on the way here.
Again, shouldn’t these intervals be asymmetric due to the multiplicative nature of odds ratios? They should, but apparently the data source for these intervals ignored that, so I will too.
Yeah, yeah, it uses box drawing characters from Unicode.