So, I put up some prediction markets on the results of quantified self RCTs. I ran two of the experiments, and scored both markets on the results.
How much should the performance of the market change our opinion about the viability of using prediction platforms to predict RCTs, and thus be plausibly useful in selecting experiments to run and actions to perform?
One issue here is that we only observed two datapoints: Two logscores from the markets on the Pomodoro method (-0.326) and the Vitamin D₃ experiment (-0.333). Qualitatively these are already fairly assuring because they're both far better than a random score of -0.69. But if we want to quantify the amount of information we've gained, we can do that by performing a Bayesian update.
For that, we need a prior. What prior to choose? I tried fiddling around a bit with the exponential distribution, which is the maximum entropy distribution over possible logscores, only nailed down by the mean of the distribution. It represents a state of minimal knowledge.
There's also the Gamma distribution which is great because it's a conjugate prior assuming an exponentially distributed underlying data generating process. So, after updating on the datapoints we again get a Gamma-distribution as the posterior.
But I didn't go with that one because… I wasn't getting the pretty results I wanted[1].
With the exponential distribution, a thing that happened was that I'd calculate the resulting distribution after two updates, but the first update would be very aggressive and the second update would then update away harder than the first update had updated towards the datapoints, causing a net information loss. The exponential distribution also has very long tails, resulting in a median of -1.0, which implies that the median market has a logscore that is worse than chance. I don't believe that to be the case. (The maxent mean implies that the mean market is only as good as chance, which I also wouldn't believe a priori? I think?)
As for using the Gamma distribution as a prior, I simply don't think that the underlying data generating process is exponentially distributed, and thus we don't get any advantage through conjugacy. The Gamma distribution also has two parameters, which is too much to be nailed down with only two datapoints.
So I decided to pick a different prior, and landed on the half normal
distribution
with half-normally distributed variance (σ ~ HalfNormal(0.5)
), which
has some nicer properties than the exponential distribution, especially
with its thin tails. But the half normal distribution can't be updated
in a closed-form solution, so instead I had to write a short script
using pymc. Because of missing
conjugacy the resulting distribution is not a half-normal distribution,
but something a lot more complicated. I can't be bothered to try to
calculate what it even is.
Summary visualization of the update:
The script[2] initializes the model with the
half normal prior, which in turn has a standard
deviation distributed
with HalfNormal("sigma", sigma=0.5)
. We then update on observed=[0.326, 0.333]
:
with pm.Model() as adaptive_model:
σ = pm.HalfNormal('sigma', sigma=0.5)
obs = pm.HalfNormal('distances', sigma=σ, observed=distances)
trace = pm.sample(2000, tune=1000, chains=4, target_accept=0.95,
return_inferencedata=True)
We can then observe the samples for the new standard deviation
σ_samples = trace.posterior.sigma.values.flatten()
σ_mean = np.mean(σ_samples)
and calculate the log-likelihoods, the Bayes factor, and the number of bits in the update:
results = {}
null_σ = null_sigmas[0]
ll_adaptive = np.sum(stats.halfnorm.logpdf(distances, scale=σ_mean))
ll_null = np.sum(stats.halfnorm.logpdf(distances, scale=null_σ))
log_bf = ll_adaptive - ll_null
bits = log_bf / np.log(2)
bf = np.exp(log_bf)
The whole script has this output:
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 2 jobs)
NUTS: [sigma]
Progress Draws Divergences Step size Grad evals Sampling Speed Elapsed Remaining
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3000 0 0.69 1 1165.53 draws/s 0:00:02 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3000 0 1.12 3 1135.09 draws/s 0:00:02 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3000 0 0.48 1 600.10 draws/s 0:00:04 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3000 0 0.78 1 566.57 draws/s 0:00:05 0:00:00
Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 5 seconds.
Posterior summary:
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
sigma 0.434 0.199 0.156 0.811 0.004 0.004 2582.0 2939.0 1.0
Posterior mean σ: 0.434
vs Null σ = 0.7:
Log likelihood (adaptive): 0.642
Log likelihood (null): 0.040
Evidence: 0.87 bits
Bayes factor: 1.8:1 in favor of adaptive
Posterior mean σ: 0.434
95% credible interval: [0.156, 0.844]
Evidence: 0.868 bits
(vs null hypothesis σ = 0.7)
Thus: 0.868 bits in favor of futarchy[3].
Many thanks to clippy (twitter) for M̶500, and Tetraspace (twitter) for M̶1000, which I used to subsidize the markets. Also many thanks to the manifold admin Genzy for subsidizing each market with M̶450.
Your funding of the sciences is greatly appreciated.
My gratitude also goes out to all the traders on the markets. You help me prioritize, you help us gain knowledge.
This is very bad statistical practice. I'm doing this because I want a cutesy title with a positive number of of bits as an update, and because I wanted to learn how to do Bayesian updating using computers. ↩︎
Code here. Thanks to Claude 4 Sonnet for writing the code and walking me through the process. ↩︎
Under several favorably cherry-picked assumptions. Don't @ me. ↩︎