Thanks for running this. It didn't work out like you hoped, but you get kudos for trying (there are way too few practical tests/challenges on LW imo) and for having your game break the 'right' way (a cheese-able challenge still helps people develop their cheese-ing skills, and doesn't take up too much of anyone's time; my least favorite D&D.Scis are ones where my screwups led to players wasting meaningful amounts of effort on scenarios where the central concept didn't work).
If you make something like this again, and want someone to playtest it before release, please let me know.
I don't have any immediate plans to do something like this again, but I'll make use of your offer if I end up doing another challenge. Thanks!
Two weeks ago I posted an experiment for priors on Bernoulli processes. I gave you all way too much data, though, so I don't think it worked out to be a very good experiment.
This post provides the results and reveals the hidden experiment class.
The experiment was meant to test induction. Suppose you are a Bayesian, and you only have a small number of observations about some time-invariant law. What is the correct posterior you should have after these observations? If you just use the bare observed frequencies, then you will be much too overconfident if you see only yesses or nos. What you need to do is start from some prior, and then update from that.
But what is the correct prior? Some people have proposed Jeffreys priors or Laplace's rule of succession as non-informative priors. Do these actually work? I wanted to see what you guys could come up with.
The only way to determine whether a prior is a good prior is by running many different experiments and testing whether the posterior after some number of trials is correct across those experiments. No single experiment is enough. So, I made a dataset of experiments and asked you to induct on the final trial given the previous four. Naturally, the test depends on how well you can predict me. But observing the provided data can help narrow down the class of experiments I was drawing from. Unfortunately, I gave you guys way too much data, and so it didn't really matter what prior you used. I think the experiment would have been much more interesting if I only provided 100 experiments, and then computed the true marginals on a held-out dataset.
The correct answers were and . Here's a table of the total number of experiments for each count of s in the first four trials:
| s in First Four Trials | # Experiments | # Final Trial is | Marginal Frequency |
|---|---|---|---|
| 0 | 253,099 | 27,856 | 11.01% |
| 1 | 166,571 | 54,125 | 32.49% |
| 2 | 161,832 | 80,924 | 50.00% |
| 3 | 165,890 | 111,825 | 67.41% |
| 4 | 252,608 | 224,700 | 88.95% |
Now that the challenge is over, I've posted all of the data online.
Most of you were very close, only off by a few parts in 10,000. I was especially impressed by Cleo Nardo's submission. She guessed 10.6%, 34.8%, 50%, 65.2%, and 89.4% without even looking at the CSV, which were all within a couple percent of the correct values.
There were four types of experiments.
The four types of experiments were mixed with weights 10%, 10%, 10%, and 70%.
The second and third cases can both be viewed as drawing the Bernoulli probability from a mixture of Beta distributions (so that the correct prior is a mixture of Beta distributions). For case (2), the Beta distribution is , and for case (3), the distribution is an even mixture of and .
Cases (1) and (4) do not have Beta priors. Case (4) in particular is tricky. The probability for its Bernoulli process is the stationary point of a Markov chain, which involves multiplying and adding many random variables together to calculate.
The marginals in the first three cases can be calculated analytically very easily. I think the fourth case can also be analytically solved, but I chose to use a Monte Carlo simulation instead. To get better accuracy, I used two tricks:
Combining everything together, I got the true marginal probabilities to six significant figures (so 5 accurate digits). They are and .
Because you all were so close, I decided to resort to the true probabilities to determine a winner. My scoring function is where is the true probability of s in the first four trials, is the true th marginal probability, and is your th guessed marginal probability. (This is a weighted average of the KL divergence from the true marginals to your guesses.)
Your scores for your public submissions are
| Name | Score |
|---|---|
| Unnamed | 2.999e-07 |
| One | 3.017e-07 |
| DaemonicSigil | 5.433e-07 |
| James Camacho | 2.840e-04 |
| Cleo Nardo | 4.520e-04 |
It's very close, but Unnamed wins. Congratulations!
As promised, I resolved the Manifold market using the marginals for the original 1,000,000 trials. Manifold Markets didn't allow me to resolve to a sub-percentage precision, so I randomly rounded a percentage to either its ceiling or its floor with probabilities and , where is the fractional part of .