Experiment: Test your priors on Bernoulli processes.

How would you answer this without looking at the csv?

I wrote a post on my prior over Bernoulli distributions, called "Rethinking Laplace's Law of Success". Laplace's Law of Succession is based on a uniform prior over [0,1], whereas my prior is based on the following mixture distribution:

w1 * logistic-normal(0, sigma^2) + w2 * 0.5(dirac(0) + dirac(1)) + w3 * thomae_{100}(α) + w4 * uniform(0,1)
where:
The first term captures logistic transformations of normal variables (weight w1), resolving the issue that probabilities should be spread across log-odds
The second term captures deterministic programs (weight w2), allowing for exactly zero and one
The third term captures rational probabilities with simple fractions (weight w3), giving weight to simple ratios
The fourth term captures uniform interval (weight w4), corresponding to Laplace's original prior
The default parameters (w1=0.3, w2=0.1, w3=0.3, w4=0.3, sigma=5, alpha=2) reflect my intuition about the relative frequency of these different types of programs in practice.

Using this prior, we get the result [0.106, 0.348, 0.500, 0.652, 0.894]

The numbers are predictions for P(5th trial = R | k Rs observed in first 4 trials):

If you see 0 Rs in the first 4 trials (all Ls), there's a 10.6% chance the 5th is R
If you see 1 R in the first 4 trials, there's a 34.8% chance the 5th is R
If you see 2 Rs in the first 4 trials, there's a 50% chance the 5th is R
If you see 3 Rs in the first 4 trials, there's a 65.2% chance the 5th is R
If you see 4 Rs in the first 4 trials (all Rs), there's an 89.4% chance the 5th is R

The Laplace's Rule of Succession five numbers using the are [0.167, 0.333, 0.500, 0.667, 0.833], but I think this is too conservative because it underestimate the likelihood of near-deterministic processes.

[-]JBlack7d50

I'm not entirely sure what's being asked here. Is this asking "if we do experiment 1000001 and see k Rs in the first four trials, then what credence do you assign to the 5th trial being R?"

Or is it "if we take a random experiment out of the million and see k Rs in the first four trials, then what credence do you assign to the 5th trial being R"? This isn't the same question as the first.

Or is it something else again?

[-]joseph_c7d10

It's asking, "If I draw a histogram of the frequency of R of the fifth trial, with buckets corresponding to the number of Rs in the first four trials, what will the heights of the bars be?"

We are not doing any more experiments. All the experiments have already been done in the 1,000,000 provided experiments. I've just left out the fifth trial from these experiments.

This is almost the same question as, "If we do experiment 1000001 and see k Rs in the first four trials, then what credence do you assign to the 5th trial being R," but not quite. Your goal is to predict the marginals frequencies for the experiments I have actually conducted, not any idealized "next experiment". Because 1,000,000 trials is so many, this should be close, but they are not quite the same. The actual marginal frequencies will have some noise, for example.

I hope this helps! If you need more explanation, feel free to ask.

[-]DaemonicSigil7d31

Also tried this, and basically ended up with the same answer as commenter One.

Key idea is that we really only care about drawing 5 trials from this process. So we just have to find a probability distribution over 6 outcomes: a count of for our 5 trials from 0-5. 10^6 datapoints is enough to kill a fair amount of noise by self-averaging, so I treated the fact that hiding a random trial has to reproduce the observed 4-trial distribution as just a hard constraint. (It's a linear constraint in the probabilities.) Then did maximum entropy optimization subject to that constraint. The output distribution in terms of 5-trial counts looked pretty symmetric and was heavier towards the extremes.

Another quick computation from these values yields the p(R | k) numbers asked for in the question: [0.11118619, 0.32422537, 0.49942029, 0.67519768, 0.88914787]

[-]Unnamed7d31

[0.111019, 0.324513, 0.5, 0.675487, 0.888981]

[-]Unnamed6d30

Explanation:

Hypothesis 1: The data are generated by a beta-binomial distribution, where first a probability x is drawn from a beta(a,b) distribution, and then 5 experiments are run using that probability x. I had my coding assistant write code to solve for the a,b that best fit the observed data and show the resulting distribution for that a,b. It gave (a,b) = (0.6032,0.6040) and a distribution that was close but still meaningfully off given the million experiment sample size (most notably, only .156 of draws from this model had 2 R's compared with the observed .162).

Hypothesis 2: With probability c the data points were drawn from a beta-binomial distribution, and with probability 1-c the experiment instead used p=0.5. This came to mind as a simple process that would result in more experiments with exactly 2 R's out of 4. With my coding assistant writing the code to solve for the 3 parameters a,b,c, this model came extremely close to the observed data - the largest error was .0003 and the difference was not statistically significant. This gave (a,b,c) = (0.5220,0.5227,0.9237).

I could have stopped there, since the fit was good enough so that anything else I'd do would probably only differ in its predictions after a few decimal places, but instead I went on to Hypothesis 3: the beta distribution is symmetric with a=b, so the probability is 0.5 with probability 1-c and drawn from beta(a,a) with probability c. I solved for a,c with more sigfigs than my previous code used (saving the rounding till the end), and found that it was not statistically significantly worse than the asymmetric beta from Hypothesis 2. I decided to go with this one because on priors a symmetric distribution is more likely than an asymmetric distribution that is extremely close to being symmetric. Final result: draw from a beta(0.5223485278, 0.5223485278) distribution with probability 0.9237184759 and use p=0.5 with probability 0.0762815241. This yields the above conditional probabilities out to 6 digits.

[-]One7d*32

Answer:

[0.111020, 0.324512, 0.5, 0.675488, 0.888980]

I will provide my solution when the market is resolved.

[-]One7d30

Decided to provide my solution since others have done so as well.

Solution

The public dataset is approximately symmetrical, so it is very likely that the distribution of the Bernoulli rate is also symmetrical (probability at p is equal to probability at 1-p). Let the probabilities of getting k Rs over all 5 trials for k=0...5 be . Then, from the public dataset, we have $a + b / 5 \approx 0.252854, 4 b / 5 + 2 c / 5 \approx 0.166231, 6 c / 5 \approx 0.161832$ . These have standard deviation $\approx 0.0004,$ which is negligible, so we can treat these as linear equations. Solving, we get $a = 0.224781, b = 0.140359, c = 0.13486$ , and we can then solve for the marginal frequencies $\frac{b / 5}{a + b / 5} = 0.111020, \frac{2 c / 5}{4 b / 5 + 2 c / 5} = 0.324512,$ etc.

Not sure if this (experiment set?) is a good test of priors, since I got an exact answer without having to consider priors, other than the data being symmetrical. (This also means that any symmetric distribution for the Bernoulli rate will result in the same answer.) Though @DaemonicSigil has a similar solution without using symmetry, instead using

maximum entropy as a prior (if i understand it correctly).

Still, almost all reasonable priors will result in very similar outcomes, differing by a factor probably on the order of the standard deviation (around $10^{- 3}$ .) This is likely less than, or at least comparable to, the noise in the actual marginal frequencies.

[-]joseph_c7d*20

You're mostly right. The other solves have given pretty much identical distributions.

Some of your distributions are worse than other distributions. If I run 100,000,000 experiments and calculate the frequencies, some of you will be more off at the fourth decimal point.

The market doesn't have that kind of precision, and even if it did, I wouldn't change the resolution criterion. But I can still score you guys myself later on.

I do agree that I should have given much fewer public experiments. Then it would be a better test on priors.

[-]AprilSR8d31

You do get one guarantee, though: All the experiments are Bernoulli processes. In particular, the order of the trials is irrelevant.

I think those aren't quite equivalent statements? If I pick my favorite string of bits, and shuffle it by a random permutation, then the probability of each bit being 1 is equal, the order is totally irrelevant (it was chosen at random), but it's not Bernoulli because the trials aren't independent of each other (if you know what my favorite string of bits is, you can learn the final bit as soon as you've observed all the rest.)

[-]Cleo Nardo7d40

That's what "in particular" means, i.e. the "the order of the trials is irrelevant" is a particular feature

[-]joseph_c8d30

Correct, they are not equivalent. The second statement is a consequence of the first. I made this consequence explicit to justify my choice later on to bucket by the number of s but not their order.

The first statement, though, is also true. It's your full guarantee.

[-]foodforthought7d10

To clarify, the ground truth P(R) is constrained to be constant over the 5 trials of any given experiment?

[-]James Camacho8d1-1

The Bernoulli rate is drawn according to

giving posterior

$\frac{k + 0.6}{5.2} .$

[-]One8d11

No; your distribution gives probabilities [0.253247, 0.168831, 0.155844, 0.168831, 0.253247] for the number of Rs in the first four trials. This predicts that the number of experiments with two Rs is binomially (i.e. approximately normally) distributed with mean ~155844 and standard deviation ~363, but the actual number is 161832, around 16 standard deviations away from the mean.

LESSWRONG
LW

LESSWRONG
LW

18

Experiment: Test your priors on Bernoulli processes.

18

18