In the previous post, we looked at Rudolph Wolf's data on 20000 rolls of a pair of dice. Specifically, we looked at the data on the white die, and found that it was definitely biased. This raises an interesting question: what biases, specifically, were present? In particular, can we say anything about the physical asymmetry of the die? Jaynes addressed this exact question; we will test some of his models here.
Elongated Cube Models
Jaynes suggests that, if the die were machined, then it would be pretty easy to first cut an even square along two dimensions. But the cut in the third dimension would be more difficult; getting the length to match the other two dimensions would be tricky. Based on this, we'd expect to see an asymmetry which gives two opposite faces (1 & 6, 2 & 5, or 3 & 4) different probabilities from all the other faces.
Here's what the model looks like for the 1 & 6 pair:
I will omit the details of calculations in this post; readers are welcome to use them as exercises. (All the integrals can be evaluated using the dirichlet-multinomial α=1 formula from the previous post.) In this case, we find
... sure enough, an asymmetry on the 3,4 axis goes a very long way toward explaining this data.
Recall from the previous post that the unbiased model gave a marginal likelihood P[data|model] around 10−70, and the biased model with separate probabilities for each face gave around 10−20. So based on the data, our 3,4 model is still about a billion times less probable than the full biased model (assuming comparable prior probabilities for the two models), but it's getting relatively close - probabilities naturally live on a log scale. It looks like the 3-4 asymmetry is the main asymmetry in the data, but some other smaller asymmetry must also be significant.
Just for kicks, I tried a model with a different probability for each pair of faces, again with uniform prior on the p's. That one came out to 1.7∗10−30 - somewhat worse than the 3,4 model. If you're used to traditional statistics, this may come as a surprise: how can a strictly more general model have lower marginal likelihood P[data|model]? The answer is that, in traditional statistics, we'd be looking for the unobserved parameter values p with the maximum likelihood P[data|model,p] - of course a strictly more general model will have a maximum likelihood value at least as high. But when computing P[data|model], we're integrating over the unobserved parameters p. A more general model has more ways to be wrong; unless it's capturing some important phenomenon, a smaller fraction of the parameter space will have high P[data|model,p]. We'll come back to this again later in the sequence.
Pip Asymmetry Model
Jaynes' other main suggestion was that the pips on the die are asymmetric - i.e. there's less mass near the 6 face than the 1 face, because more pips have been dug out of the 6 face.
As a first approximation to this, let's consider just the asymmetry between 1 and 6 - the pair with the highest pip difference. We'll also keep all the structure from the 3,4 model, since that seems to be the main asymmetry. Here's the model:
3 & 4 have the same probability p2 , as before
2 & 5 have the same probability 1−p4, as before
1 & 6 together have probability 1−p2, same as 3 and 5 together, but their individual probabilities may be different. Conditional on rolling either a 1 or 6, 1 comes up with probability p′ and 6 with probability (1−p′)
Both p and p′ have uniform priors
The conditional parameterization for 1 & 6 is chosen to make the math clean.
Let's call this model3,4+pip. Marginal likelihood:
... and now we have a model which solidly beats separate probabilities for each face!
(I also tried a pip model by itself, without the 3,4 asymmetry. That one wound up at 2.1∗10−70 - almost as bad as the full unbiased model.)
We can also go one step further, and assume that the pip difference also causes 2 and 5 to have slightly different probabilities. This model gives P[data|model]≈3.9∗10−17 - a bit lower than the model above, but close enough that it still gets significant posterior probability (about 3.9∗10−173.9∗10−17+2.3∗10−16=14% assuming equal priors; all the other models we've seen have near-zero posterior assuming equal priors). So based on the data, the model with just the 1-6 pip difference is a bit better, but we're not entirely sure. My guess is that a fancier model could significantly beat both of these by predicting that the effect of a pip difference scales with the number of pips, rather than just using whole separate parameters for the 1-6 and 2-5 differences. But that would get into hairier math, so I'm not going to do it here.
To recap, here's what model3,4+pip says:
3 and 4 have the same probability, but that probability may be different from everything else
2 and 5 have the same probability, and 1 and 6 together have the same probability as 2 and 5, but 1 and 6 have different probabilities.
That's it; just two "free parameters". Note that the full biased model, with different probabilities for each face, is strictly more general than this - any face probabilities p which are compatible with model3,4+pip are also compatible with the full biased model. But the full biased model is compatible with any face probabilities p; model3,4+pip is not compatible with all possible p's. So if we see data which matches the p's compatible with model3,4+pip, then that must push up our posterior for model3,4+pip relative to the full unbiased model - model3,4+pip makes a stronger prediction, so it gets more credit when it's right. The result: less flexible models which are consistent with the data will get higher posterior probability. The "complexity penalty" is not explicit, but implicit: it's just a natural consequence of conservation of expected evidence.
Next post we'll talk about approximation methods for hairy integrals, and then we'll connect all this to some common methods for scoring models.
In the previous post, we looked at Rudolph Wolf's data on 20000 rolls of a pair of dice. Specifically, we looked at the data on the white die, and found that it was definitely biased. This raises an interesting question: what biases, specifically, were present? In particular, can we say anything about the physical asymmetry of the die? Jaynes addressed this exact question; we will test some of his models here.
Elongated Cube Models
Jaynes suggests that, if the die were machined, then it would be pretty easy to first cut an even square along two dimensions. But the cut in the third dimension would be more difficult; getting the length to match the other two dimensions would be tricky. Based on this, we'd expect to see an asymmetry which gives two opposite faces (1 & 6, 2 & 5, or 3 & 4) different probabilities from all the other faces.
Here's what the model looks like for the 1 & 6 pair:
Let's call this model1,6.
I will omit the details of calculations in this post; readers are welcome to use them as exercises. (All the integrals can be evaluated using the dirichlet-multinomial α=1 formula from the previous post.) In this case, we find
P[data|model1,6]=n!1!(n+1)!((n1+n6)!n1!n6!(12)n1+n6)((n2+...+n5)!n2!…n5!(14)n2+n3+n4+n5)≈2.2∗10−59
For the other two opposite face pairs, we get:
... sure enough, an asymmetry on the 3,4 axis goes a very long way toward explaining this data.
Recall from the previous post that the unbiased model gave a marginal likelihood P[data|model] around 10−70, and the biased model with separate probabilities for each face gave around 10−20. So based on the data, our 3,4 model is still about a billion times less probable than the full biased model (assuming comparable prior probabilities for the two models), but it's getting relatively close - probabilities naturally live on a log scale. It looks like the 3-4 asymmetry is the main asymmetry in the data, but some other smaller asymmetry must also be significant.
Just for kicks, I tried a model with a different probability for each pair of faces, again with uniform prior on the p's. That one came out to 1.7∗10−30 - somewhat worse than the 3,4 model. If you're used to traditional statistics, this may come as a surprise: how can a strictly more general model have lower marginal likelihood P[data|model]? The answer is that, in traditional statistics, we'd be looking for the unobserved parameter values p with the maximum likelihood P[data|model,p] - of course a strictly more general model will have a maximum likelihood value at least as high. But when computing P[data|model], we're integrating over the unobserved parameters p. A more general model has more ways to be wrong; unless it's capturing some important phenomenon, a smaller fraction of the parameter space will have high P[data|model,p]. We'll come back to this again later in the sequence.
Pip Asymmetry Model
Jaynes' other main suggestion was that the pips on the die are asymmetric - i.e. there's less mass near the 6 face than the 1 face, because more pips have been dug out of the 6 face.
As a first approximation to this, let's consider just the asymmetry between 1 and 6 - the pair with the highest pip difference. We'll also keep all the structure from the 3,4 model, since that seems to be the main asymmetry. Here's the model:
The conditional parameterization for 1 & 6 is chosen to make the math clean.
Let's call this model3,4+pip. Marginal likelihood:
P[data|model3,4+pip]=n!1!(n+1)!((n3+n4)!n3!n4!(12)n3+n4)∗((n1+n2+n5+n6)!(n1+n6)!n2!n5!(12)n1+n6(14)n2+n5)((n1+n6)!1!(n1+n6+1)!)≈2.3∗10−16
... and now we have a model which solidly beats separate probabilities for each face!
(I also tried a pip model by itself, without the 3,4 asymmetry. That one wound up at 2.1∗10−70 - almost as bad as the full unbiased model.)
We can also go one step further, and assume that the pip difference also causes 2 and 5 to have slightly different probabilities. This model gives P[data|model]≈3.9∗10−17 - a bit lower than the model above, but close enough that it still gets significant posterior probability (about 3.9∗10−173.9∗10−17+2.3∗10−16=14% assuming equal priors; all the other models we've seen have near-zero posterior assuming equal priors). So based on the data, the model with just the 1-6 pip difference is a bit better, but we're not entirely sure. My guess is that a fancier model could significantly beat both of these by predicting that the effect of a pip difference scales with the number of pips, rather than just using whole separate parameters for the 1-6 and 2-5 differences. But that would get into hairier math, so I'm not going to do it here.
To recap, here's what model3,4+pip says:
That's it; just two "free parameters". Note that the full biased model, with different probabilities for each face, is strictly more general than this - any face probabilities p which are compatible with model3,4+pip are also compatible with the full biased model. But the full biased model is compatible with any face probabilities p; model3,4+pip is not compatible with all possible p's. So if we see data which matches the p's compatible with model3,4+pip, then that must push up our posterior for model3,4+pip relative to the full unbiased model - model3,4+pip makes a stronger prediction, so it gets more credit when it's right. The result: less flexible models which are consistent with the data will get higher posterior probability. The "complexity penalty" is not explicit, but implicit: it's just a natural consequence of conservation of expected evidence.
Next post we'll talk about approximation methods for hairy integrals, and then we'll connect all this to some common methods for scoring models.