Jaynesian interpretation - How does “estimating probabilities” make sense?

No worries :) Thanks a lot for your help! Much appreciated.

It’s amazing how complex a simple coin flipping problem can get when we approach it from our paradigm of objective Bayesianism. Professor Jaynes remarks on this after deriving the principle of indifference: “At this point, depending on your personality and background in this subject, you will be either greatly impressed or greatly disappointed by the result (2.91).” - page 40

A frequentist would have “solved“ this problem rather easily. Personally, I would trade simplicity for coherence any day of the week...

26dI looooove that coin flip section! Cheers

Jaynesian interpretation - How does “estimating probabilities” make sense?

Think I have finally got it. I would like to thank you once again for all your help; I really appreciate it.

This is what I think “estimating the probability” means:

We define theta to be a real-world/objective/physical quantity s.t. P(H|theta=alpha) = alpha & P(T|theta=alpha) = 1 - alpha. We do not talk about the nature of this quantity theta because we do not care what it is. I don’t think it is appropriate to say that theta is “frequency” for this reason:

- “frequency” is not a well-defined physical quantity. You can’t measure “frequency” like you

24dI think the above is accurate.
I disagree with the last part, but it has two sources of confusion
1. Frequentists vs Bayesian is in principle about priors but in practice about
about point estimates vs distributions 1. Good Frequentists use
distributions and bad Bayesian use point estimates such as Bayes
Factors, a good review is this is
https://link.springer.com/article/10.3758/s13423-016-1221-4
[https://link.springer.com/article/10.3758/s13423-016-1221-4]
2. But the leap from theta to probability of heads I think is an intuitive leap
that happens to be correct but unjustified.
Philosophically then the posterior predictive is actually frequents, allow me to
explain:
Frequents are people who estimates a parameter and then draws fake samples from
that point estimate and summarize it in confidence intervals, to justify this
they imagine parallel worlds and what not.
Bayesian are people who assumes a prior distributions from which the parameter
is drawn, they thus have both prior and likelihood uncertainty which gives
posterior uncertainty, which is the uncertainty of the parameters in their
model, when a Bayesian wants to use his model to make predictions then they
integrate their model parameters out and thus have a predictive distribution of
new data given data*. Because this is a distribution of the data like the
Frequentists sampling function, then we can actually draw from it multiple times
to compute summary statistics much like the frequents, and calculate things such
as a "Bayesian P-value" which describes how likely the model is to have
generated our data, here the goal is for the p-value to be high because that
suggests that the model describes the data well.
*In the real world they do not integrate out theta, they draw it 10.000 times
and use thous samples as a stand in distribution because the math is to hard for
complex models

Jaynesian interpretation - How does “estimating probabilities” make sense?

I‘m afraid I have to disagree. I do sometimes regret not focusing more on applied Bayesian inference. (In fact, I have no idea what WAIC or HMC is.) But in my defence, I am an amateur analytical-philosopher & logician and I couldn’t help finding more non-sequiturs in classical expositions of probability theory than plot-holes in Tolkien novels. Perhaps if had been more naive and less critical (no offence to anyone) when I read those books, I would have “progressed” faster. I had lost hope in understanding probability theory before I read Professor Jayn... (read more)

24dI am one of those people with an half baked epistemology and understanding of
probability theory, and I am looking forward to reading Janes. And I agree there
are a lot of ad hocisms in probability theory which means everything is wrong in
the logic sense as some of the assumptions are broken, but a solid moden
bayesian approach has much less adhocisms and also teaches you to build advanced
models in less than 400 pages.
HMC is a sampling approach to solving the posterior which in practice is
superior to analytical methods, because it actually accounts for correlations in
predictors and other things which are usually assumed away.
WAIC is information theory on distributions which allows you to say that model A
is better than model B because the extra parameters in B are fitting noice,
basically minimum description length on steroids for out of sample uncertainty.
Also I studied biology which is the worst, I can perform experiments and thus do
not have to think about causality and I do not expect my model to acout for half
of the signal even if it's 'correct'

Jaynesian interpretation - How does “estimating probabilities” make sense?

I believe it is the same thing. A uniform prior means your prior is constant function i.e. P(A_p|I) = x where x is a real number with the usual caveats. So if you have a uniform prior, you can drop it (from a safe height of course). But perhaps the more seasoned Bayesians disagree? (where are they when you need them)

26dShoot! You’re right! I think I was wrong this whole time on the impact of
dropping the prior term. Cuz data term * prior term is like multiplying the
distributions, and dropping the prior term is like multiplying the data
distribution by the uniform one. Thanks for sticking with me :)

Jaynesian interpretation - How does “estimating probabilities” make sense?

You are right; dropping priors in the A_p distribution is probably not a general rule. Perhaps the propositions don’t always need to interpretable for us to be able impose priors? For example, people impose priors over the parameter space of a neural network which is certainly not interpretable. But the topic of Bayesian neural networks is beyond me

27dIt seems like in practice, when there’s a lot of data, people like Jaynes and
Gelman are happy to assign low-information (or “uninformative”) priors, knowing
that with a lot of data the prior ends up getting washed away anyway. So just
slapping a uniform prior down might be OK in a lot of real-world situations.
This is I think pretty different than just dropping the prior completely, but
gets the same job done.

Jaynesian interpretation - How does “estimating probabilities” make sense?

To calculate the posterior predictive you need to calculate the posterior and to the calculate posterior you need to calculate the likelihood (in most problems). For the coin flipping example, what is the probability of heads and what is the probability of tails given that the frequency is equal to some value theta? You might accuse me of being completely devoid of intuition for asking this question but please bear with me...

Sounds good. I thought nobody was interested in reading Professor Jaynes’ book anymore. It’s a shame more people don’t know about him

27dRegarding reading Jaynes, my understanding is its good for intuition but bad for
applied statistics because it does not teach you modern bayesian stuff such as
WAIC and HMC, so you should first do one of the applied books. I also think
Janes has nothing about causality.

27dGiven 1. your model and 2 the magical no uncertainty in theta, then it's theta,
the posterior predictive allows us to jump from infrence about parameters to
infence about new data, it's a distribution of y (coin flip outcomes) not theta
(which describes the frequency)

Jaynesian interpretation - How does “estimating probabilities” make sense?

“[…] A_p the distribution over how often the coin will come up heads […]” - I understood A_p to be a sort of distribution over models; we do not know/talk about the model itself but we know that if a model A_p is true, then the probability of heads is equal to p by definition of A_p. Perhaps the model A_p is the proposition “the centre of mass of the coin is at p” or “the bias-weighting of the coin is p” but we do not care as long the resulting probability of heads is p. So how can the prior not be indifferent when we do not know the nature of each proposition A_p in a set of mutually exclusive and exhaustive propositions?

27dI can’t see anything wrong in what you’ve said there, but I still have to insist
without good argument that dropping P(A_p|I) is incorrect. In my vague defense,
consider the two A_p distributions drawn on p558, for the penny and for Mars.
Those distributions are as different as they are because of the different prior
information. If it was correct to drop the prior term a priori, I think those
distributions would look the same?

Jaynesian interpretation - How does “estimating probabilities” make sense?

I dropped the prior for two reason:

- I assumed the background information to be indifferent to the A_p’s
- We do not explicitly talk about the nature of the A_p’s. Prof. Jaynes defines it as a proposition such that P(A|A_p, E) = p. In my example A_p is defined as a proposition such that P(H|A_p, I) = p. No matter what prior information we have, it is going to be indifferent to the A_p’s by virtue of the fact that we don’t know what A_p represents

Is this justification valid?

28dIsn’t A_p the distribution over how often the coin will come up heads, or the
probability of life on Mars? If so… there’s no way those things could be
indifferent to the background information. A core tenet of the philosophy
outlined in this book is that when you ignore prior information without good
cause, things get wacky and fall apart. This is part of desiderata iii from
chapter 2: “The robot always takes into account all of the evidence it has
relevant to a question. It does not arbitrarily ignore some of the information,
basing its conclusions only on what remains.”
(Then Jaynes ignores information in later chapters because it doesn’t change the
result… so this desideratum is easier said than done… but yeah)

Jaynesian interpretation - How does “estimating probabilities” make sense?

Response to point one: I do find that to be satisfactory from a philosophical perspective but only because theta refers to a real-world property called frequency and not the probability of heads. My question to you is this: if you have a point estimate of theta or if you find the exact real world-value of theta (perhaps by measuring it with an ACME frequency-o-meter), what does it tell you about the probability of heads?

Response to point two: The honour is mine :) If you ever create a study group or discord server for the book, then please count me in

27dIn Bayesian statistics there are two distributions which I think we are
conflating here because they happen to have the same value
The posteriorp(θ∣y)describes our uncertainty ofθ, given data (and prior
information), so it's how sure we are of the frequency of the coin
The posterior predictive is our prediction for new coin flips~ygiven old coin
flipsy
p(~y|y)=∫Θp(~y∣θ,y)p(θ∣y)dθFor the simple Bernoulli distribution coin example,
the following issue arise: the parameterθ, the posterior predictive and the
posterior all have the same value, but they are different things.
Here is an example were they are different:
Hereθwas not a coin but the logistic intercept of some binary outcome with
predictor variable x, let's imagine an evil Nazi scientist poisoning people,
then we could make a logistic model of y (alive/dead) such asy=invlogit(ax+logit
(θ)), Let's imagine that x is how much poison you ate above/below the average
poison level, and that we haveθ=0.5, so on average half died
Now we have:
the value if we were omniscient
θ=0.5The posterior ofθbecause we are not omniscient there is error
p(θ|y)=0.5+ϵPredictions for two different y with uncertainty:
p(~ylots of poison∣y)=p(~y∣y,~x=2)=0.99±ϵ≈0.99p(~yaverage poison∣y)=p(~y∣y,~x=0)
=0.5±ϵ≈0.5Does this help?
I will PM you when we start reading Jaynes, we are currently reading Regression
and other stories, but in about 20 weeks (done if we do 1 chapter per week)
there is a good chance we will do Jaynes

Jaynesian interpretation - How does “estimating probabilities” make sense?

Thank you so much for telling me about A_p distribution! This is exactly what I have been looking for.

“Pending a better understanding of what that means, let us adopt a cautious notation that will avoid giving possibly wrong impressions. We are not claiming that P(Ap|E) is a ‘real probability’ in the sense that we have been using that term; it is only a number which is to obey the mathematical rules of probability theory. Perhaps its proper conceptual meaning will be clearer after getting a little experience using it. So let us refrain from using the prefi... (read more)

28dI don't think so. Like you, I don't really understand thisApstuff
philisophically. But the step where you drop the priorP(Ap|I)to obtainP(Ap|D,I)∝
P(D|Ap,I)is, I think, not warranted. Dropping the prior term outright like
that... I don't think there are many cases where that's acceptable. Doing so
does not reflect a state of low knowledge, but instead a state of pretty strong
knowledge. To give intuition on what I mean:
Contrast with the prior that reflects the state of knowledge "All I know is that
H is possible and T is possible". This is closer to Jaynes' example about
whether there's life on Mars. The prior that reflects that state of knowledge is
Beta(1,1), which after two heads come up, becomes Beta(3, 1). The mean of
Beta(3, 1) is 3/4 = 0.75. This is much less than the 1.0 you arrive at.
A prior that gives 1.0 after the data H,H might be something like:
"This coin is very unfair in a well-known, specific way: It either always gives
heads, always gives tails, or gives heads and tails alternating: 'H,T,H,T...'."
Under that prior, the data HH would give you a probability of near-1 that H is
next. But that's a prior that reflects definite, strong knowledge of the coin.
Maybe this argument changes given the nature ofAp, which again I don't really
understand. But whatever it is, I don't think it's valid to assume the prior
away.

Jaynesian interpretation - How does “estimating probabilities” make sense?

I am very grateful for your answer but I have a few contentions from my paradigm of objective Bayesianism

- You have replaced probability with a physical property: “frequency“. I have also seen other people use terms like bias-weighting, fairness, center of mass, etc. which are all properties of the coin, to sidestep this question. I have nothing against theta being a physical property such that P(heads|theta=alpha) = alpha. In fact, it would make a ton of sense to me if this actually were the case. But the issue is when people say that theta is a probability

28dI may be to bad at philosophy to give a satisfying answer, and it may turn out
that I actually do not know and am simply to dumb to realize that I should be
confused about this :)
1. There is a frequency of the coin in the real world, let's say it hasθ=0.5 1.
Because I am not omniscient there is a distribution overθit's
parameterized by some prior which we ignore (let's not fight about that
:)) and some data x, thus In my head there exists a probability
distributionp(θ∣x)
2.
The probability distribution on my head is a distribution not a scaler,
I don't know whatθis but I may be 95% certain that it's between 0.4 and
0.6
2. I think there are problems with objective priors, but I am honored to have
meet an objective Bayesian in the wild, so I would love to try to understand
you, I am Jan Christian Refsgaard on the University of Bayes and Bayesian
conspiracy discord servers. My main critique is the 'in-variance' of some
priors under some transformations, but that is a very weak critique and my
epistemology is very underdeveloped, also I just bought Jaynes book :) and
will read when I find a study group, so who knows maybe I will be an
objective Bayesian a year from now :)

Jaynesian interpretation - How does “estimating probabilities” make sense?

So you are saying that “we” are uncertain about the degree of belief/plausibility that what our brain is going to assign? Then who are “we” exactly? Apologies for being glib but I really don’t understand

Also, it is a crime to have different priors given the same information according to us objective Bayesians so that can’t be the issue

Jaynesian interpretation - How does “estimating probabilities” make sense?

These subjective Bayesians... :) I feel the same way about that statement. Could you please elaborate?

18dUncertainty is a statement about my brain not the real world, if you replicate
the initial conditions then it will always land either Head or Tails, so even if
the coin is "fair"p(H∣θ)=0.5, then maybep(H∣θ,very good at physics)=0.95. the
uncertainty comes form be being stupid and thus being unable to predict the next
coin toss.
Also there are two things we are uncertain about, we are uncertain aboutθ(the
coins frequency) and we are uncertain aboutp(H∣θ), the next coin toss

Jaynesian interpretation - How does “estimating probabilities” make sense?

What is the theoretical justification behind taking the mean? Argmax feels more intuitive for me because it is literally “the most plausible value of theta”. In either case, whether we use argmax or mean, can we prove that it is equal to P(H|D)?

18dIf I have a distribution of 2 kids and a professional boxer, and a random one is
going to hit me, then argmax tells me that I will always be hit by a kids, sure
if you draw from the distribution only once then argmax will beat the mean in
2/3 of the cases, but its much worse at answering what will happen if I draw 9
hits (argmax=nothing, mean=3hits from a boxer)
This distribution is skewed, like the beta distribution, and is therefore better
summarized by the mean than the mode.
In Bayesian statistics argmax on sigma will often lead to sigma=0, if you assume
that sigma follows a exponential distribution, thus it will lead you to assume
that there is no variance in your sample
The variance is also lower around the mean than the mode if that counts as a
theoretical justification :)

Jaynesian interpretation - How does “estimating probabilities” make sense?

I believe, mathematically, your claim can be expressed as:

=

where is the ”probability“ parameter of the Bernoulli distribution, H represents the the proposition that heads occurs, and D represents our data. The left side of this equation is the plausibility based on knowledge and the right side is Professor Jaynes’ ‘estimate of the probability’ . How can we prove this statement?

Edit:

Latex is being a nuisance as usual :) The right side of the equation is the argmax with respect to theta of P(theta | data)

28dI think argmax is not the way to go as the beta distribution and binomial
likelihood is only symmetric when the coin is fair, if you want a point estimate
the mean of the distribution is better, which will always be closer to 50/50
than the mode, and thus more conservative, you are essentially ignoring all the
uncertainty of theta and thus overestimating the probability.

Jaynesian interpretation - How does “estimating probabilities” make sense?

Appreciate your reply. I think the source of my confusion is there being uncertainty in the degree of plausibility that * we *assign given our knowledge or there being uncertainty in

28dThe probability is an external/physical thing because your brain is physical,
but I take your point.
I think the we/our distinction arises because we have different priors

Excellent! One final point that I would like to add is if we say that “theta is a physical quantity s.t. [...]“, we are faced with an ontological question: “does a physical quantity exist with these properties?”.

I recently found about Professor Jaynes’ A_p distribution idea, it is introduced in chapter 18 of his book, from Maxwell Peterson in the sub-thread below and I believe it is an elegant workaround to this problem. It leads to the same results but is more satisfying philosophically.

This is how it would work in the coin flipping example:

De... (read more)