Excellent! One final point that I would like to add is if we say that “theta is a physical quantity s.t. [...]“, we are faced with an ontological question: “does a physical quantity exist with these properties?”.
I recently found about Professor Jaynes’ A_p distribution idea, it is introduced in chapter 18 of his book, from Maxwell Peterson in the sub-thread below and I believe it is an elegant workaround to this problem. It leads to the same results but is more satisfying philosophically.
This is how it would work in the coin flipping example:
De... (read more)
No worries :) Thanks a lot for your help! Much appreciated.
It’s amazing how complex a simple coin flipping problem can get when we approach it from our paradigm of objective Bayesianism. Professor Jaynes remarks on this after deriving the principle of indifference: “At this point, depending on your personality and background in this subject, you will be either greatly impressed or greatly disappointed by the result (2.91).” - page 40
A frequentist would have “solved“ this problem rather easily. Personally, I would trade simplicity for coherence any day of the week...
Think I have finally got it. I would like to thank you once again for all your help; I really appreciate it.
This is what I think “estimating the probability” means:
We define theta to be a real-world/objective/physical quantity s.t. P(H|theta=alpha) = alpha & P(T|theta=alpha) = 1 - alpha. We do not talk about the nature of this quantity theta because we do not care what it is. I don’t think it is appropriate to say that theta is “frequency” for this reason:
I‘m afraid I have to disagree. I do sometimes regret not focusing more on applied Bayesian inference. (In fact, I have no idea what WAIC or HMC is.) But in my defence, I am an amateur analytical-philosopher & logician and I couldn’t help finding more non-sequiturs in classical expositions of probability theory than plot-holes in Tolkien novels. Perhaps if had been more naive and less critical (no offence to anyone) when I read those books, I would have “progressed” faster. I had lost hope in understanding probability theory before I read Professor Jayn... (read more)
I believe it is the same thing. A uniform prior means your prior is constant function i.e. P(A_p|I) = x where x is a real number with the usual caveats. So if you have a uniform prior, you can drop it (from a safe height of course). But perhaps the more seasoned Bayesians disagree? (where are they when you need them)
You are right; dropping priors in the A_p distribution is probably not a general rule. Perhaps the propositions don’t always need to interpretable for us to be able impose priors? For example, people impose priors over the parameter space of a neural network which is certainly not interpretable. But the topic of Bayesian neural networks is beyond me
To calculate the posterior predictive you need to calculate the posterior and to the calculate posterior you need to calculate the likelihood (in most problems). For the coin flipping example, what is the probability of heads and what is the probability of tails given that the frequency is equal to some value theta? You might accuse me of being completely devoid of intuition for asking this question but please bear with me...
Sounds good. I thought nobody was interested in reading Professor Jaynes’ book anymore. It’s a shame more people don’t know about him
“[…] A_p the distribution over how often the coin will come up heads […]” - I understood A_p to be a sort of distribution over models; we do not know/talk about the model itself but we know that if a model A_p is true, then the probability of heads is equal to p by definition of A_p. Perhaps the model A_p is the proposition “the centre of mass of the coin is at p” or “the bias-weighting of the coin is p” but we do not care as long the resulting probability of heads is p. So how can the prior not be indifferent when we do not know the nature of each proposition A_p in a set of mutually exclusive and exhaustive propositions?
I dropped the prior for two reason:
Is this justification valid?
Response to point one: I do find that to be satisfactory from a philosophical perspective but only because theta refers to a real-world property called frequency and not the probability of heads. My question to you is this: if you have a point estimate of theta or if you find the exact real world-value of theta (perhaps by measuring it with an ACME frequency-o-meter), what does it tell you about the probability of heads?
Response to point two: The honour is mine :) If you ever create a study group or discord server for the book, then please count me in
Thank you so much for telling me about A_p distribution! This is exactly what I have been looking for.
“Pending a better understanding of what that means, let us adopt a cautious notation that will avoid giving possibly wrong impressions. We are not claiming that P(Ap|E) is a ‘real probability’ in the sense that we have been using that term; it is only a number which is to obey the mathematical rules of probability theory. Perhaps its proper conceptual meaning will be clearer after getting a little experience using it. So let us refrain from using the prefi... (read more)
I am very grateful for your answer but I have a few contentions from my paradigm of objective Bayesianism
So you are saying that “we” are uncertain about the degree of belief/plausibility that what our brain is going to assign? Then who are “we” exactly? Apologies for being glib but I really don’t understand Also, it is a crime to have different priors given the same information according to us objective Bayesians so that can’t be the issue
These subjective Bayesians... :) I feel the same way about that statement. Could you please elaborate?
What is the theoretical justification behind taking the mean? Argmax feels more intuitive for me because it is literally “the most plausible value of theta”. In either case, whether we use argmax or mean, can we prove that it is equal to P(H|D)?
I believe, mathematically, your claim can be expressed as:
P(H|D) = argmaxθP(θ|D)
where θ is the ”probability“ parameter of the Bernoulli distribution, H represents the the proposition that heads occurs, and D represents our data. The left side of this equation is the plausibility based on knowledge and the right side is Professor Jaynes’ ‘estimate of the probability’ . How can we prove this statement?
Latex is being a nuisance as usual :) The right side of the equation is the argmax with respect to theta of P(theta | data)
Appreciate your reply. I think the source of my confusion is there being uncertainty in the degree of plausibility that we assign given our knowledge or there being uncertainty in our degree of belief given our knowledge. This feels a bit unnatural to me because this quantity is not an external/physical and unknown quantity but one that we assign given our knowledge. If we were to think of probabilities as physical properties that are unknown, then it makes sense to me that there can uncertainty in its value. How would you reconcile this?