All of Jan Christian Refsgaard's Comments + Replies

List of Probability Calibration Exercises

It would be nice if you wrote a short paragraph for each link, "requires download", "questions are from 2011", or you sorted the list somehow :)

Use Normal Predictions

Yes, You can change future by being smarter and future by being better calibrated, my rule assumes you don't get smarter and therefore have to adjust only future .

If you actually get better at prediction you could argue you would need to update less than the RMSE estimate suggests :)

Use Normal Predictions

I agree with both points

If you are new to continuous predictions then you should focus on the 50% Interval as it gives you most information about your calibration, If you are skilled and use for example a t-distribution then you have for the trunk and for the tail, even then few predictions should land in the tails, so most data should provide more information about how to adjust , than how to adjust

Hot take: I think the focus 95% is an artifact of us focusing on p<0.05 in frequentest statistics.

Use Normal Predictions

Our ability to talk past each other is impressive :)

would have been an easier way to illustrate your point). I think this is actually the assumption you're making. [Which is a horrible assumption, because if it were true, you would already be perfectly calibrated].

Yes this is almost the assumption I am making, the general point of this post is to assume that all your predictions follow a Normal distribution, with as "guessed" and with a that is different from what you guessed, and then use to get a point estimate for the counterfactual you sho... (read more)

Use Normal Predictions

Thanks!, I am planing on writing a few more in this vein, currently I have some rough drafts of:

  • 30% Done, How to callibrate normal predictions
  • defence of my calibration scheme, and an explanation of how metaculus does.
  • 10% Done, How to make overdispersed predictions
  • like this one for the logistic and t distribution.
  • 70% Done, How to calibrate binary predictions
  • like this one + but gives a posterior over the callibration by doing an logistic regression with your predictions as "x" and outcome as "y"

I can't promise they will be as good as this ... (read more)

Use Normal Predictions

Yes you are right, but under the assumption the errors are normal distributed, then I am right:


Then Which is much less than 1.


import scipy as sp

x1 = sp.stats.norm(0, 0.5).rvs(22 * 10000)
x2 = sp.stats.norm(0, 1.1).rvs(78 * 10000)
x12 = pd.Series(np.array(x1.tolist() + x2.tolist()))
print((x12 ** 2).median())
2SimonM12dUnder what assumption? 1/ You aren't "[assuming] the errors are normally distributed". (Since a mixture of two normals isn't normal) in what you've written above. 2/ If your assumption isX∼N(0,1)then yes, I agree the median ofX2is ~0.45 (although from scipy import stats stats.chi2.ppf(.5, df=1) >>> 0.454936 would have been an easier way to illustrate your point). I think this is actually the assumption you're making. [Which is a horrible assumption, because if it were true, you would already be perfectly calibrated]. 3/ I guess you're new claim is "[assuming] the errors are a mixture of normal distributions, centered at 0", which okay, fine that's probably true, I don't care enough to check because it seems a bad assumption to make. More importantly, there's a more fundamental problem with your post. You can't just take some numbers from my post and then put them in a different model and think that's in some sense equivalent. It's quite frankly bizarre. The equivalent model would be something like: p∼Bern(0.78) σ∼p⋅N(1.1,ε)+(1−p)∼N(0.5,ε)
Use Normal Predictions

I am making the simple observation that the median error is less than one because the mean squares error is one.

1SimonM13dThat isn't a "simple" observation. Consider an error which is 0.5 22% of the time, 1.1 78% of the time. The squared errors are 0.25 and 1.21. The median error is 1.1 > 1. (The mean squared error is 1)
Use Normal Predictions

That's also how I conseptiolize it, you have to change your intervals because you are to stupid to make better predictions, if the prediction was always spot on then sigma should be 0 and then my scheme does not make sense

If you suck like me and get a prediction very close then I would probably say: that sometimes happen :) note I assume the average squared error should be 1, which means most errors are less than 1, because 02+22=2>1

1SimonM13dI assume you're making some unspoken assumptions here, because02+22>12is not enough to say that. A naive application of Chebyshev's inequality would just say thatE(X2)=1,E(X)=0⇒P(X≤1)≤1. To be more concrete, if you were very weird, and either end up forecasting 0.5 s.d. or 1.1 s.d. away, (still with mean 0 and average squared error 1) then you'd find "most" errors are more than 1.
Use Normal Predictions

I agree, most things is not normal distributed and my callibrations rule answers how to rescale to a normal. Metaculus uses the cdf of the predicted distribution which is better If you have lots of predictions, my scheme gives an actionable number faster, by making assumptions that are wrong, but if you like me have intervals that seems off by almost a a factor of 2, then your problem is not the tails but the entire region :), so the trade of seems worth it.

2SimonM13dYou keep claiming this, but I don't understand why you think this
Use Normal Predictions

Agreed, More importantly the two distribution have different kurtosis, so their tails are very different a few sigmas away

I do think the Laplace distribution is a better beginner distribution because of its fat tails, but advocating for people to use a distribution they have never heard of seems like a to tough sell :)

Use Normal Predictions

My original opening statement got trashed for being to self congratulatory, so the current one is a hot fix :), So I agree with you!

Genetics: It sometimes skips a generation is a terrible explanation!

Me to, I learned about this from another disease and taught, that's probably how it works for colorblindness as well.

Use Normal Predictions

I would love you as a reviewer of my second post as there I will try to justify why I think this approach is better, you can even super dislike it before I publish if you still feel like that when I present my strongest arguments, or maybe convince me that I am wrong so I dont publish part 2 and make a partial retraction for this post :). There is a decent chance you are right as you are the stronger predictor of the two of us :)

5SimonM16dI'd be happy to.
Use Normal Predictions

Can I use this image for my "part 2" posts, to explain how "pros" calibrate their continuous predictions?, And how it stacks up against my approach?, I will add you as a reviewer before publishing so you can make corrections in case I accidentally straw man or misunderstand you :)

I will probably also make a part 3 titled "Try t predictions" :), that should address some of your other critiques about the normal being bad :)

Use Normal Predictions

Note 1 for JenniferRM: I have updated the text so it should alleviate your confusion, if you have time, try to re-read the post before reading the rest of my comment, hopefully the few changes should be enough to answer why we want RMSE=1 and not 0.
Note 2 for JenniferRM and others who share her confusion: if the updated post is not sufficient but the below text is, how do I make my point clear without making the post much longer?

With binary predictions you can cheat and predict 50/50 as you point out... You can't cheat with continuous predictions as ther... (read more)

2JenniferRM14dWhen I google for [Bernoulli likelihood] I end up at the distribution [] and I don't see anything there about how to use it as a measure of calibration and/or decisiveness and/or anything else. One hypothesis I have is that you have some core idea like "the deep true nature of every mental motion comes out as a distribution over a continuous variable... and the only valid comparison is ultimately a comparison between two distributions"... and then if this is what you believe then by pointing to a different distribution you would have pointed me towards "a different scoring method" (even though I can't see a scoring method here)... Another consequence of you thinking that distributions are the "atoms of statistics" (in some sense) would (if true) imply that you think that a Brier Score has some distribution assumption already lurking inside it as its "true form" and furthermore that this distribution is less sensible to use than the Bernoulli? ... As to the original issue, I think a lack of an ability, with continuous variables, to "max the calibration and totally fail at knowing things and still get an ok <some kind of score> (or not be able to do such a thing)" might not prove very much about <that score>? Here I explore for a bit... can I come up with a N(m,s) guessing system that knows nothing but seems calibrated? One thought I had: perhaps whoever is picking the continuous numbers has biases, and then you could make predictions of sigma basically at random at first, and then as confirming data comes in for that source, that tells you about the kinds of questions you're getting, so in future rounds you might tweak your guesses with no particular awareness of the semantics of any of the questions... such as by using the same kind of reasoning that lead you to concluding "widen my future intervals by 73%" in the example in the OP. With a bit of extra glue logic that says something vaguely like "use all pas
Use Normal Predictions

The big ask is making normal predictions, calibrating them can be done automatically here is a quick example using google sheets: here is an example

I totally agree with both your points, This comment From a Metaculus user have some good objections to "us" :)

Use Normal Predictions

I am sorry if I have straw manned you, and I think your above post is generally correct. I think we are cumming from two different worlds.

You are coming from Metaculus where people make a lot of predictions. Where having 50+ predictions is the norm and the thus looking at a U(0, 1) gives a lot of intuitive evidence of calibration.

I come from a world where people want to improve in all kids of ways, and one of them is prediction, few people write more than 20 predictions down a year, and when they do they more or less ALWAYS make dichotomous predictions. I ... (read more)

4SimonM16dI still think you're missing my point. If you're making ~20 predictions a year, you shouldn't be doing any funky math to analyse your forecasts. Just go through each one after the fact and decide whether or not the forecast was sensible with the benefit of hindsight. I think this is exactly my point, if someone doesn't know what a normal distribution is, maybe they should be looking at their forecasts in a fuzzier way than trying to back fit some model to them. I disagree that's all you propose. As I said in an earlier comment, I'm broadly in favour of people making continuous forecasts as they convey more information. You paired your article with what I believe is broadly bad advise around analysing those forecasts. (Especially if we're talking about a sample of ~20 forecasts)
Use Normal Predictions

TLDR for our disagreement:

SimonM: Transforming to Uniform distribution works for any continuous variable and is what Metaculus uses for calibration
Me: the variance trick to calculate from this post is better if your variables are form a Normal distribution, or something close to a normal.
SimonM: Even for a Normal the Uniform is better.

4SimonM17dI disagree with that characterisation of our disagreement, I think it's far more fundamental than that. 1. I think you misrepresent the nature of forecasting (in it's generality) versus modelling in some specifics 2. I think your methodology is needlessly complicated 3. I propose what I think is a better methodology To expand on 1. I think (although I'm not certain, because I find your writing somewhat convoluted and unclear) that you're making an implicit assumption that the error distribution is consistent from forecast to forecast. Namely your errors when forecasting COVID deaths and Biden's vote share come from some similar process. This doesn't really mirror my experience in forecasting. I think this model makes much more sense when looking at a single model which produces lots of forecasts. For example, if I had a model for COVID deaths each week, and after 5-10 weeks I noticed that my model was under or over confident then this sort of approach might make sense to tweak my model. To expand on 2. I've read your article a few times and I still don't fully understand what you're getting at. As far as I can tell, you're proposing a model for how to adjust your forecasts based on looking at their historic performance. Having a specific model for doing this seems to miss the point of what forecasting in the real world is like. I've never created a forecast, and gone "hmm... usually when I forecast things with 20% they happen 15% of the time, so I'm adjusting my forecast down" (which is I think what you're advocating) it's more likely a notion of, "I am often over/under confident, when I create this model is there some source of variance I am missing / over-estimating?". Setting some concrete rules for this doesn't make much sense to me. Yes, I do think it's much simpler for people to look at a list of percentiles of things happening, to plot them, and then think "am I generally over-confident / under-confident"? I think it's generally much easier for peo
Use Normal Predictions

I don't know what s.f is, but the interval around 1.73 is obviously huge, with 5-1-0 data points it's quite narrow if your predictions are drawn from N(1, 1.73), that is what my next post will be about. There might also be a smart way to do this using the Uniform, but I would be surprised if it's dispersion is smaller than a chi^2 distribution :) (changing the mean is cheating, we are talking about calibration, so you can only change your dispersion)

Use Normal Predictions

Hard disagree, From two data points I calculate that my future intervals should be 1.73 times wider, converting these two data points to U(0,1) I get

[0.99, 0.25]

How should I update my future predictions now?

2SimonM17dIf you think 2 data points are sufficient to update your methodology to 3 s.f. of precision I don't know what to tell you. I think if I have 2 data point and one of them is 0.99 then it's pretty clear I should make my intervals wider, but how much wider is still very uncertain with very little data. (It's also not clear if I should be making my intervals wider or changing my mean too)
Use Normal Predictions

you are missing the step where I am transforming arbitrary distribution to U(0, 1)

medium confident in this explanation: Because the square of random variables from the same distributions follows a gamma distribution, and it's easier to see violations from a gamma than from a uniform, If the majority of your predictions are from a weird distributions then you are correct, but if they are mostly from normal or unimodal ones, then I am right. I agree that my solution is a hack that would make no statistician proud :)

Edit: Intuition pump, a T(0, 1, 100) obviou... (read more)

3SimonM17dI am absolutely not missing that step. I am suggesting that should be the only step. (I don't agree with your intuitions in your "explanation" but I'll let someone else deconstruct that if they want)
Use Normal Predictions

changed to "Making predictions is a good practice, writing them down is even better."

does anyone have a better way of introducing this post?

2Less_Random14dOverall great post: by retrospectively evaluating your prior predictions (documented so as to avoid one's tendency to 'nudge' your memories based on actual events which transpired) using a 'two valued' Normal distribution (guess and 'distance' from guess as confidence interval), rather than a 'single-valued' bernoulli/binary distribution (yes/no on guess-actual over/under), one is able to glean more information and therefore more efficiently improve future predictions. That opening statement, while good and useful, does come off a little 'non sequitur'-ish. I urge to find a more impactful opening statement (but don't ahve a recommendation, other than some simplification resulting from what I said above).
Use Normal Predictions

(Edit: the above post has 10 up votes, so many people feel like that, so I will change the intro)

You have two critiques:

  1. Scott Alexander evokes tribalism

  2. We predict more than people outside our group holding everything else constant

  3. I was not aware of it, and I will change if more than 40% agree

Remove reference to Scott Alexander from the intro: [poll]{Agree}{Disagree}

  1. I think this is true, but have no hard facts, more importantly you think I am wrong, or if this also evokes tribalism it should likewise be removed...

Also Remove "We rationalists... (read more)

[This comment is no longer endorsed by its author]Reply
Use Normal Predictions

This is a good point, but you need less data to check whether your squared errors are close to 1 than whether your inverse CDF look uniform, so if the majority of predictions are normal I think my approach is better.

The main advantage of SimonM/Metaculus is that it works for any continuous distribution.

9SimonM17dI don't understand why you think that's true. To rephrase what you've written: "You need less data to check whether samples are approximately N(0,1) than if they are approximately U(0,1)" It seems especially strange when you think that transforming your U(0,1) samples to N(0,1) makes the problem soluble.
Use Normal Predictions

Agreed 100% on 1) and with 2) I think my point is "start using the normal predictions as a gate way drug to over dispersed and model based predictions"

I stole the idea from Gelman and simplified it for the general community, I am mostly trying to raise the sanity waterline by spreading the gospel of predicting on the scale of the observed data. All your critiques of normal forecasts are spot on.

Ideally everybody would use mixtures of over-dispersed distributions or models when making predictions to capture all sources of uncertainty

It is my hope that by ed... (read more)

Use Normal Predictions

You could make predictions from a t distribution to get fatter tails, but then the "easy math" for calibration becomes more scary... You can then take the "quartile" from the t distribution and ask what sigma in the normal that corresponds to. That is what I outlined/hinted at in the "Advanced Techniques 3"

Use Normal Predictions

Good Points, Everything is a conditional probability, so you can simply make conditional normal predictions:

Let A = Biden alive

Let B = Biden vote share

Then the normal probability is conditional on him being alive and does not count otherwise :)

Another solution is to make predictions from a T-distribution to get fatter tails. and then use "Advanced trick 3" to transform it back to a normal when calculating your calibration.

Genetics: It sometimes skips a generation is a terrible explanation!

I think this was by parents, so they are forgiven :), your story is pretty crazy, but there is so much to know as a doctor that most becomes rules of thumbs (maps vs buttons) untill called out like you did

Genetics: It sometimes skips a generation is a terrible explanation!

fair point. I think my target audience is people like me who heard this saying about colorblindness (or other classical Mendelian diseases that runs in families)

I have added a disclaimer towards the end :)

Genetics: It sometimes skips a generation is a terrible explanation!

I am not sure I follow, I am confused about whether the 60/80 family refers to both parents, and what is meant by "off-beat" and "snap-back", I am also confused about what the numbers mean is it 60/80 of the genes or 60/80 of the coding region (so only 40 genes)

3Slider25d60 trait supporting genes out of 80 locations that could support it. I am worried that the main finding is misleading because it is an improper application of spherical cow thinking to a concept that oriented to dealing with messiness.
Genetics: It sometimes skips a generation is a terrible explanation!

I totally agree, technically it's a correct observation, but it's also what I was taught by adults when I asked as a kid, and therefore I wanted to correct it as the real explanation is very short and concise.

6localdeity25dAh, that explains it. Adults are often not very good at explaining science to kids. And I'd guess the adults in question might not have known that colorblindness was X-linked, even if they were paid to teach science; I think I'd only be surprised by that ignorance in K-12 education if a teacher chose to present the subject of colorblind genetics to the class. I once had a doctor (I'd guess in her early thirties) who, in a discussion of male-pattern baldness, mentioned the mother's father as the best data point—which means it must be X-linked, because otherwise the father's father would be an equally good data point (not to mention the father, if old enough). I said, "So, it's X-linked, then." She said, "No, it's not X-linked". I stated the above logic. She didn't comment on it, but consulted her computer system, and reported that there were five genes found to be associated with male-pattern baldness, some on the X chromosome and some not.
This Year I Tried To Teach Myself Math. How Did It Go?

That is hard to believe, you seem so smart at the UoB discord and your podcast :), thanks for sharing

This Year I Tried To Teach Myself Math. How Did It Go?

The University of Bayes Discord (UoB) has study groups for Bayesian statistics which might be relevant to you. The newest study group is doing Statistical Rethinking 2022 as the lectures get posted to YouTube. It requires less math than you have demonstrated in your post.

If you want a slightly more rigors path to Bayesian statistics, then I would advice to read Lambert or Gelman See here for more info.

If you want to take the mathematician approach and lean probability theory first, then the book Probability 110 by Blitzstein is pretty good, the study group... (read more)

The Genetics of Space Amazons

Totally agree, it's also Christians critique of the idea :)... Maybe it could be relevant for aliens on a smaller planet as they could leave their planet more easily, and would thus be less advanced than us when we become space faring :)... Or a scifi where the different tech trees progress different, like stram punk

The Genetics of Space Amazons

Then maybe it only work for harem anime in space :)

From Considerations to Probabilities

A lot of your latex is not rendered correctly...

The Genetics of Space Amazons

Agreed, but then you don't get cool space amazons :). It could be an extra fail safe mechanism :)

2ChristianKl1moYou can get rid of men entirely and then have your space amazons.
The Genetics of Space Amazons

Good Point, In principle the X chromosome already has this issue when you get it from your farther, if the A chromosome is simply a normal X chromosome with an insertion of a set of proteins that blocks silencing, then you can still have recombination, if we assume the Amazon proteins are all located in the same LD region then mechanically everything is as in the post, but we do not have the Muller's ratchet problem

Also the A only recombines with X as AY is female and therefore never mates with an AX or AY

The Genetics of Space Amazons

When the space ship lands there is a 1% chance that no males are among the first 16 births ()

Luckily males are firtile for longer so if the second generation had no men the first generation still works

If the A had a mutation such that AX did not have 50% chance of passing on a A, then the gender ratio would be even more extreme, if the last man dies the a AY female could probably artificially incriminate a female.

You can update the matrix and do the for product to see how those different rules pan out, if you have a specific ratio you want to try then ... (read more)

The Genetics of Space Amazons

1:10 was a good guess, but unfortunately the amazon gene only gets us to 1:3

Question about Test-sets and Bayesian machine learning

Wild Speculation:

I am 70% confident that if we were smarter then we would not need it. 

If you have some data that you (magically) know the likelihood and prior. Then you would have some uncertainty from the parameters in the model and some from the parameters, this would then change the form of the posterior for example from normal to a t-distribution to account for this extra uncertainty.

In the real world we assume a likelihood and guess a prior, and even with simple models such as y ~ ax + b we will usually model the residual errors as a normal dist... (read more)

2Radford Neal6mo"... a model with more* parameters will always have less residual errors (unless you screw up the prior) and thus the in sample predictions will seem better" Not always (unless you're sweeping all exceptions under "unless you screw up the prior"). With more parameters, the prior probability for the region of the parameter space that fits the data well may be smaller, so the posterior may be mostly outside this region. Note that "less residual errors" isn't avery clear concept in Bayesian terms - there's a posterior distribution of residual error on the training set, not a single value. (There is a single residual error when making the Bayesian prediction averaged over the posterior, but this residual error also doesn't necessarily go down when the model becomes more complex.) "Bayesian Models just like Frequentest Models are vulnerable to over fitting if they have many parameters and weak priors." Actually, Bayesian models with many parameters and weak priors tend to under fit the data (assuming that by "weak" you mean "vague" / "high variance"), since the weak priors in a high dimensional space give high prior probability to the data not being fit well.
1[anonymous]6moIt seems to me like there are two distinct issues: estimating error of model on future data and model comparison. 1⟼It would be useful to know the most likely value of error on an future data before we actually use the model; but is this what test set error represents - the most likely value of error on future data? 2⟼Why do we use techniques like WAIC and PSIS-LOO when we can (and should?) simply usep(M|D)i.e. Bayes factors, Ockham factors, Model Evidence, etc.? These techniques seem to work well for over-fitting (see image below). Once we find the more plausible model, we use it to make predictions
Do Bayesians like Bayesian model Averaging?

Good points, but can't you still solve the discrete problem with a single model and a stick breaking prior on the number of mints, right?

2Radford Neal6moIf you're thinking of a stick-breaking prior such as a Dirichlet process mixture model, they typically produce an infinite number of components (which would be mints, in this case), though of course only a finite number will be represented in your finite data set. But we know that the number of mints producing coins in the Roman Empire was finite. So that's not a reasonable prior (though of course you might sometimes be able to get away with using it anyway).
Do Bayesians like Bayesian model Averaging?

If there are 3 competing models then Ideally you can make a larger model where each submodel is realized by specific parameter combinations.

If a M2 is simply M1 with an extra parameter b2, then you should have a stronger prior b2 being zero in M2, if M3 is M1 with one parameter transformed, then you should have a parameter interpolating between this transformation so you can learn that between 40-90% interpolating describe the data better.

If it's impossible to translate between models like this then you can do model averaging, but it's a sign of you not understanding your data.

2Radford Neal6moYes, this is usually the right approach - use a single, more complex, model that has the various models you were considering as special cases. It's likely that the best parameters of this extended model won't actually turn out to be one of the special cases. (But note that this approach doesn't necessarily eliminate the need for careful consideration of the prior, since unwise priors for a single complex model can also cause problems.) However, there are some situations where discrete models make sense. For instance, you might be analysing old Roman coins, and be unsure whether they were all minted in one mint, or in two (or three, ...) different mints. There aren't really any intermediate possibilities between one mint or two. Or you might be studying inheritance of two genes, and be considering two models in which they are either on the same chromosome or on different chromosones.
2[anonymous]6moAhhh... that makes a lot of sense.↗Thankyou!↖
Do Bayesians like Bayesian model Averaging?

You are correct, we have to assume a model, just like we have to assume a prior. And strictly speaking the model is wrong and the prior is wrong :). But we can calculate how good the posterior predictive describe the data to get a feel for how bad our model is :)

2[anonymous]6moIgnoring the practical problems of Bayesian model averaging, isn’t assuming that eitherM1, M2, or M3 is true better than assuming that some model M is true? So Bayesian model averaging is always better right (if it is practically possible)?
Do Bayesians like Bayesian model Averaging?

I am a little confused by what x is on your statement, and by why you think we can't compute the likelihood or posterior predictive. In most real problems we can't compute the posterior but we can draw from it and thus approximate it via MCMC

2[anonymous]6moSorry! Bad notation... What I meant was that we can’t compute the conditional posterior predictive densityp(~y|~x,D)whereD={(x1,y1),…,(xn,yn)}. We can compute p(~y|~x,D,M), whereMis some model, approximately using MCMC by drawing samples from the parameter space ofM, i.e. we can approximate the integral below using MCMC: p(~y|~x,D,M)=∫θ∈Θp(~y|~x,M,θ)p(θ|M,D)dθ whereΘis the parameter space ofM. But the quantity that we are interested in is p(~y|~x,D)notp(~y|~x,D,M)for a specific model i.e. we need to marginalise over the unknown model. How can we do this?
Do Bayesians like Bayesian model Averaging?

I agree with Radford Neal, model average and Bayes factors are very sensitive to the priors specification of the models, if you absolutely have to do model average methods such as PSIS-LOO or WAIC that focus on the predictive distribution are much better. If you had two identical models where one simply had a 10 times boarder uniform prior then their posterior predictive distributions would be identical but their Bayes factor would be 1/10, so a model average (assuming uniform prior on p(M_i)) would favor the narrow prior by a factor 10 where the predictiv... (read more)

2[anonymous]6moThat seems to be a bit of conundrum: we needp(y|x,D)but we can’t compute it? If can’t computep(y|x,D), then what hope is there for statistics?
Jaynesian interpretation - How does “estimating probabilities” make sense?

I am one of those people with an half baked epistemology and understanding of probability theory, and I am looking forward to reading Janes. And I agree there are a lot of ad hocisms in probability theory which means everything is wrong in the logic sense as some of the assumptions are broken, but a solid moden bayesian approach has much less adhocisms and also teaches you to build advanced models in less than 400 pages.

HMC is a sampling approach to solving the posterior which in practice is superior to analytical methods, because it actually accounts for ... (read more)

Jaynesian interpretation - How does “estimating probabilities” make sense?

I think the above is accurate. 


I disagree with the last part, but it has two sources of confusion

  1. Frequentists vs Bayesian is in principle about priors but in practice about about point estimates vs distributions
    1. Good Frequentists use distributions and bad Bayesian use point estimates such as Bayes Factors, a good review is this is
  2. But the leap from theta to probability of heads I think is an intuitive leap that happens to be correct but unjustified.


Philosophically then the post... (read more)

2[anonymous]6moExcellent! One final point that I would like to add is if we say that “theta is a physical quantity s.t. [...]“, we are faced with an ontological question: “does a physical quantity exist with these properties?”. I recently found about Professor Jaynes’ A_p distribution idea, it is introduced in chapter 18 of his book, from Maxwell Peterson in the sub-thread below and I believe it is an elegant workaround to this problem. It leads to the same results but is more satisfying philosophically. This is how it would work in the coin flipping example: Define A(u) to a function that maps from real numbers to propositions with domain [0, 1] s.t. 1. The set of propositions {A(u): 0 <= u <= 1} is mutually exclusive and exhaustive 2. P(y=1 | A(u)) = u and P(y=0 | A(u)) = 1 - u Because the set of propositions is mutually exclusive and exhaustive, there is one u s.t. A(u) is true and for any v != u, A(v) is false. We call this unique value of u: theta. It follows that P(y=1 | theta) = theta and P(y=0 | theta) = 1 - theta and we use this to calculate the posterior predictive distribution
Jaynesian interpretation - How does “estimating probabilities” make sense?

Regarding reading Jaynes, my understanding is its good for intuition but bad for applied statistics because it does not teach you modern bayesian stuff such as WAIC and HMC, so you should first do one of the applied books. I also think Janes has nothing about causality.

1[anonymous]6moI‘m afraid I have to disagree. I do sometimes regret not focusing more on applied Bayesian inference. (In fact, I have no idea what WAIC or HMC is.) But in my defence, I am an amateur analytical-philosopher & logician and I couldn’t help finding more non-sequiturs in classical expositions of probability theory than plot-holes in Tolkien novels. Perhaps if had been more naive and less critical (no offence to anyone) when I read those books, I would have “progressed” faster. I had lost hope in understanding probability theory before I read Professor Jaynes’ book; that’s why I respect the man so much. Now I have the intuition but I am still trying to reconcile it with what I read in the applied literature. I sometimes find it frustrating that I am worrying about the philosophical nuances and intricacies of probability theory while others are applying their (perhaps less coherent) understanding of it to solve problems but I strongly believe it is worth it :)
Jaynesian interpretation - How does “estimating probabilities” make sense?

Given 1. your model and 2 the magical no uncertainty in theta, then it's theta, the posterior predictive allows us to jump from infrence about parameters to infence about new data, it's a distribution of y (coin flip outcomes) not theta (which describes the frequency)

2[anonymous]6moThink I have finally got it. I would like to thank you once again for all your help; I really appreciate it. This is what I think “estimating the probability” means: We define theta to be a real-world/objective/physical quantity s.t. P(H|theta=alpha) = alpha & P(T|theta=alpha) = 1 - alpha. We do not talk about the nature of this quantity theta because we do not care what it is. I don’t think it is appropriate to say that theta is “frequency” for this reason: 1. “frequency” is not a well-defined physical quantity. You can’t measure “frequency” like you measure temperature. But we do not need to dispute about this as theta being “frequency” is unnecessary. Using the above definitions, we can compute the likelihood and then the posterior and then the posterior predictive which is represents the probability of heads in the next flip given data from previous flips. Is the above accurate? So Bayesians who say that theta is the probability of heads and compute a point estimate of the parameter theta and say that they have “estimated the probability” are just frequentists in disguise?
Load More