vae — LessWrong

vae2mo10

Neat post, thank you! Also, you seem to have posted this twice.

vae2mo10

Maybe! But I would expect him to change his view to something like this in case you managed to persuade him that there is some crucial flaw in Bayesianism. While your goal seems to be to propagate the toolbox-view as the only valid approach. So you might as well engage with a stronger version of law-view right now.

Well, maybe, I don't know. As it stands it just seems best to argue against what he has said at his word then to assume otherwise, though, insofar as other people take this view at face value. Though, if such a thing does come about, I would of course have to write a different post. This could be some part of LessWrong culture that I am just ignorant of, though, so apologies.

And then I assign equiprobable prior between these models and start collecting experimental data - see how well all of them perform. Do I understand correctly, that Walker considers such approach incoherent?

It depends on what you mean by see how well all of them perform. In this situation, where you can easily get a reasonably small set of models that might represent your total uncertainty, and then (crucially) obtain whatever estimates or uncertainties you desire by updating the posterior of the complete model (including these sub-models - i. e. must be marginalized over M).

To a Bayesian, this is simply the uniquely-identified distribution function which represents your uncertainty about these parameters - no other function represents this and any other probability represents some other belief. This of course includes the procedure of maximizing some data score (i. e. finding an empirical 'best model'), which would be something like $P (p_{1}, p_{2}, \dots, p_{6} | X, M = m_{i}^{*})$ where $m_{i}^{*} = {arg}_{m \in M} max T (X, m)$ in which T is some model evaluation score (possibly just the posterior probability of the model).

This seems like a very artificial thing to report as your uncertainty about these parameters and essentially guarantees that your uncertainties will be underestimated - among other things, there is no guarantee that such a procedure follows the likelihood principle (for most measures of model correctness other than something in proportion to the posterior probability of each model, but if you have those at hand you might as well just marginalize over them), and by the uniqueness part of Cox's proof it will break one of the presuppositions there (and therefore likely fall pray to some Dutch Book). Therefore, if you consider these the reasons to be Bayesian, to be a Bayesian and do this seems ill-fated.

Of course, he and I and most reasonable people accept that this could be a useful practical device in many contexts, but it is indeed incoherent in the formal sense of the word.

But why would the prior, capturing all your information about a setting, be sufficiently far off from the true value, in the first place?

Well, you told me to grab you the simplest possible example, not the most cogent or common or general one!

But, well, regardless if you're looking for this happening in practice, this complaint is sometimes levied at Bayesian versions of machine learning models where it is especially hard to convincingly supply priors for humongous quantities of parameters. Here's some random example I found of this happening in a much less complex situation. This is all beside the point, though, which is again that there is no guarantee that I know of for Bayesians to be finite-sample calibrated AT ALL. There are asymptotic arguments, and there are approximate arguments, but there is simply no property which says that even an ideal Bayesian (under the subjectivist viewpoint) will be approximately calibrated.

Note that I also have no reason to assume the prior must be sufficiently close to the true value - since a prior for a subjectivist is just an old posterior, this is tantamount to just assuming a Bayesian is not only consistent but well on his way to asymptotic-land, so the cart is put before the horse.

Interesting! (...) "so the question is still open".

Few Frequentists believe that there is an absolute principle to which an estimate is the unique best one, no, but this doesn't make the question 'still open', just only as open as any other statistical question (in that, until proven otherwise, you could always pose a method which is just as good as the original in one way or another while also being better in some other dimension). 1/2 seems hard to beat, though, unless you specify some asymmetrical loss function instead (which seems like an appropriate amount of openness to me, in that in those situations you obviously do want to use a different estimate).

though everyone is in agreement that Bayes theorem naturally follows from the axioms of probability theory.

Yes, of course, but from this does not follow that any particular bit of thought can be represented as a posterior obtained from some prior. Clearly, whatever thought might be in my head there could easily just not follow the axioms of probability, and frankly I think this is almost certain. Maybe there does exist some decent representation of these sorts of practical mental problems such as these, but I would have to see it to believe it. Not only that, but I am doubtful of the value of supposing that whatever ideal this thought ought to aspire to achieve is a Bayesian one (thus the content of the post - consider that a representation of this practical problem is another formally nonparametric one, in that the ideal list of mathematical properties must be comically large - if I am to assume some smaller space, I am implicitly claiming a probability of zero on rather a lot of mathematical laws which could be good which I cannot conceive of immediately, which seems horrible as an ideal.)

Contra Yudkowsky's Ideal Bayesian

vae2mo20

Thank you for spotting that.

Contra Yudkowsky's Ideal Bayesian

vae2mo*10

Is it? Certainly Yudkowksy takes a lot of inspiration from Jaynes but I don't remember him beating on this particular drum. Of course there were arguments as to why being Bayesian is correct from a philosophical perspective, just not so much that everything must be approximating it. Though, it's been years since I read him, so I could be wrong.

Contra Yudkowsky's Ideal Bayesian

vae2mo*10

You can have a law-view interpretation of such synthesis where we conceptualize Bayesianism as an imperfect approximation, a special case of the True Law, which should also capture all the good insights of Frequentism.

Yes, but such an interpretation falls outside of Yudkowsky's view, as I understand it (for example on that X thread in another comment in this post, and his comments on other statistics topics I've seen around - could fish for the quotes, but I'm a bit held up at this precise moment), which is what that I'm focusing on here.

On Walker, on that paragraph he is criticizing the specific (common) practice of comparing separate Bayesian models and picking the best (via ratios or errors or some such) when there is uncertainty about the truth instead of appropriately representing this uncertainty about your sampling model in the prior.

Rolling a die is a bit of a nifty example here since it's the case where you assign a separate probability to each label in the sample space, so that your likelihood is in fact fully general, which is where the idea for a Dirichlet prior comes from in an attempt to generalize this notion of covering all possible models for less trivial problems. In the rest of the intro, Walker points to Bayesians fitting models with different likelihoods (i. e., Weibull v. s. Lognormal, I think he is a survival guy), each with their own "inner" priors, comparing them against each other and then picking the one which is "best" as incoherent, since picking a prior just to compare posteriors on some frequentist property like error or coverage is not an accurate representation of your prior uncertainty (instead, he wants you to pick some nonparametric model).

On Bayesian finite-sample miscalibration, simply pick a prior which is sufficiently far off from the true value and your predictive intervals will be very bad for a long time (you may check by simulation on some conjugate model). This is a contrived example, of course, but this happens on some level all the time, since Bayesian methods make no promise of finite-sample calibration - your prediction intervals just reflect belief, not what future data might be (in practice, I've heard people complain about this in Bayesian machine learning type situations). Of course, asymptotically and under some regularity conditions you will be calibrated, but one would rather be right before then. If you want finite-sample calibration, you have to look for a method which promises it. In terms of the coverage of credible intervals more generally, though, unless you want to be in the throes of asymptotics, you'd have to pick what is called a matching prior, which again seems in conflict with subjectivist information input

On minimax: I don't know how to format math in this site on a phone, so I will be a bit terse here, but the proof is very simple (I thought). in statistics, when we call an estimator "minimax" it means that it minimizes the maximum risk, which is the expectation of the loss over the distribution of the estimator. Since we have no data, all estimators are some constant c. The expectation of the loss with is just the loss with respect to the parameter (i. e. (c-p)^2). Clearly the maxima are taken when p is 0 or 1, so minimize the maximum of c^2, (1-c)^2, which has a minimum at c=0.5. Which is to say, 1/2 has this nice Frequentist property, which is how one could justify it.

On your last comment, it seems like a bit of an open question to attribute the existence of practical intuition and reasoning about mathematical constructs like this to a Bayesian prior updating process. Certainly I reason, and I change my mind, but to me personally I see no reason to imagine this was Bayesian in some way (or that those thoughts were expressed in credence-probabilities which I shifted by conditioning on a type of sense-data), nor that I would be ideally doing this instead. But, I suppose such a thing could be possible.

Contra Yudkowsky's Ideal Bayesian

vae2mo*20

I think you've misunderstood the point somewhat. On the question of 'taking the best from both', that is what Yudkowsky calls a "tool" view, whereas I'm trying to argue against his view of its status as a necessary law of correct reasoning (see my other comment for related points). Insofar as you acknowledge that Frequentists can produce good answers independently of there existing some Bayesian prior-likelihood combination they must be approximating, we agree.

Still, the problem of 'taking the best from both' from a philosophically Bayesian view is that it is incoherent - you can't procedurally pick a method which isn't derived through a Bayes update of your prior beliefs without incurring in all the Dutch book/axiom-breaking behaviours coherence is supposed to insure against.

There is more responsibility on the Bayesian: she gets more out in the form of a posterior distribution on the object of interest. Hence more care needs to be taken in what gets put into the model in the first place. For the posterior to mean anything it must be representing genuine posterior beliefs, solely derived by a combination of the data and prior beliefs via the use of the Bayes theorem. Hence, the prior used must genuinely represent prior beliefs (beliefs without data). If it does not, how can the posterior represent posterior beliefs? So a “prior” that has been selected post data via some check and test from a set of possible “prior” distributions cannot represent genuine prior beliefs. This is obvious, since no one of these “priors” can genuinely represent prior beliefs. The posterior distributions based on such a practice are meaningless.
- Stephen J. Walker (first chapter of Bayesian Nonparametrics)

Not only that, but insofar as you 'want both' (finite-sample) calibration and coherence, you are called to abandon one or the other - Insofar as there are Bayesian methods that can get you the former, they are not derived by prior distributions that represent your knowledge of the world (if they even exist in general, anyway - not something I know of).

On your query about coins, 1/2 is minimax for the squared error, I believe. But, on a more fundamental level, at least to me most of the point of being Frequentist is to believe in no unique and nontrivial optimal framework for reasoning (i. e. not a hodgepodge of principle-less methods), that there are only good properties which a method can or can't obtain.

Contra Yudkowsky's Ideal Bayesian

vae2mo20

Hello! Thank you for the comment, these are good points.

I do not consider myself a Rationalist nor know much of anything about Yudkowsky's more current positions on this subject, but I probably should have mentioned somewhere in the post that this article was partly motivated by this discussion on X, and his comment. I must admit I do not really grasp what he is gesturing towards with the point he makes there, but it seems like he still believes some version of the original point as stated.
This post is not about Bayesian inference as practiced by mortal statistical workers; I have other reasons to justify my Frequentism there, but I wrote this so as to eschew the "Tool vs. Law" distinction that seems to be sometimes drawn here. Of course, Bayesian methods in statistics are sometimes useful (it's hard to justify a hierarchical model without reference to conditioning, "H-likelihood" feels like the sort of post-hoc methodological loop-the-loop that I criticize Bayesians for), and I have used them myself here and there. I am very interested to hear what methods you derived through Bayesian thinking which are not equivalent to a Frequentist estimate, though!
I agree with you here, almost completely - it just doesn't seem like what Yudkowsky is saying. To wit:

I have been justly chastised by the discussants for spreading alarm about the health of the body Bayesian. Certainly it has held together more successfully than any other theory of statistical inference, and I am not predicting its imminent demise. But no human creation is completely perfect, and we should not avert our eyes from its deficiencies. In its solipsistic satisfaction with the psychological self-consistency of the schizophrenic statistician, it runs the risk of failing to say anything useful about the world outside.

(Though I would personally add that, even though it's probably the best unifying principle in statistics, there is no need to adhere to any such general principle when there are better alternatives.)

This is one of those practical questions which I tried to avoid here (maybe I should just write a separate Frequentism post eventually), but yes, I agree, and would characterize this as probably the biggest advantage of Bayesian methods in practice - that they are "plug-and-play", that if you specify a minimally sensible model you have strong guarantees (in nice, parametric, problems) that your answers will be sensible too.

I imagine this is why they are most often seen in fields like astrophysics, where you don't want to seek out the best methods for really complicated physical models, you just want something that works well without having to worry. Still, the comparative strength of Frequentism is being able to specify and more directly obtain exactly what you want, sometimes optimally. An easy example is exact finite-sample calibration: if I want my predictions to be calibrated (and there are many situations in which I do), the methods which will guarantee that I get this will involve conformal inference method or the like. I don't have to wrangle a prior which matches this or hope everything works out. Other examples are, say, robustness, or in experiment design.

You comment on assumptions, here, but in my opinion you have it backwards - if your Bayesian model handles non i.i.d-ness well, this is because the dependency shows up in the likelihood, which (say) the MLE still handles quite well (vaguely asymptotically efficient and so on). What if you want to be distribution-free, or want to check if your answers are robust to your model being wrong in some directions? Maybe there will be better Bayesian answers here someday, statistics generally is a young field, but (in practice) I think the Frequentists just take the cake on this one.
This is again correct, of course, but I am specifically criticizing the essence of Yudkowsky's point of "if it's any good, it must be approximating a Bayesian answer": who's approximating who? Here it seems much more sensible to say that we have a good answer (the OLS estimate), one that we have reasons to prefer in some scenarios (e. g. Gauss-Markov, general distribution-free niceness) that a Bayesian method, strictly speaking, can only approximate, and which seems at odds with a pure subjectivist point of view (because the prior is incoherent, but this is much more salient in the Cox model example). Indeed in practice this is irrelevant.

Contra Yudkowsky's Ideal Bayesian

vae2mo30

Yep - this is the standard term for the property (e. g. in that Seidenfeld paper).

The LessWrong 2018 Book is Available for Pre-order

vae5y*30

[This comment is no longer endorsed by its author]Reply

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments