It is a damn shame to hear of this tiredness, and I hope that your mood improves, somehow, somewhen, hopefully sooner than you expect.
This reply, though, I am forced to say, does not quite persuade, and to be totally frank even disappoints me a little. It was my understanding that both MIRI and to some extent one of your goals as of now was one of public outreach and communication (as a sub-goal for policy change) - this was at least how I understood this recent tweet describing what MIRI is doing and why people should donate to it, as well as other things you've been doing somewhat recently going on a bunch of podcasts and interviews and things of that nature (as well as smaller things such as separating out a 'low-volume' public persona account for reach as well as a shitpost-y side one).
Therefore, to put it maybe somewhat bluntly, I thought that thinking deeply and being maximally deliberate about what you communicate and how, and in particular how well it might move policy or the public, was, if not quite the whole idea, maybe a main goal or job of your organization and indeed your public persona. So, though of course a great many allowances are to be made when it comes to the tiredness, I don't understand how to square the idea that
you don't really have an option of continuing to hear from a polite Eliezer; I'd just stop talking instead.
with what you and what MIRI are currently, well, for. You said, and I totally understand why, that the plan is to get other people to 'take over' that role, but this doesn't really make it any less of a bad thing, just more of a hopefully temporary one. Is it truly such an all-or-nothing thing, that you'd just abandon all hope at such an important part of MIRI's goals instead of trying to learn to be (as other commenters have put it) less abrasive?
I hope it is not too rude to hope that you come upon a different attitude when it comes to this at some point, because as it stands it seems kind of contradictory and self-defeating.
Hmm. I'm sorry to have to say this, but I think this post is not very good.
In summary, I would say that this article is discussing a set of very common, well-trodden issues (Bayesian model checking/expansion/selection, generally speaking problems of insufficient 'model space') in a very nontechnical way (which is of course a fine approach on its own, but it's strange that the thing that this is talking about is not referenced or referred to as the thing it's called anywhere, instead everything here is presented as if it was a novel first-principles contribution), in such a way that proposes the most wishy-washy solution to the problem (just ignoring it) without reference to any alternatives and without really discussing the disadvantages of this approach.
To go line by line: first, though the title and the framing of this article surround priors, as you kind of point out, the prior isn't really what's changing here, but the likelihood (and the part of the prior that interacts with it). You can consider yourself to have, in theory, a larger prior that you just weren't aware you should be considering. So, the problem that this is dealing with is the classic "I don't actually believe my likelihood is correct" model, for which there are a billion proposed solutions, which you don't really touch on for some reason (to name a few, "M-open" analyses, everything in the Gelman BDA chapter on model expansion, nonparametric methods which to be fair you hint at with 'infinite-dimensional', etc).
To briefly explain some of these: one classic approach is to consider your prior attempt some model , with some prior probability, and your next one an , and your final posterior will be an averaging out of each, with some prior model credences. In this specific case, you could probably nest both of the likelihoods into a single model, somehow. The most common applied approach, as many have pointed out, is to just split the data and do whatever exploratory analyses you want on the head of the table or some such approach.
The problem with this classic solving-by-not-solving approach is that, well, you are not being an actual Bayesian anymore; if your (technical) prior depends on observing the data, your procedure no longer actually follows the laws of probability, no longer benefits from Cox's theorem, has no guarantee relating to Dutch books and does not result in decision-optimal rules (roughly the same reasons orthodox Bayesians reject empirical-Bayes methods, which are a non-Bayesian half-solution to this issue, kind of). So, well, I don't see the point.
There are good ways and bad ways to be approximately Bayesian, and this particular one seems no good to me, at least without further argument (especially when avoiding it is so easy and common. Just data-split!). Double-counting methods always look good when you are in a thought experiment and are doubling down on what you know to be the right answer, the problem is that they are doubling down on the wrong answers for no reason, too; it seems perfectly reasonable to me to admit that 50/50 posterior after seeing just a little bit of data, then some model refinement happens, and your posterior upon seeing the full data looks more 'reasonable'.
So, I don't know. These are good (classic) questions, but I am forced to disagree with the notion that this is a good answer.
Neat post, thank you! Also, you seem to have posted this twice.
Maybe! But I would expect him to change his view to something like this in case you managed to persuade him that there is some crucial flaw in Bayesianism. While your goal seems to be to propagate the toolbox-view as the only valid approach. So you might as well engage with a stronger version of law-view right now.
Well, maybe, I don't know. As it stands it just seems best to argue against what he has said at his word then to assume otherwise, though, insofar as other people take this view at face value. Though, if such a thing does come about, I would of course have to write a different post. This could be some part of LessWrong culture that I am just ignorant of, though, so apologies.
And then I assign equiprobable prior between these models and start collecting experimental data - see how well all of them perform. Do I understand correctly, that Walker considers such approach incoherent?
It depends on what you mean by see how well all of them perform. In this situation, where you can easily get a reasonably small set of models that might represent your total uncertainty, and then (crucially) obtain whatever estimates or uncertainties you desire by updating the posterior of the complete model (including these sub-models - i. e. must be marginalized over M).
To a Bayesian, this is simply the uniquely-identified distribution function which represents your uncertainty about these parameters - no other function represents this and any other probability represents some other belief. This of course includes the procedure of maximizing some data score (i. e. finding an empirical 'best model'), which would be something like where in which T is some model evaluation score (possibly just the posterior probability of the model).
This seems like a very artificial thing to report as your uncertainty about these parameters and essentially guarantees that your uncertainties will be underestimated - among other things, there is no guarantee that such a procedure follows the likelihood principle (for most measures of model correctness other than something in proportion to the posterior probability of each model, but if you have those at hand you might as well just marginalize over them), and by the uniqueness part of Cox's proof it will break one of the presuppositions there (and therefore likely fall pray to some Dutch Book). Therefore, if you consider these the reasons to be Bayesian, to be a Bayesian and do this seems ill-fated.
Of course, he and I and most reasonable people accept that this could be a useful practical device in many contexts, but it is indeed incoherent in the formal sense of the word.
But why would the prior, capturing all your information about a setting, be sufficiently far off from the true value, in the first place?
Well, you told me to grab you the simplest possible example, not the most cogent or common or general one!
But, well, regardless if you're looking for this happening in practice, this complaint is sometimes levied at Bayesian versions of machine learning models where it is especially hard to convincingly supply priors for humongous quantities of parameters. Here's some random example I found of this happening in a much less complex situation. This is all beside the point, though, which is again that there is no guarantee that I know of for Bayesians to be finite-sample calibrated AT ALL. There are asymptotic arguments, and there are approximate arguments, but there is simply no property which says that even an ideal Bayesian (under the subjectivist viewpoint) will be approximately calibrated.
Note that I also have no reason to assume the prior must be sufficiently close to the true value - since a prior for a subjectivist is just an old posterior, this is tantamount to just assuming a Bayesian is not only consistent but well on his way to asymptotic-land, so the cart is put before the horse.
Interesting! (...) "so the question is still open".
Few Frequentists believe that there is an absolute principle to which an estimate is the unique best one, no, but this doesn't make the question 'still open', just only as open as any other statistical question (in that, until proven otherwise, you could always pose a method which is just as good as the original in one way or another while also being better in some other dimension). 1/2 seems hard to beat, though, unless you specify some asymmetrical loss function instead (which seems like an appropriate amount of openness to me, in that in those situations you obviously do want to use a different estimate).
though everyone is in agreement that Bayes theorem naturally follows from the axioms of probability theory.
Yes, of course, but from this does not follow that any particular bit of thought can be represented as a posterior obtained from some prior. Clearly, whatever thought might be in my head there could easily just not follow the axioms of probability, and frankly I think this is almost certain. Maybe there does exist some decent representation of these sorts of practical mental problems such as these, but I would have to see it to believe it. Not only that, but I am doubtful of the value of supposing that whatever ideal this thought ought to aspire to achieve is a Bayesian one (thus the content of the post - consider that a representation of this practical problem is another formally nonparametric one, in that the ideal list of mathematical properties must be comically large - if I am to assume some smaller space, I am implicitly claiming a probability of zero on rather a lot of mathematical laws which could be good which I cannot conceive of immediately, which seems horrible as an ideal.)
Thank you for spotting that.
Is it? Certainly Yudkowksy takes a lot of inspiration from Jaynes but I don't remember him beating on this particular drum. Of course there were arguments as to why being Bayesian is correct from a philosophical perspective, just not so much that everything must be approximating it. Though, it's been years since I read him, so I could be wrong.
You can have a law-view interpretation of such synthesis where we conceptualize Bayesianism as an imperfect approximation, a special case of the True Law, which should also capture all the good insights of Frequentism.
Yes, but such an interpretation falls outside of Yudkowsky's view, as I understand it (for example on that X thread in another comment in this post, and his comments on other statistics topics I've seen around - could fish for the quotes, but I'm a bit held up at this precise moment), which is what that I'm focusing on here.
On Walker, on that paragraph he is criticizing the specific (common) practice of comparing separate Bayesian models and picking the best (via ratios or errors or some such) when there is uncertainty about the truth instead of appropriately representing this uncertainty about your sampling model in the prior.
Rolling a die is a bit of a nifty example here since it's the case where you assign a separate probability to each label in the sample space, so that your likelihood is in fact fully general, which is where the idea for a Dirichlet prior comes from in an attempt to generalize this notion of covering all possible models for less trivial problems. In the rest of the intro, Walker points to Bayesians fitting models with different likelihoods (i. e., Weibull v. s. Lognormal, I think he is a survival guy), each with their own "inner" priors, comparing them against each other and then picking the one which is "best" as incoherent, since picking a prior just to compare posteriors on some frequentist property like error or coverage is not an accurate representation of your prior uncertainty (instead, he wants you to pick some nonparametric model).
On Bayesian finite-sample miscalibration, simply pick a prior which is sufficiently far off from the true value and your predictive intervals will be very bad for a long time (you may check by simulation on some conjugate model). This is a contrived example, of course, but this happens on some level all the time, since Bayesian methods make no promise of finite-sample calibration - your prediction intervals just reflect belief, not what future data might be (in practice, I've heard people complain about this in Bayesian machine learning type situations). Of course, asymptotically and under some regularity conditions you will be calibrated, but one would rather be right before then. If you want finite-sample calibration, you have to look for a method which promises it. In terms of the coverage of credible intervals more generally, though, unless you want to be in the throes of asymptotics, you'd have to pick what is called a matching prior, which again seems in conflict with subjectivist information input
On minimax: I don't know how to format math in this site on a phone, so I will be a bit terse here, but the proof is very simple (I thought). in statistics, when we call an estimator "minimax" it means that it minimizes the maximum risk, which is the expectation of the loss over the distribution of the estimator. Since we have no data, all estimators are some constant c. The expectation of the loss with is just the loss with respect to the parameter (i. e. (c-p)^2). Clearly the maxima are taken when p is 0 or 1, so minimize the maximum of c^2, (1-c)^2, which has a minimum at c=0.5. Which is to say, 1/2 has this nice Frequentist property, which is how one could justify it.
On your last comment, it seems like a bit of an open question to attribute the existence of practical intuition and reasoning about mathematical constructs like this to a Bayesian prior updating process. Certainly I reason, and I change my mind, but to me personally I see no reason to imagine this was Bayesian in some way (or that those thoughts were expressed in credence-probabilities which I shifted by conditioning on a type of sense-data), nor that I would be ideally doing this instead. But, I suppose such a thing could be possible.
I think you've misunderstood the point somewhat. On the question of 'taking the best from both', that is what Yudkowsky calls a "tool" view, whereas I'm trying to argue against his view of its status as a necessary law of correct reasoning (see my other comment for related points). Insofar as you acknowledge that Frequentists can produce good answers independently of there existing some Bayesian prior-likelihood combination they must be approximating, we agree.
Still, the problem of 'taking the best from both' from a philosophically Bayesian view is that it is incoherent - you can't procedurally pick a method which isn't derived through a Bayes update of your prior beliefs without incurring in all the Dutch book/axiom-breaking behaviours coherence is supposed to insure against.
There is more responsibility on the Bayesian: she gets more out in the form of a posterior distribution on the object of interest. Hence more care needs to be taken in what gets put into the model in the first place. For the posterior to mean anything it must be representing genuine posterior beliefs, solely derived by a combination of the data and prior beliefs via the use of the Bayes theorem. Hence, the prior used must genuinely represent prior beliefs (beliefs without data). If it does not, how can the posterior represent posterior beliefs? So a “prior” that has been selected post data via some check and test from a set of possible “prior” distributions cannot represent genuine prior beliefs. This is obvious, since no one of these “priors” can genuinely represent prior beliefs. The posterior distributions based on such a practice are meaningless.
- Stephen J. Walker (first chapter of Bayesian Nonparametrics)
Not only that, but insofar as you 'want both' (finite-sample) calibration and coherence, you are called to abandon one or the other - Insofar as there are Bayesian methods that can get you the former, they are not derived by prior distributions that represent your knowledge of the world (if they even exist in general, anyway - not something I know of).
On your query about coins, 1/2 is minimax for the squared error, I believe. But, on a more fundamental level, at least to me most of the point of being Frequentist is to believe in no unique and nontrivial optimal framework for reasoning (i. e. not a hodgepodge of principle-less methods), that there are only good properties which a method can or can't obtain.
Hello! Thank you for the comment, these are good points.
I have been justly chastised by the discussants for spreading alarm about the health of the body Bayesian. Certainly it has held together more successfully than any other theory of statistical inference, and I am not predicting its imminent demise. But no human creation is completely perfect, and we should not avert our eyes from its deficiencies. In its solipsistic satisfaction with the psychological self-consistency of the schizophrenic statistician, it runs the risk of failing to say anything useful about the world outside.
(Though I would personally add that, even though it's probably the best unifying principle in statistics, there is no need to adhere to any such general principle when there are better alternatives.)
Thanks for replying. Given that it's been a month, sadly, I don't fully remember all the details of why I wrote what I wrote in my initial comment, but I'll try to roughly rewrite my objections in a more specific way so that you get where it makes contact with your post ("if I had more time, I would've written a shorter letter"). Forgive me if it was somehow hard to understand, English is my second language.
My first issue: the post is titled "Good if make prior after data instead of before". Yet, the post's driving example is a situation where the (marginal) prior probability of what you're interested in doesn't actually change, but instead is coupled to a larger model with a larger probability space where the likelihood is different at these different points. So, what you're talking about isn't really post-hoc changes to the prior, but something like model expansion, as you write in the comment.
In the context of methodologies for Bayesian model expansion, there is a lot of controversy and much ink has been spilled, because being ad-hoc and implicitly accepting a data-driven prior/selected model leads to incoherence; the decision-procedure you now derive from this is not actually Bayesian in the sense that it satisfies all the nice properties people expect of Bayesian decision rules and Bayesian reasoning, it just vaguely follows Bayes's rule for conditioning. When you write
you are sidestepping all of these issues (what I called "solving by not solving") and accepting incoherence as OK. And, well, this can be a fine approach - being approximately incoherent can be approximately no problem. But, I think that the post not only fails to address the negatives of this particular approach, positioning it as kind of the only thing you can reasonably do (which is in itself a sufficiently large problem), but fails to consider any other ones (A classic objection to this type of methodology in a canonical introductory textbook, providing one of the alternatives I mentioned, is here, for example, in which the idea is to have a model flexible and general enough that it can learn in essentially any situation; I mentioned other methods in the comment). Do you not see the incoherence of a data-driven prior as bad somehow?
To be clear, the other approach you consider of "never change your model/prior after seeing the data, even if your model makes no sense, your posterior is stuck as it is" is also bad for all the obvious model misspecification reasons. But, at the very least it is coherent (and, of course, by data-splitting you get to enjoy this coherency without being rigid at the cost of a little data, so there's another approach, much less technical to explain than the nonparametric approach mentioned prior). This is my main problem with the article, really: it proposes just this one idea among several without discussing its positives or negatives in relation to any of the other ones.
My point with this article endorsing "double-counting" is that one way in which this approach (roughly summarized as "construct the model after seeing the data, pretending like you haven't seen the data") is that, in comparison to either a nonparametric approach or some M-open idea like model mixing or stacking, it will privilege the particular model which you happened to construct on the basis of the data more so than a fully coherent theoretical approach.
An easy way to see this is to imagine if you were to try this approach while being knowledgeable about all possible models you could have picked (i. e. in model averaging, they wave at a similar critique to this idea in this other intro); in this representation, instead of observing the data and updating yourself towards one particular model representation which fits best with the data, your method is to set one model's probability to 1 and all others to zero, which is a rather extreme version of double-counting[1].
So, in my perspective, a good version of this article would not talk about anything being "the only practical way to get good results", and would situate this idea alongside all the other ones in this vein which have been discussed for decades, or at least sort of gestures at the more common approaches you consider sensible and that you think you can explain nontechnically (hopefully referenced by their names), and at the bare minimum it should explain the pros and cons of what it advocates with more balance. Admittedly, this is a much harder article to write, because the issue has become nuanced, and I would not know how to write it non-technically, at least immediately. However, the issue seems to be nuanced, at least to me, and this level of simplification misleads more than it helps.
In the original comment I decided to talk more about how easy it is to make double-counting methods seem arbitrarily good by way of constructing examples where you know the truth in advance, since of course it looks better if you get to the truth twice as fast, but the double-counting when the data happens to be misleading gets you doubly wrong too, but this objection seems kind of petty and irrelevant compared to the other ones, in hindsight.