I think my point is different, although I have to admit I don't entirely grasp your objection to Nostalgebraist's objection. I think Nostalgebraist's point about rules being gameable does overlap with my example of multi-agent systems, because clear-but-only-approximately-correct rules are exploitable. But I don't think my argument is about it being hard to identify legitimate exceptions. In fact, astrophysicists would have no difficulty identifying when it's the right time to stop using Newtonian gravity.
But my point with the physics analogy is that sometimes, even if you actually know the correct rule, and even if that rule is simple (Navier-Stokes is still just one equation), you still might accomplish a lot more by using approximations and just remembering when they start to break down.
That's because Occam's-razor-simple rules like "to build a successful business, just turn a huge profit!" or "air is perfectly described by this one-line equation!" can be very hard to apply to synthesize into specific new business plans or airplane designs, or even to make predictions about existing business plans or airplane designs.
I guess a better example is: the various flavours of utilitarianism each convert complex moral judgements into simple, universal rules to maximize various measures of utility. But even with a firm belief in utilitarianism, you could still be stumped about the right action in any particular dilemma, just because it might be really hard to calculate the utility of each option. In this case, you don't feel like you've reached an "exception" to utilitarianism at all -- you still believe in the underlying principle -- but you might find it easier to make decisions using an approximation like "try not to kill anybody", until you reach edge-cases where that might break down, like in a war zone.
You might not even know if eating a cookie will increase or decrease your utility, so you stick to an approximation like "I'm on a diet" to simplify your decision-making process until you reach an exception like "this is a really delicious-looking / unusually healthy cookie", in which you decide it's worth dropping the approximation and reaching for the deeper rules of utilitarianism to make your choice.
In spirit I agree with "the real rules have no exceptions". I believe this applies to physics just as well as it applies to decision-making.
But, while the foundational rules of physics are simple and legible, the physics of many particles -- which are needed for managing real-world situations -- includes emergent behaviours like fluid drag and turbulence. The notoriously complex behaviour of fluids can be usefully compressed into rules that are simple enough to remember and apply, such as inviscid or incompressible flow approximations, or tables of drag coefficients. But these simple rules are built on top of massively complex ones like the Navier-Stokes equation (which is itself still a simplifying assumption over quantum physics and relativity).
It is useful to remember that the equations of incompressible flow are not foundational and so will have exceptions, or else you will overconfidently predict that nobody can fly supersonic airplanes. But that doesn't mean you should discard those simplified rules when you reach an exception and proceed to always use Navier-Stokes, because the real rules might simply be too hard to apply the rest of the time and give the same answer anyway, to three significant figures. It might just be easier in practice to remember the exceptions.
Hence, when making predictive models, even astrophysicists will think of gravity in terms of "stars move according to Newton's inverse square law, except when dealing with black holes or gravitational lensing". They know that it's really relativity under the hood, but only draw on that when they know it's necessary.
OK, that's enough of an analogy. When might this happen in real life?
One case could be multi-agent, anti-inductive systems... like managing a company. As soon as anyone identifies a complete and compact formula for running a successful business it either goes horrifyingly wrong, or the competitive landscape adapts to nullify it, or else it was too vague of a rule to allow synthesizing concrete actions. ("Successful businesses will aim to turn a profit").
You're welcome! And I'm sorry if I went a little overboard. I didn't mean it to sound confrontational.
X and ~X will always receive the same score by both the logarithmic and least-squares scoring rules that I described in my post, although I certainly agree that the logarithm is a better measure. If you dispute that point, please provide a numerical example.
Because of the 1/N factor outside the sum, doubling predictions does not affect your calibration score (as it shouldn't!). This factor is necessary or your score would only ever get successively worse the more predictions you make, regardless of how good they are. Thus, including X and ~X in the enumeration neither hurts nor helps your calibration score (regardless of whether using the log or the least-squares rule).
I agree that eyeballing a calibration graph is no good either. That was precisely the point I made with the lottery ticket example in the main post, where the prediction score is lousy but the graph looks perfect.
I agree that there's no magic in the scoring rule. Doubling predictions is unnecessary for practical purposes; the reason I detail it here is to make a very important point about how calibration works in principle. This point needed to be made, in order to address the severe confusion that was apparent in the Slate Star Codex comment threads, because there was widespread disagreement about what exactly happens at 50%.
I think we both agree that there should be no controversy about this -- however, go ahead and read through the SSC thread to see how many absurd solutions were being proposed! That's what this post is responding to! What is made clear by enumerating both X and ~X in the bookkeeping of predictions -- a move for which there is no possible objection, because it is no different than the original prediction, nor is does it affecting a proper score in any way -- is that there is no reason to treat 50% as though it has special properties that are different than 50.01%, and there's certainly no reason to think that there is any significance to the choice between writing "X, with probability P" and "~X, with probability 1-P", even when P=50%.
If you still object to doubling the predictions, you can instead choose to take Scott's predictions and replace all X all with ~X, and all P with 1-P. Do you agree that this new set should be just as representative of Scott's calibration as his original prediction set?
The calibration you get, by the way, will be better represented by the fact that if you assigned 50% to the candidate that lost, then you'll necessarily have assigned a very low probability to the candidate that won, and that will be the penalty that will tell you your calibration is wrong.
The problem is the definition of more specific. How do you define specific? The only consistent definition I can think of is that a proposition A is more specific than B if the prior probability of A is smaller than that of B. Do you have a way to consistently tell whether one phrasing of a proposition is more or less specific than another?
By that definition, if you have 10 candidates and no information to distinguish them, then the prior for any candidate to win is 10%. Then you can say "A: Candidate X will win" is more specific than "~A: Candidate X will not win", because P(A) = 10% and P(~A) = 90%.
Since the proposition "A with probability P" is the exact same claim as the proposition "~A with probability 1-P"; since they are the same proposition, there is no consistent definition of "specific" that will let one phrasing be more specific than the other when P = 50%.
"Candidate X will win the election" is only more specific than "Candidate X will not win the election" if you think that it's more likely that Candidate X will not win.
For example, by your standard, which of these claims feels more specific to you?
A: Trump will win the 2016 Republican nomination
B: One of either Scott Alexander or Eliezer Yudkowsky will win the 2016 Republican nomination
If you agree that "more specific" means "less probable", then B is a more specific claim than A, even though there are twice as many people to choose from in B.
Which of these phrasings is more specific?
C: The winner of the 2016 Republican nomination will be a current member of the Republican party (membership: 30.1 million)
~C: The winner of the 2016 Republican nomination will not be a current member of the Republican party (non-membership: 7.1 billion, or 289 million if you only count Americans).
The phrasing "C" certainly specifies a smaller number of people, but I think most people would agree that ~C is much less probable, since all of the top-polling candidates are party members. Which phrasing is more specific by your standard?
If you have 10 candidates, it might seem more specific to phrase a proposition as "Candidate X will win the election with probability 50%" than "Candidate X will not win the election with probability 50%". That intuition comes from the fact that an uninformed prior assigns them all 10% probability, so a claim that any individual one will win feels more specific in some way. But actually the specificity comes from the fact that if you claim 50% probability for one candidate when the uninformed prior was 10%, you must have access to some information about the candidates that allows you to be so confident. This will be properly captured by the log scoring rule; if you really do have such information, then you'll get a better score by claiming 50% probability for the one most likely to win rather than 10% for each.
Ultimately, the way you get information about your calibration is by seeing how well your full probability distribution about the odds of each candidate performs against reality. One will win, nine will lose, and the larger the probability mass you put on the winner, the better you do. Calibration is about seeing how well your beliefs score against reality; if your score depends on which of two logically equivalent phrasings you choose to express the same beliefs, there is some fundamental inconsistency in your scoring rule.
Yes, but only because I don't agree that there was any useful information that could have been obtained in the first place.
I don't understand why there is so much resistance to the idea that stating "X with probability P(X)" also implies "~X with probability 1-P(X)". The point of assigning probabilities to a prediction is that it represents your state of belief. Both statements uniquely specify the same state of belief, so to treat them differently based on which one you wrote down is irrational. Once you accept that these are the same statement, the conclusion in my post is inevitable, the mirror symmetry of the calibration curve becomes obvious, and given that symmetry, all lines must pass through the point (0.5,0.5).
Imagine the following conversation:
A: "I predict with 50% certainty that Trump will not win the nomination".
B: "So, you think there's a 50% chance that he will?"
A: "No, I didn't say that. I said there's a 50% chance that he won't."
B: "But you sort of did say it. You said the logically equivalent thing."
A: "I said the logically equivalent thing, yes, but I said one and I left the other unsaid."
B: "So if I believe there's only a 10% chance Trump will win, is there any doubt that I believe there's a 90% chance he won't?
A: "Of course, nobody would disagree, if you said there's a 10% chance Trump will win, then you also must believe that there's a 90% chance that he won't. Unless you think there's some probability that he both will and will not win, which is absurd."
B: "So if my state of belief that there's a 10% chance of A necessarily implies I also believe a 90% chance of ~A, then what is the difference between stating one or the other?"
A: "Well, everyone agrees that makes sense for 90% and 10% confidence. It's only for 50% confidence that the rules are different and it matters which one you don't say."
B: "What about for 50.000001% and 49.999999%?"
A: "Of course, naturally, that's just like 90% and 10%."
B: "So what's magic about 50%?"
"Candidate X will win the election with 50% probability" also implies the proposition "Candidate X will not win the election with 50% probability". If you propose one, you are automatically proposing both, and one will inevitably turn out true and the other false.
If you want to represent your full probability distribution over 10 candidates, you can still represent it as binary predictions. It will look something like this:
Candidate 1 will win the election: 50% probability
Candidate 2 will win the election: 10% probability
Candidate 3 will win the election: 10% probability
Candidate 4 will win the election: 10% probability
Candidate 5 will win the election: 10% probability
Candidate 6 will win the election: 2% probability
Candidate 1 will not win the election: 50% probability
Candidate 2 will not win the election: 90% probability
The method described in my post handles this situation perfectly well. All of your 50% predictions will (necessarily) come true 50% of the time, but you rack up a good calibration score if you do well on the rest of the predictions.
I'm really not so sure what a frequentist would think. How would they express "Jeb Bush will not be the top-polling Republican candidate" in the form of a repeated random experiment?
It seems to me more likely that a frequentist would object to applying probabilities to such a statement.