Noting the obvious connection to Goodhart's law - and elsewhere I've described the mistake of pushing to maximize easy-to-measure / cognitively available items rather than true goals.
Yeah, that's true. I don't recall exactly what I was thinking. Perhaps it was regarding time-weighting, and the difficulty of seeing what your score will be based on what you predict - but the Metaculus interface handles this well, modulus early closings, which screw lots of things up. Also, log-scoring is tricky when you have both continuous and binary outcomes, since they don't give similar measures - being well calibrated for binary events isn't "worth" as much, which seems perverse in many ways.
In many cases, yes. But for some events, the "obvious" answers are not fully clear until well after the event in question takes place - elections, for example.
About 20% of Americans develop skin cancer during their lifetime, and the 5-year overall survival rate for melanoma is over 90 percent. Taking this as the mortality risk, i.e. ignoring timing and varied risk levels, it's a 2% risk of (eventual) death.But risk of skin cancer depends on far more than sun exposure - and the more important determinant is frequency of sunbathing below age 30. Other factors that seem to matter are skin color, skin response (how much you burn,) weight, and family history of cancers.
re: "Get this wrong" versus "the balance should be better," there are two different things that are being discussed. The first is about defining individual questions via clear resolution criteria, which I think is doe well, and the second is about defining clear principles that provide context and inform what types of questions and resolution criteria are considered good form.A question like "will Democrats pass H.R.2280 and receive 51 votes in the Senate" is very well defined, but super-narrow, and easily resolved "incorrectly" if the bill is incorporated into another bill, or if an adapted bill is proposed by a moderate Republican and passes instead, or passed via some other method, or if it passes but gets vetoed by Biden. But it isn't an unclear question, and given the current way that Metaculus is run, would probably be the best way of phrasing the question. Still, it's a sub-par question, given the principles I mentioned. A better one would be "Will a bill such as H.R.2280 limiting or banning straw purchases of firearms be passed by the current Congress and enacted?" It's much less well defined, but the boundaries are very different. It also uses "passed" and "enacted", which have gray areas. At the same time, the failure modes are closer to the ones that we care about near the boundary of the question. However, given the current system, this question is obviously worse - it's harder to resolve, it's more likely to be ambiguous because a bill that does only some of the thing we care about is passed, etc.
Still, I agree that the boundaries here are tricky, and I'd love to think more about how to do this better.
I haven't said, and I don't think, that the majority of markets and prediction sites get this wrong. I think they navigate this without a clear framework, which I think the post begins providing. And I strongly agree that there isn't a slam-dunk-no-questions case for principles overriding rules, which the intro might have implied too strongly. I also agree with your point about downsides of ambiguity potentially overriding the benefits of greater fidelity to the intent of a question, and brought it up in the post. Still, excessive focus on making rules on the front end, especially for longer-term questions and ones where the contours are unclear, rather than explicitly being adaptive, is not universally helpful. And clarifications that need to change the resolution criteria mid-way are due to either bad questions, or badly handled resolutions. At the same time, while there are times that avoiding ambiguity is beneficial, there are also times when explicitly addressing corner cases to make them unambiguous ("if the data is discontinued or the method is changed, the final value posted using the current method will be used") makes the question worse, rather than better. Lastly, I agree that one general point I didn't say, but agree with, was that "where the spirit and letter of a question conflict, the question should be resolved based on the spirit." I mostly didn't make an explicit case for this because I think it's under-specified as a claim. Instead, the three more specific claims I would make are: 1) When the wording of a question seems ambiguous, the intent should be an overriding reason to choose an interpretation.2) When the wording of a question is clear, the intent shouldn't change the resolution.
As an aside, I find it bizarre that Economics gets put at 9 - I think a review of what gets done in top econ journals would cause you to update that number down by at least 1. (It's not usually very bad, but it's often mostly useless.) And I think it's clear that lots of Econ does, in fact, have a replication crisis. (But we'll if see that is true as some of the newer replication projects actually come out with results.)
Generally agree that there's something interesting here, but I'm still skeptical that in most prediction market cases there would be enough money across questions, and enough variance in probabilities, for this to work well.
For betting markets, the market maker may need to manage the odds differently, and for prediction markets, it's because otherwise you're paying people in lower brier scores for watching the games, rather than being good predictors beforehand. (The way that time-weighted brier scores work is tricky - you could get it right, but in practice it seems that last minute failures to update are fairly heavily penalized.)
That's good to hear. But if "he started at 60%," that seems to mean if he "still thinks dark matter is overwhelmingly likely" he is updating in the wrong direction. (Perhaps he thought it was 60% likely that the LHC found dark matter? In which case I still think that he should update away from "overwhelmingly likely" - it's weak evidence against the hypothesis, but unless he started out almost certain, "overwhelmingly" seems to go a bit too far.)