Prediction Contest 2018: Scores and Retrospective

[-]gjm7y60

Are you familiar with Metaculus ? That functions somewhat as a continuously-running prediction contest along similar lines, and it has a lot more than three participants. (Including some people who are active on LW.)

[-]jbeshir7y70

PredictionBook itself has a bunch more than three participants and functions as an always-running contest for calibration, although it's easy to cheat since it's possible to make and resolve whatever predictions you want. I also participate in GJ Open, which has an eternally ongoing prediction contest. So there's stuff out there where people who want to compete on running score can do so.

The objective of the contest was less to bring such an opportunity into existence as to see if it'd incentivise some people who had been "meaning" to practice prediction-making and not gotten to it yet to do so on on one of the platforms, by offering a kind of "reason to get around to it now"; the answer was no, though.

I don't participate much on Metaculus because for my actual, non-contest prediction-making practice, I tend to favour predictions that resolve within about six weeks, because the longer the time between prediction and resolution, the slower the iteration process on improving calibration; if I predict on 100 things that happen in four years, it takes four years for me to learn if I'm over or under confident at the 90% or so mark, and then another four years for me to learn if my reaction to that was an over or under reaction. Metaculus seems to favour predictions 2-4 or more years out, and requires sticking with private predictions to create your own short term ones in number, which is interesting for getting a crowd read on the future, but doesn't offer me so much of an opportunity to iterate and improve. It's a nice project, though.

[-]wb7y10

lalaithion and jbeshir made predictions at the time of the announcement, while bendini and I made predictions at closing time, about two months later. This should have been a substantial advantage for us. Indeed, the fall in bitcoin from $9k to $6k helped us a lot. On many questions about whether something occurs (eg, Fatah and Hamas reconcile) we should multiply the probability by about 6/8 because we were considering a 6 month span, while they were considering an 8 month span. But I was systematically less confident than they were and I think bendini only about as confident.

[-]wb7y10

What is the Moonbird algorithm? Why do you call it ML? because algorithm won't tell me much without the data used to train it?

The algorithm that first springs to mind is to treat every number of predictions separately and apply kNN (in logistic space?). Better: if there are N predictions, average over the kNN applied to every one of the 2^N subsets of the predictions. Maybe weight by how well trained the different lengths are.

Tetlock tells us that although individuals are overconfident, crowds are underconfident, so once we've averaged, we should shift away from 0.5. This helps a bit in this case, but Moonbird does a lot better. I guess it's increasing confidence when the crowd is agreed and decreasing confidence when the crowd disagrees.

[-]jbeshir7y20

It's not a novel algorithm type, just a learning project I did in the process of learning ML frameworks, a fairly simple LSTM + one dense layer, trained on the predictions + resolution of about 60% of the resolved predictions from PredictionBook as of September last year (which doesn't include any of the ones in the contest). The remaining resolved predictions were used for cross-validation or set aside as a test set. An even simpler RNN is only very slightly less good, though.

The details of how the algorithm works are thus somewhat opaque but from observing the way it reacts to input, it seems to lean on the average, weight later in sequence predictions more heavily (so order matters) and get more confident with number of predictions, while treating the propositions with only one probability assignment as probably being heavily overconfident. It seems to have more or less learnt that insight Tetlock pointed out on its own. Disagreement might also matter to it, not sure.

It's on GitHub at https://github.com/jbeshir/moonbird-predictor-keras; this doesn't include the data, which I downloaded using https://github.com/jbeshir/predictionbook-extractor. It's not particularly tidy though, and still includes a lot of unused functionality for input features- the words of the proposition, the time between a probability assignment and the due time, etc- which I didn't end up using because the dataset was too small for it to learn any signal in them.

I'm currently working on making the online frontend to the model automatically retrain the model at intervals using freshly resolved predictions, mostly for practice building a simple "online" ML system before I move on to trying to build things with more practical application.

The main reason I ran figures for it against the contest was that some of its individual confidences seemed strange to me, and while the cross-validation stuff was saying it was good, I was suspicious I was getting something wrong in the process.

LESSWRONG
LW

LESSWRONG
LW

28

Prediction Contest 2018: Scores and Retrospective

28

28

The Results