Way back in April 2018, I announced a Prediction Contest, in which the person who made the best predictions on a bunch of questions on PredictionBook ahead of a 1st July deadline would win a prize after they all resolved in January 2019, which is now.
It was a bit of an experiment; I had no idea how many people were up for practicing predictions to try to improve their calibration, and decided to throw a little money and time at giving it a try. And in the spirit of reporting negative experimental results: The answer was 3, all of which I greatly appreciate for their participation. I don't regret running the experiment, but I'm going to pass on running a Prediction Contest 2019. I don't think this necessarily rules out trying to practically test and compete in rationality-related areas in other ways later, though.
Our entrants were bendini, bw, and Ialaithion, and their ranked log scores were:
This was sufficiently close that changing a single question's resolution could tip the results, so they were all pretty good. That said, bw came out ahead, and even managed to beat averaging everyone's predictions- if you simply took the average prediction (including non-entrants) as of entry deadline and made that your prediction, you'd have got -9.576568147.
The full calculations for each of the log scores, as well as my own log score and the results of feeding the predictions as of prediction time to a simple model rather than simply averaging them, are in a spreadsheet here.
I'll be in touch with bw to sort out their prize this evening, and thanks to everyone who participated and who helped with finding questions to use for it.
Are you familiar with Metaculus ? That functions somewhat as a continuously-running prediction contest along similar lines, and it has a lot more than three participants. (Including some people who are active on LW.)
PredictionBook itself has a bunch more than three participants and functions as an always-running contest for calibration, although it's easy to cheat since it's possible to make and resolve whatever predictions you want. I also participate in GJ Open, which has an eternally ongoing prediction contest. So there's stuff out there where people who want to compete on running score can do so.
The objective of the contest was less to bring such an opportunity into existence as to see if it'd incentivise some people who had been "meaning" to practice prediction-making and not gotten to it yet to do so on on one of the platforms, by offering a kind of "reason to get around to it now"; the answer was no, though.
I don't participate much on Metaculus because for my actual, non-contest prediction-making practice, I tend to favour predictions that resolve within about six weeks, because the longer the time between prediction and resolution, the slower the iteration process on improving calibration; if I predict on 100 things that happen in four years, it takes four years for me to learn if I'm over or under confident at the 90% or so mark, and then another four years for me to learn if my reaction to that was an over or under reaction. Metaculus seems to favour predictions 2-4 or more years out, and requires sticking with private predictions to create your own short term ones in number, which is interesting for getting a crowd read on the future, but doesn't offer me so much of an opportunity to iterate and improve. It's a nice project, though.
lalaithion and jbeshir made predictions at the time of the announcement, while bendini and I made predictions at closing time, about two months later. This should have been a substantial advantage for us. Indeed, the fall in bitcoin from $9k to $6k helped us a lot. On many questions about whether something occurs (eg, Fatah and Hamas reconcile) we should multiply the probability by about 6/8 because we were considering a 6 month span, while they were considering an 8 month span. But I was systematically less confident than they were and I think bendini only about as confident.
What is the Moonbird algorithm? Why do you call it ML? because algorithm won't tell me much without the data used to train it?
The algorithm that first springs to mind is to treat every number of predictions separately and apply kNN (in logistic space?). Better: if there are N predictions, average over the kNN applied to every one of the 2^N subsets of the predictions. Maybe weight by how well trained the different lengths are.
Tetlock tells us that although individuals are overconfident, crowds are underconfident, so once we've averaged, we should shift away from 0.5. This helps a bit in this case, but Moonbird does a lot better. I guess it's increasing confidence when the crowd is agreed and decreasing confidence when the crowd disagrees.
It's not a novel algorithm type, just a learning project I did in the process of learning ML frameworks, a fairly simple LSTM + one dense layer, trained on the predictions + resolution of about 60% of the resolved predictions from PredictionBook as of September last year (which doesn't include any of the ones in the contest). The remaining resolved predictions were used for cross-validation or set aside as a test set. An even simpler RNN is only very slightly less good, though.
The details of how the algorithm works are thus somewhat opaque but from observing the way it reacts to input, it seems to lean on the average, weight later in sequence predictions more heavily (so order matters) and get more confident with number of predictions, while treating the propositions with only one probability assignment as probably being heavily overconfident. It seems to have more or less learnt that insight Tetlock pointed out on its own. Disagreement might also matter to it, not sure.
It's on GitHub at https://github.com/jbeshir/moonbird-predictor-keras; this doesn't include the data, which I downloaded using https://github.com/jbeshir/predictionbook-extractor. It's not particularly tidy though, and still includes a lot of unused functionality for input features- the words of the proposition, the time between a probability assignment and the due time, etc- which I didn't end up using because the dataset was too small for it to learn any signal in them.
I'm currently working on making the online frontend to the model automatically retrain the model at intervals using freshly resolved predictions, mostly for practice building a simple "online" ML system before I move on to trying to build things with more practical application.
The main reason I ran figures for it against the contest was that some of its individual confidences seemed strange to me, and while the cross-validation stuff was saying it was good, I was suspicious I was getting something wrong in the process.