[LINK] Get paid to train your rationality

A tournament is currently being initiated by the Intelligence Advanced Research Project Activity (IARPA) with the goal of improving forecasting methods for global events of national (US) interest. One of the teams (The Good Judgement Team) is recruiting volunteers to have their forecasts tracked. Volunteers will receive an annual honorarium ($150), and it appears there will be ongoing training to improve one's forecast accuracy (not sure exactly what form this will take).

I'm registered, and wondering if any other LessWrongers are participating/considering it. It could be interesting to compare methods and results.

Extensive quotes and links below the fold.

Despite its importance in modern life, forecasting remains (ironically) unpredictable. Who is a good forecaster? How do you make people better forecasters? Are there processes or technologies that can improve the ability of governments, companies, and other institutions to perceive and act on trends and threats? Nobody really knows.

The goal of the Good Judgment Project is to answer these questions. We will systematically compare the effectiveness of different training methods (general education, probabilistic-reasoning training, divergent-thinking training) and forecasting tools (low- and high-information opinion-polls, prediction market, and process-focused tools) in accurately forecasting future events. We also will investigate how different combinations of training and forecasting work together. Finally, we will explore how to more effectively communicate forecasts in ways that avoid overwhelming audiences with technical detail or oversimplifying difficult decisions.

Over the course of each year, forecasters will have an opportunity to respond to 100 questions, each requiring a separate prediction, such as “How many countries in the Euro zone will default on bonds in 2011?” or “Will Southern Sudan become an independent country in 2011?” Researchers from the Good Judgment Project will look for the best ways to combine these individual forecasts to yield the most accurate “collective wisdom” results.  Participants also will receive feedback on their individual results.

All training and forecasting will be done online. Forecasters’ identities will not be made public; however, successful forecasters will have the option to publicize their own track records.

Who We Are

The Good Judgment research team is based in the University of Pennsylvania and the University of California Berkeley. The project is led by psychologists Philip Tetlock, author of the award-winning Expert Political Judgment, Barbara Mellers, an expert on judgment and decision-making, and Don Moore, an expert on overconfidence. Other team members are experts in psychology, economics, statistics, interface design, futures, and computer science.

We are one of five teams competing in the Aggregative Contingent Estimation (ACE) Program, sponsored by IARPA (the U.S. Intelligence Advanced Research Projects Activity). The ACE Program aims "to dramatically enhance the accuracy, precision, and timeliness of forecasts for a broad range of event types, through the development of advanced techniques that elicit, weight, and combine the judgments of many intelligence analysts." The project is unclassified: our results will be published in traditional scholarly and scientific journals, and will be available to the general public.

A general description of the expected benefits for volunteers:

All decisions involve forecasts, and we all make forecasts all the time.  When we decide to change jobs, we perform an analysis of potential futures for each of our options.  When a business decides to invest or disinvest in a project, it moves in the direction it believes to present the best opportunity.  The same applies when a government decides to launch or abandon a policy.

But we virtually never keep score. Very few forecasters know what their forecasting batting average is — or even how to go about estimating what it is.

If you want to discover what your forecasting batting average is — and how to think about the very concept — you should seriously consider joining The Good Judgment Project. Self-knowledge is its own reward. But with self-knowledge, you have a baseline against which you can measure improvement over time. If you want to explore how high your forecasting batting average could go, and are prepared to put in some work at self-improvement, this is definitely the project for you.

Could that be any more LessWrong-esque?

Prediction markets can harness the "wisdom of crowds" to solve problems, develop products, and make forecasts. These systems typically treat collective intelligence as a commodity to be mined, not a resource that can be grown and improved. That’s about to change.

Starting in mid-2011, five teams will compete in a U.S.-government-sponsored forecasting tournament. Each team will develop its own tools for harnessing and improving collective intelligence and will be judged on how well its forecasters predict major trends and events around the world over the next four years.

The Good Judgment Team, based in the University of Pennsylvania and the University of California Berkeley, will be one of the five teams competing – and we’d like you to consider joining our team as a forecaster. If you're willing to experiment with ways to improve your forecasting ability and if being part of cutting-edge scientific research appeals to you, then we want your help.

We can promise you the chance to: (1) learn about yourself (your skill in predicting – and your skill in becoming more accurate over time as you learn from feedback and/or special training exercises); (2) contribute to cutting-edge scientific work on both individual-level factors that promote or inhibit accuracy and group- or team-level factors that contribute to accuracy; and (3) help us distinguish better from worse approaches to generating forecasts of importance to national security, global affairs, and economics.

Who Can Participate

Requirements for participation include the following:

(1) A baccalaureate, bachelors, or undergraduate degree from an accredited college or university (more advanced degrees are welcome);

(2) A curiosity about how well you make predictions about world events – and an interest in exploring techniques for improvement.

More info: http://goodjudgmentproject.blogspot.com/

Pre-Register: http://surveys.crowdcast.com/s3/ACERegistration

55 comments, sorted by
magical algorithm
Highlighting new comments since Today at 5:01 PM
Select new highlight date

The Good Judgment project has started publishing a leaderboard. FWIW, as of this writing I am in pole position with a "Brier score" of 0.18, with numbers 2 and 3 at 0.2 and 0.23 respectively. (I'm not sure whether other participants are also from LW.)

(ETA: dethroned! I'm #2 now, #1 has a score of .16.)

Team scores seem a bit below the best individual scores: 0.32, 0.33 and 0.36 for the best three teams.

From the emails I've been getting from the organizers, they have trouble sustaining participation from all who signed up; poor participation is leading to poor forecasting scores.

FYI the leaderboard rankings are fake, or at least generated strategically to provide users with specific information. I am near the top of my own leaderboard, while my friend sees his own name but not mine. Also, my Brier is listed at 0.19, strikingly close to yours. I wonder if they are generated with some apparent distribution.

My take is that the leader stats are some kind of specific experimental treatment they're toying with.

This is almost more interesting than the study itself. :)

Are your friend and you able to see each other's comments on predictions?

poor participation is leading to poor forecasting scores.

Hmm, correlation v. causation maybe? It is possible that some people were doing poorly and so started participating less?

Yes, it's possible too. I used "causing" referring to a direct link: some predictions are of the form "event X will happen before date D", and you lose points if you fail to revise your estimates as D draws nearer.

Apparently many people weren't aware of this aspect - they took a "fire and forget" approach to prediction. (That is in itself an interesting lesson.) That was before the leaderboard was set up.

Is this limited to graduates from U.S. universities?

Apparently the only way to know is to try. It seems likely that there is such a restriction. I'd estimate a better than 70% chance that I get turned down. :)

I got an email an hour ago from the study saying I was accepted and taking me to the initial survey (a long one, covering calibration on geopolitics, finance, and religion; personality surveys with a lot of fox/hedgehog questions; basic probability; a critical thinking test, the CRT; and then what looked like a full matrix IQ test). The message at the end of all the questions:

Congratulations! You’ve completed the survey. Sometime later this year, we’ll post information on the distribution of answers among those participating in this study.

What comes next? Some of you (by random assignment) will receive an e-mail with a link to a training exercise. Again, we ask you to complete that exercise before forecasting begins on September 1st. That’s the big day for the entire team – the official start of forecasting on 9/1/2011.

Be sure to watch your e-mail for a personalized link to “your” forecasting website. We hope you’re as eager as we are for the tournament to begin.

So I'm marking me as accepted, anyway.

And the "tournament" is now begun. Just got email with login instructions.

Looks somewhat similar to PredictionBook, actually. :)

I did all my predictions last night immediately after the email showed up, so that meant I got to place a lot of bets at 50/50 odds :)

(Then I recorded everything privately in PredictionBook. No point in leaving my predictions trapped on their site.)

Interface-wise, I don't like it at all. I'm still not sure what exactly I am betting at or with, compared to PB with straight probabilities or Intrade with share prices.

Did you take the "training refresher"? That includes a general-knowledge test at the end which scores you on both calibration and resolution. My results were pretty poor (but not abysmal):

You got 63% of the items correct, and your average confidence rating over all of the items was 74.33%. (...) In this exercise, your calibration is 11.00 (average confidence minus percent correct). (...) Your confidence when you were correct was 75.26%, and your confidence when you were incorrect was 72.73%. The difference is 2.53%.

I'd be curious to compare with yours if you'd care to share.

Without actually going through the whole refresher, it seems to be the same; when I did the training, I don't remember that calibration/resolution test. Perhaps that is one of the experimental differences.

I didn't remember that test from earlier, either. Worth checking out? I don't mind accidentally unblinding a little if it is an experimental/control difference - curious folks will be curious.

I just went through the whole thing again; there was no test of that kind at the end. (What there was was the previous multiple-choice quiz about some example forecasts and how they went wrong.) Looks like this is an experimental/control difference. I'd rather not discuss that bit further - this isn't about possibly life-or-death drugs, after all, and I already know where I can find calibration tests like that.

I've been wondering if it's not the other way round, the Good Judgement project copying Inkling Market's questions? What info do you have that leads you to think the copying was in the direction you assume?

My evidence for the other way round is that the Brent question has a starred footnote which is present on IM but not on GJ, while the star is in the text of the GJ question.

The description, from the latest email, is

First, for those of you who logged in before September 6, please be aware that the tournament's sponsor issued 10 new questions today, which we have posted. As the About page notes, new questions usually will be distributed on Mondays (but not every Monday). These questions arrived on Tuesday because of the Labor Day holiday.

My understanding was that the sponsor was IARPA. And googling, I don't see any listed connections between Inkling and Good Judgement Project.

Stray asterisks are very suspicious. I see one in the Inkling question, but I don't see the footnote itself. It has a "background information" section, but all their questions do. Is "last price" a technical term? If the usual term is "settlement price," and GJ doesn't make that clear, then it is quite suspicious.

Here are two two more Inkling questions with asterisks. One has an explicit footnote. The other is a change to the question.

Have you entered any comments on your predictions at the GJ site? (You're supposed to enter a minimum number of comments over one year, and also a minimum number of responses to others' comments. My understanding is that this will in time be run as a team game, with team play conventions.)

From my first experiences, I'm assuming the scoring will be pretty much as with PB.com - based on probability. Their model seems to be calibration/resolution rather than the visual "slope" representation.

Comments? I don't see any relevant fields for that, checking right now, nor does my 'About' include the substring "comment". Another experimental difference, I guess...

The "Why did you answer the way you did" field. I've been assuming we're both using the same underlying app, i.e. Crowdcast. But perhaps we're not...

I'm in; pleasantly surprised.

This bit from the final registration page is interesting - "But one other requirement for forecasters has changed. We can welcome those who are not US citizens." Implying that at some prior point non-US citizens were not accepted.

Especially (mischievous mode ON) as I've only implied, not outright stated, that I've applied.

Mischievous mode OFF - that's a problem in arbitrating predictions, btw - the potential for ambiguity inherent in all human languages. If I hadn't in fact applied (I have), how should the prediction that I am "turned down" be judged?

I should use PredictionBook more often but I don't, partly due this kind of thing, also due to the trivial-inconvenience effort of having to come up with my own predictions to assess and the general uselessness for that purpose of the stream of other users' predictions.

Other than Tricycle folks, is anyone here on LW officially (or unofficially) "in charge" of maintaining and enhancing PredictionBook?

Other than Tricycle folks, is anyone here on LW officially (or unofficially) "in charge" of maintaining and enhancing PredictionBook?

I have some sort of moderator power; I am de facto in charge of the content house-keeping - editing bad due-dates, making private bad or long-overdue-unjudged predictions, criticizing predictions, etc. I also make and register hundreds of predictions, obviously.

(In addition, I have commit access to the codebase on GitHub, but I don't know Ruby, so I will probably never make use of said commit-bit.)

One thing that would probably greatly improve PB for my purposes is a tagging / filtering system, so that you could for instance pick out predictions about consumer devices or predictions about politics; or conversely leave out some uninteresting categories (e.g. predictions about the private lives of particular PB users, which I interpret as pure noise).

No; I just tried the query "consumer electronics site:predictionbook.com", and that only returned 1 hit; I know there are more (including one I just made and another I just voted on). It really is the lack of user-supplied meta-information that prevents useful querying, not the lack of a UI for doing so. The UI encourages predictions to be written very tersely, and doesn't supply an extended-info field when you make a prediction.

PB.com is quite possibly the least well executed idea out there that I keep not giving up on. :)

Ah, that's what you meant by tags. Yes, that would be nice. On the other hand, I rather doubt that tags would instantly create massive demand for PB's services - other places like Intrade have well-categorized predictions/bets, and none of them have seen traffic explode the moment they implemented that feature.

If you really found tags all that valuable, you could start doing them inside comments. Go over the 969 upcoming predictions and add comments like 'tags: personal, exercise' or 'tags: America, politics'. Later, it'd be even easier to turn them into some real software-supported tags/categories, and in the meantime, you can query using Google. This wouldn't even take very long - at 30 predictions a day, which ought to take 10 minutes max, you'd be done in a month.

(I doubt you will adopt my suggestion and tag even 500 predictions (10%). This seems to be common to suggestions for PB: 'I'd use and really find PB useful if only it were executed better in this way', which of course never happens. It's starting to remind me of cryonics.)

If you really found tags all that valuable, you could start doing them inside comments.

Preliminary report: this isn't going to work, not without drastic contortions in the choice of tags (which IMO kills the effectiveness of the tactic). For instance, from my first set of 30 I tagged a number with the tag "personal", predictions which only concern one user (or two acquainted with each other) and that I don't want to see because I can't effectively assess them. The Google query including "personal" returns close to 30 spurious results: for instance those containing "personal computer" or "personal transportation". (A temporary workaround is to include the term "tags" in the query, but this will cease to work once a greater fraction of predictions have been tagged.)

I doubt you will adopt my suggestion

You are correct about the likely outcome, but I think I've just proven your model of the underlying reasons wrong: I won't do it because it won't work, not because I lack the conscientiousness to do so, or because I'm too selfish to take on an effort that will benefit all users.

The Google query including "personal" returns close to 30 spurious results: for instance those containing "personal computer" or "personal transportation".

JoshuaZ has (example) been adding brackets to the tags, such as [economics]. You don't mention forcing Google to include the brackets, so it's not surprising it includes those extra results.

I don't think google respects punctuation. It's a common complaint.

Hm, you're right. I did some searches on this, and apparently brackets are one of the special characters specifically excluded by Google (along with spam-licious '@' and others). How unfortunate.

This seems to be common to suggestions for PB: 'I'd use and really find PB useful if only it were executed better in this way', which of course never happens.

How many times was a new feature implemented as a test of such a hypothesis?

PB.com seems like it would be a great place for things like A/B testing and other tactics in the "Lean startup" repertoire, but what actually seems to be the case is that the site isn't under active development any more; no one is apparently trying to develop traffic or usage by improving the user experience there. (This isn't to slight your own efforts or your obvious enthusiasm; merely my best current hypothesis.)

(I'm finding the comparison with cryonics ironically apt, as a fence-straddling LW user who's not giving up on the entire project despite facing, as a non-US citizen, a battery of obstacles that I suspect also apply in the US, where they're just less obvious and as a result people take it for granted that things will "just work". Though it's more likely that the comparison is entirely besides the point and just a red herring for the purposes of present discussion.)

If you really found tags all that valuable, you could start doing them inside comments.

I'll try that, for a minimum of 30 predictions.

How many times was a new feature implemented as a test of such a hypothesis?...PB.com seems like it would be a great place for things like A/B testing and other tactics in the "Lean startup" repertoire, but what actually seems to be the case is that the site isn't under active development any more; no one is apparently trying to develop traffic or usage by improving the user experience there.

This is true. Trike is not doing anything but maintenance because their options are to work on PB, LW, or Khan Academy. When I asked for features to be added and argued that work on PB could be justified, Matthew Fallshaw gave me Analytics data to look at. At that point, LW had ~140,000 unique visitors in the previous 30 days. And Khan Academy had a total of 25.2 million video watches. And Trike had no shortage of valuable things it could do on LW or Khan - why should it work on PB? (Practice in Agile methodology? Better done on high-traffic sites where measurements are more trustworthy.)

The final crushing statistic: PB had just 4 visitors who visited more than 10 times that month. Including me.

PB had just 4 visitors

Ah. So that is the "true rejection" of feature suggestions for PB, rather than "sounds nice but would not increase usage if implemented"?

Well, that's Trike's true rejection. Development of PB is worth a fair bit to me, the major user of it, so while I'm swayed by Trike's argument - I agree that from a utilitarian point of view PB is a bad investment - it doesn't affect my appraisal much. I just think those suggestions do not affect other people's 'true rejection' of PB use.

Hmm, I think this is a good idea. When I make a prediction or comment on it I will add tag remarks. It is non-ideal hack but should help a little bit.

No. My degrees are from Canada and France, and I'm in.

How long did it take between your "preregistering" and hearing back?

What form did that take? (I.e. form email, personal email, direct link to a Web page, or whatever?)

How long between hearing back and being fully accepted? (I'm assuming that's what "I'm in" means...)

How long did it take between your "preregistering" and hearing back?

3-4 days.

What form did that take?

An email welcoming me to the study with a link to the pre-study survey (which is a mix of attitude/ideology, knowledge, logic, and intelligence questions).

How long between hearing back and being fully accepted?

Same as above.

Excerpted from interesting recent news from the GJP, which is now entering the "official" tournament phase:

Meanwhile, we have updated the scoring sidebar accessible from the "About" tab of your forecasting website to provide forecasters affected by the new scoring rule with more information (this does not apply to prediction-market forecasters). We also will be using the FAQs to provide all of you with details about the number of forecasters participating in the tournament (currently over 2,700 on the Good Judgment Team, spread over 12 experimental conditions) and other topics that have prompted questions to our Help Desk or project administrator.

"How many countries in the Euro zone will default on bonds in 2011?” or “Will Southern Sudan become an independent country in 2011?”

It's hard to make predictions about politics because the decision makers have perverse/unknown sets of incentives. In contrast, it's much easier to make guesses with reasonable error bars when the decision maker is spending his/her own money.

Many thanks for posting this! I'd probably want to do this even if there were no payment, so it's doubly attractive to me. I've submitted the form.

EDIT: I wonder if I'll get in; it just got posted to Marginal Revolution and I doubt they have that huge a budget...

So... what's the catch?

Also, my main reason for not signing up is time, responsibility and commitment. Any idea how much of those this might require?

Edit: entire conversation removed due to complete failure of reading comprehension.

The necessary time commitment is in the order of 6-10 hours per year. You can put as much time into training as you like, of course.

is that 10 hours the week of signup, or 1.6438356164 minutes per day?

... wait, that's still almost 2 min? Probably not worth it even then.

Just to be clear, the deal is that you will receive somewhere between $15 to $25 per hour and also receive an assessment of your calibration and possibly also receive forecasting training...

Oh.

Well, I'm probably disqualified by virtue of this conversation taking place then.