Does anybody know where to find a large database of statements that are roughly 50% likely to be true or false?  These would be used for confidence calibration / Bayesian updating exercises for CMR/HRP.

One way to make such a database would be to buy a bunch of trivia games with True/False questions, and type each statement and its negation into a computer.  A problem with this might be that trivia questions are selected to have surprising/counterintuitive truth values; I'm not sure if that's true.  I'd be happy to acquire an already-made database of this form, but ideally I'd like statements that are "more neutral" in terms of how counterintuitive they are.

Any thoughts on where we might find a database like this to use/buy?

Thanks for any help!

Revision: We actually want a database of two-choice answer questions. This way, the player won't get trained on a base rate of 50% of statements in the world being true... they'll just get trained that when there are two possible answers, one is always true.  In the end, the database should look something like this (warning: I made up the "correct" answers):

Question: "Which is diagnosed more often in America (2011)?"; 
Answers: (a) "the cold", (b) allergies"; 
Correct Answer: (a); 
Tags: {medical}

Question: "Which city has a higher average altitude?"; 
Answers: (a) "Chicago", (b) "Las Vegas"; 
Correct Answer: (a)
Tags: {geography}

Question: "Who sold more albums while living"?; 
Answers: (a) "Michael Jackson", (b) "Elvis Presley"; 
Correct Answer: (b)
Tags: {history, pop-culture, music}

Question: "Was the price of IBM stock higher or lower at the start of the month after the Berlin wall fell, compared with the start of the previous month?"; 
Answers: (a) "higher", (b) "lower"; 
Correct Answer: (a)
Tags: {history, finance}



New Comment
17 comments, sorted by Click to highlight new comments since:

Download a bunch of historical stock price information, then ask questions like "Did company X's stock go up the day after event Y?" (Did IBM go up after the Berlin Wall fell?)

Hmm... this made me think that perhaps two-choice questions are better than true/false questions, because when all the questions have the same two possible answers T/F, there is a base rate of how often the answer "T" is correct which the player should account for. For real life questions with two possible answers like "Who is taller, Alex or Bob?", there is not really a well-known base rate.


The problem is, that's too obscure for most people to even have intuition for it.

You could use data about countries (this site looks like it has a number of links to spreadsheets) and randomly generate questions of the form "Does [country] or [country] have higher [economic statistic]?". If it's a concern that the questions would be too easy, you could pick only cases where the statistics are in fact reasonably close.

statements that are ~50% true... this is actually pretty hard, mine some dataset for statistical info?

generally, I would look into RDF, (protege and topbraid composer free will let you poke around for free without knowing the data format)

US 2000 Census in RDF

Freebase has all manner of data in RDF public data sets, not all in RDF but "it's more important that the data have structure" and all that

cancer stats

You could take a look at the 15 million statements in the database of the Never-Ending Language Learning project. The subset of beliefs for which human-supplied feedback exists and beliefs that have high-confidence truth values may be appropriate for your purpose.

Good Link. Wordnet is also the canonical language reference, but probably doesn't serve OP's purpose directly. If you start getting into these kind of graphs though, it's quite useful to move around with.


roughly 50% likely to be true or false

I feel like you'd need to specify for what kind of person these statements shall appear about 50% likely. That can be very different across different knowledge backgrounds. I, as a European, have no idea whether or not Iowa and Ohio are neighboring states.

That said, I think geographical questions might do well because such statements should be easy to generate and find evidence for/against.


  • The Great Slave Lake is the 11th largest lake in the world
  • Algeria is the 12th largest country in the world
  • Israel is bigger than New Jersey
  • Germany is smaller than Montana
  • Sulawesi is one of the ten largest islands in the world

(some of these are false, some are true).

To create these statements, one could look up wikipedia lists, e.g.List of islands by area, List of countries by area, List of rivers by length and so on.

Writing a script that extracts statements from this type of data should be feasible, and one could write it such that for each true statement extracted, a wrong statement is created as well.

I find it very hard to judge these questions, however given a world map (without borders) this changes. Also, you could tell me how many people live in the countries/states mentioned, how large one of this countries is in absolute numbers or what the greatest depth of the Great Slave Lake and fifteen other lakes in the world is.

Once these statements are available, they could not only be used for calibration training, but also for exercises about seeking the truth in groups.


I would recommend against "X is the 13th largest Y", because other than people who've memorized the Top Twenty Ys getting this right is purely a matter of guesswork. "One of the 10 largest" is better; so is "X is bigger than Y".


Well, if one can come up with the top Ys, one can reason about what probability one wants to assign to that statement. For example, if I can think of 9 countries that I think are bigger than Algeria, and three of which I am uncertain, I can well assign a probability of, say, 30%. Calibration training could be done this way.


Yeah, I guess, but that's a whole lot of work for one short question of this kind and if you can think of 12 candidates then there's a good chance you've forgotten a couple. I don't mean to imply that this kind of question is completely useless, only that other sorts are probably better.


I'd say it depends on what exactly you want to do once you have the statements.

USGS has good info. Also there is no need to scrape wikipedia, work has been done for you. You can do sparql queries to get most of your statements and the CEGIS site supposedly has a working sparql endpoint but I haven't used that in years.

It might be possible to have subject matter experts assemble sets of such questions about their areas. We have a lot of people with different expertise on LW and so it wouldn't be that hard. For example, I could pretty easily supply 5-10 such questions in a math context or in a history of science context.

You could try using PredictionBook? I don't really know what your specific needs are so I don't know if PB could really be adapted for your needs.

We need something for rapid calibration rather than slow-to-verify predictions like prediction book (which are also good to train on).

Going for a trivia database seems like your best bet. I just had a look at (first google result for "trivia database") and the questions seem relatively neutral in terms of intuitiveness.

"What is the addictive chemical in cigarettes?" "What unit is force measured in?" "What country did King Hussein rule?"

Keep in mind that it's much harder to generate tricky questions than simple ones, so it's unlikely you'll have much cleverness in a big database unless it's extremely expensive.