Epistemic status: perspective derived from following Dan Luu's output for the last 5 years or so.  Trying to vaguely gesture at a few things at once.  Please ask questions if you find something confusing.

Dan Luu has written a interesting post analysing the track record of futurists' predictions.  The motivation:

I've been reading a lot of predictions from people who are looking to understand what problems humanity will face 10-50 years out (and sometimes longer) in order to work in areas that will be instrumental for the future and wondering how accurate these predictions of the future are. The timeframe of predictions that are so far out means that only a tiny fraction of people making those kinds of predictions today have a track record so, if we want to evaluate which predictions are plausible, we need to look at something other than track record.

The idea behind the approach of this post was to look at predictions from an independently chosen set of predictors (Wikipedia's list of well-known futurists1) whose predictions are old enough to evaluate in order to understand which prediction techniques worked and which ones didn't work, allowing us to then (mostly in a future post) evaluate the plausibility of predictions that use similar methodologies.

I'm primarily going to address the appendix, particularly the section on Holden Karnofsky's analysis on the same subject, but the article is interesting reading and I'd recommend going through the whole thing.  (I think Dan is evaluating forecasting track records pretty differently from how I would, and I haven't actually dug into any of the other analysis.  On priors I'd expect it to be similar to his analysis of Holden's work.)

Karnofsky's evaluation of Kurzweil being "fine" to "mediocre" relies on these two analyses done on LessWrong and then uses a very generous interpretation of the results to conclude that Kurzweil's predictions are fine. Those two posts rate predictions as true, weakly true, cannot decide, weakly false, or false. Karnofsky then compares the number of true + weakly true to false + weakly false, which is one level of rounding up to get an optimistic result; another way to look at it is that any level other than "true" is false when read as written. This issue is magnified if you actually look at the data and methodology used in the LW analyses.

The specific claim I have an issue with here is "another way to look at it is that any level other than "true" is false when read as written".  Depending on how you want to evaluate it it, it's either technically true but irrelevant[1], or not even wrong[2].

In the second post, the author, Stuart Armstrong indirectly noted that there were actually no predictions that were, by strong consensus, very true when he noted that the "most true" prediction had a mean score of 1.3 (1 = true, 2 = weakly true ... , 5 = false) and the second highest rated prediction had a mean score of 1.4. Although Armstrong doesn't note this in the post, if you look at the data, you'll see that the third "most true" prediction had a mean score of 1.45 and the fourth had a mean score of 1.6, i.e., if you round to the nearest prediction score, only 3 out of 105 predictions score "true" and 32 are >= 4.5 and score "false". Karnofsky reads Armstrong's as scoring 12% of predictions true, but the post effectively makes no comment on what fraction of predictions were scored true and the 12% came from summing up the total number of each rating given.

I'm not going to say that taking the mean of each question is the only way one could aggregate the numbers (taking the median or modal values could also be argued for, as well as some more sophisticated scoring function, an extremizing function, etc.), but summing up all of the votes across all questions results in a nonsensical number that shouldn't be used for almost anything. If every rater rated every prediction or there was a systematic interleaving of who rated what questions, then the number could be used for something (though not as a score for what fraction of predictions are accurate), but since each rater could any questions (although people were instructed to start rating at the first question and rate all questions until they stop, people did not do that and skipped arbitrary questions), aggregating the number of each score given is not meaningful and actually gives very little insight into what fraction of questions are true.

This seems like a basically accurate description of the methodology used in the 2019 assessment.  Stuart Armstrong says in a footnote that that removing 4 of the 34 assessors who had gaps in their predictions didn't change any of the results, but I don't expect this would address Dan's primary criticism.  I performed my own data cleaning, and removed 9 of the predictors who had substantial gaps in their predictions (there are 2 left who have "any" missing predictions).  In both cases, the results obtained on mean (<2 and <1.5), median, and mode[3] are identical:

MEAN < 28
MEAN < 1.53
MEDIAN == 16
MODE == 114

(Quantities absolute, divide by 105 questions.)

I think requiring a mean under 1.5 to decide something is "true" puts too much weight in the hands of outliers who are either interpreting the prediction differently from the rest, or are simply wrong as a matter of fact[4].

With that said, I think deferring to these aggregate evaluations at all is a mistake.  It seems like Dan agrees, though for reasons that I disagree with:

Another fundamental issue with the analysis is that it relies on aggregating votes of a kind from Less Wrong readers and the associated community. As we discussed here, it's common to see the most upvoted comments in forums like HN, lobsters, LW, etc., be statements that can clearly be seen to be wrong with no specialized knowledge and a few seconds of though (and an example is given from LW in the link), so why should an aggregation of votes from the LW community be considered meaningful?

I can't actually look at the paywalled source, but putting aside the accuracy of the "most upvoted comments" on LessWrong, the graders were not randomly selected from those LessWrong users who make heavily upvoted comments.  Nearly half of them were publicly named, and many those have extensively documented their thoughts online.  One could, if desired, go read some of their writings, and judge their epistemics for one's self.

Of the predictions which had a modal grade of 1, I personally consider 8 of them to be true, and maybe 5-6 to be non-trivial[5].  I think Dan would consider some of them - perhaps even most - to be insufficiently rigorously specified to grade.

Dan seems to have an unusual knack for noticing inconsistencies[6] and $20 bills on the sidewalk.  His work sometimes seems to avoid performing inside-view analysis[7], which can make engaging with it a bit tricky.  It does seem to pay off in cases like this - I don't have time to dig into it right now, but in the appendix, he also linked to a post by nostagebraist addressing the intial Bio Anchors report that seems worth following up on.

  1. ^

    Someone in 1960 predicting an unlikely outcome in 2010, and that outcome actually occuring in 2011, is technically "wrong", but a very different kind of wrong from someone in 1960 predicting an unlikely outcome in 1965 but that outcome not having occurred yet at all.

  2. ^

    Other people's evaluations of predictions are not, in fact, especially solid pointers to the truth-value of those predictions.  Given the subject of the article I think Dan probably appreciates this point.

  3. ^

    Three of the basic aggregations suggested by Dan as being even minimally informative.

  4. ^

    The fact that the modal grade for question 47 (about cochlear implants) was a 1, while the mean was ~2.5 (with many 4s and 5s), is mostly an indication that the prediction was underspecified, and that the graders in question had very different ideas in mind of what "very effective" and "widely used" meant (or had similar ideas but didn't bother looking up the actual numbers).

  5. ^

    In the sense that most people were not making similar predictions at the time, and priors on those predictions were probably low.

  6. ^

    Quote: "I find it a bit odd that, with all of the commentary of these LW posts, few people spent the one minute (and I mean one minute literally — it took me a minute to read the post, see the comment Armstrong made which is a red flag, and then look at the raw data) it would take to look at the data and understand what the post is actually saying, but as we've noted previously, almost no one actually reads what they're citing."

  7. ^

    For reasons that at least aren't obviously wrong, though I think foregoing an inside-view opinion while simultaneously delivering an outside-view refutation is not enormously productive.

New Comment
9 comments, sorted by Click to highlight new comments since: Today at 2:32 PM

I bought a subscription and tracked down the offending LW comment:

Another fundamental issue with the analysis is that it relies on aggregating votes of a kind from Less Wrong readers and the associated community. As we discussed here, it's common to see the most upvoted comments in forums like HN, lobsters, LW, etc., be statements that can clearly be seen to be wrong with no specialized knowledge and a few seconds of though (and an example is given from LW in the link), so why should an aggregation of votes from the LW community be considered meaningful?

He doesn't actually give a link to a LW comment, but he describes it. He says Jeff Kaufman asked why there are so few 6-door cars, and the top comment said that doors are an expensive part of the car, but this is obviously false because (a) they can't be thousands of dollars each and that's what it would take to make them a noticeable fraction of the cost, and (b) if that were true we'd expect cheap cars to more often have two doors instead of four, but instead cheap cars usually have four doors and if anything it's the expensive sports cars that have two.

Tracking down the original post, it appears to be this one. Top comment is roughly as described. It has 5 upvotes. 

What's my overall take? Well, I don't think the explanation is as obviously false as Dan Luu thinks. But I do agree that a and b are good objections to it.

It's not obvious to me that (a,b) are such good objections, and furthermore the comment doesn't just say "it's because doors are expensive".

The comment in question says: doors cost money and worsen crash safety, so cars with more doors would cost more, and most buyers don't value the extra convenience enough to justify what it would cost them, so the market would be small which would make it not worth the development cost.

I agree that it's unlikely that the cost of an extra pair of doors is multiple thousands of dollars per car. But the price is, let's say, 3x the cost, and I don't have any trouble believing that an extra pair of doors might increase the cost by say $700, meaning a price $2000 higher. Is this "clearly seen to be wrong with no specialized knowledge"? Doesn't seem so. So, then the question is how willing buyers with large families would be to pay an extra $2000 for the convenience of a third door. Again, it's not clear to me that this wouldn't hurt the sellability of the vehicle. The idea that making such a vehicle as safe as everyone expects these days could be difficult also seems plausible to me. I would expect that families with multiple smallish children are (1) the main potential market for 6-door cars, (2) very safety-conscious, and (3) often very price-conscious.

I guess point (b) is meant to undermine the idea that adding doors increases the cost at all, or something. I'm not convinced by this. Isn't it plausible that the gain in convenience in going from 2 to 4 doors is substantially bigger than the gain in going from 4 to 6? Or that the loss in safety in going from 2 to 4 is smaller than going from 4 to 6? ... And some cheap-and-nasty cars have had only two doors. For instance, if I try to think of a cheap and crappy car, the first thing that comes to mind is the old Reliant Robin: four seats, two doors. (Also, three wheels.)

i don't know whether that comment is right. But I don't see how danluu reckons it's obviously wrong with a few seconds of thought. I wonder whether danluu might change his mind about its obvious wrongness if he thought about it for more than a few seconds.

Oh yeah, I should have clarified that I agree with your take -- Dan seems totally wrong to take this comment as significant negative evidence about the epistemic standards of LW. Waaaay too big of a stretch. It's just one comment with 5 upvotes, plus it's not obviously wrong & may even be right.

Credit where it is due. It's a good article.

On the first read I was annoyed at the post for criticizing futurists for being too certain in their predictions, while it also throws out and refuses to grade any prediction that expressed uncertainty, on the grounds that saying something "may" happen is unfalsifiable.

On reflection these two things seem mostly unrelated, and for the purpose of establishing a track record "may" predictions do seem strictly worse than either predicting confidently (which allows scoring % of predictions right), or predicting with a probability (which none of these futurists did, but allows creating a calibration curve).

An interesting section in the appendices, a criticism of Ajeya Cotra’s “Forecasting Transformative AI with Biological Anchors”:

If you do a sensitivity analysis on the most important variable (how much Moore's law will improve FLOPS/$), the output behavior doesn't make any sense, e.g., Moore's law running out of steam after "conventional" improvements give us a 144x improvement would give us a 34% chance of transformative AI (TAI) by 2100, a 144*6x increase gives a 52% chance, and a 144*600x increase gives a 66% chance (and with the predicted 60000x improvement, there's a 78% chance), so the model is, at best, highly flawed unless you believe that going form a 144x improvement to a 144*6x improvement in computer cost gives almost as much increase in the probability of TAI as a 144*6x to 144*60000x improvement in computer cost.

The part about all of this that makes this fundamentally the same thing that the futurists here did is that the estimate of the FLOPS/$ which is instrumental for this prediction is pulled from thin air by someone who is not a deep expert in semiconductors, computer architecture, or a related field that might inform this estimate.


If you say that, based on your intuition, you think there's some significant probability of TAI by 2100; 10% or 50% or 80% or whatever number you want, I'd say that sounds plausible but wouldn't place any particular faith in the estimate. But if you take a model that produces nonsense results and then pick an arbitrary input to the model that you have no good intuition about to arrive at an 80% chance, you've basically picked a random number that happens to be 80%.

The claim that the probability goes from 34% -> 52% from a 6x of compute does sound pretty weird! But I think it's just based on a game of telephone and a complete misunderstanding.

I was initially confused where the number came from, then I saw the reference to Nostalgebraist's post. They say that "Assume a 6x extra speedup, and you get a 52% chance. (Which is still pretty high, to be fair.) Assume no extra speedup, and also no speedup at all, just the same computers we have now, and you get a 34% chance … wait, what?!"

Nostalgebraist is saying that you move from 34% to 52% by moving from 1x and 144*6x---not by moving from 144x to 144*6x. That is, this if you increase compute by about 3 OOMs you increase the probability from 34% to 52%.

Similarly, if you increase probability by 14*6x to 144*60000x, or about 4OOMs, you increase probability from 52% to 78%.

So 3 OOMs is 18% and 4 OOMs is 26%, roughly proportional as you'd expect given the nature of the model. The report basically distributes TAI over 20 OOMs and so a 3 OOM increase covers about 3/20th of the range.

But if you take a model that produces nonsense results and then pick an arbitrary input to the model that you have no good intuition about to arrive at an 80% chance, you've basically picked a random number that happens to be 80%.

If you get a nonsensical number out of a model, I think it's worth reflecting more on whether there was a misunderstanding.

Aside from this, calling "how far does Moore's law go" the most important variable seems kind of overstated. The criticism is that 7 orders of magnitude in this parameter leads to a change from 34% to 78%. I agree that's a significant difference, but 7 orders of magnitude is a lot of uncertainty in this parameter, and I don't think that's grounds for saying that it's the number that drives the whole estimate. And even after 7 OOMs these estimates aren't even that different in an action-relevant way---in particular this change doesn't result in a similarly-dramatic change for your 10 year or 20 year timelines, and shifting your 100 year TAI probability from 55% to 78% is not a huge deal.

And aside from that, saying that the estimates for Moore's law are arbitrary isn't right. I think it's totally fair that Ajeya isn't an expert, but that doesn't mean that things are totally unknown within 7 orders of magnitude. At the upper end things are pretty constrained by basic physics, at the lower end things are pretty constrained by normal technological extrapolation. There's a ton of uncertainty left but it's just not a big deal relative to the uncertainty about AI training.

The overall estimate is basically driven by the fact that a broad distribution over horizon lengths in the existing NN extrapolation gives you a similar range of estimates to the entire space from human lifetime to human evolution. So it's very easy to squint and get a broad distribution with around 5% probability per OOM of compute (which is a couple percent per year right now). The criticism of this that seems most plausible to me is that maybe inside-view you can just eyeball how good AI systems are and how close they are to transformative effects and it's just not that far. That said, the second most plausible criticism (especially about the 20%+ short-term predictions) is that you can eyeball how good AI systems are and it's probably not that close.

(Disclaimer: this report is written by my wife and so I may be biased.)

FWIW I'm not married to Ajeya and I agree with you; I was pretty disappointed by Nostalgebraist's post & how much positive reception it seemed to get. I've been thinking about writing up a rebuttal. Most of what I'd say is what you've already said here though, so yay.

I pointed out the OOM error on Twitter (citing this comment), and Dan has updated the post with a correction.