Imagine if there was a financial pundit who kept saying "Something really bad is brewing in the markets and we may be headed for a recession.  But we can't know when recessions will come, nobody can predict them".  And then every time there was a selloff in the market, they tell everyone "I've been saying we were headed for trouble", taking credit.  This doesn't work as a forecasting track record, and it shouldn't be thought of that way.

If they want forecaster prestige, their forecasts must be:

  1. Pre-registered,
  2. So unambiguous that people actually agree whether the event "happened",
  3. With probabilities and numbers so we can gauge calibration,
  4. And include enough forecasts that it's not just a fluke or cherry-picking.

When Eliezer Yudkowsky talks about forecasting AI, he has several times claimed implied he has a great forecasting track record.  But a meaningful "forecasting track record" has well-known and very specific requirements, and Eliezer doesn't show these.

Here he dunks on the Metaculus predictors as "excruciatingly predictable" about a weak-AGI question, saying that he is a sane person with self-respect (implying the Metaculus predictors aren't):

To be a slightly better Bayesian is to spend your entire life watching others slowly update in excruciatingly predictable directions that you jumped ahead of 6 years earlier so that your remaining life could be a random epistemic walk like a sane person with self-respect.

I wonder if a Metaculus forecast of "what this forecast will look like in 3 more years" would be saner. Is Metaculus reflective, does it know what it's doing wrong?

He clearly believes he could be placing forecasts showing whether or not he is better.  Yet he doesn't.

Some have argued "but he may not have time to keep up with the trends, forecasting is demanding".  But he's the one making a claim about relative accuracy! And this is in the domain he says is the most important one of our era.  And he seems to already be keeping up with trends -- just submit the distribution then.

And here he dunks on Metaculus predictors again:

What strange inputs other people require instead of the empty string, to arrive at conclusions that they could have figured out for themselves earlier; if they hadn't waited around for an obvious whack on the head that would predictably arrive later. I didn't update off this.

But still without being transparent about his own forecasts, preventing a fair comparison.

In another context, Paul Christiano offered to bet Eliezer about AI timelines.  This is great, a bet is a tax on bullshit.  While it doesn't show a nice calibration chart like on Metaculus, it does give information about performance.  You would be right to be fearful of betting against Bryan Caplan.  And to Eliezer's great credit, he has actually made a related bet with Bryan!  EDIT: Note that Eliezer also agreed to this bet with Paul.

But at one point in responding to Paul, Eliezer mentions some nebulous, unscorable debates and claims:

I claim that I came off better than Robin Hanson in our FOOM debate compared to the way that history went.  I'd claim that my early judgments of the probable importance of AGI, at all, stood up generally better than early non-Yudkowskian EA talking about that.

Nothing about this is a forecasting track record.  These are post-hoc opinions.  There are unavoidable reasons we require pre-registering of the forecasts, removal of definitional wiggle room, explicit numbers, and a decent sample.  This response sounds like the financial pundit, saying he called the recession.

EDIT: I think some people are thinking that Eliezer was an unambiguous winner of that debate, and therefore this works as part of a forecasting track record.  But you can see examples of why it's far more ambiguous than that in this comment by Paul Christiano.

In this comment, Eliezer said Paul didn't need to bet him, and that Paul is...lacking a forecasting track record.

I think Paul doesn't need to bet against me to start producing a track record like this; I think he can already start to accumulate reputation by saying what he thinks is bold and predictable about the next 5 years; and if it overlaps "things that interest Eliezer" enough for me to disagree with some of it, better yet.

But Eliezer himself doesn't have a meaningful forecasting track record.

In other domains, where we have more practice detecting punditry tactics, we would dismiss such an uninformative "track record".  We're used to hearing Tetlock talk about ambiguity in political statements.  We're used to hearing about a financial pundit like Jim Cramer underperforming the market.  But the domain is novel in AI timelines.

When giving "AGI timelines", I've heard several EAs claim there are no ambiguity risks for the forecast resolution.  They think this because the imagery in their heads is dramatic, and we'll just know if they were right.  No we won't.  This shows wild overconfidence in scenarios they can imagine, and overconfidence in how powerful words are at distinguishing.

Even "the AGI question" on Metaculus had some major ambiguities that could've prevented resolution.  Matthew Barnett nicely proposed solutions to clarify them.  Many people talking about AI timelines should find this concerning.  Because they make "predictions" that aren't defined anywhere near as well as that question.  It's okay for informal discussions to be nebulous.  But while nebulous predictions sound informative, it takes years before it's obvious that they were meaningless.

So why won’t Eliezer use the ways of Tetlock? He says this:

I consider naming particular years to be a cognitively harmful sort of activity; I have refrained from trying to translate my brain's native intuitions about this into probabilities, for fear that my verbalized probabilities will be stupider than my intuitions if I try to put weight on them.  What feelings I do have, I worry may be unwise to voice; AGI timelines, in my own experience, are not great for one's mental health, and I worry that other people seem to have weaker immune systems than even my own.  But I suppose I cannot but acknowledge that my outward behavior seems to reveal a distribution whose median seems to fall well before 2050.

He suggests that if he uses proper forecasting methods, it would hurt people's mental health.  But Eliezer seems willing to format his message as blatant fearmongering like this.  For years he's been telling people they are doomed, and often suggests they are intellectually flawed if they don't agree.  To me, he doesn't come across like he's sparing me an upsetting truth.  To me he sounds like he's catastrophizing, which isn't what I expect to see in a message tailored for mental health.

I'm not buying speculative infohazard arguments, or other "reasons" to obfuscate.  If Eliezer thinks he has detected an imminent world-ending danger to humans, then the best approach would probably be to give a transparent, level-headed assessment.

...for fear that my verbalized probabilities will be stupider than my intuitions if I try to put weight on them

Well, with practice he would improve at verbalized probabilities, as Tetlock found.  Also, how does he expect to know if his intuitions are stupid, if he doesn't test them against reality? Sure, it would probably make him seem much less prescient.  But that's good, if it's more objective and real.

And no, his domain eminence isn't much of an update.  The edge in forecasting from being an expert is generally pretty underwhelming, however special you think AI is.  Maybe even less so if we consider him a relatively famous expert.  Does anyone predict they can dominate the long-term question leaderboards, by having insights, and skipping the proper forecasting practice? This is wishful thinking.

One justification I've heard: "Shorter-term questions can't show how good his judgment is about longer-term questions".  This seems like a rationalization.  Suppose if you have 2 groups: those who show good calibration on 3-year AI questions, and those who don't.  Now in many cases, both groups end up being dart-throwing chimps on 30-year AI questions.  But this hardly justifies not even trying to do it properly.  And if some do outperform at the long-term questions, they should have a much better chance if they were at least calibrated on 3 -year questions, versus the group who didn't demonstrate that.  It's easy to have an outcome where the uncalibrated just do even worse than a dart-throwing chimp.

If you would like to have some chance at forecasting AI timelines, here are a couple paths.  1) Good generalist forecasters can study supplemental domain material.  2) Non-forecaster domain experts can start building a calibration graph of proper forecasts.  Those are basically the options.

People who avoid forecasting accountability shouldn't boast about their forecasting performance.  And other people shouldn't rationalize it.  I thought Eliezer did great betting with Bryan.  Before dunking on properly-scored forecasts, he should be transparent, create a public Metaculus profile, place properly-scored forecasts, and start getting feedback.

Thank you to KrisMoore, Linch, Stefan Schubert, Nathan Young, Peterwildeford, Rob Lee, Ruby, and tenthkrige for suggesting changes.

New Comment
113 comments, sorted by Click to highlight new comments since: Today at 2:49 AM
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

he has several times claimed to have a great forecasting track record.

This seems like an unfair exaggeration, going off the quotes you pulled.

To help me understand, would you endorse that "he claims to be better than most Metaculus predictors"?

Yes, but I don't think he claims to have a better forecasting track record than them. I think he would say he is epistemically better in general, but as you say he doesn't participate on Metaculus, he barely has any track record to speak of, so he'd have to be pretty delusional to think his track record is better.

I too would claim such a thing, or something similar at least -- I'd say that my forecasts about AGI are better than the typical Metaculus forecast about AGI; however, I would not claim to have a great forecasting track record or even a better forecasting track record than Metaculus, because (a) I don't have much of a track record at all, and (b) there are lots of other non-AGI questions on metaculus and on those questions I expect to do worse than Metaculus on average, lacking expertise as I do. (Alas, the AGI questions have mostly not resolved yet and will not resolve for some years, so we can't just check those.)

Yes, I agree with the points you make about e.g. the importance of track records, the importance of betting, etc. etc. No, I don't expect you to take my word for anything (or Yudkowsky's). Yes, I think it's reasonable for outsiders / people who aren't super familiar with the literature on AI to defer to Metaculus instead of me or Yudkowsky. 

Perhaps this explains my position better:

If I saw a Yudkowsky tweet saying "I have a great forecasting track record" or "I have a better forecasting track record than Metaculus" my immediate reaction would be "Lol no you don't fuck off." When I read the first few lines of your post, I expected to shortly see a pic of such a tweet as proof. In anticipation my "lol fuck you Yudkowsky" reaction already began to rise within me.

But then when I saw the stuff you actually quoted, it seemed... much more reasonable? In particular, him dumping on Metaculus for updating so hard on Gato seemed... correct? Metaculus really should have updated earlier, Gato just put together components that were already published in the last few years. So then I felt that if I had only skimmed the first part of your post and not read the actual post, I would have had an unfairly negative opinion of Yudkowsky, due to the language you used: "He has several times claimed to have a great forecasting track record."

For what it's worth, I agree that Yudkowsky is pretty rude and obnoxious & that he should probably get off Twitter if this is how he's going to behave. Like, yes, he has alpha about this AI stuff; he gets to watch as the "market" gradually corrects and converges to his position. Yay. Good for him. But he's basically just stroking his own ego by tweeting about it here; I don't see any altruistic purpose served by it.

I am a forecaster on that question: the main doubt I had was if/when someone would try to do wordy things + game playing on a "single system". Seemed plausible to me that this particular combination of capabilities never became an exciting area of research, so the date at which an AI can first do these things would then be substantially after this combination of tasks would be achievable with focused effort. Gato was a substantial update because it does exactly these tasks, so I no longer see much reason possibility that the benchmark is achieved only after the capabilities are substantially overshot.

I also tend to defer somewhat to the community.

I was at 2034 when the community was at 2042, and I updated further to 2026 on the Gato news.

That's good feedback.  I can see why the wording I used gives the wrong impression -- he didn't literally say out loud that he has "a great forecasting track record".  It still seems to me heavily implied by several things he's said, especially what he said to Paul.

I think the point you raise is valid enough.  I have crossed out the word "claimed" in the essay, and replaced it with "implied".

3Daniel Kokotajlo2y
OK, thanks!
Well, when he says something like this: ...He's saying something notably positive about some sort of track record.  That plus the comments he made about the Metaculus updates, and he clearly thinks he's been doing well.  Yes, he doesn't have a track record on Metaculus (I'm not even aware of him having a profile).  But if I just read what he writes and see what he's implying, he thinks he's doing much better at predicting events than somebody, and many of those somebodys seem to be people closer to Hanson's view, and also seem to be Metaculus predictors. Also, perhaps I'm using the word "great" more informally than you in this context.

As an example of the kind of point that one might use in deciding who "came off better" in the FOOM debate, Hanson predicted that "AIs that can parse and use CYC should be feasible well before AIs that can parse and use random human writings", which seems pretty clearly falsified by large language models—and that also likely bears on Hanson's view that "[t]he idea that you could create human level intelligence by just feeding raw data into the right math-inspired architecture is pure fantasy".

As you point out, however, this exercise of looking at what was said and retrospectively judging whose worldview seemed "less surprised" by what happened is definitely not the same thing as a forecasting track record. It's too subjective; rationalizing why your views are "less surprised" by what happened than some other view (without either view having specifically predicted what happened), is not hugely more difficult than rationalizing your views in the first place.

There was a lot of other stuff in that debate.

I think the passage you quote there is just totally correct though. If you turn the clock back ten years or more to when all that stuff was happening, Yudkowsky was the "AGI is really important and coming sooner than you think" end of the spectrum, and the other side seemed to be "AGI is either not ever going to be a thing, or not ever going to be important" and then the median opinion was something like "Plausibly it'll be an important thing but it's coming 50 - 100 years from now." At least that's my impression from the 9-ish years I've been lurking on LW and the 7-ish years I've been talking to people in the community. (gosh I'm old.)

In the passage you quote I interpret Yud as saying that when you compare his claims about AGI back then to claims that other rationalists and EAs were making, people like Hanson, with the benefit of hindsight his look closer to the truth. I think that's correct. Of course the jury is still out, since most of the claims on both sides were about things that haven't happened yet (AGI is still not here) but e.g. it's looking pretty unlikely that uploads/ems will come first, it's looking pretty unlikely that AGI will be an accumulation of specialized modules built by different subcontractors (like an f-35 fighter jet lol), it's looking pretty likely that it'll happen in the 20's or 30's instead of the 60's or 70's... most of all, it's looking pretty likely that it'll be a Big Deal, something we all should be thinking about and preparing for now.

On overall optimism it seems clear that Eliezer won---Robin seems unusually bad, while Eliezer seems unusually good. I also think on "domain-specific engineering" vs "domain-general engineering" Eliezer looks unusually good while Robin looks typical.

But I think there are also comparably-important substantive claims that look quite bad. I don't think Eliezer has an unambiguous upper hand in the FOOM debate a all:

  • The debate was about whether a small group could quickly explode to take over the world. AI development projects are now billion-dollar affairs and continuing to grow quickly, important results are increasingly driven by giant projects, and 9 people taking over the world with AI looks if anything even more improbable and crazy than it did then. Now we're mostly talking about whether a $10 trillion company can explosively grow to $300 trillion as it develops AI, which is just not the same game in any qualitative sense. I'm not sure Eliezer has many precise predictions he'd stand behind here (setting aside the insane pre-2002 predictions), so it's not clear we can evaluate his track record, but I think they'd look bad if he'd made them. This is really one of the foundational c
... (read more)

Robin on AI timelines just seems particularly crazy. We can't yet settle the ems vs de novo AI bet, but I think the writing is on the wall, and his forecasting methodology for the 300 year timeline seems so crazy---ask people in a bunch of fields "how far have you come to hman level, is it speeding up?" and then lean entirely on that (I think many of the short-term predictions are basically falsified now, in that if you ask people the same question they will give much higher percentages and many of the tasks are solved).

ETA: Going through the oldest examples from Robin's survey to see how the methodology fares:

  • Melanie Mitchell gives 5% progress in 20 years towards human-level analogical reasoning. But the kinds of string manipulation used in Mitchell's copycat problem seems to just be ~totally solved by the current version of the OpenAI API. (I tried 10 random questions from this list, and the only one it got wrong was "a -> ab, z -> ?" where it said "z -> z b" instead of what I presume was the intended "z -> z y". And in general it seems like we've come quite a long way.
  • Murray Shanahan gives 10% progress on "knowledge representation" in 20 years, but I don't know what
... (read more)

Regarding the weird mediocrity of modern AI, isn't part of this that GPT-3-style language models are almost aiming for mediocrity? 

Would a hypothetical "AlphaZero of code" which built its own abstractions from the ground up - and presumably would not reinvent Python (AlphaCode is cool and all, but it does strike me as a little absurd to see an AI write Python) - have this property? 

Game-playing AI is also mediocre, as are models fine-tuned to write good code. 100B parameter models trained from scratch to write code (rather than to imitate human coders) would be much better but would take quite a lot longer to train, and I don't see any evidence that they would spend less time in the mediocre subhuman regime (though I do agree that they would more easily go well past human level).
Also this.

The debate was about whether a small group could quickly explode to take over the world. AI development projects are now billion-dollar affairs and continuing to grow quickly, important results are increasingly driven by giant projects, and 9 people taking over the world with AI looks if anything even more improbable and crazy than it did then.

Maybe you mean something else there, but wasn't Open AI like 30 people when they released GPT-2 and maybe like 60 when they released GPT-3? This doesn't seem super off from 9 people, and my guess is there is probably a subset of 9 people that you could poach from OpenAI that could have made 80% as fast progress on that research as the full set of 30 people (at least from talking to other people at OpenAI, my sense is that contributions are very heavy-tailed)? 

Like, my sense is that cutting-edge progress is currently made by a few large teams, but that cutting-edge performance can easily come from 5-10 person teams, and that if we end up trying to stop race-dynamics, that the risk from 5-10 person teams would catch up pretty quickly with the risk from big teams, if the big teams halted progress. It seems to me that if I sat down with 8 ot... (read more)

GPT-2 is very far from taking over the world (and was indeed <<10 people). GPT-3 was bigger (though still probably <10 people depending how you amortize infrastructure), and remains far from taking over the world. Modern projects are >10 people, and still not yet taking over the world.  It looks like it's already not super plausible for 10 people to catch up, and it's rapidly getting less plausible. The prediction isn't yet settled, but neither are the predictions in Eliezer's favor, and it's clear which way the wind blows.

These projects are well-capitalized, with billions of dollars in funding now and valuations rapidly rising (though maybe a dip right now with tech stocks overall down ~25%). These projects need to negotiate absolutely massive compute contracts, and lots of the profit looks likely to flow to compute companies. Most of the work is going into the engineering aspects of these projects. There are many labs with roughly-equally-good approaches, and no one has been able to pull much ahead of the basic formula---most variation is explained by how big a bet different firms are willing to make.

Eliezer is not talking about 10 people making a dominant AI bec... (read more)

Historical track record of software projects is that it's relatively common that a small team of ~10 people outperforms 1000+ person teams. Indeed, I feel like this is roughly what happened with Deepmind and OpenAI. I feel like in 2016 you could have said that current AGI projects already have 500+ employees and are likely to grow even bigger and so it's unlikely that a small 10-person team could catch up, and then suddenly the most cutting-edge project was launched by a 10-person team. (Yes, that 10 person team needed a few million dollars, but a few million dollars are not that hard to come by in the tech-sector).

My current guess is that we will continue to see small 10-person teams push the cutting-edge forward in AI, just as we've seen the same in most other domains of software.

In addition to 10 people, the view "you can find a better way to build AI that's way more efficient than other people" is also starting to look increasingly unlikely as performance continues to be dominated by scale and engineering rather than clever ideas.

I do agree with this in terms of what has been happening in the last few years, though I do expect this to break down as we see more things in the "le... (read more)

I think "team uses Codex to be 3x more productive" is more like the kind of thing Robin is talking about than the kind of thing Eliezer is talking about (e.g. see the discussion of UberTool, or just read the foom debate overall). And if you replace 3x with a more realistic number, and consider the fact that right now everyone is definitely selling that as a product rather than exclusively using it internally as a tool, then it's even more like Robin's story.

Everyone involved believes in the possibility of tech startups, and I'm not even sure if they have different views about the expected returns to startup founders. The 10 people who start an AI startup can make a lot of money, and will typically grow to a large scale (with significant dilution, but still quite a lot of influence for founders) before they make their most impressive AI systems.

I think this kind of discussion seems pretty unproductive, and it mostly just reinforces the OP's point that people should actually predict something about the world if we want this kind of discussion to be remotely useful for deciding how to change beliefs as new evidence comes in (at least about what people / models / reasoning strategies w... (read more)

Yeah, I think this is fair. I'll see whether I can come up with some good operationalizations.
Possible counterevidence (10 months later)?—the GPT-4 contributors list lists almost 300 names.[1] ---------------------------------------- 1. Methodology: I copied text from the contributors page (down to just before it says "We also acknowledge and thank every OpenAI team member"), used some quick Emacs keyboard macros to munge out the section headers and non-name text (like "[topic] lead"), deduplicated and counted in Python (and subtracted one for a munging error I spotted after the fact), and got 290. Also, you might not count some sections of contributors (e.g., product management, legal) as relevant to your claim. ↩︎
Yep, that is definitely counterevidence! Though my model did definitely predict that we would also continue seeing huge teams make contributions, but of course each marginal major contribution is still evidence. I have more broadly updated against this hypothesis over the past year or so, though I still think there will be lots of small groups of people quite close to the cutting edge (like less than 12 months behind).  Currently the multiple on stuff like better coding tools and setting up development to be AI-guided just barely entered the stage where it feels plausible that a well-set-up team could just completely destroy large incumbents. We'll see how it develops in the next year or so.
-1[comment deleted]2y
If you're not already doing machine learning research and engineering, I think it takes more than two years of study to reach the frontier? (The ordinary software engineering you use to build Less Wrong, and the futurism/alignment theory we do here, are not the same skills.) As my point of comparison for thinking about this, I have a couple hundred commits in Rust, but I would still feel pretty silly claiming to be able to build a state-of-the-art compiler in 2 years with 7 similarly-skilled people, even taking into account that a lot of the work is already done by just using LLVM (similar to how ML projects can just use PyTorch or TensorFlow). Is there some reason to think AGI (!) is easier than compilers? I think "newer domain, therefore less distance to the frontier" is outweighed by "newer domain, therefore less is known about how to get anything to work at all."
Yeah, to be clear, I think I would try hard to hire some people with more of the relevant domain-knowledge (trading off against some other stuff). I do think I also somewhat object to it taking such a long time to get the relevant domain-knowledge (a good chunk of people involved in GPT-3 had less than two years of ML experience), but it doesn't feel super cruxy for anything here, I think? To be clear, I agree with this, but I think this mostly pushes towards making me think that small teams with high general competence will be more important than domain-knowledge. But maybe you meant something else by this.
1Adam Jermyn2y
I think the argument “newer domain hence nearer frontier” still holds. The fact that we don’t know how to make an AGI doesn’t bear on how much you need to learn to match an expert.

Now we're mostly talking about whether a $10 trillion company can explosively grow to $300 trillion as it develops AI, which is just not the same game in any qualitative sense.

To be clear, this is not the scenario that I worry about, and neither is it the scenario most other people I talk about AI Alignment tend to worry about. I recognize there is disagreement within the AI Alignment community here, but this sentence sounds like it's some kind of consensus, when I think it clearly isn't. I don't expect we will ever see a $300 trillion company before humanity goes extinct.

I'm just using $300 trillion as a proxy for "as big as the world." The point is that we're now mostly talking about Google building TAI with relatively large budgets. It's not yet settled (since of course none of the bets are settled). But current projects are fairly big, the current trend is to grow quite quickly, and current techniques have massive returns to scale. So the wind certainly seems to be blowing in that direction about as hard as it could.
Well, $300 trillion seems like it assumes that offense is about similarly hard to defense, in this analogy. Russia launching a nuclear attack on the U.S. and this somehow chaining into a nuclear winter that causes civilizational collapse, does not imply that Russia has "grown to $300 trillion". Similarly, an AI developing a bioweapon that destroys humanity's ability to coordinate or orient and kills 99% of the population using like $5000, and then rebuild over the course of a few years without humans around, also doesn't look at all like "explosive growth to $300 trillion".  This seems important since you are saying that "[this] is just not the same game in any qualitative sense", whereas I feel like something like the scenario above seems most likely, we haven't seen much evidence to suggest it's not what's going to happen, and sounds quite similar to what Eliezer was talking about at the time. Like, I think de-facto probably an AI won't do an early strike like this that only kills 99% of the population, and will instead wait for longer to make sure it can do something that has less of a chance of failure, but the point-of-no-return will have been crossed when a system first had the capability to kill approximately everyone. I agree with this. I agree that it seems likely that model sizes will continue going up, and that cutting-edge performance will probably require at least on the order of $100M in a few years, though it's not fully clear how much of that money is going to be wasted, and how much a team could reproduce the cutting-edge results without access to the full $100M. I do think in as much as it will come true, this does make me more optimistic that cutting edge capabilities will at least have like 3 years of lead, before a 10 person team could reproduce something for a tenth of the cost (which my guess is probably currently roughly what happened historically?).

Eliezer very specifically talks about AI systems that "go foom," after which they are so much better at R&D than the rest of the world that they can very rapidly build molecular nanotechnology, and then build more stuff than the rest of the world put together.

This isn't related to offense vs defense, that's just >$300 trillion of output conventionally-measured. We're not talking about random terrorists who find a way to cause harm, we are talking about the entire process of (what we used to call) economic growth now occurring inside a lab in fast motion.

I think he lays this all out pretty explicitly. And for what it's worth I think that's the correct implication of the other parts of Eliezer's view. That is what would happen if you had a broadly human-level AI with nothing of the sort anywhere else. (Though I also agree that maybe there'd be a war or decisive first strike first, it's a crazy world we're talking about.)

And I think in many ways that's quite to what will happen. It just seems most likely to take years instead of months, to use huge amounts of compute (and therefore share proceeds with compute providers and a bunch of the rest of the economy), to result in "AI improvements" that look much more similar to conventional human R&D, and so on.

Good points; those do seem to be cases in which Hanson comes out better. As you say, it comes down to how heavily you weight the stuff Yudkowsky beat Hanson on vs. the stuff Hanson beat Yudkowsky on. I also want to reiterate that I think Yudkowsky is being obnoxious.

(I also agree that the historical bio anchors people did remarkably well & much better than Yudkowsky.)

Note that I feel like, if we look at the overall disagreements in 2008, Eliezer's view overall seems better than Robin's. So I think we're probably on the same page here.

5Tomás B.2y
Regarding the Einstein stuff, do you think Einstein's brain had significantly more compute than a 100 IQ person? I would be very surprised if Einstein had more than, I don't know, twice the computing power of a normal brain. And so the claim that the distance between an idiot and Einstein is far smaller than that between the idiot and a chimpanzee still seems true to me.
I'd guess that Einstein's brain probably used something like 10% more compute than the median person. But to the extent that there is any prediction it's about how good software will be, and how long it will spend in the range between mediocre and excellent human performance, rather than about how big Einstein's brain is. And that prediction seems to be faring poorly across domains.
This is a great list and I thank you for describing it.  Good examples of one of the claims I'm making -- there's nothing about their debate that tells us much meaningful about Eliezer's forecasting track record.  In fact I would like to link to this comment in the original post because it seems like important supplemental material, for people who are convinced that the debate was one-sided.
I agree about ems being nowhere in sight, versus steady progress in other methods.  I also disagree with Hanson about timeframe (I don't see it taking 300 years).  I also agree that general algorithms will be very important, probably more important than Hanson said.  I also put a lower probability on a prolonged AI winter than Hanson. But as you said, AGI still isn't here.  I'd take it a step further -- did the Hanson debates even have unambiguous, coherent ideas of what "AGI" refers to? Of progress toward AGI, "how much" happened since the Hanson debate? This is thoroughly nebulous and gives very little information about a forecasting track record, even though I disagree with Hanson.  With the way Eliezer is positioned in this debate, he can just point to any impressive developments, and say that goes in his favor.  We have practically no way of objectively evaluating that.  If someone already agrees "the event happened", they update that Eliezer got it right.  If they disagree, or if they aren't sure what the criteria were, they don't. Being able to say post-hoc say that Eliezer "looks closer to the truth" is very different from how we measure forecasting performance, and for good reason.  If I was judging this, the "prediction" absolutely resolves as "ambiguous", despite me disagreeing with Hanson on more points in their debate.

The comments about Metaculus ("jumped ahead of 6 years earlier") make more sense if you interpret them as being about Yudkowsky already having "priced in" a deep-learning-Actually-Works update in response to AlphaGo in 2016, in contrast to Metaculus forecasters needing to see DALLE 2/PaLM/Gato in 2022 in order to make "the same" update.

(That said, I agree that Yudkowsky's sneering in the absence of a specific track record is infuriating; I strong-upvoted this post.)

(That said, I agree that Yudkowsky's sneering in the absence of a specific track record is infuriating; I strong-upvoted this post.)

In particular I am irritated that Yudkowsky is criticizing metaculus forecasters when he literally doesn't even have a metaculus account, at least one he posts. He's pro-bets in theory, but then will asymmetrically criticize the people who make their prediction track record quantifiable and public. The reputational risk of making regular metaculus predictions would be a lot more psychologically and socially relevant to him than losing a thousand dollars, so the fact that he's not doing so says a lot to me.

so the fact that he's not doing so says a lot to me

How about Metaculus points being worth nothing, or it being a huge time commitment with no payoff? Last I heard (e.g. from Zvi, who wasn't impressed with it), Metaculus still punished people for not continually updating their predictions, and occasionally rewarded them for making predictions, period (as in, both betting "no" or "yes" on some predictions granted points).

Have any of those things changed?

Metaculus generates lots of valuable metrics besides the "meaningless internet points" about which Zvi and others complained. If Yudkowsky had predicted regularly, he would have been able to know e.g. how well-calibrated he is, how his Brier score evolved over time, how it compares to the community's, etc.
You're right that that context makes it make more sense.  One might even say it sounds impressive! But it's a real shame he's avoiding the format that could make these things a non-corrupt, meaningful forecasting track record.

Note that Eliezer and I ended up making one bet (8% vs 16% probability on AI IMO gold by 2025). I would have liked to have gotten to more disagreements about more central topics---I feel like the result will be mostly noise---but it's not nothing. (I felt like I offered some reasonable candidates that we ought to disagree about, though the reader can judge for themselves.)

Can someone explain to me why we don't see people with differing complex views on something placing bets in a similar fashion more often? 

It was quite hard to get to this forecast, and in the end I don't think it will be that useful. I think it's just generally really hard. I don't have a clear sense for why Eliezer and I weren't able to get to more bets, but I'm not that surprised.

I do think that this kind of betting has a lot of hazards and there's a good chance that we are both going t come out behind in social EV. For example: (i) if you try to do it reasonably quickly then you basically just know there are reasons that your position is dumb that you just haven't noticed, (ii) it's easier to throw something in someone's face as a mistake than to gloat about it as a win, (iii) there is all kinds of adverse selection...

Several different tough hurdles have to be passed, and usually aren't.  For one, they would have to both even agree on criteria that they both think are relevant enough, and that they can define well-enough for it to be resolvable. They also have to agree to an offer with whatever odds, and the amount of money. They then also have to be risk-tolerant enough to go through knowing they may lose money, or may be humiliated somewhat (though with really good betting etiquette, IMO it need not humiliating if they're good sports about it).  And also the obvious counterparty risk, as people may simply not pay up.

My first 'dunk' on April 18, about a 5-year shortening of Metaculus timelines in response to evidence that didn't move me at all, asking about a Metaculus forecast of the Metaculus forecast 3 years later, implicitly predicts that Metaculus will update again within 3 years.

My second 'dunk' on May 12 is about Metaculus updating that much again in that direction, one month later.

I do admit, it's not a good look that I once again understate my position by so much compared to what the reality turns out to be, especially after having made that mistake a few times before.

I do however claim it as a successful advance prediction, if something of a meta one, and cast a stern glance in your direction for failing to note this over the course of your attempting to paint me in a negative light by using terms like 'dunk'.

asking about a Metaculus forecast of the Metaculus forecast 3 years later, implicitly predicts that Metaculus will update again within 3 years. [emphasis mine]

I feel like this is missing the key claim underlying this post: that verbal statements making implicit predictions are too hard to judge and too easy to hindsight bias about, and so aren't strong evidence about a person's foresight.

For instance, if Metaculus, did not, in fact, update again over the upcoming 3 years, and you were merely optimizing for the appearance of accuracy, you could claim that you weren't making a prediction, merely voicing a question.  And more likely, you and everyone else would just have forgotten about this tweet.

I don't particularly want to take a stance on whether verbal forecasts like that one ought to be treated as part of one's forecasting record. But insofar as the author of this post clearly doesn't think they should be, this comment is not addressing his objection.

These sorts of observations sound promising for someone's potential as a forecaster.  But by themselves, they are massively easier to cherry-pick, fudge, omit, or re-define things, versus proper forecasts.

When you see other people make non-specific "predictions", how do you score them? How do you know the scoring that you're doing is coherent, and isn't rationalizing? How do you avoid the various pitfalls that Tetlock wrote about? How do you *ducks stern glance* score yourself on any of that, in a way that you'll know isn't rationalizing?

For emphasis, in this comment you reinforce that you consider it a successful advance prediction.  This gives very little information about your forecasting accuracy.  We don't even know what your actual distribution is, and it's a long time before this resolves, we only know it went in your direction.  I claim that to critique other people's proper-scored forecasts, you should be transparent and give your own.

EDIT: Pasted from another comment I wrote:

Instead of that actual [future resolution] reality, and because of how abruptly the community ended up shifting, Eliezer seems to be interpreting that to mean that his position about that reality is not extreme enough.  Those 2 things are somewhat related but pretty weakly, so it seems like rationalizing for him to frame it as showing his forecast isn't extreme enough.

My first 'dunk' on April 18, about a 5-year shortening of Metaculus timelines in response to evidence that didn't move me at all, asking about a Metaculus forecast of the Metaculus forecast 3 years later, implicitly predicts that Metaculus will update again within 3 years.

I do however claim it as a successful advance prediction, if something of a meta one

Wait, unless I misunderstand you there’s a reasoning mistake here. You request epistemic credit for predicting implicitly that the Metaculus median was going to drop by five years at some point in the next three years. But that’s a prediction that the majority of Metaculites would also have made and it’s a given that it was going to happen, in an interval of time as long as three years. It’s a correct advance prediction, if you did make it (let’s assume so and not get into inferring implicit past predictions with retrospective text analysis), but it’s not one that is even slightly impressive at all.

As an example to explain why, I predict (with 80% probability) that there will be a five-year shortening in the median on the general AI question at some point in the next three years. And I also predict (with 85% probability) that... (read more)

You're right that volatility is an additional category of reasons that him not giving his actual distribution makes it less informative. It's interesting to me that in his comment, he states: He sees it as significant evidence that his position wasn't extreme enough.  But he didn't even clearly given his position, and "the reality" is a thing that is determined by the specific question resolution when that day comes.  Instead of that actual reality, and because of how abruptly the community ended up shifting, Eliezer seems to be interpreting that to mean that his position about that reality is not extreme enough.  Those 2 things are somewhat related but pretty weakly, so it seems like rationalizing for him to frame it as showing his forecast isn't extreme enough. I don't expect him to spend time engaging with me, but for what it's worth, to me the comment he wrote here doesn't address anything I brought up, it's essentially just him restating that he interprets this as a nice addition to his "forecasting track record".  He certainly could have made it part of a meaningful track record! It was a tantalizing candidate for such a thing, but he doesn't want to, but expects people to just interpret it the same, which doesn't make sense.
Both of these things have happened. The community prediction was June 28, 2036 at one time in July 2022, July 30, 2043 in September 2022 and is March 13, 2038 now. So there has been a five-year shortening and a five-year lengthening.

Thanks for the post and expressing your opinion!

That being, I feel like there is a misunderstanding here. Daniel mentioned that in another comment thread, but I don't think Eliezer claims what you're attributing to him, nor that your analogy with financial pundits works in this context.

My model of Eliezer, based on reading a lot of his posts (old and new) and one conversation, is that he's dunking on Metaculus and forecasters for a combination of two epistemic sins:

  • Taking a long time to update on available information Basically, you shouldn't take so long to update on the risk for AI, the accelerating pace, the power of scaling. I don't think Eliezer is perfect on this, but he definitely can claim that he thought and invested himself in AI risks literally decades before any metaculus forecaster even thought about the topic. This is actually a testable claim: that forecasts ends up trailing things that Eliezer said 10 years later.
  • Doing a precise prediction when you don't have the information I feel like there's been a lot of misunderstanding about why Eliezer doesn't want to give timeline predictions, when he said it repeatedly: he thinks there is just not enough bits of evidence fo
... (read more)
Not really, unless you accept corruptible-formats of forecasts with lots of wiggle room.  It isn't true that we can have a clear view of how he is forecasting if he skips proper forecasting. I think you're right that it's impressive he alerted people to potential AI risks.  But if you think that's an informative forecasting track record, I don't think that heuristic is remotely workable in measuring forecasters. To clarify, I'm not saying he should give a specific year that he thinks it happens, like such a 50% confidence interval of 12 months.  That would be nuts.  Per Tetlock, it just isn't true that you can't (or shouldn't) give specific numbers when you are uncertainty.  You just give a wider distribution.  And not giving that unambiguous distribution when you're very uncertain just obfuscates, and is the real epistemic sin. I don't understand what you mean by the bolded part.  What do you mean everybody does it? No they don't.  Some people pretend to, though.  The analogy is relevant in the sense that Eliezer should show that he is calibrated at predicting AI risks, rather than only arguing so.  The details you mention don't work as a proper forecasting track record.
The subtlety I really want to point out here is that the choice is not necessarily "make a precise forecast" or "not make any forecast at all". Notably, the precise forecasts that you generally can write down or put on website are limited to distributions that you can compute decently well and that have well-defined properties. If you arrive at a distribution that is particularly hard to compute, it can still tell you qualitative things (the kind of predictions Eliezer actually makes) without you being able to honestly extract a precise prediction. In such a situation, making a precise prediction is the same as taking one element of a set of solutions for an equation and labelling it "the" solution. (If you want to read more about Eliezer's model, I recommend this paper)

I agree with some of your complaints here. But Eliezer has more of a track record than you indicate. E.g. he made one attempt that I know of to time the stock market, buying on March 23, 2020 - the day on which it made its low for 2020.

There are further shards of a track record strewn across the internet:

Might be worth noting the odds for people who don't click through: these were at 1:1 and 1:20 (his 1 to Gwern's 20).
That's good, but it doesn't really add much to a forecasting track record.  Doesn't meet the criteria.  I do think he would be good, if he did proper forecasting.  EDIT 10 days later: I would be happy to hear from the downvoters why they think that example tells us much, OR why it's anyone's job other than Eliezer to curate his predictions into something accountable (instead of the stuff he said to Paul for example.)

A few parts of this OP seem in bad faith:

Here he dunks on Metaculus predictors as "excruciatingly predictable" about a weak-AGI question

No, the original Yudkowsky quote is:

To be a slightly better Bayesian is to spend your entire life watching others slowly update in excruciatingly predictable directions that you jumped ahead of 6 years earlier so that your remaining life could be a random epistemic walk like a sane person with self-respect.

I wonder if a Metaculus forecast of "what this forecast will look like in 3 more years" would be saner.  Is Metaculus reflective, does it know what it's doing wrong?

  • This quote does not insult Metaculus predictors as "excruciatingly predictable".
  • It doesn't call out individual Metaculus predictors at all.

And regarding this:

But Eliezer seems willing to format his message as blatant fearmongering like this.  For years he's been telling people they are doomed, and often suggests they are intellectually flawed if they don't agree.  To me, he doesn't come across like he's sparing me an upsetting truth.  To me he sounds like he's catastrophizing, which isn't what I expect to see in a message tailored for mental health.

If OP had extend... (read more)

A couple questions:

  1. It's quite easy and common to insult groups of people.  And me and some other people found him very sneering in that post.  In order to count as "calling them excruciatingly predictable", it seems like you're suggesting Eliezer would have had to be naming specific people, and that it doesn't count if it's about a group (people who had placed forecasts in that question)? If yes, why?
  2. For that post that I described as fearmongering, it's unrelated whether his "intention" is fearmongering or not.  I would like if you elaborated.  The post has a starkly doomsday attitude.  We could just say it's an April Fool's joke, but the problem with this retort is Eliezer has said quite a few things with a similar attitude.  And in the section "addressing" whether it's an April Fool's joke he first suggests that it is, but then implies that he intends for the reader to take the message very seriously so not really.

    Roughly, the post seems to imply a chance of imminent extinction that is, like, a factor of ~100x higher (in odds format) than what scored aggregated forecasters roughly give.  Such an extreme prediction could indeed be described as fearmongering.

    In order to count as "fearmongering", are you saying he would've had to meet the requirement of being motivated specifically for fearmongering? Because that's what your last sentence suggests.
Regarding 1: Drop your antagonism towards Yudkowsky for a moment and consider how that quote could be not insulting. It simply says "people are slowly updating in excruciatingly predictable directions". Unless you have an axe to grind, I don't understand how you immediately interpret that as "people are excruciatingly predictable". The point is simply: Yudkowsky has been warning of this AI stuff forever, and gotten increasingly worried (as evidenced by the post you called "fearmongering"). And the AI forecasts are getting shorter and shorter (which is an "excruciatingly predictable" direction), rather than occasionally getting longer (which would be an "epistemic random walk"). Finally, I'm still not seeing how this is in good faith. You interpreted a quote as insulting and now call it "very sneering", then wrote something I find 10x more insulting and sneering on your own (the second quote in my top comment). That seems like a weird double standard. ---------------------------------------- Regarding 2: "The post has a starkly doomsday attitude." Yes. However, fearmongering, as described in a dictionary, is "the action of intentionally trying to make people afraid of something when this is not necessary or reasonable". If someone thinks we're actually doomed and writes a post saying so, it's not fearmongering. Yudkowsky simply reported his firmly held beliefs. Yes, those are much much much more grim than what scored aggregated forecasters believe. But they're not out of place for MIRI overall (with five predictions all noting risks in the range of 25%-99%). To me, the main objectionable part of that post is the April Fool's framing, which seems to have been done because of Yudkowsky's worry that people who would despair needed an epistemic out or something. I understand that worry, but it's led to so much confusion that I'm dubious whether it was worth it. Anyway, this comment clarified things for me.
You're right about the definition of fearmongering then.  I think he clearly tries to make people worried, and I often find it unreasonable.  But I don't expect everyone to think he meets the "unreasonable" criterion. On the second quote in your top comment: indeed, most scored forecasters with a good track record don't give 25% risk of extinction, say, before e.g. 2200. And as for 99%: this is wackadoodle wildly extreme, and probably off by a factor of roughly ~1,000x in odds format.  If I assume the post's implied probability is actually closer to 99%, then it seems egregious.  You mention these >25% figures are not that out of place for MIRI, but what does that tell us? This domain probably isn't that special, and humans would need to be calibrated forecasters for me to care much about their forecasts. Here are some claims I stand by: I genuinely think the pictured painted by that post (and estimates near 99%) are overstating the odds of extinction soon by a factor of roughly ~1,000x.  (For intuition, that's similar to going from 10% to 99%.) I genuinely think these extreme figures are largely coming from people who haven't demonstrated calibrated forecasting, which would make it additionally suspicious in any other domain, and should here too. I genuinely think Eliezer does something harmful by overstating the odds, by an amount that isn't reasonable. I genuinely think it's bad of him to criticize other proper-scored forecasts without being transparent about his own, so a fair comparison could be made.   On insults This part I've moved to the bottom of this comment because I think it's less central to the claim I'm making.  For the criteria for "insulting" or sneering, well, a bunch of people (including me) found it like that.  Some people I heard from described it as infuriating that he was saying these things without being transparent about his own forecasts.  And yes, the following does seem to imply other people aren't sane nor self-respecting: P
Is there any evidence that calibrated forecasters would be good at estimating high odds of extinction, if our actual odds are high? How could you ever even know? For instance, notions like <if we're still alive, that means the odds of extinction must have been low> run afoul of philosophical issues like anthropics. So I don't understand this reasoning at all. You're presuming that the odds of extinction are orders of magnitude lower than 99% (or whatever Yudkowsky's actual assumed probability is), fine. But if your argument for why is that other forecasters don't agree, so what? Maybe they're just too optimistic. What would it even mean to be well-calibrated about x-risk forecasts? If we were talking about stock market predictions instead, and you had evidence that your calibrated forecasters were earning more profit than Yudkowsky, I could understand this reasoning and would agree with it. But for me this logic doesn't transfer to x-risks at all, and I'm confused that you think it does. Here I strongly disagree. Forecasting x-risks is rife with special problems. Besides the (very important) anthropics confounders, how do you profit from a successful prediction of doom if there's no-one left to calculate your brier score, or to pay out on a prediction market? And forecasters are worse at estimating extreme probabilities (<1% and >99%), and at longer-term predictions. Etc. ---------------------------------------- Regarding the insult thing: I agree that section can be interpreted as insulting, although it doesn't have to be. Again, the post doesn't directly address individuals, so one could just decide not to feel addressed by it. But I'll drop this point; I don't think it's all that cruxy, nor fruitful to argue about.
"Bad faith" seems a bit strong. People usually don't realise their realise their inconsistencies.

Making highly visible predictions about AGI timelines as a safety figure is a lose-lose situation. If you're right, you will all be dead, so it won't matter. If you're wrong, bad people who don't make any predictions will use yours to tar you as a kook. Then everyone will stop listening to you, and AGI will come five years later and you'll all be dead.

I'm not saying he shouldn't shut up about the metaculus updates, but he's in a bit of a bind here. And as you noticed, he has in fact made a substantial prediction via his bet with Caplan. The reason he doesn't do much else is because (in my model of Eliezer) the kinds of people who are likely to take heed of his bets are more likely to be intellectually honest.

I don't like this defense for two reasons. One, I don't se why the same argument doesn't apply to the role Eliezer has already adopted as an early and insistent voice of concern. Being deliberately vague on some types of predictions doesn't change the fact that his name is synonymous with AI doomsaying. Second, we're talking about a person whose whole brand is built around intellectual transparency and reflection; if Eliezer's predictive model of AI development contains relevant deficiencies, I wish to believe that Eliezer's predictive model of AI development contains relevant deficiencies. I recognize the incentives may well be aligned against him here, but it's frustrating that he seems to want to be taken seriously on the topic but isn't obviously equally open to being rebutted in good faith.

3Adam B2y
Posting a concrete forecast might motivate some people to switch into working on the problem, work harder on it, or reduce work that increases risk (e.g. capabilities work). This might then make the forecast less accurate, but that seems like a small price to pay vs everyone being dead. (And you could always update in response to people's response).

Epistemic Virtue

Taking a stab at the crux of this post:

The two sides have different ideas of what it means to be epistemically virtuous.

Yudkowsky wants people to be good Bayesians, which e.g. means not over-updating on a single piece of evidence; or calibrating to the point that whatever news of new AI capabilities appears is already part of your model, so you don't have to update again. It's not so important to make publically legible forecasts; the important part is making decisions based on an accurate model of the world. See the LW Sequences, his career, etc.

The OP is part of the Metaculus community and expects people to be good... Metaculeans? That is, they must fulfill the requirements for "forecaster prestige" mentioned in the OP. Their forecasts must be pre-registered, unambiguous, numeric, and numerous.

So it both makes perfect sense for Yudkowsky to criticize Metaculus forecasts for being insufficiently Bayesian (it made little sense that a forecast would be this susceptible to a single piece of news; compare with the LW discussion here), and for OP to criticize Yudkowsky for being insufficiently Metaculean (he doesn't have a huge public catalog of Metaculean predictions).... (read more)

This isn't a good description of being on Metaculus versus being a Bayesian.

How does one measure if they are "being a Bayesian"? The general point is you can't, unless you are being scored.  You find out by making forecasts -- if you aren't updating you get fewer points, or even lose points.  Otherwise you have people who are just saying things that thematically sound Bayesian but don't mean very much in terms of updated believes.  Partly I'm making an epistemic claim that Eliezer can't actually know if he's being a good Bayesian, without proper forecasting.  You can check out Tetlock's work if you're unsure why that would be the case, though I mention it in the post.

The more central epistemic claim I'm making in this essay: if someone says they are doing a better job of forecasting a topic than other people, but they aren't actually placing forecasts so we could empirically test if they are, then that person's forecasts should be held in high suspicion.  I'm claiming this would be the same in every other domain, and AI timelines are unlikely to be that special, and his eminence doesn't really buy a good justification why we would hold him to drastically lower standards about measuring his forecast accuracy.

I understand that you value legibility extremely highly. But you don't need a community or a scoring rule to assess your performance, you can just let reality do it instead. Surely Yudkowsky deserves a bajillion Bayes points for founding this community and being decades ahead on AI x-risk. Bayes points may not be worth anything from the Metaculean point of view, but the way I understand you, you seem to be saying that forecasts are worth everything while ignoring actions entirely, which seems bizarre. That was the point in my original comment, that you two have entirely different standards. Yudkowsky isn't trying to be a good Metaculean, so of course he doesn't score highly from your point of view.

An analogy could be Elon Musk.  He's done great things that I personally am absolutely incapable of.  And he does deserve praise for those things.  And indeed, Eliezer was a big influence on me.  But he gives extreme predictions that probably won't age well.

Him starting this site and writing a million words about rationality is wonderful and outstanding.  But do you think it predicts forecasting performance nearly as well as proper forecasting actual performance? I claim it doesn't come anywhere near as good of a predictive factor than just making some actual forecasts and seeing what happens, and I don't see the opposing position holding well at all.  You can argue that "we care about other things too than just forecasting ability" but in this thread I am specifically referring to his implied forecasting accuracy, not his other accomplishments.  The way you're referring to Bayes points here doesn't seem workable or coherent, any more than Musk Points tell me his predictions are accurate.

If people (like Musk) are continually successful, you know they're doing something right. One-off sucess can be survivorship bias, but the odds of having continued success by mere happenstance get very low, very quickly. When I call that "getting Bayes points", what I mean is that if someone demonstrates good long-term decision-making, or gets good long-term outcomes, or arrives at an epistemic state more quickly, you know they're doing some kind of implicit forecasting correctly, because long-term decisions in the present are evaluated by the reality of the future. ---------------------------------------- This whole discussion vaguely reminds me of conflicts between e.g. boxing and mixed martial arts (MMA) advocates: the former has more rules, while the latter is more flexible, so how can two competitors from their respective disciplines determine who of them is the better martial artist? They could both compete in boxing, or both compete in MMA. Or they could decide not to bother, and remain in their own arenas. I guess it seems to you like Yudkowsky has encroached on the Metaculus arena but isn't playing by the Metaculus rules?

No, success and fame are not very informative about forecasting accuracy.  Yes they are strongly indicative of other competencies, but you shouldn't mix those in with our measure of forecasting.  And nebulous unscorable statements don't at all work as "success", too cherry-picked and unworkable.  Musk is famously uncalibrated with famously bad timeline predictions in his domain! I don't think you should be glossing over that in this context by saying "Well he's successful..."

If we are talking about measuring forecasting performance, then it's more like comparing tournament Karate with trench warfare.

I'm going to steal the tournament karate and trench warfare analogy. Thanks.
Unless success breeds more success, irrespective of other factors.

I share the sense that many "AGI" forecasts are going to be very hard to arbitrate---at best they have a few years of slack one way or the other, and at worst they will be completely reinterpreted (I could easily see someone arguing for AGI today).

I try to give forecasts for "technological singularity" instead, which I think has a variety of mostly-equivalent operationalizations. (When asked to give a timeline to AI I often give a tongue-in-cheek operationalization of "capture >1% of sun's energy." This is obviously more appropriate if coupling timelines with the substantive prediction that that it will only make a few months of difference which crazy-ambitious technological milestone you choose---just as I think it only makes a few centuries of difference which milestone you use for forecasting the technological singularity starting from 10,000 BC.)

I'm fond of "x percent of sun's energy used"-syle stuff because I would expect runaway superintelligence to probably go ahead and use that energy, and it has a decent shot at being resolvable. But I think we need to be careful about assuming all the crazy-ambitious milestones end up only a few months from each other.  You could have a situation where cosmic industrialization is explosively fast heading away from Earth, with every incentive to send out seed ships for a land-grab.  But it's plausible that could be going on despite here on Earth it's much slower, if there are some big incumbents that maintain control and develop more slowly.  I'm not super sure that'll happen but it's not obvious that all the big milestones happen within a few months of each other, if we assume local control is maintained and the runaway Foom goes elsewhere. This is an example of why I think it does matter what milestone people pick, but it will often be for reasons that are very hard to foresee.
Yeah, I somewhat prefer "could capture >1% of the energy within a year if we wanted." But if being more serious I think it's better to just directly try to get at the spirit of "basically everything happens all at once," or else do something more mundane like massive growth in energy capture (like 100%/year rather than "all the sun's energy").

Suppose someone in 1970 makes the prediction. "more future tech progress will be in computers, not rockets." (Claiming amongst other arguments that rockets couldn't be made orders of magnitude smaller, and computers could) There is a sense in which they are clearly right, but its really hard to turn something like that into a specific objective prediction, even with the benefit of hindsight. Any time you set an objective criteria, there are ways that the technical letter of the rules could fail to match the intended spirit.

(The same way complicated laws have  loopholes)

Take this attempt 

Change the first criterion to: "Able to reliably pass a 2-hour, adversarial Turing test during which the participants can send text, images, and audio files (as is done in ordinary text messaging applications) during the course of their conversation. An 'adversarial' Turing test is one in which the human judges are instructed to ask interesting and difficult questions, designed to advantage human participants, and to successfully unmask the computer as an impostor. A single demonstration of a computer passing such a Turing test, or one that is sufficiently similar, will be sufficient for t

... (read more)

I like your list!

Definitely agree that narrow questions can lose the spirit of it.  The forecasting community can hedge against this by having a variety of questions that try to get at it from "different angles".

For example, that person in 1970 could set up a basket of questions:

  1. Percent of GDP that would be computing-related instead of rocket-related.
  2. Growth in the largest computer by computational power, versus the growth in the longest distance traveled by rocket, etc.
  3. Growth in the number of people who had flown in a rocket, versus the number of people who own computers.
  4. Changes in dollars per kilo of cargo hauled into space, versus changes in FLOPS-per-dollar.

Of course, I understand completely if people in 1970 didn't know about Tetlock's modern work.  But for big important questions, today, I don't see why we shouldn't just use modern proper forecasting technique.  Admittedly it is laborious! People have been struggling to write good AI timeline questions for years.

This is not verbatim because I never wrote it down till now, but Eliezer said in March 2005 that an AI's beating Go champions in Go would be a sign of real progress towards AGI. IMO that counts as a successful prediction because the rate at which the field racked up successes increased significantly around the time of AlphaGo.

(The statement was made in the house of a man named Spike Jones at a party to give the public a chance to meet Eliezer, who'd just moved to the Bay Area.)

It tells us essentially nothing.  How are you going to score the degree to which it turned out to be "a sign of real progress towards AGI"? I understand it feels impressive but it's far too nebulous to work as a forecasting track record.

2[comment deleted]2y

He clearly believes he could be placing forecasts showing whether or not he is better.  Yet he doesn't.

Eliezer hasn't said he thinks he can do better than Metaculus on arbitrary questions. He's just said he thinks Metaculus is wrong on one specific question. Quoting a point I made in our conversation on Twitter:

[...] From my perspective, it looks like: I don't think Metaculus performance on physics-loaded tech progress is a perfect proxy for physics knowledge (or is the only way a physicist could think they know better than Metaculus on a single question).

It seems like you're interpreting EY as claiming 'I have a crystal ball that gives me unique power to precisely time AGI', whereas I interpret EY as saying that one particular Metaculus estimate is wrong.

Metaculus being wrong on a particular very-hard-to-forecast question is not a weird or crazy claim, so you don't need to claim to be a genius.

Obviously EY shouldn't get a bunch of public "aha, you predicted Metaculus' timeline was way too long" credit when he didn't clearly state this in advance (at least before the first update) and hasn't quantified what "too long" means.

I'm not saying 'give EY social credit for this' or ev

... (read more)

I think this is an unreasonable characterization of the situation and my position, especially the claim:

Eliezer hasn't seen a big list of prediction successes from Paul about this thing Paul claims to be unusually good at (whereas, again, EY makes no claim of being unusually good at timing arbitrary narrow-AI advances)

I responded to a long thread of Eliezer trash-talking me in particular (here), including making apparent claims about how this is not the kind of methodology that makes good forecasts. He writes:

It just seems very clear to me that the sort of person who is taken in by this essay is the same sort of person who gets taken in by Hanson's arguments in 2008 and gets caught flatfooted by AlphaGo and GPT-3 and AlphaFold 2 [... the kind of person who is] going "Huh?" when AlphaGo or GPT-3 debuts[1]

He also writes posts like this one. Saying "the trick that never works" sure seems like it's making a claim that something has a worse track record than whatever Eliezer is doing.

Overall it looks to me that Eliezer is saying, not once but many times, that he is better at predicting things than other people and that this should be taken as a reason to dismiss various kinds of argumen... (read more)

You're not wrong, and I'm not saying you shouldn't have replied in your current position, but the youtube drama isn't increasing my respect for either you or Eliezer. 

Yeah, I think I should probably stay out of this kind of interaction if I'm going to feel compelled to respond like this.  Not that maximizing respect is the only goal, but I don't think I'm accomplishing much else.

I'm also going to edit the the phrases "shouldn't talk quite as much shit" and "full of himself," I just shouldn't have expressed that idea in that way. (Sorry Eliezer.)

I think the YouTube drama is serving an important function. Yudkowsky routinely positions himself in the role of a religious leader who is (in his own words) "always right".

(I think "role of a religious leader" is an apt description of what's going on sociologically, even if no supernatural claims are being made; that's why the "rightful caliph" language sticks.)

I used to find the hyper-arrogant act charming and harmless back in 2008, because, back in 2008, he actually was right about almost everything I could check myself. (The Sequences were very good.)

For reasons that are beyond the scope of this comment, I no longer think the hyper-arrogant act is harmless; it intimidates many of his faithful students (who genuinely learned a lot from him) into deferring to their tribal leader even when he's obviously full of shit.

If he can't actually live up to his marketing bluster, it's important for our collective sanity that people with reputation and standing call bullshit on the act, so that citizens of the Caliphate remember that they have the right and the responsibility to think things through for themselves. I think that's a more dignified way to confront the hazards that face us in ... (read more)

This is correct.

Elizabeth van Nostrand comments in private chat:

Can everyone agree that:

  1. there are many forms of prediction, of which narrow, precise forecasting of the kind found on prediction markets is only one
  2. narrow forecasting is only viable for a small subset of problems, and often the most important problems aren’t amenable to narrow forecasting
  3. narrow forecasting is much harder to fake than the other kinds. Making vague predictions and taking credit for whatever happens to happen is a misallocation of truthseeking credit.
  4. It is possible to have valuable models without being good at narrow predictions- black swans is a useful concept but it’s very annoying how the media give Nassim Taleb credit everytime something unexpected happens.
  5. It is possible to have models that are true but not narrow-predictive enough to be valuable [added: you can have a strong, correct model that a stock is overpriced, but unless you have a model for when it will correct it’s ~impossible to make money off that information]

I like this addition, and endorse 1-5!

But still without being transparent about his own forecasts, preventing a fair comparison.

I think it's a fair comparison, in that we can do at least a weak subjective-Bayesian update on the information -- it's useful and not cherry-picked, at least insofar as we can compare the AGI/TAI construct Eliezer was talking about in December, to the things Metaculus is making predictions about.

I agree that it's way harder to do a Bayesian update on data points like 'EY predicted AGI well before 2050, then Metaculus updated from 2052 to 2035' when we don't have a full EY probability distribution over years.

I mostly just respond by making a smaller subjective update and then going on with my day, rather than treating this as revelatory. I'm better off with the information in hand, but it's a very small update in the grand scheme of things. Almost all of my knowledge is built out of small updates in the first place, rather than huge revelatory ones.

If I understand your views, Jotto, three big claims you're making are:

  1. It's rude to be as harsh to other futurists as Eliezer was toward Metaculus, and if you're going to be that harsh then at minimum you should clearly be sticking your neck out as m
... (read more)
8Rob Bensinger2y
I guess "top Metaculus forecaster" is a transparently bad metric, because spending more time on Metaculus tends to raise your score? Is there a 'Metaculus score corrected for how much you use the site' leaderboard?
6Alex Lawsen 2y
Yes, It has its own problems in terms of judging ability. But it does exist.
2Rob Bensinger2y
Thanks! :)
This is good in some ways but also very misleading.  This selects against people who also place a lot of forecasts on lots of questions, and also against people who place forecasts on questions that have already been open for a long time, and who don't have time to later update on most of them. I'd say it's a very good way to measure performance within a tournament, but in the broader jungle of questions it misses an awful lot. E.g. I have predictions on 1,114 questions, and the majority were never updated, and had negligible energy put into them. Sometimes for fun I used to place my first (and only) forecast on questions that were just about to close.  I liked it because this made it easier to compare my performance on distribution questions, versus the community, because the final summary would only show that for the final snapshot.  But of course, if you do this then you will get very few points per question.  But if I look at my results on those, it's normal for me to slightly outperform the community median. This isn't captured by my average points per question across all questions, where I underperform (partly because I never updated on most of those questions, and partly because a lot of it is amusingly obscure stuff I put little effort into.)  Though, that's not to suggest I'm particular great either (I'm not), but I digress. If we're trying to predict a forecaster's insight on "the next" given discrete prediction, then a more useful metric would be the forecaster's log score versus the community's log score on the same questions, at the time they placed those forecast.  Naturally this isn't a good way to score tournaments, where people should update often, and focus on high-effort per question.  But if we're trying to estimate their judgment from the broader jungle of Metaculus questions, then that would be much more informative than a points average per question.
4Eli Tyre2y
In most of the world this reads as kind of offensive, or as an affront, or as inciting conflict, which makes having this thought in the first place hard, which is one of the contributors to modest epistemology. I wish that we had better social norms for distinguishing between  1. Claims that I am making and justifying based on legible info and reasoning. ie "Not only do I think that X is true, I think that any right thinking person who examines the evidence should come to conclude X," and if you disagree with me about that we should debate it. 2. Claims that I am making based on my own private or illegible info and reasoning. ie "Given my own read of the evidence, I happen to think X. But I don't think that the arguments that I've offered are necessarily sufficient to convince a third party." I'm not claiming that you should believe this, I'm merely providing you the true information that I believe it. I think clearly making this distinction would be helpful for giving space for people to think. I think lots of folks implicitly feel like they can't have an opinion about something unless it is of the first type, which means they have to be prepared to defend their view from attack.  Accordingly, they have a very high bar for letting themselves believe something, or at least to say it out loud. Which, at least, impoverishes the discourse, but might also hobble their own internal ability to reason about the world. On the flip side, I think sometimes people in our community come off as arrogant, because they're making claims of the second type, and others assume that they're making claims of the second type, without providing much supporting argument at all. (And sometimes, folks around here DO make claims of the first type, without providing supporting arguments. eg "I was convinced by the empty string, I don't know strange inputs others need to be convinced" implying "I think all right thinking people would reach this conclusion, but none of you are right thin
First I commend the effort you're putting into responding to me, and I probably can't reciprocate as much. But here is a major point I suspect you are misunderstanding: This is neither necessary for my argument, nor at any point have I thought he's saying he can "precisely time AGI". If he thought it was going to happen earlier than the community, it would be easy to show an example distribution of his, without high precision (nor much effort).  Literally just add a distribution into the box on the question page, click and drag the sliders so it's somewhere that seems reasonable to him, and submit it.  He could then screenshot it.  Even just copypasting the confidence interval figures. Note that this doesn't mean making the date range very narrow (confident), that's unrelated.  He can still be quite uncertain about specific times.  Here's an example of me somewhat disagreeing with the community.  Of course now the community has updated to earlier, but he can still do these things, and should.  Doesn't even need to be screenshotted really, just posting it in the Metaculus thread works. And further, this point you make here: My argument doesn't need him to necessarily be better at "arbitrary questions".  If Eliezer believes Metaculus is wrong on one specific question, he can trivially show a better answer.  If he does this on a few questions and it gets properly scored, that's a track record. You mentioned other things, such as how much it would transfer to broader, longer-term questions.  That isn't known and I can't stay up late typing about this, but at the very minimum people can demonstrate they are calibrated, even if you believe there is zero knowledge transfer from narrower/shorter questions to broader/longer ones. Going to have to stop it there for today, but I would end this comment with a feeling: it feels like I'm mostly debating people who think they can predict when Tetlock's findings don't apply, and so reliably that it's unnecessary to forecast

Note that this doesn't mean making the date range very narrow (confident), that's unrelated.

Fair enough, but I was responding to a pair of tweets where you said:

Eliezer says that nobody knows much about AI timelines. But then keeps saying "I knew [development] would happen sooner than you guys thought". Every time he does that, he's conning people.

I know I'm using strong wording. But I'd say the same in any other domain.

He should create a public Metaculus profile. Place a bunch of forecasts.

If he beats the community by the landslide he claims, then I concede.

If he's mediocre, then he was conning people.

'It would be convenient if Eliezer would record his prediction on Metaculus, so we know with more precision how strong of an update to make when he publicly says "my median is well before 2050" and Metaculus later updates toward a nearer-term median' is a totally fair request, but it doesn't bear much resemblance to 'if you record any prediction anywhere other than Metaculus (that doesn't have similarly good tools for representing probability distributions), you're a con artist'. Seems way too extreme.

Likewise, 'prove that you're better than Metaculus on a ton of forecasts or you're ... (read more)

Also, I think you said on Twitter that Eliezer's a liar unless he generates some AI prediction that lets us easily falsify his views in the near future? Which seems to require that he have very narrow confidence intervals about very near-term events in AI.

So I continue to not understand what it is about the claims 'the median on my AGI timeline is well before 2050', 'Metaculus updated away from 2050 after I publicly predicted it was well before 2050', or 'hard takeoff is true with very high probability', that makes you think someone must have very narrow contra-mainstream distributions on near-term narrow-AI events or else they're lying.

Some more misunderstanding: No, I don't mean the distinguishing determinant of con-artist or not-con-artist trait is whether it's recorded on Metaculus.  It's mentioned in that tweet because if you're going to bother doing it, might as well go all the way and show a distribution. But even if he just posted a confidence interval, on some site other than Metaculus, that would be a huge upgrade.  Because then anyone could add it to a spreadsheet scorable forecasts, and reconstruct it without too much effort. No, that's not what I'm saying.  The main thing is that they be scorable.  But if someone is going to do it at all, then doing it on Metaculus just makes more sense -- the administrative work is already taken care of, and there's no risk of cherry-picking nor omission. Also, from another reply you gave: I never used the term "liar".  The thing he's doing that I think is bad is more like what a pundit does, like the guy who calls recessions, a sort of epistemic conning.  "Lying" is different, at least to me. More importantly, no he doesn't necessarily need to have really narrow distributions, and I don't know why you think this.  Only if he was squashed close against the "Now" side on the chart, then yes it would be "narrower" -- but if that's what Eliezer thinks, if he's saying himself it's earlier than x date, then on a graph that looks like it's a bit narrower and shifted to the left, and it simply reflects what he believes. There's nothing about how we score forecasters that requires him have "very narrow" confidence intervals about very near-term events in AI, in order to measure alpha.  To help me understand, can you describe why you think this? Why don't you think alpha would start being measurable with merely slightly more narrow confidence intervals than the community, and centered closer to the actual outcome? EDIT a week later: I have decided that several of your misunderstandings should be considered strawmanning, and I've switched from upvoting

I agree with you on most counts, but am not updating as much as I would against Eliezer's epistemic hygiene here as I would because you say Eliezer didn't bet Paul, but he did, and you know he did, but didn't issue a correction, which indicates the evidence may be selected overly much for dumb comments that Eliezer has made but possibly didn't put all that much thought into.

Upvoted -- I agree that that bet they made should be included in the post, around when I'm mentioning how Eliezer told Paul he doesn't have "a track record like this".  He did decline to bet Paul in that comment I am quoting, claiming Paul "didn't need to bet him".  But you are right that it's wrong not to include the bet they do have going, and not just the one with Bryan Caplan.  I've made some edits to include it.

Eliezer and Bryan's bet is 1:1 odds for a CPI-adjusted $100 bet that the world won't be destroyed by Jan. 1, 2030.

After reading comments on my post "How to place a bet on the end of the world," which was motivated by Bryan Caplan's description of his bet with Eliezer, I concluded that you can't extract information on confidence from odds on apocalyptic bets. Explanation is in the comments.

Bryan told me via email that these are the best odds he could get from Eliezer.

I think the best way to think about their bet is that it's just for fun. We shouldn't try t... (read more)

Ah by the way, I think the link you posted accidentally links to this post.
Fixed, thanks!
You're absolutely right, that bet with Bryan gives very little information! Definitely doesn't compare at all to a proper track record of many questions. I think your explanation for his behavior is good.  I don't think it's justified, or at least, I am deeply suspicious of him thinking anything thematically similar to "I have to obfuscate my forecasting competence, for the good of the world, but I'll still tell people I'm good at it".  The more likely prior is just that people don't want to lose influence/prestige.  It's like a Nobel laureate making predictions, and then not seeming so special after.
But then anyone who makes a precise bet could lose out in the same way. I assume you don't believe that getting in general is wrong, so where does the asymmetry come from? Yudkowsky is excused betting because he's actually right?

Would it be helpful to think about something like "what Brier score will a person in the reference class of "people-similar-to-Eliezer_2022-in-all-relevant-ways" have after making a bunch of predictions on Metaculus?" Perhaps we should set up this sort of question on Metaculus or Manifold? Though I would probably refrain from explicitly mentioning Eliezer in it.

That might be possible, but it would take a lot of effort to make resolvable.  Who is "similar to Eliezer", and how do we define that in advance? Which forecasting questions are we going to check their Brier score on? Which not-like-Eliezer forecasters are we going to compare them to (since scores are much more informative for ranking forecasters, than in isolation)? Etc.  I'd rather people like Eliezer just placed their forecasts properly!