[Update: we have taken this bet with two people, as detailed in a comment below.]

Recently, a post claimed,

it seems very possible (>30%) that we are now in the crunch-time section of a short-timelines world, and that we have 3-7 years until Moore's law and organizational prioritization put these systems at extremely dangerous levels of capability.

We (Tamay Besiroglu and I) think this claim is strongly overstated, and disagree with the suggestion that “It's time for EA leadership to pull the short-timelines fire alarm.” This post received a fair amount of attention, and we are concerned about a view of the type expounded in the post causing EA leadership to try something hasty and ill-considered.

To counterbalance this view, we express our disagreement with the post. To substantiate and make concrete our disagreement, we are offering to bet up to $1000 against the idea that we are in the “crunch-time section of a short-timelines”

In particular, we are willing to bet at at 1:1 odds that no more than one of the following events will occur by 2026-01-01, or alternatively, 3:1 odds (in our favor) that no more than one of the following events will occur by 2030-01-01.

  • A model/ensemble of models achieves >80% on all tasks in the MMLU benchmark
  • A credible estimate reveals that an AI lab deployed EITHER >10^30 FLOPs OR hardware that would cost $1bn if purchased through competitive cloud computing vendors at the time on a training run to develop a single ML model (excluding autonomous driving efforts)
  • A model/ensemble of models will achieve >90% on the MATH dataset using a no-calculator rule
  • A model/ensemble of models achieves >80% top-1 strict accuracy on competition-level problems on the APPS benchmark
  • A gold medal for the IMO Grand Challenge (conditional on it being clear that the questions were not in the training set)
  • A robot that can, from beginning to end, reliably wash dishes, take them out of an ordinary dishwasher and stack them into a cabinet, without breaking any dishes, and at a comparable speed to humans (<120% the average time)
  • Tesla’s full-self-driving capability makes fewer than one major mistake per 100,000 miles

Since we recognize that betting incentives can be weak over long time-horizons, we are also offering the option of employing Tamay’s recently described betting procedure in which we would enter a series of repeated 2-year contracts until the resolution date.

Specific criteria for bet resolution

For each task listed above, we offer the following concrete resolution criteria.

TaskFull details
A model/ensemble of models achieves >80% on all tasks in the MMLU benchmarkA public document reveals that a model or ensemble of models had an average accuracy score of more than 80.0% on every task listed in the Measuring Massive Multitask Language Understanding benchmark by Dan Hendrycks et al.
A credible estimate reveals that an AI lab deployed EITHER >10^30 FLOPs OR hardware that would cost $1bn if purchased through competitive cloud computing vendors at the time on a training run to develop a single ML model (excluding autonomous driving efforts)If there is a dispute, our counterparty will offer an example that they believe may trigger this condition to resolve positively. Then, we will use the method described in this post to estimate the number of FLOPs used for a given training run. We will then find the competitive rates by taking the minimum of the most recent prices found on Google Cloud, Microsoft Azure, and Amazon AWS, or the equivalent services corresponding to each of those companies if their names change in the future. We will determine the rate corresponding to the processing unit that would be capable of training the model in the paper. If the training run was completed than 2 years before resolution, an adjustment is applied to the cost of the training run to account for price-declines in computing. This adjustment will be based on a credible estimate of price-performance declines in the last few years, OR, a constant decline of 30% per year, whichever is more favorable to the counterparty. We will use inflation-adjusted 2022 US dollars, according to the CPI according to the BLS.
A model/ensemble of models will achieve >90% on the MATH dataset using a no-calculator rule.A public document reveals that a model or ensemble of models had an average accuracy score of more than 90% when evaluated on the MATH benchmark by Dan Hendrycks et al. The no-calculator rule ensures that competitors don't use calculators or script executors. As Hendrycks explains, “With a script executor, competitors do not need to figure out how to succinctly reason to the conclusion and cleverness is rarely needed.”
A model/ensemble of models achieves >80% top-1 strict accuracy on competition-level problems on the APPS benchmarkA public document reveals that a model or ensemble of models had an average top-1 strict accuracy score of more than 80% on the competition-level coding problems listed in the APPS benchmark by Dan Hendrycks et al. Top-1 accuracy is distinguished, as in the paper, from top-k accuracy in which k outputs from the model are generated, and the best output is selected. Strict accuracy "requires programs pass every test case" (as defined in section 4.2 of the paper).
A gold medal for the IMO Grand ChallengeWe will use the same criteria as revealed by Paul Christiano here. To clarify, we will not require that the model be open-source, only that it clearly demonstrates the ability to obtain a gold medal for the IMO.
A robot that can, from beginning to end, reliably wash dishes, take them out of an ordinary dishwasher and stack them into a cabinet, without breaking any dishes, and at a comparable speed to humans (<120% the average time)For this task, a credible video or report must be public that clearly demonstrates that this ability is feasible to a very high level of confidence. Videos or reports that do not report rates of system malfunctions, or do not permit the calculation of how much time it would take to take out and stack at least 10 dishes, are not admissible. An error is something that would have a reasonable chance of a dish being broken if performed in an ordinary setting. Reliability is defined as an error rate below 1 percent across all full loading-and-unloading tests, when given natural starting setups ie. pre-programmed precise robotic planning is not admissible. This blog post will be consulted to determine the average time it takes to load and unload dishes, under the section “Time Study Of Dishes”. In case there is a dispute over this question, our counterparty will offer an example that they believe may trigger this condition to resolve positively. We will then email the first author of whatever paper, blog post, or video in question asking whether they believe their system satisfies the conditions specified here, using their best judgment.
Tesla’s full-self-driving capability makes fewer than one major mistake per 100,000 milesWe will require that a credible document from Tesla report crash or error-rate statistics for cars using their full self-driving mode. The error rate must include errors that would have happened if it were not for the presence of human intervention and it must be calculated over a reasonable distribution over all weather and road conditions that humans normally drive in. The cars must be able to go from arbitrary legal locations in the US to other arbitrary locations without errors or interventions, at least in theory. A “major mistake” is defined as a mistake that requires human intervention to correct, as otherwise the car would grossly violate a traffic law, crash into an object, prevent the car from getting to its intended destination, or have the potential to put human life into serious jeopardy.

Some clarifications

For each benchmark, we will exclude results that employed some degree of cheating. Cheating includes cases in which the rules specified in the original benchmark paper are not followed, or cases where some of the test examples were included in the training set. 

184

98 comments, sorted by Click to highlight new comments since: Today at 10:18 AM
New Comment
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Ok, I take your bet for 2030. I win, you give me $1000. You win, I give you $3000. Want to propose an arbiter? (since someone else also took the bet, I'll get just half the bet, their $500 vs my $1500)

Shouldn't it be: 'They pay you $1,000 now, and in 3 years, you pay them back plus $3,000' (as per Bryan Caplan's discussion in the latest 80k podcast episode)? The money won't do anyone much good if they receive in it a FOOM scenario. 

Since my goal is to convince people that I take my beliefs seriously, and this amount of money is not actually going to change much about how I conduct the next three years of my life, I'm not worried about the details. Also, I'm not betting that there will be a FOOM scenario by the conclusion of the bet, just that we'll have made frightening progress towards one.

1Nathan Helm-Burger4d
Related: just for your amusement, here's a link to a bet about AI timelines that I won, but which I incorrectly believed that I would not win before the end of 2022. In other words, evidence of me being surprised by the high rate of AI progress... Interesting, eh? https://manifold.markets/MatthewBarnett/will-a-machine-learning-model-score-f0d93ee0119b#pzSuEYIhRiXoIFSjPQz2 [https://manifold.markets/MatthewBarnett/will-a-machine-learning-model-score-f0d93ee0119b#pzSuEYIhRiXoIFSjPQz2]

For people reading this post in the future, I'd like to note that I have written a somewhat long comment describing my mixed feelings about this post, since posting it. You can find my comment here. But I'll also repeat it below for completeness:

The first thing I'd like to say is that we intended this post as a bet, and only a bet, and yet some people seem to be treating it as if we had made an argument. Personally, I am uncomfortable with the suggestion that our post was "misleading" because we did not present an affirmative case for our views.

I agree that LessWrong culture benefits from arguments as well as bets, but it seems a bit weird to demand that every bet come with an argument attached. A norm that all bets must come with arguments would seem to substantially damper the incentives to make bets, because then each time people must spend what will likely be many hours painstakingly outlining their views on the subject.

That said, I do want to reply to people who say that our post was misleading on other grounds. Some said that we should have made different bets, or at different odds. In response, I can only say that coming up with good concrete bets about AI timelines is actua... (read more)

3rhollerith_dot_com3mo
It would make me sad if people on this site felt a need to apologize for "putting their money where their mouth is" (i.e., for offering to bet).

I might disagree with you epistemically but... what do I have to win if AGI happens before 2030 and I win the bet? I don't think either of us will still care about our bet after that happens. Doesn't this just run into all the standard problems of predicting doomsday?

Edit: Oh, I also just saw you meant 3:1 odds in your favor. That's... weird, since it doesn't even disagree with the OP? Why would the OP take the bet that you propose, given they only assign ~30% probability to this outcome?

Bryan Caplan and Eliezer are resolving their Doomsday bet by having Bryan Caplan pay Eliezer upfront and if the doomsday scenario does not happen by Jan 1 2030, Eliezer will give Bryan his payout. It's a pretty method for betting on doomsday.   

Why would the OP take the bet that you propose, given they only assign ~30% probability to this outcome?

The conditions we offered fall well short of AGI, so it seems reasonable that the author would assign way more than 30% to this outcome. Furthermore, we offered a 1:1 bet for January 1st 2026.

Edit: The OP also says, "Crying wolf isn't really a thing here; the societal impact of these capabilities is undeniable and you will not lose credibility even if 3 years from now these systems haven't yet FOOMed, because the big changes will be obvious and you'll have predicted that right." which seems to imply that we will likely obtain very impressive capabilities within 3 years. In my opinion, this statement is directly challenged by our 1:1 bet.

Hmm, I guess it's just really not obvious that your proposed bet here disagrees with the OP. I think I roughly believe both the things that the OP says, and also wouldn't take your bet. It still feels like a fine bet to offer, but I am confused why it's phrased so much in contrast to the OP. If you are confident we are not going to see large dangerous capability gains in the next 5-7 years, I think I would much prefer you make a bet that tries to offer corresponding odds and the specific capability gains (though that runs a bit into "betting on doomsday" problems)

If you are confident we are not going to see large dangerous capability gains in the next 5-7 years, I think I would much prefer you make a bet that tries to offer corresponding odds and the specific capability gains (though that runs a bit into "betting on doomsday" problems)

What are the "specific cabability gains" you are referring to? I don't see any specific claims in the post we are responding to. By contrast, we listed 7 concrete tasks that appear trivial to perform if we are AGI-levels of capability, and very easy if we are only a few steps from AGI. I'd be genuinely baffled if you think AGI can be imminent at the same time we still don't have good self-driving cars, robots that can wash dishes, or AI capable of doing well on mathematics word problems. This view would seem to imply that we will get AGI pretty much out of nowhere.

8ChristianKl3mo
An AGI might become a dictator in every country on earth while still not being able to wash dishes or make errors when it comes to driving 100,000 miles. Physical coordination is not required. It's not clear to me what practical implicates it has to measure reason about the math abilities with models with a no calculator rule. If someone would build an AGI, it makes sense for the AGI to be able to access subprocesses for tasks like calculators.
1aogara3mo
How would you expect an AI to take over the world without physical capacity? Attacking financial systems, cybersecurity networks, and computer-operated weapons systems all seem possible from an AI that can simply operate a computer. Is that your vision of an AI takeover, or are there other specific dangerous capabilities you'd like the research community to ensure that AI does not attain?
7habryka3mo
I mean, Eliezer has commented on this position extensively in the AI dialogues. I do think we would likely see AI doing well on mathematics word-problems, but the other two are definitely not things I obviously expect to see before the end (though I do think it's more likely than not that we would see them). Zooming out a bit though, I am confused what you are overall responding to with your comment. The thing I am critiquing is not about the "specific capability gains". It's just that you are responding to a post saying X, with a bet at odds Y that do not contradict X, and indeed where I think it's a reasonable common belief to hold both X and Y. Like, if someone says "it's ~30% likely" and you say "That seems wrong, I am offering you a bet that you should only take if you have >70% probability on a related hypothesis" then... the obvious response is "but I said I only assign ~30% to this hypothesis, I agree that I assign somewhat more to your weaker hypothesis, but it's not at all obvious I should assign 70% to it, that's a big jump". That's roughly where I am at.
1Not Relevant2mo
As the previous OP, to chime in, the specific mechanism by which self-driving cars don’t work but FOOM does is extremely high-capability consequentialist software engineering plus not-much-better-than-today world modeling. Self-driving and manipulation require incredible-quality video/world modeling, and a bunch of control problems that seem unrelated to symbolic intelligence. Re: solving math problems, that seems way more likely to be a thing such a system could do; the only uncertainty is whether someone invests the time, given it’s not profitable.
3Veedrac3mo
You didn't bet against any of those happening in 5-7 years, though. You bet against it being >25% likely by 2030, or >50% likely in 4. Your bet is completely in concordance with it being more likely than not to happen in 5-7 years.
2Mitchell_Porter3mo
Do you have any suggestions for what to do when the entire human race is economically obsolete and definitively no longer has any control over its destiny? Your post gives no indication of what you would do or recommend, when your benchmarks are actually surpassed.
9Not Relevant3mo
Just for the record, I regret that statement, independent of making a bet or not.

If you expect the apocalypse to happen by a given date, you should rationally value having money then much less than the market(if the market doesn't expect the apocalypse). So you can simulate a bet by having an apocalypse-expecter take a high-interest-rate loan from an apocalypse-denier, paying the loan back(if the world survives) at the date of the purported apocalypse(h/t daniel filan).

4MichaelStJules3mo
Couldn't they just get lower interest rate loans elsewhere? Or, interest doesn't start until the bet outcome date passes? I'd give the apocalypse-expecter $1000 now, and they pay me back with interest when the outcome date passes, with no interest payments before then. For those wanting to lend out money to gain interest on and use that money for EA causes, this might be useful: https://founderspledge.com/stories/investing-to-give [https://founderspledge.com/stories/investing-to-give]
4NunoSempere3mo
This doesn't mean necessarily that you shouldn't take the bet, but maybe that you should also take the loan.
3MichaelStJules3mo
Ya, I was thinking this, too, but they could possibly get a lot of loans or much larger loans at lower interest rates, and it's not clear when they would start looking at this one as the next best to pursue. Maybe it's more time-efficient (more loaned money per hour spent setting up and dealing with) to take this kind of AI-bet loan, though, but $1000 is very low.
2interstice3mo
Yeah, this is what I had in mind. There wouldn't be interest payments until the date of the apocalypse.
7Tamay3mo
We also propose betting using a mechanism that mitigates some of these issues:
7Matthew Barnett3mo
Also, we give odds of 1:1 if anyone wants to take us up on the bet for January 1st 2026.

I want to state here that I regret my previous post, and have retracted it, primarily because it was not constructive and I think this post does an excellent job of calling out what a specific constructive dialogue looks like.

Of the above, the only ones that seem likely to me in the world I was imagining are MMLU and APPS - I'm much less familiar with the two math competitions, which seem like the other plausible ones.

I think I'll take you up on the 2026 version at 1:1 odds.

Is it really constructive? This post presents no arguments for why they believe what they believe which should serve very little to convince others of long timelines. Moreover it proposes a bet from an assymetric position that is very undesirable for short-timeliners to take, since money is worth nothing to the dead, and even in the weird world where they win the bet and are still alive to settle it, they have locked their money for 8 years for a measly 33% return - less than expected by simply say, putting it in index funds. Believing in longer timelines gives you the privilege of signalling epistemic virtue by offering bets like this from a calm, unbothered position, while people sounding the alarm sound desperate and hasty, but there is no point in being calm when a meteor is coming towards you, and we are much better served by using our money to do something now rather than locking it in a long term bet.

Not only that, the decision from mods to push this to the frontpage is questionable since it served as a karma boost to this post that the other didn't have, possibly giving the impression of higher support than it actually has.

3Jotto9993mo
On the reduced value of money given catastrophe: that could be used in a betting circumstance. Someone giving higher-than-market estimates could take a high-interest "loan" from the person giving lower estimates of catastrophe. This can be rational and efficient for both of them, and help "price in" the implied probability of doom.
4Ricardo Meneghin3mo
Well, if OP is willing then I'd love to take a high-interest loan from him to be paid back in 2030.
2Yitz3mo
By the way, just in case you didn’t know, you can edit your original post with a disclaimer at the beginning or something, if you want to make clear how your opinions have changed.
3Not Relevant3mo
Already done.

I guess I'd be curious about your reasons of thinking that timelines are longer.

I am also willing to take your bet for 2030. 

I would propose one additional condition: If there evidence of a deliberate or coordinated slowdown on AGI development by the major labs, then the bet is voided. I don't expect there will be such a slowdown, but I'd rather not be invested in it not happening.

I think this post is epistemically weak (which does not mean I disagree with you):

  1. Your post pushes the claim that “It's time for EA leadership to pull the short-timelines fire alarm.” wouldn't be wise. Problems in the discourse: (1) "pulling the short-timelines fire alarm" isn't well-defined in the first place, (2) there is a huge inferential gap between "AGI won't come before 2030" and "EA shouldn't pull the short-timelines fire alarm" (which could mean sth like e.g. EA should start planning to start a Manhattan project for aligning AGI in the next few years.), and (3) your statement "we are concerned about a view of the type expounded in the post causing EA leadership to try something hasty and ill-considered" that slightly addresses that inferential gap is just a bad rhetorical method where you interpret what the other said in a very extreme and bad way, although the other person actually didn't mean that, and you are definitely not seriously considering the pros and cons of taking more initiative. (Though of course it's not really clear what "taking more initiative" means, and critiquing the other post (which IMO was epistemically very bad) would be totally right.)
  2. You're not gi
... (read more)
5Matthew Barnett3mo
I don't agree that we sold our post as an argument for why timelines aren't short. Thus, I don't think this objection applies. That said, I do agree that the initial post deserves a much longer and nuanced response. While I don't think it's fair to demand that every response be nuanced and long, I do agree that our post could have been a bit better in responding to the object-level claims. For what it's worth, I do hope to write a far more nuanced and substantive take on these issues in the relative near-term.
2Simon Skade3mo
You probably mean "why timelines aren't short". I didn't think you explicitly thought it was an argument against short timelines, but because the post got so many upvotes I'm worried that many people implicitly perceive it as such, and the way the post is written contributes to that. But great that you changed the title, that already makes it a lot better! I don't really think the initial post deserves a nuanced response. (My response would have been "the >30% 3-7 years claim is compared to current estimates of many smart people an extraordinary claim that requires an extraordinary burden of proof, which isn't provided".) But I do think that the community (and especially EA leadership) should probably carefully reevaluate timelines (considering arguments of short timelines and how good they are), so great if you are planning to do a careful analysis of timeline arguments!
4Veedrac3mo
I share your opinion that the post is misleading. Adding to the list, 1. Bets don't pay out until you win them, and this includes in epistemic credit, but we need to realize we are in short timelines before they happen. If they are to lose this bet, we wouldn't learn from it until it is dangerously late. 2. There are market arguments to update from betting markets ahead of time, but a fair few people have accepted the bet, so that does not transparently help the authors' case. 3. 1:1 odds in 2026 on human-expert MMLU performance, $1B models, >90% MATH , >80% APPS top-1, IMO Gold Medal, or human-like robot dexterity is a short timeline. The only criteria that doesn't seem to me to support short timelines at 1:1 odds is Tesla FSD, and some people might disagree.
5Matthew Barnett3mo
I disagree. I think each of these benchmarks will be surpassed well before we are at AGI-levels of capability. That said, I agree that the post was insufficient in justifying why we think this bet is a reasonable reply to the OP. I hope in the near-term future to write a longer, more personal post that expands on some of my reasoning. The bet itself was merely a public statement to the effect of "if people are saying these radical things, why don't they put their money where their mouths are?" I don't think such statements need to have long arguments attached to them. But, I can totally see why people were left confused.
4Veedrac3mo
I appreciate that you changed the title, and think this makes the post a lot more agreeable. It is totally reasonable to be making bets without having to justify them, just as long as the making of a bet is not mistaken to be more evidence than its associated sustained market movement. Solving any of these tasks in a non-gamed manner just 14 years after AlexNet might not be at the point of AGI, or at least I can envision a future consistent with it coming prior, but it is significant evidence that AGI is not too many years out. I can still just about imagine today that neural networks might hit some wall that ultimately limits their understanding, but this point has to come prior to neural networks showing that they are almost fully general reasoners with the right backpropagation signal (it is after all the backpropagation that is capable of learning almost arbitrary tasks with almost no task-specialization). An alarm needs to precede alignment catastrophe by long enough that you have time to do something about it; isn't much use if it is only there to tell you how you are going to die. Bootstrapping is often painted as a model looking at its own code, thinking really hard, and writing better code that it knows to be better, but this is an extremely strong version of bootstrapping and you don't need to come anywhere close to these capabilities in order to start worrying about concrete dangers. I wrote a post that gave an example of a minimum viable FOOM [https://www.lesswrong.com/posts/kpPnReyBC54KESiSn/optimality-is-the-tiger-and-agents-are-its-teeth] , but it is not only possible to get to from that angle, nor the earliest level of capability where I think things will start breaking. It is worth remembering that evolution optimized for humanity from proto-humans that could not be given IMO Gold Medal questions and be expected to solve them. Evolution isn't intelligent at all, so it certainly is not the case that you need human level intelligence before you can

Mod note: there's some weirdness about this post being frontpage, and the post it's responding to being on personal blog. I'm not 100% sure of my preferred call, but, the previous post seemed to primarily be arguing a community-centric-political point, and this one seems more to be making a straightforward epistemic claim. (I haven't checked in with other moderators about their thoughts on either post)

I frontpaged it because I am very excited about bets on timelines/takeoff speeds. I do think the title and framing about what EA leadership should do is not really a good fit for frontpage, and (for frontpage) I would much prefer a post title that's something like "A Concrete Bet Offer To Those With Short-Timelines".

I would much prefer a post title that's something like "A Concrete Bet Offer To Those With Short-Timelines".

Thanks, I more-or-less adopted this exact title. I hope that makes things look a bit better.

9Ben Pace3mo
Seems good to me, thank you.
6Olga Babeeva3mo
Why do you think this? Would it have made a difference if, instead of referring to EA leadership, the post had said "we should sound the alarm" (as in readers/LW/EAs)?
9Ben Pace3mo
LessWrong is primarily a place to understand human rationality, AI, existential risk, and more, it is not primarily a place to do social coordination bids, and I want to select much more on users interested in and excited by the former than the latter. It would have made a difference, yes, though on the margin getting even closer to the object level is better for frontpage IMO.
7Ruby3mo
Speaking as a moderator, it's not obvious to me that LessWrong shouldn't be a place where coordination happens. It's scary and I don't know how to cause it to happen well, but if not here, where? We'd have to build something else (totally an option, but not something anyone has done).
5Ben Pace3mo
Yeah I think insofar as it's happened on LessWrong it's been better than happening on e.g. Facebook or only in-person. The rough story I am hoping for here is something like "Come for the frontpage object-level content, stay for the frontpage object-level content, but also you'll likely model / engage with local politics a bit if you stay (which is on personal blog)."

I'll happily emulate Matthew Barnett's and Tamay's bet for any interested counter-bettors, at pretty much any volume with substantially better odds (for you.) I have a lot of vouches and willing to use a middleman/mediator if necessary. The best way to contact me is on discord at PTB kao#2111

If the aim is for non-takeup of this bet to provide evidence against short timelines, I think you'd need to change the odds significantly: conditional on short timelines, future money is worth very little. Better to have an extra $1000 in a world where it's still useful than in one where it may already be too late.

Personal update:

The recent breakthrough on the MATH dataset has made me update substantially in the direction of thinking I’ll lose the bet. I’m now at about 50% chance of winning by 2026, and 25% chance of winning by 2030.

That said, I want others to know that, for the record, my update mostly reflects that I now think MATH is a relatively easy dataset, and my overall AGI median only advanced by a few years.

Previously, I relied quite heavily on statements that people had made about MATH, including the authors of the original paper, who indicated it was a difficult dataset full of high school “competition-level” math word problems. However, two days ago I downloaded the dataset and took a look at the problems myself (as opposed to the cherry-picked problems I saw people blog about), and I now understand that a large chunk of the dataset includes simple plug-and-chug and evaluation problems—some of them so simple that Wolfram Alpha can perform them. What’s more: the previous state of the art model, which was touted as achieving only 6.9%, was simply a fine-tuned version of GPT-2 (they didn’t fine-tune anything larger), which makes it very unsurprising that the prior SOTA was so low.

I... (read more)

4Tomás B.12h
I agree this is more of an update about what existing models were already capable of. I disagree that this means someone in your position should not be updating to significantly lower timelines. Even removing MATH, I'm pretty confident I will "win". If you want to replace it with something that more represents what you thought MATH did, I will probably take this second bet at the same odds.
1Matthew Barnett12h
I’m confused. I am not saying that, so I’m not sure which part of my comment you’re agreeing with. If I found something, I’d be sympathetic to taking another bet. Unfortunately I don’t know of any other good datasets.
3Tomás B.12h
The part about the previous SOTA being fine-tuned GPT-2, which means a lot of MATH performance was latent in LMs that existed at the time we made the bet. On top of this, the various prompting and data-cleaning changes strike me as revealing latent capacity.
1Matthew Barnett12h
If I thought large language models were already capable of doing simple plug-and-chug problems, I’m not sure why I’d update much on this development. There were some slightly hard problems that the model was capable of doing, that Google highlighted in their paper (though they were cherry-picked)—and for that I did update by a bit (I said my timelines advanced by “a few years”).
3Tomás B.12h
>If I thought large language models were already capable of doing simple plug-and-chug problems, I’m not sure why I’d update much on this development. I suppose I just have different intuitions on this. Let's just make a second bet. I imagine you can find another element for your list you will be comfortable adding - it doesn't necessarily have to be a dataset, just something in the same spirit as the other items in the list.
1Matthew Barnett12h
I think I’ll pass up an opportunity for a second bet for now. My mistake was being too careless in the first place—and I’m not currently too interested in doing a deeper dive into what might be a good replacement for MATH.

Update: We have taken the bet with 2 people.

First: we have taken the 1:1 bet for January 1st 2026 with Tomás B. at our $500 to his $500.

Second: we have taken the 3:1 bet for January 1st 2030 with Nathan Helm-Burger at our $500 to his $1500

Personal note

Just as a personal note (I'm not speaking for Tamay here), I expect to lose the 2030 bet with >50% probability. I took it because it has positive EV on my view, though not as much as I believed when I first drafted the bet. I also disagree with comments here that state that these bets imply that I have short timelines. I think there's a huge gap between AI performing well on benchmarks, and AI having a large economic splash in the real world. 

Here, we mostly focused on benchmarks because I think these metrics are fairly neutral markers between takeoff views. By this I mean that I expect fast-takeoff folks to think that AI will do well on benchmarks before we get to AGI, even if they think AI will have roughly zero economic impact before then. Since I wanted my bet to be applicable to people without slow-takeoff views, we went with benchmarks.

4Tomás B.3mo
I should probably account for the fact that I am the only one who took the 1:1 bet, but still I foolishly think I will win.

I think this would be more informative for the community if we had answers to the following questions here:

  1. What are the AI states of the art on these problems?
  2. How have the SoTAs changed over time?
  3. What is human performance on these problems (top human performance, average, any other statistics, the whole distribution, etc., whichever seems most useful)?

(Anyone can answer, and feel free to provide only partial information. I'm guessing the authors have a lot of this info handy already.)

Some questions/clarifications about the bet terms:

A robot that can, from beginning to end, reliably wash dishes, take them out of an ordinary dishwasher and stack them into a cabinet, without breaking any dishes, and at a comparable speed to humans (<120% the average time)

The dishwasher is the one actually washing the dishes right, not the robot? The robot just needs to load the dishwasher, run it, and then unload it fast enough and without breaks?

Tesla’s full-self-driving capability makes fewer than one major mistake per 100,000 miles

Can we modify this... (read more)

It might be useful to create Metaculus predictions for the individual tasks. 

Matthew, Tamay: Refreshing post, with actual hard data and benchmarks. Thanks for that.

My predictions:

  • A model/ensemble of models achieves >80% on all tasks in the MMLU benchmark

No in 2026, no in 2030. Mainly due to the fact that we don't have much structured data and incentives to solve some of the categories. A powerful unsupervised AI would be needed to clear those categories, or more time.

  • A credible estimate reveals that an AI lab deployed EITHER >10^30 FLOPs OR hardware that would cost $1bn if purchased through competitive cloud computing ve
... (read more)
3Matthew Barnett3mo
The criteria adjusts for inflation.
2FeepingCreature3mo
How much would your view shift if there was a model that could "engineer its own prompt", even during training?
3Lorenzo Rex3mo
A close call, but I would lean still on no. Engineering the prompt is where humans leverage all their common sense and vast (w.r.t.. the AI) knowledge.
2Nathan Helm-Burger3mo
Nice specific breakdown! Sounds like you side with the authors overall. Want to also make the 3:1 bet with me?
1Lorenzo Rex3mo
Thanks. Yes, pretty much in line with the authors. Btw, I would super happy to be wrong and see advancement in those areas, especially the robotic one. Thanks for the offer, but I'm not interested in betting money.

I'll just note that several of these bets don't work as well if I expect discontinuous &/or inconsistently distributed progress.  As was observed on many of individual tasks in PaLM: https://twitter.com/LiamFedus/status/1511023424449114112 (obscured by % performance & by the top-level benchmark averaging 24 subtasks that spike at different levels of scaling)

I might expect performance just prior to AGI to be something like 99% 40% 98% 80% on 4 subtasks, where parts of the network developed (by gd) for certain subtasks enable more general capabilities

1rchplg3mo
Naive analogy: two tasks for humans: (1) tell time (2) understand mechanical gears. Training a human on (1) will outperform (2) for a good while, but once they get a really good model for (2) they can trivially do (1) & performance would spike dramatically
1TLW3mo
[Citation Needed] Designing an accurate mechanical clock is non-trivial[1] [#fn6goo6xppyg], even assuming knowledge of gears 1. ^ [#fnref6goo6xppyg]Understatement.
2rchplg3mo
Trivially do better than the naive thing I human would do*, sry (e.g. v.s. looking at the sun & seasons, which is what I think human trying to tell time would do to locally improve). Definitely agree can't trivially do a great job on traditional standards. Wasn't a carefully chosen example The broader point was that some subskills can enable better performance at many tasks, which causes spiky performance in humans at least. I see no reason why this wouldn't apply to nns. (e.g. part of the nn develops a model of something for one task, once it's good enough discovers that it can use that model for very good performance on an entirely different task - likely observed as a relatively sudden, significant improvement)

Just noting that I think you're arguing strongly against what is at most a weak man argument. (And given that the author retracted the post, it might just be a straw-man.)

Super excited to see the offers to bet, though.

Just noting that I think you're arguing strongly against what is at most a weak man argument. (And given that the author retracted the post, it might just be a straw-man.)

Before we wrote the post, the OP had something like 140 karma. Also, it was only retracted after we posted.

8Not Relevant3mo
As the OP, I endorse this.

A model/ensemble of models will achieve >90% on the MATH dataset using a no-calculator rule

Curious to hear if/how you would update your credence in this being achieved by 2026 or 2030 after seeing the 50%+ accuracy from Google's Minerva. Your prediction seemed reasonable to me at the time, and this rapid progress seems like a piece of evidence favoring shorter timelines. 


 

5Matthew Barnett4d
I've updated significantly. However, unfortunately, I have not yet seen how well the model performs on the hardest difficulty problems on the MATH dataset, which could give me a much better picture of how impressive I think this result is.
3Tomás B.4d
I’m pretty sure I will “win” my bet against him; even two months is a lot of time in AI these days.

A model/ensemble of models will achieve >90% on the MATH dataset using a no-calculator rule

A "no calculator rule". If the model is just a giant neural network, it is pretty clear what this means. (Although unclear why you should care, real world neural nets are allowed to use calculators). Over the general space of all AI techniques, its unclear what this means.

A robot that can, from beginning to end, reliably wash dishes, take them out of an ordinary dishwasher and stack them into a cabinet, without breaking any dishes, and at a comparable speed t

... (read more)

I would like to bet against you here, but it seems like others have beat me to the punch. Are you planning to distribute your $1000 on offer across all comers by some date, or did I simply miss the boat?

"We (Tamay Besiroglu and I) think this claim is strongly overstated, and disagree with the suggestion that “It's time for EA leadership to pull the short-timelines fire alarm.” This post received a fair amount of attention, and we are concerned about a view of the type expounded in the post causing EA leadership to try something hasty and ill-considered."


What harm do you think will come if this happens and what do you think should be done instead?

for the record I think all of those are going to happen by 2024 and I'm surprised you're willing to bet otherwise. other people already took the bet. but the improvements from geometric deep learning, conservation laws, diffusion models, 3D understanding, and recursive feedback on chip design are all moving very fast. embodiment is likely to be solved suddenly when the underlying models are structured correctly. I maintain my assertion from previous discussion that compute is the only limitation and that the deep learning community has now demonstrated that compute is the only thing stopping them. deep learning is certainly bumping up against a wall, but just like every other wall it has run into, it's just going to go around.

Reading the comments, it seems like the idea you’re presenting of giving concrete bets on timelines is a great one, but the details of implementation can definitely be improved, so that making such a bet is meaningful for an AI pessimist.

I haven't look deeply at what the % on the ML benchmarks actually mean. On the one hand it would be a bit weird to me if in 2030 we still have not made enough progress on them, given the current rate. On the other hand, I trust the authors in that it should be AGI-ish to pass those benchmarks, and then I don't want to bet money on something far into the future if money might not matter as much then. (Also, without considering money mattering less or the fact that the money might not be delivered in 2030 etc., I think anyone taking the 2026 bet should take... (read more)

Hmm, two points:

First, $1000 is basically nothing these days, so no skin in the game. Something more leveraged would show that you are at least mildly serious. 

Second, none of your benchmarks are FOOMy. I would go for something like "At least one ML/AI company has an AI writing essential algorithms" (possibly validated by humans before being deployed).

Should be pointed out that $1000 is no skin in the game to you. To some people I know, $1000 would have been nearly lifesaving at certain points in their lives.

5shminux3mo
I'd be very surprised if that were the case for the two authors, but who knows.
6[anonymous]3mo
Might not be for the other people taking the bet. 1000$ is a lot to me.
4Davidmanheim3mo
Fine, but the offer was for "up to $1000"
2philh3mo
Do you simply think $1000 wouldn't be nearly-lifesaving money for the authors? If so, I think you've kind of missed the point; you've replied to "X might not be true, e.g. in situation Y" by saying "but Y probably doesn't apply". Okay, but X still might not be true. Or do you think $1000 is no-skin-in-the-game / not-even-mildly-serious money for the authors? If so I think you're probably wrong, and even more probably overconfident. (I object mostly to the "at least mildly serious" part. I'm in a position where $1000 wouldn't make a noticeable difference to my life, so maybe it wouldn't be skin in the game for me. But I'm still not going to throw away $1000 on a bet I'm not even mildly serious about.) (Also: it feels distasteful to me to speculate about the authors' wealth here, and this kind of conversation feels like it's going to put pressure on them to share. I want to disavow that pressure, though I acknowledge the question is relevant.)

We will use inflation-adjusted 2022 US dollars.

Be aware that current inflation estimates are potentially distorted. It may be worth mentioning exactly what inflation estimate to use, lest you end up in a situation where this is true in some but not all estimates.

3Matthew Barnett3mo
I've now clarified that it refers to the consumer price index according to the BLS.

Does the MATH dataset have the worst scaling laws of all these tasks? (and math/logic tasks in general?)

  • A robot that can, from beginning to end, reliably wash dishes, take them out of an ordinary dishwasher and stack them into a cabinet, without breaking any dishes, and at a comparable speed to humans (<120% the average time)

Speed and ordinary dishwasher are pretty crucial here, right? Boston Dynamics claimed they could do this back in 2016, but much slower than the average human.

2Matthew Barnett3mo
Did they? The video you sent showed a robot placing a single cup from a sink into a dishwasher, and then placing a single can into a trash-can. This all looked pre-programmed. By contrast, we require that the robot must be able to put away dishes in ordinary situations (it can't know whether the dishes are ahead of time, or the precise movements necessary to put them away). We also require that it achieve a low error rate, which Boston Dynamics did not appear to report. Also, yes, the speed at which robots can do this is a major part of the prediction.
0MichaelStJules3mo
Ah, my bad, missed that part. I guess not knowing where the dishes are head of time also rules out pre-training on the specific test environments, but it might be worth making that explicit, too.

I agree with the need for "skin in the game", for most the same reasons as you, and I think the AI Alignment field is falling prey to the unilateralist's curse here.

For anyone else who wants to bet on this, here's a market on manifold: