IMO challenge bet with Eliezer

22Eliezer Yudkowsky

20paulfchristiano

13toner

12paulfchristiano

2ESRogs

3paulfchristiano

8ESRogs

13Zach Stein-Perlman

7Thomas Kwa

5paulfchristiano

13Austin Chen

6paulfchristiano

5Austin Chen

11Raemon

8Richard_Ngo

8paulfchristiano

2Richard_Ngo

6Daniel Paleka

2paulfchristiano

5Tomás B.

3Mo Putera

9Tomás B.

4paulfchristiano

4paulfchristiano

1Eliezer Yudkowsky

New Comment

25 comments, sorted by Click to highlight new comments since: Today at 12:55 PM

I emphasize that we arrived at this concrete disagreement in the wake of my naming multiple sure-that-could-happen predictions of Paul's that I thought looked more likely in Paul's universe than my own, including:

- Average Mike Blume-level programmers making $10M/year.
- The world economy doubling in any four-year period before the world ends.
- $10 trillion spent on training any AI model.

I'd be happy to do bet about fast growth prior to end of world. I'd say something like:

> 25% chance that there are 4 consecutive years before 2045 with >10% real economic growth per year, under an accounting scheme that feels plausible/real to Eliezer and Paul, and the fast growth sure feels to Paul and Eliezer like it's about automation rather than some weird thing like recovery from a pandemic.

You would win when the end of the world looks close enough that we can't get 4 years of progress (e.g. if we have 50% growth in one year with <10% progress in the year before, or if it becomes clear enough that we won't have time for rapid growth before the AI-R&D-AI gets us to a singularity). It's kind of unsatisfying not to pay out until right before the end with kind of arbitrary conditions (since it's so hard for you to actually win before 2045 unless the world ends).

I would give you higher probabilities if we relax 2045, but then it just gets even harder for you to win. Probably there is some better operationalization which I'm open to. Maybe a higher threshold than 10% growth for 4 years is better, but my probabilities will start to decline and I'd guess we get more expected info from disagreements about higher probability events.

(I don't know what kind of programmer Mike Blume is but don't expect to ever have many $10M+ programmers. Likewise, I don't think we have $10T spent training a single AI model until after we've already faced down alignment risk, and I assume that you also thinks that society will eventually use >>$10T of resources on training a single ML model. I would have thought we could also bet about softer versions of those things---I assume I also assign much higher probability than you do $1B training runs even if it's not worldview-falsifying, I don't think worldviews usually get falsified so much as modestly undermined.)

You might want to rethink the definition of "hardest problem" because I wouldn't expect the IMO committee to care particularly much that their problems are hard for machines as well as humans.

For example, if you look at the 2005 IMO, but suppose that problem 6 was geometry, so that the problem 3 was the "hardest", then you're in trouble. Problem 3 has a one-line proof "by sum_of_squares", assuming that someone has written a sum-of-squares tactic in time and that the sum-of-squares certificate is small enough to find within the time limit.

Alternatively, the IMO committee might set a problem without realizing that it's close to a special case of something already in mathlib.

As defined, I would put higher probability on solving the "hardest problem" than getting a gold medal and would bet on that.

Yeah, this may well be right. I agree that IMO 2015 problem 3 can be solved by machine. I think both the geometry and 3 variable inequalities could be a very big problem (in that lots of instances are easy for machines and hard for humans). Some diophantine equations and functional equations are also very easy for machine. By 2022 my vague understanding is that the committee is much less likely to put that kind of question in the hardest slot since increasingly many contestants can basically do the machine-like proofs. But I maybe shouldn't count on that, since the big problem is that we are letting it win at any of 4 IMOs (2022, 2023, 2024, 2025).

I'd guess we have like a 2-3% chance of 3-variable inequality in hardest spot (hasn't happened in last 20 years and is getting less likely, but not *that* confident). By now I think it would be considered unreasonable to put a muirhead bashable problem as 3/6 (though even the other 3 variable inequalities will likely be *much* easier for machines than humans), even in 2005 I think that was probably considered a mistake. But still lots of these may be easy for machines, so you may easily be looking at like 10% chance of one of them being a 3-variable inequality and a 50% chance of solving it conditioned on a serious effort which would already be >4%.

Other candidate definitions would have been:

- Instead of just using the type of problem, try to explicitly flag constant-number-variables inequalities or diophantine equations or functional equations as special cases to be ranked lower.
- Have a stronger preference for combo, e.g. pick from amongst 3/6 (or even 2/3/5/6) with preference for combo then NT then algebra than geo. (But I'm scared of e.g. combo 2/5 that can be solved with easy monovariant or that kind of thing, I strongly suspect many of those are also automatically solvable without deep learning.)
- Actually ask someone who doesn't yet know the results of running a machine which problem seems hardest for machine.

I also agree that "special case of existing result" is plausible, but I think that's going to be <1% (and it's fairly likely to be missed by machines even so). Though that could get worse if many more formal proofs are added in the interim as part of increased investment in theorem proving which could drive that up, and that's very correlated with a serious effort that could effectively exploit such data.

If Eliezer wants to actually bet about the hardest problem claim I may want to argue for some kind of reformulation (even though I said I didn't want to change my predictions). If Eliezer doesn't want to bet about it, I think it's OK to ding me epistemically if the AI ends up getting a "hardest problem" that is a 3 variable inequality or bashable functional or diophantine equation, though I think it's a bit of a technicality / a general error in my forecasting rather than a claim about AI.

Problems 1 and 4 are often very easy and nearly-mechanical for humans. And it's not uncommon for 2 of the other problems to happen to be easy as well (2/5 are much easier than 3/6 and it's considered OK for them to be kind of brute-forceable, and I expect that you could solve most IMO geometry problems without any kind of AI if you tried hard enough, even the ones that take insight for a human).

Sometimes you can get a gold by solving only 4/6 of the problems, and that's not very well-correlated with whether the problems happen to be easy for machine. And even when you need 5/6 it looks like sometimes you could get a very lucky break.

So if you get 4 swings, it looks fairly likely that one of them will be quite easy, and that dominates P(gold). E.g. I wouldn't be too surprised by someone doing a bronze this year, having a swing next year and getting a silver or gold based on how lucky they get, and then if they get a silver trying again the next year just to be able to say they did it.

What I was expressing a more confident prediction about was an idea like "The hardest IMO problems, and especially the *ad hoc* problems, seem very hard for machines." Amongst humans solving those problems is pretty well-correlated with getting golds, but unfortunately for machines it looks quite noisy and so my probability had to go way up.

Unfortunately "hardest problem" is also noisy in its own way (in large part because a mechanical definition of "hardest" isn't good) so this isn't a great fix.

Edit: now refers to footnote 2, see below.

If Eliezer wins, he gets 1 bit of epistemic credit. So e.g. if Eliezer and I used to get equal weight in a mixture of experts, now Eliezer should get 2x my weight. Conversely, if I win then I should get 1.1x his weight.

The second and third sentences do not follow from the first. I would update to more like 1.05x vs 1.005x (or ideally something more specific depending on more details than binary questions like "solved hardest" or "won gold," eg among other nuances I'd update more if the bet resolved for Eliezer in 2022 vs 2025).

If you are asserting that observers *should* eg update from 1:1 to 2:1 toward Eliezer if he wins, that seems far too strong to me, particularly given that neither of you seems to have spent a great deal of effort on the IMO question, but I would be very interested to hear why you think we should update so much.

These numbers come from if each party were actually in a mixture of experts, or equivalently Kelly-betting their epistemic credit points on a prediction market. We might update less due to some combination of

- other evidence we have about Paul and Eliezer is not independent with this
- The bet is between Paul thinking for a few hours, and Eliezer thinking for a few hours. But what we really want to decide between is Paul's model being right vs Eliezer's model being right, and the bet is not perfect evidence of that.

I set up a Manifold market for this, for anyone who'd like to bet along: https://manifold.markets/Austin/will-an-ai-get-gold-on-any-internat

The concept of epistemic bits is quite interesting! In Manifold terms, that would be like comparing the total amount of M$ one has amassed (maybe in a specific community like the one for AI) and using them as weights for how seriously to take two commentators.

I worked through an example just to prove the math to myself:

Eliezer and Paul both start with 1000 M$.

For Eliezer to end up with 2x the epistemic weight, Paul wagers 333 M$ (so the end result would be 1333:667 or ~2:1).

Taking 9:1 odds (roughly geometric mean between 8% and 16%), Eliezer puts up 37 M$.

So if Paul wins, he ends up at 1037:963, or ~1.08:1

Yes -- this surfaces the issue with long-term markets, where it's not worth the investment lockup when you could be trading on issues with a shorter resolution time.

We have a few tentative proposals to address this:

- Issue small interest-free loans on the first M$10 bet (Scott's proposal)
- Margin loans against the value of of your long-term bets
- Set up derivative markets (e.g. "What will the IMO market be at on April 1st 2022")
- Demurrage, where we charge a fee for cash balances

I'm partial to Scott's proposal, though it may be technically tricky to implement...

I believe most people who work in deep learning (but don't have inside evidence about particular theorem-proving efforts currently underway) are on Eliezer's side and would go even higher than 16%.

In the broader world I think probably it's more divided, and I'd guess that most academics in AI would be on my side though I'm not sure.

This is definitely a thing that makes me uneasy about the bet (I would have loved for Eliezer to be the one with the more contrarian take, and for my position to be closer to 50-50, but you get what you get).

I do not think the ratio of the "AI solves hardest problem" and "AI has Gold" probabilities is right here. Paul was at the IMO in 2008, but he might have forgotten some details...

(My qualifications here: high IMO Silver in 2016, but more importantly I was a Jury member on the Romanian Master of Mathematics recently. The RMM is considered the harder version of the IMO, and shares a good part of the Problem Selection Committee with it.)

The IMO Jury does not consider "bashability" of problems as a decision factor, in the regime where the bashing would take good contestants more than a few hours. But for a dedicated bashing program, it makes no difference.

It is extremely likely that an "AI" solving most IMO geometry problems is possible today -- the main difficulty being converting the text into an algebraic statement. Given that, polynomial system solvers should easily tackle such problems.

Say the order of the problems is (Day 1: CNG, Day 2: GAC). The geometry solver gives you 14 points. For a chance on IMO Gold, you have to solve the easiest combinatorics problem, plus one of either algebra or number theory.

Given the recent progress on coding problems as in AlphaCode, I place over 50% probability on IMO #1/#4 combinatorics problems being solvable by 2024. If that turns out to be true, then the "AI has Gold" event becomes "AI solves a medium N or a medium A problem, or both if contestants find them easy".

Now, as noted elsewhere in the thread, there are various types of N and A problems that we might consider "easy" for an AI. Several IMOs in the last ten years contain those.

In 2015, the easiest five problems consisted out of: two bashable G problems (#3, #4), an easy C (#1), a diophantine equation N (#2) and a functional equation A (#5). Given such a problemset, a dedicated AI might be able to score 35 points, without having capabilities remotely enough to tackle the combinatorics #6.

The only way the Gold probability could be comparable to "hardest problem" probability is if the bet only takes general problem-solving models into account. Otherwise, inductive bias one could build into such a model (e.g. resort to a dedicated diophantine equation solver) helps much more in one than in the other.

I think this is quite plausible. Also see toner's comment in the other direction though. Both probabilities are somewhat high because there are lots of easy IMO problems. Like you, I think "hardest problem" is quite a bit harder than a gold, though it seems you think the gap is larger (and most likely it sounds like you put a much higher probability on an IMO gold overall).

Overall I think that AI can solve most geometry problems and 3-variable inequalities for free, and many functional equations and diophantine equations seem easy. And I think the easiest problems are also likely to be soluble. In some years this might let it get a gold (e.g. 2015 is a good year), but typically I think it's still going to be left with problems that are out of reach.

I would put a lower probability for "hardest problem" if we were actually able to hand-pick a really hard problem; the main risk is that sometimes the hardest problem defined in this way will also be bashable for one reason or another.

Right now Eliezer has gold at "16%+" and I have it at "8%."

I think Eliezer would probably like to claim more than a factor of 2 of epistemic credit if this were to happen (e.g. a more significant update than if I'd put 1/3 probability on something and he'd put 2/3 probability).

If that's right, then I think it would be good to publicly state a higher probability than 16% in advance, so that I can get correspondingly more credit in the worlds where this doesn't happen.

Eliezer and I publicly stated some predictions about AI performance on the IMO by 2025. In honor of OpenAI's post Solving (Some) Formal Math Problems, it seems good to publicly state and clarify our predictions, have a final chance to adjust them, and say a bit in advance about how we'd update.

## The predictions

Eliezer and I had an exchange in November 2021.

^{[1]}My final prediction (after significantly revising my guesses after looking up IMO questions and medal thresholds) was:Eliezer spent less time revising his prediction, but said (earlier in the discussion):

So I think we have Paul at <8%, Eliezer at >16% for AI made before the IMO is able to get a gold (under time controls

etc.of grand challenge) in one of 2022-2025.Separately, we have Paul at <4% of an AI able to solve the "hardest" problem under the same conditions.

I don't plan to revise my predictions further, but I'd be happy if Eliezer wants to do so any time over the next few weeks.

Earlier in the thread I clarified that my predictions are specifically about gold medals (and become even sharper as we move to harder problems), I am not surprised by silver or bronze. My guess would be that Eliezer has a more broad distribution. The comments would be a good place for Eliezer to state other predictions, or take a final chance to revise the main prediction.

## How I'd update

The informative:

^{[2]}^{[3]}These kinds of updates are slow going, and it would be better if we had a bigger portfolio of bets, but I'll take what we can get.The uninformative:

dobear on these things, but we weren't able to pick out a disagreement on any of them.If an AI wins a gold on some but not all of those years, without being able to solve the hardest problems, then my update will be somewhat more limited but in the same direction. If an AI wins a bronze/silver medal, I'm not making any of these updates and don't think Eliezer gets any credit unless he wants to stake some predictions on those lower bars (I consider them much more likely, maybe 20% for "bronze or silver" vs 8% on "gold," but that's less well-considered than the bets above, but I haven't thought about that at all).

^{^}We also looked for claims that Eliezer thought were very unlikely, so that he'd also have an opportunity to make some extremely surprising predictions. But we weren't able to find any clean disagreements that would resolve before the end of days.

^{^}I previously added the text: "So e.g. if Eliezer and I used to get equal weight in a mixture of experts, now Eliezer should get 2x my weight. Conversely, if I win then I should get 1.1x his weight." But I think that really depends on how you want to assign weights. That's a very natural algorithm that I endorse generally, but given that neither of us really has thought carefully about this question it would be reasonable to just not update much one way or the other.

^{^}More if he chooses to revise his prediction up from 16%, or if he wants to make a bet about the "hardest problem" claim where I'm at 4%.