Review

The bet was arranged on Twitter between @MichaelVassar and I (link).

Conditions are similar to this question on Metaculus, except for the open-source condition (I win even if the AI is closed-source, and in fact I would very much prefer it to be closed-source).

@Zvi has agreed to adjudicate this bet in case there is no agreement on resolution.


Michael has asked me two questions by email, and I'm sharing my answers.

Any thoughts on how to turn winning these sorts of bets into people actually updating?

Geoffrey Hinton mentioned recently that, while GPT4 can "already do simple reasoning", "reasoning is the area where we're still better" [source].

It seems to me that, after being able to beat humans at math, there won't be anything else fundamental where we're still better. I wish more people could realize this now.

For the people who disagree, I would like to get them to make their views known before that. I feel like many people just don't think enough about it and are in a "superposition state", where their belief can collapse to anything without it causing any surprise feelings or model updating on them. Maybe if they think about it and commit to their views today they'll be more surprised when it happens and therefore willing to change their mind about important matters.

Kelvin, thoughts on how you’ll update if it doesn’t at least come close?

Yes. I'll update in the following directions:

  • That it is much harder for search algorithms to do "amplification" right in rich (very large branching factor) environments like math (and the real-world), than in games such as Go.
  • That superintelligence is still further away, even if LLMs are having lots of economic use cases and replacing many human jobs.
  • That we're probably bottlenecked on search algorithms, rather than compute power/model-size. This would have policy implications.
New Comment
29 comments, sorted by Click to highlight new comments since:

Dataset contamination is a big problem with these kinds of bets. The sorts of models I expect people to use to win an IMO will probably have seen those IMO questions before, or extremely similair ones. Also, I don't buy that winning an IMO means AI "beat humans at math". Mathematics research, on the level of proving a major theorem autonomosly, requires quite different capabilities than winning an IMO.

Still, I'd guess there's maybe a 30-35% chance of an AI, NOT trained on prior IMO/maths contest questions, winning an IMO by 2026. 

From Metaculus' resolution criteria:

This question resolves on the date an AI system competes well enough on an IMO test to earn the equivalent of a gold medal. The IMO test must be most current IMO test at the time the feat is completed (previous years do not qualify)."The IMO test must be most current IMO test at the time the feat is completed (previous years do not qualify)."

I think this was defined on purpose to avoid such contamination. It also seems common sense to me that, when training a system to perform well on IMO 2026, you cannot include any data point from after the questions were made public.

At the same time training on previous IMO/math contest questions should be fair game. All human contestants practice quite a lot on questions from previous contents, and IMO is still very challenging for them.

I dunno, I think there are a LOT of old olympiad problems—not just all the old IMOs but also all the old national-level tests from every country that publishes them. (Bottom section here.) I think that even the most studious humans only study a small fraction of existing problems, I think. Like, if someone literally read every olympiad-level problem and solution ever published, then went to a new IMO, I would expect them to find that at least a couple of the problems were sufficiently similar to something they’ve seen that they could get the answer without too much creativity. (That’s just a guess, not really based on anything.)

(That’s not enough for a gold by itself, but could be part of the plan, in conjunction with special-case AIs for particular common types of problems, and self-play-proof-assistant things, etc.)

I know a guy from the Physics Olympiads that was a mobile library of past olympiads problems. I think you're underestimating the level of weirdness you can find around. Maybe it's still a fraction of the existing problems, but I'd estimate enough to cover non-redundant ones.

I would expect them to find that at least a couple of the problems were sufficiently similar to something they’ve seen that they could get the answer without too much creativity.

I've not been to the IMO but I'd bet this already happens from comments I overheard by people who have been.

I see about ~100 book in there. I met several IMO gold-medal winners and I expect most of them to have read dozens of these books, or the equivalent in other forms. I know one who has read tens of olympiad-level books in geometry alone!

And yes, you're right that they would often pick one or two problems as similar to what they had seen in the past, but I suspect these problems still require a lot of reasoning even after the analogy has been established. I may be wrong, though.

We can probably inform this debate by getting the latest IMO and creating a contest for people to find which existing problems are the most similar to those in the exam. :)

Eh, there are not that many IMO problems, even including shortlisted problems. Since there are not that many, IMO contestants basically solve all previous IMO problems to practice. So it's not like AI is having an unfair advantage.

I am of the opinion that adding the condition of "not trained on prior IMO/math contest problems" is ridiculous.

IMO problem solving (the ones you need for gold specifically) is much closer to research math than high school math. Generalizing from some IMO problems to others would be as impressive as starting from scratch

I kind of disagree. (I was on South Korean IMO team.) I agree IMO problems are in similar category of tasks including research math than high school math, but since IMO problems are intended to be solvable within a time limit, there is (quite low, in absolute sense) upper limit to their difficulty. Basically, intended solution is not longer than a single page. Research math problems have no such limit and can be arbitrarily difficult, or have a solution arbitrarily long.

Edit: Apart from time limit, length limit, and difficulty limit, another important aspect is that IMO problems are already solved, so known to be solvable. IMO problems are "Prove X". Research math problems, even if they are stated as "Prove X", is really "Prove or disprove X", and sometimes this matters.

"Prove or disprove X" is only like 2x harder than "Prove X." Sometimes the gap is larger for humans because of psychological difficulties, but a machine can literally just pursue both in parallel. (That said, research math involves a ton of problems other than prove or disprove.)

I basically agree that IMO problems are significantly easier than research math or other realistic R&D tasks. However I think that they are very much harder than the kinds of test questions that machines have solved so far. I'm not sure the difference is about high school math vs research math so much as about very easy problems vs problems designed to be challenging and require novel thinking.

My view, having spent a fair amount of time on IMO problems as well as on theoretical research and more practical R&D, is that the IMO is significantly easier but just not very far away from the kind of work human scientists need to do in order to be productive.

I think the biggest remaining difference is that the hardest research math problems operate over a timescale about 2-3 orders of magnitude longer than IMO problems, and I would guess transformative R&D requires operating over a timescale somewhere in between. (While IMO problems are themselves about 2-3 orders of magnitude longer for humans than questions that you can solve automatically.)

Research problems also involve a messier set of data and so training on "all IMO problems" is more like getting good at an incredibly narrow form of R&D. And I do think it's just cognitively harder, but by an amount that feels like much less than a GPT-3 to GPT-4 sized gap.

I'd be personally surprised if you couldn't close the gap between IMO gold and transformative R&D with 3-4 orders of magnitude of compute (or equivalent algorithmic progress) + an analogous effort to construct relevant data and feedback for particular R&D tasks. If we got an IMO gold in 2023 I would intuitively expect transformative AI to happen well before 2030, and I would shift my view from focusing more on compute to focusing more on data and adapting R&D workflows to benefit from AI.

At least in certain areas of mathematics, research problems are often easier than the harder IMO problems. That is to say, you can get pretty far if you know a lot of previously proven results and combine them in relatively straightforward ways. This seems especially true in areas where it is hard for a single human to know a lot of results, just because it takes a long time to read and learn things.

In the MIRI dialogues from 2021/2022 I thought you said you would update to 40% of AGI by 2040 if AI got an IMO gold medal by 2025? Did I misunderstand or have you shifted your thinking (if so, how?)

I agree timescale is a good way to think about this. My intuition is if high school math problems are 1 then IMO math problems are 100(1e2) and typical research math problems are 10,000(1e4). So exactly half way! I don't have first hand experience with hardest research math problems, but from what I heard about timescale they seem to reach 1,000,000(1e6). I'd rate typical practical R&D problems 1e3 and transformative R&D problems 1e5.

Edit: Using this scale, I rate GPT-3 at 1 and GPT-4 at 10. This suggests GPT-5 for IMO, which feels uncomfortable to me! Thinking about this, I think while there are lots of 1-data and 10-data, there are considerably less 100-data and above that most things are not written down. But maybe that is an excuse and it doesn't matter.

That we're probably bottlenecked on search algorithms, rather than compute power/model-size. This would have policy implications.

If a model can't carry out good enough reasoning to solve IMO problems, then I think you should expect a larger gap between the quality of LM thinking and the quality of human thinking. This suggests that we need bigger models to have a chance of automating challenging tasks, even in domains with reasonably good supervision.

Why would failure to solve the IMO suggest that search is the bottleneck?

My model is that the quality of the reasoning can actually be divided into two dimensions, the quality of intuition (what the "first guess" is), and the quality of search (how much better you can make it by thinking more).

Another way of thinking about this distinction is as the difference between how good each reasoning step is (intuition), compared to how good the process is for aggregating steps into a whole that solves a certain task (search).

It seems to me that current models are strong enough to learn good intuition about all kinds of things with enough high-quality training data, and that if you have good enough search you can use that as an amplification mechanism (on tasks where verification is available) to improve through self-play.

This being right then failure to solve IMO probably means a good search algorithm (analogous to AlphaZero's MCTS-UCT, maybe including its own intuition model) has not been found that is capable of amplifying the intuitions useful for reasoning.

So far all problem-solving AIs seem to use linear or depth-first search, that is, you sample one token at a time (one reasoning step), chain them up depth-first (generate a full text/proof-sketch) check to see if it solves the full problem, and if it doesn't work then it just tries again from scratch throwing all the partial work away. No search heuristic is used, no attempt to solve smaller problems first, etc. So it can certainly get a lot better than that (which is why I'm making the bet).

Curious to hear your thoughts @paulfchristiano, and whether you have updated based on the latest IMO progress.

I think math is "easy" in the sense that we have proof assistants that can verify proofs so AIs can learn it through pure self-play. Therefore I agree that AI will probably soon solve math, but I disagree that it indicates particularly high capabilities gain.

Just want to record that I agree with this with moderate confidence. Or more precisely: I think we'll probably be able to make extremely good math-solving AIs with architecture that would be generally incompetent at most real world tasks, as (for instance) AlphaGoZero is thus incompetent.

I think "A narrow skill that in humans seems to require a whole lot of general intelligence, turns out to be able to be easily unbundled (or to be unbundled by default) from many of the other aspects of general intelligence" has a pretty good track record. Part of why I'm not super doomy, really.

A lot of my confidence this will happen is this and a generalized Morvec's paradox-style "hard things are easy, easy things are hard" intuition. 

Does this "paradox" still hold in the era of recent multimodal AI? In particular, what are some things that are easy for humans but hard for AI, other than things requiring embodiment? What areas are human mechanical Turks still much better at? (I believe there are areas but pretty fuzzy about what they are.)

I don't think reading/writing is very easy for humans - compared to perception and embodied tasks. My Morvec's paradox intuition here is maths is of a similar order of difficulty to what we have been very successfully automating in the last couple years, so I expect it will happen soon.

Embodied tasks just aren't an area where comparison makes much sense yet. What kind of perception tasks did you have in mind?

I mean like the type of perception one needs to empty a random dishwasher, make a cup of coffee with a random coffee machine type of stuff, clean a room. Hunt and skin a rabbit. 

Glad someone took this bet! I still think I think you'll win, but I myself backed out of a similar bet. 

Out of curiosity, how much experience do either of you have solving IMO problems?

I participated in the selection tests for the Brazilian IMO team, and got to the last stage. That being said, never managed to solve the hard problems independently (problems 3 and 6).

Once upon a time I was decent at math contests. Currently I'm decent at game design. It feels to me like:

  • A: difficulty getting an LLM to get good at hard math contests
  • B: difficulty getting an LLM to get good at satisfying game design
  • C: difficulty getting a human to get good at hard math contests
  • D: difficulty getting a human to get good at satisfying game design

A/B << C/D

I take from this comment that you do not see "AI winning the gold medal" as a good predictor of superintelligence arriving soon as much as I do.

I agree with the A/B < C/D part but may disagree with the "<<". LLMs already display common sense. LLMs already generalize pretty well. Verifying whether a given game design is good is mostly a matter of common sense + reasoning. Finding a good game design given you know how to verify it is a matter of search.

A expect an AI that is good at both reasoning and search (as it has to be to win the IMO gold medal) to be quite capable of mechanism design as well, once it also knows how to connect common sense to reasoning + search. I don't expect this to be trivial, but I do expect it to depend more on training data than on architecture.

Edit: by "training data" here I mostly mean "experience and feedback from multiple tasks" in a reinforcement learning sense, rather than more "passive" supervised learning.

I dont think the difficulty of the task has much with the outcome. 

 

I mean, I take your comment at face value and update to "it's going to get powerful faster" and not the other way around.