Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Epistemic status: I have just skimmed through OpenAI's blogpost and paper, I do not fully understand the details.

From the blogpost

We built a neural theorem prover for Lean that learned to solve a variety of challenging high-school olympiad problems, including problems from the AMC12 and AIME competitions, as well as two problems adapted from the IMO.
The prover uses a language model to find proofs of formal statements. Each time we find a new proof, we use it as new training data, which improves the neural network and enables it to iteratively find solutions to harder and harder statements.

From the paper

We explore the use of expert iteration in the context of language modeling applied to formal mathematics. We show that at same compute budget, expert iteration, by which we mean proof search interleaved with learning, dramatically outperforms proof search only. We also observe that when applied to a collection of formal statements of sufficiently varied difficulty, expert iteration is capable of finding and solving a curriculum of increasingly difficult problems, without the need for associated ground-truth proofs. Finally, by applying this expert iteration to a manually curated set of problem statements, we achieve state-of-the-art on the miniF2F benchmark, automatically solving multiple challenging problems drawn from high school olympiads.


  • Uses the Lean formal environment instead of the Metamath used in GPT-f.
  • Uses "decoder-only Transformers similar to GPT-3" with 774M trainable parameters
  • Pre-trained "successively on GPT-3’s postprocessed version of CommonCrawl (for 300B tokens) and an updated version of WebMath (for 72B tokens)"
  • "proof search interleaved with learning"

The two IMO-adapted problems

Problem 1: Suppose a, b, c are the sides of a triangle. Prove that a^2(b + c − a) + b^2(c + a − b) + c^2(a + b − c) ≤ 3abc.
Problem 2: For a, b, c reals, prove that (a^2 + ab + b^2)(b^2 + bc + c^2)(c^2 + ca + a^2) ≥ (ab+bc+ ca)^3.

Both solutions to those problems use "nlinarith" applied to the right arguments, which, as far as I understand, is a tactic from mathlib for solving nonlinear arithmetic problems by adding more assumptions to the context of the solver. (source)

The right arguments for the first problem are said in the blogpost to come (informally) from Schur's inequality, which gives

nlinarith [sq_nonneg (b - a), sq_nonneg (c - b), sq_nonneg (c - a)]

The second problem is solved by applying the Cauchy-Schwarz multiple times, then using some inequality it "invented", and ends up with the same nlinarith expression above.

New Comment
27 comments, sorted by Click to highlight new comments since: Today at 8:36 AM

Most of the heavy lifting in these proofs seem to be done by the Lean tactics. The comment "arguments to nlinarith are fully invented by our model" above a proof which is literally the single line nlinarith [sq_nonneg (b - a), sq_nonneg (c - b), sq_nonneg (c - a)] makes me feel like they're trying too hard to convince me this is impressive.

The other proof involving multiple steps is more impressive, but this still feels like a testament to the power of "traditional" search methods for proving algebraic inequalities, rather than an impressive AI milestone. People on twitter have claimed that some of the other problems are also one-liners using existing proof assistant strategies - I find this claim totally plausible.

I would be much more impressed with an AI-generated proof of a combinatorics or number theory IMO problem (eg problem 1 or 5 from 2021). Someone with more experience in proof assistants probably has a better intuition for which problems are hard to solve with "dumb" searching like nlinarith, but this is my guess.

I think it's worth distinguishing how hard it is for a lean programmer to write the solution, how hard it is to solve the math problem in the first place, and how hard it is to write down an ML algorithm that spits out the right lean tactics.

Like, even if something can be written in a compact form, there might be only a dozen of combinations of ~10 tokens that give us a correct solution like nlinarith (b- a), ..., where by token I count "nlinarith", "sq_nonneg", "b", "-", "a", etc., and the actual search space for something of length 10 is probably ~(grammar size)^10 where the grammar is possibly of size 10-100. (Note: I don't know how traditional solvers perform on statement of that size, it's maybe not that hard.)

I agree that traditional methods work well for algebraic problems where proofs are short, and that AI doing search with nlinarith seems "dumb", but the real question here is whether OAI has found a method to solve such problems at scale.

As you said, the one liner is not really convincing, but the multi step solution, introducing a new axiom in the middle, seems like a general construction to solve all algebraic problems, and even more. (Though they do mention how infinite action space and no no-self-play limits scaling in general.)

I do agree with the general impression that it's not a huge breakthrough. To me, it's mostly an update like "look, two years after gpt-f, it's still hard but se can solve a theorem which requires multiple steps with transformers now!".

I think we're basically in agreement here (and I think your summary of the results is fair, I was mostly pushing back on a too-hyped tone coming from OpenAI, not from you)

In the comments of "Yudkowsky and Christiano discuss takeoff speeds", Christiano mentions a "<4% on end of 2025" for OpenAI (or other) pulling off an IMO gold medal. (However, after looking up actual IMO problems, he comments "feeling bad" about those 4%, because the problems would actually be doable and PR could be enough motivation.)

At the end of the thread I settled on "<8% by end of 2025." We should clarify the final state of the bet soon so that it's easier to reference (and Eliezer also has a chance to update his probabilities).

My main update was how often you could get a gold with only 4 problems, and generally how often it looked like you could get away with only answering easy questions. This kind of variability in the test generally pushes you back to 50% even though I think Eliezer and I still have a stronger disagreement about theorem-proving progress.

Oh right I had missed that comment. Edited the post to mention 8% instead. Thanks.

Just wanted to point out that using machine learning to produce formal proofs is not new; this work has been ongoing in both the machine learning and programming languages community. Most of the work has been done in the Coq or Isabelle/HOL proof languages, with a few things done in Lean. As far as I can tell, a lot of the other work is much farther along in terms of actually producing proofs of useful things; Proverbot9001 (, which uses Coq, is somewhere near proving 25% of the theorems in a verified compiler, a better example of the kinds of proofs that are actually hard and useful. TacTok is another interesting work in the area which improves on CoqGym (cited in the paper)'s tree based model ( 


Disclaimer: I'm the author of Proverbot9001.

Strange that they call the action space infinite instead of breaking it down further by the language's grammar. I'd like to see them translate between formal and informal proofs.

This is indeed amusing. In reality, the action space can be taken to be of size 256 (the number of possible byte values), with the number of bytes in the solution as the episode length. Note also that 256 is an upper bound, not all byte values are valid at all points, and most of the time only the 128 ASCII values are used. Using a tokenizer as is standard in language models simply reduces the episode length by increasing the action space, it does not change the size of the overall state space.

This also means that, despite their claims, the search space for the example solutions shown on their website is similar or smaller than for board games such as Chess and Go :D

By the language's grammar, it is infinite. You can introduce an infinity of axioms in any step of deduction.

If I understand correctly, you are saying that one can technically create an arbitrarily long axiom at any step, so the possibilities are infinite considering the action "add new axiom".

I think Gurkenglas was saying something like "you could break down creating a new axiom in the steps it takes to write an axiom, like the actions 'start axiom', and then add a symbol/token (according to the available grammar)."

OpenAI was possibly mentioning that the number of actions is unbounded, like you could create an axiom, which becomes part of your grammar, and then have one more action, etc.

Where after "start axiom", the next action would not be to add the leftmost character of the axiom's string, but the topmost node of the abstract syntax tree, such as that it is a disjunction of 3 terms.

I think this kind of research is very harmful and should stop.

I think it's important to repeat this even if it's common knowledge in many of our circles, because it's not in broader circles, and we should not give up on reminding people not to conduct research that leads to net increased risk of destroying the world, even if its really cool, gets you promoted, or makes you a lot of money.

Again, OpenAI people, if you're reading this, please stop.

I think it's very strange that this is the work that gets this sort of pushback—of all the capabilities research out there, I think this is some of the best from an alignment perspective. See e.g. STEM AI and Thoughts on Human Models for context on why this sort of work is good.

If our safety research is useless, this path to AGI gives me the most hope, because it may produce math that lets us solve alignment before it becomes general.

Because of the Curry-Howard correspondence, as well as for other reasons, it does not seem that the distance between solving math problems and writing AIs is large. I mean, actually, according to the correspondence, the distance is zero, but perhaps we may grant that programming an AI is a different kind of math problem from the Olympiad fare. Does this make you feel safe?

Also, it seems that the core difficulty in alignment is more in producing definitions and statements of theorems, than in proving theorems. What should the math even look like or be about? A proof assistant is not helpful here.

Writing AIs is not running them. Proving what they would do is, but we need not have the math engine design an AI and prove it safe. We need it to babble about agent foundations in the same way that it would presumably be inspiring to hear Ramanujan talk in his sleep.

The math engine I'm looking for would be able to intuit not only a lemma that helps prove a theorem, but a conjecture, which is just a lemma when you don't know the theorem. Or a definition, which is to a conjecture as sets are to truth values. A human who has proven many theorems sometimes becomes able to write them in turn, why should language models be any different?

I can sense some math we need: An AI is more interpretable if the task of interpreting it can be decomposed into interpreting its parts, we want the assembly of descriptions to be associative, an AI design tolerates more mistakes if its behavior is more continuous in its parts than a maximizer's in its utility function. Category Theory formalizes such intuitions, and even a tool that rewrites all our math in its terms would help a lot, let alone one that invents a math language even better at CTs job of the short sentences being the useful ones.

On the usefulness of proving theorems vs. writing them down: I think there's more of a back and forth. See for instance Nature's post on how DM used an AI to guide intuition (

To me, it's going to help humans "babble" more in math by using some extension, like a Github Copilot but for math. And I feel that overall increasing math generation is more "positive for alignment in EV" than generating code, especially when considering agent foundations.

Besides. in ML the correct proofs can happen many years after algorithms are used by everyone in the community (e.g. Adam's proof shown to be wrong in 2018). Having a way to have a more grounded understanding of things could help both for interpretability & having guarantees of safety.

I think it might be better if you said "may have very harmful long-run consequences" or "is very harmful in expectation" rather than "is very harmful." I worry that people who don't already agree with you will find it easier to roll their eyes at "is very harmful."

Is gain-of-function research "very harmful"? I feel like it's not appropriate to nickel-and-dime this.

And also, yes, I do think it's harmful directly, in addition to eventually in expectation. It's a substantial derogation of a norm that should exist. To explain this concept further:

  • In addition to risking pandemics, participating in gain-of-function research also sullies and debases the research community, and makes it less the shape it needs to be culturally to do epidemiology. Refusing to take massive risks with minor upsides, even if they're cool, is also a virtue cultivation practice.
  • When a politician talks openly about how he wants to rig elections, exchange military aid for domestic political assistance, etc., he causes direct harm now even if the "plans" do not amount to anything later. This is because the speech acts disrupt the equilibria that make similar things less likely in general.

My comments here are intended as an explicit, loud signal of condemnation. This research is misconduct. Frankly, I am frustrated I have to be the one to say this, when it duly falls to community leaders to do so.

I don't think we disagree about the harmfulness of this kind of research. Our disagreement is about the probable consequences of going around saying "I think this research is harmful and should stop."

It's the classic disagreement about how "righteous" vs. "moderate" a movement should be. "Speaking truth to power" vs. "winning hearts and minds." I don't have anything interesting to say here, I was just putting in a vote for a small move towards the "moderate" direction. I defer to the judgment of people who spend more time talking to policymakers and AGI capabilities researchers, and if you are such a person, then I defer to your judgment.

As a programmer by trade, I am mostly concerned about the other development, where an AI is as good as an average human programmer (in these controlled setups). If this is not a fire alarm for general AI, I don't know what is. How long until an AI is better than the best programmers? If the past is any indication, probably a few years at best.


It's been 3 years since GPT-2, how long until an AI is better than the best novelist?

It's a good question, has there been a push to get a GPT to write quality novels? Because there will be a push to automate programming jobs, that's a given.

I'm not aware of any real push. OA pushed (and may still be pushing) in the inverse direction, recursive summarization of novels, with some success. Matt Brockman and some others were early on discussing a bit how to do recursive expansion but AFAIK no one ever got a prototype working and everyone has long since moved on (especially now that the API isn't free so a single novel from davinci with full context windows would cost a fair bit)


Just like the push to automate truck drivers. Any day now.


I think this kind of research is very harmful and should stop.

I think it's important to repeat this even if it's common knowledge in many of our circles, because it's not in broader circles, and we should not give up on reminding people not to conduct research that leads to net increased risk of destroying the world, even if its really cool, gets you promoted, or makes you a lot of money.

Again, OpenAI people, if you're reading this, please stop.