Scaling the mathematics open problem benchmark: Lessons from Contributions to DeepMind's  OSS of the mathematics open problem formalization

K.Kitamura

Rejected for the following reason(s):

Insufficient Quality for AI Content.
Unclear writing.

Read full explanation

TL;DR I made two contributions to formal-conjectures, a DeepMind open-source repository that collects formalizations of open problems in mathematics. Through my contributions, I noticed three impacts of formalizing open problems in mathematics on AI: (1) the impact on the capabilities of mathematical benchmarks, (2) the impact on AI-human collaboration in mathematics, and (3) the impact on AI self-improvement.

Introduction

In 2025, Google DeepMind launched formal-conjectures^[1], an open-source repository of formalized statements of open problems using Lean (a mathematical formal verification software). The repository's official goals are follows^[1]:

Become a great benchmark for automated theorem provers and automated formalisation tools.
Help clarify the precise meaning of conjectures through formalisation.
Encourage the expansion of mathlib by highlighting needed definitions.

When it comes to evaluating AI capabilities, existing mathematical benchmarks (MATH and GSM8K) are rapidly reaching saturation.
The next frontier is "unsolved problems." In fact, it has recently been reported that AI has contributed to or solved a number of unsolved mathematical problems.
However, there is a huge hurdle: who determines the correct answer? One method of verification is the use of the formal verification software LEAN.

This repository explicitly states that contributions include not only simply formalizing and listing unsolved conjectures, but also formalizing intermediate theorems for unsolved problems as follows^[1] .

We are also interested in the formalised statements of solved variants of open conjectures and solved statements from dedicated problem lists. While the main goal is to collect conjecture statements, we appreciate the inclusion of very short proofs for solved items or counterexamples, especially if they are illuminating and testing the definitions. Lengthy proofs are outside the scope of this repository.

For example, an intermediate theorem is not the conjecture itself, such as Fermat's Last Theorem, but an intermediate theorem such as the Frey / Serre Theorem, which makes it possible to solve that conjecture.

I have made two contributions to this DeepMind's OSS repository.

Contribution 1^[2]: I proposed to formalize Terence Tao's intermediate theorem^[3], which supports recent progress on the unsolved problem known as the Lonely Runner Conjecture (LRC). The proposal was positively received by the community^[4].

However, shortly after, I had a question. There are a huge number of such intermediate theorems, so the number of such proposals is infinite. This would consume the resources of repository contributors who are responsible for formalizing them using LEAN.

So I made my second contribution.
Contribution 2^[5]: I proposed categorizing intermediate theorems proposed for formalization using issue tags.
This will make it easier for contributors to formalize them to understand and address the subject of their formalization.

These two contributions made me wonder, "What did they mean for AI safety?" I believe they could have (1) an impact on the measurability of mathematical benchmarks, (2) an impact on the transformation of human-AI collaboration in mathematics, and (3) an impact on AI self-improvement.

(1) Impact on the Measurability of Mathematical Benchmarks: Lesson from Contribution 1

Creating a verifier for intermediate theorems of unsolved problems changes the difficulty of unsolved problems as benchmarks for mathematical problems.
For example, the LRC I contributed to is known to have solutions for cases up to n<=10, but when solving n=8, the formalization of the intermediate theorem I contributed, Terence Tao's contribution^[3], will lower the difficulty for an AI to solve it.
This is likely to be the same for unsolved cases for n>10.

This effect is likely to be similar in unsolved problem using mathematical benchmarks.
In other words, providing such judges may itself lead to an overestimation of AI problem-solving ability.

(2) Impact on the Transformation of Human-AI Collaboration in Mathematics: Lesson from Contribution 2

Building an environment for a repository of open problems involves considering how to design a place for collaboration between humans and AI. It's a kind of human-computer interaction (HCI).

The proposal to tag intermediate theorems for open problems, to which I contributed, will not only make it easier for humans to work, but also for AI to learn and explore. For humans, this allows them to assign an importance to intermediate theorems and evaluate the formalization of open problems in order to avoid wasting the resources of contributors who formalize mathematical problems. For AI, labels also make it easier to ingest and identify data, and to make inferences when working on open problems.

This type of tagging could even transform the roles of humans and AI in mathematics.
For example, in a possible future division of labor, humans could judge how many intermediate theorems to accept into the repository, while AI would actually formalize the intermediate theorems and use an intermediate theorem verifier to explore whether they can solve the open problems.

How to create a space for such collaboration between AI and humans and how to divide roles will likely be debated further depending on the speed at which unsolved problems are solved and the nature of the unsolved problems that are solved.

(3) Impact on AI self-improvement: Lesson from Contribution 1 and Contribution2

There is a possibility that "cultivating a repository of formalized unsolved problems will lead to the autonomous strengthening of AI." Because this repository is open, AI will also be able to contribute. In the future, AI may autonomously improve its mathematical capabilities by searching for statements, formalizing intermediate theorems, and committing them.

This raises the issue of AI autonomy, a frequent problem in AI safety. This is because new mathematics has the potential to strengthen AI. This is different from the commonly discussed self-reinforcement by AI that can program AI, but it is one form of AI self-improvement. In other words, AI will autonomously strengthen itself by expanding the frontiers of mathematics, such as by discovering new mathematical treatments. This could lead to AI self-improvement by solving unsolved mathematical problems and utilizing that mathematics, a future that humans cannot predict.

<Appendix>

Another Question 1: Is it different from SWE's CI/CD?

One question arises: is this different from SWE CI/CD?
Intuitively, mathematics contains phenomena that are unknown to humanity, but which undoubtedly exist.
In CI/CD, mathematics may contain things that cannot be protected, and such mathematical discoveries could have a tremendous impact on AI capabilities.

Another Question 2: Is LEAN a Verifier?

LEAN has previously been referred to as a "verifier." This is because it can determine everything from statements to proofs without relying on natural language.

However, a formalization of statements alone, such as in this DeepMind repository, is more of a "judgment" than a "verifier." Is the development of a judgment function different from the development of a verifier?

How to handle this aspect is an area that requires further exploration.

Details of my contribution 1: Proposal of the LRC intermediate theorem

In this repository, I contributed to the Lonely Runner Conjecture.
The Lonely Runner Conjecture requires a proof for general n, but results for n=7, 8, 9, and 10 have been rapidly solved in recent years. The key to these solutions was a paper published by Terence Tao in 2017^[3].
Since Terence Tao's theorem can be used in future proofs for n>10, I thought it would be desirable to formalize it.
I filed an issue in the official repository, and the community response to issues on this topic was positive.

Details of my contribution 2: Proposal for the treatment of intermediate theorem

There are a large number of intermediate theorems for unsolved problems. This DeepMind repository is only nine months old, and many open problems have not even been proposed yet, so formalization efforts rely heavily on the statement of those open problems. Therefore, proposing a large number of intermediate theorems could drain the resources of contributors.

Therefore, I proposed attaching a specific tag to intermediate theorems. This would allow contributors to recognize them as intermediate theorems and work on formalizing them, or to choose not to devote resources to them.

^{^}
https://github.com/google-deepmind/formal-conjectures
^{^}
https://github.com/google-deepmind/formal-conjectures/issues/1935
^{^}
https://arxiv.org/abs/1701.02048
^{^}
https://leanprover.zulipchat.com/#narrow/channel/524981-Formal-conjectures/topic/Statement.3A.20Tao's.20lower.20bound.20for.20Lonely.20Runner
^{^}
https://github.com/google-deepmind/formal-conjectures/issues/1941