With sparsity of Lean-specific training data, I wonder if it is easier for a model to work with some kind of Python wrapper over Lean, or if this does not help in practice…
Basically, is it easier for a model to master a rarely used Python library than to master a rarely used language?
Yall, I really do apologize for radio silence. It has mostly to do with breaking my ankle in three places, but I’m walking again.
This edition of the newsletter looks a bit more like movement happenings and announcements, which isn’t to say that there weren’t more papers or technical results I could’ve included, just that my mind wasn’t on them over the summer. I feel like I should be working on strategic clarity right now! Watch this space, etc.
I remain turbo impressed with Theorem’s tech. Formal methods is quality assurance, and as every QA technician learns on the first day of QA School, property based tests are 80% of the value of a proof for 20% of the effort.
Why combine proofs and PBTs? Standard PBTs are compute efficient but prone to gaps: developers miss edge cases, and rare bugs slip into production. Proofs solve this with modular steps that decompose and incorporate reasoning, revealing what cases we’re missing—but they’re computationally impractical.
They’re a little coy with the details. Too coy really, I thought of omitting this from the newsletter but I’ve seen some demos from the team that aren’t public yet that make me bullish on the company, so I want to remind everyone to watch this space.
No reason this couldn’t have been covered in an earlier episode! I think it slipped through the cracks or I had too many similar papers that month.
I’m pretty interested in the version control problem, in the long game. Every Lean synthesis laborer in the prompt mines every day wiring up their agents, MCP servers, and lean-toolchain
files knows that the high code velocity of Lean and Mathlib relative to fewness of tokens in the pretraining data is a major pain point, especially compared to the synthesis successes of larger-resourced languages. What does it look like if language model technology eventually scales to something less hobbled by this?[1]
To be clear, this question of version control is different from the interest in version control over in Safeguarded/ARIA land that was covered in a previous newsletter. Over there (davidadia? davidadistan?), specifications and world models are a kind of structured data, and they want version control that (among other things) doesn’t just flatten critical specs to plaintext.
Moreover, I don’t think this is necessarily the reason ByteDance got interested in version control for proof synthesis. They may be thinking diff-aware / git history aware approach to Mathlib leads to more synthetic data and a better understanding (in the model) of how to add features, refactor, etc.
Recent progress in large language models (LLMs) has shown promise in formal theorem proving, yet existing benchmarks remain limited to isolated, static proof tasks, failing to capture the iterative, engineering-intensive workflows of real-world formal mathematics libraries. Motivated by analogous advances in software engineering, we introduce the paradigm of Automated Proof Engineering (APE), which aims to automate proof engineering tasks such as feature addition, proof refactoring, and bug fixing using LLMs. To facilitate research in this direction, we present APE-Bench I, the first realistic benchmark built from real-world commit histories of Mathlib4, featuring diverse file-level tasks described in natural language and verified via a hybrid approach combining the Lean compiler and LLM-as-a-Judge. We further develop Eleanstic, a scalable parallel verification infrastructure optimized for proof checking across multiple versions of Mathlib. Empirical results on state-of-the-art LLMs demonstrate strong performance on localized edits but substantial degradation on handling complex proof engineering. This work lays the foundation for developing agentic workflows in proof engineering, with future benchmarks targeting multi-file coordination, project-scale verification, and autonomous agents capable of planning, editing, and repairing formal libraries.
This was originally posted a whole year ago, but had a revision over the summer. In any case, I had missed it until now.
I basically consider this a neurosymbolic architecture. The
hammer
(an SMT driven proof search tool) is the symbolic, and the regular LLM parts you know and love are the neuro part. Obviously you can do more than prompt the agent “try using the hammer tactic sometimes if you get stuck, little buddy :) I believe in you”. Their divide-and-conquer is one of those things.
Standard (by now) CoT tricks and agent stuff. It’s further ahead than a couple papers I’ve seen in the Lean agent space that came out since then, so good for them.
Formal verification using proof assistants, such as Coq, is an effective way of improving software quality, but requires significant effort and expertise. Machine learning can automatically synthesize proofs, but such tools are able to prove only a fraction of desired software properties. We introduce Cobblestone, a divide-and-conquer approach for proof synthesis. Cobblestone uses a large language model (LLM) to generate potential proofs, uses those proofs to break the problem into simpler parts, automatically identifies which of those parts were successfully proven, and iterates on the remaining parts to build a correct proof that is guaranteed to be sound, despite the reliance on unsound LLMs. We evaluate Cobblestone on four benchmarks of open-source Coq projects, controlling for training data leakage. Fully automatically, Cobblestone outperforms state-of-the-art non-LLM tools, and proves many theorems that other LLM-based tools cannot, and on many benchmarks, outperforms them. Each Cobblestone run costs only $1.25 and takes 14.7 minutes, on average. Cobblestone can also be used with external input, from a user or another tool, providing a proof structure or relevant lemmas. Evaluated with such an oracle, Cobblestone proves up to 58% of theorems. Overall, our research shows that tools can make use of partial progress and external input to more effectively automate formal verification.
As always, I’m out in the applebees parking lot informing everyone that the VC influx into formal methods (especially Lean) doesn’t turn into AI security gains by default. We need you and I to make sure that formal verification agents are directed at problems in real world AI security and safety. Some or perhaps most math companies are doing PR in 2025 so they can create software QA products in 2026. I talked to one, though, who is adamant they are not. They just want to solve math problems and aren’t terribly fussed with capturing more than 1% of the upside. I think these guys are the outlier: most can be expected to pivot to program synthesis in 2026. I’m happy to discuss operationalizations of this to take bets. See also this brief memo I just jotted down.[2]
I feel like if you have friends who want to keep their ear to the ground about this math automation stuff you should send this to them, that way I can corrupt them into AI security/safety slowly over time.
SMBC recently hit the nail on the nose. It feels in the direction of if I had ranted at Zach Weinersmith himself, but he went way harder and took it way further.
And while they have not literally posted their job ads yet, reach out and I can let you know if I think they’d like to talk to you.
I’d love to do some napkin math about the VC influx in FMxAI over the last year, and include projections for how it might grow in the next couple years. I’d also love for a reader to do it and I can just link to it here!
Separately, Orpheus has a Manifund post up to support the seminars and his other GSAI movement building efforts. Early endorsements by both of the FVAPPS coauthors.
I think we’ve finally had a newsletter published while ARIA does not have an active Safeguarded AI funding call. Achievement unlocked! The original TA2 plan to fund one team to the tune of 8 figures was canceled in the 9th inning.
When we designed this programme, the world looked different. Today, the pace of progress in frontier AI models has fundamentally altered the path to our goals. We now expect that the intended technical objectives of TA2 will be attainable as a side effect of this progress, without requiring a dedicated R&D organisation. Instead of investing in creating specialised AI systems that can use our tools, it will be more catalytic to broaden the scope and power of the TA1 toolkit itself, making it a foundational component for the next generation of AI.
TA2, as you’ll recall, was to be a large investment in founding a new org that would be kinda frontier-lab like in some ways but it would specialize in being superpowered at the new specification - world model - proof cert language/stack that TA1 is shipping. I think it’s roughly true that advanced capability in this new language/stack can be accomplished as a side effect of what claude, gpt, grok, and gemini are going to do anyway. But I’m also surprised that the extent to which that is forecasted to be true now, by davidad and his team, wasn’t priced in back when the Plan™ was originally drawn up. Davidad just seems a little more bitter lesson / scaling pilled than being surprised enough by capabilities alone to pivot! I guess I have a mild impulse to speculate that the team got spooked by the difficulty of aligning incentives for this kind of org to get a 7 or 8 figure injection at its founding, but again I’d ultimately expect that to have been priced in when the Plan™ was originally published.
I’ve heard of at least one turbo sick venture that was germinated by the prospect of pursuing this grant, and I don’t think they’re giving up just yet. Watch this space.
Lisa from the SL5 Taskforce has a few more months to disperse up to $100k. SL5 Taskforce is interested in some applications of FMxAI, especially to cloud infrastructure hardening. The center of the venn diagram (between Progress in GSAI readers and the interests of the SL5 Taskforce) is not massive, but I think it’s big enough for us to discuss here!
Apply here, I think you also have to file a Manifund project as well.
I think the “future” section on cyberphysical systems and comparing AI to cyberphysical systems is good. The AI4FV section is all the stuff we talk about, and FV4AI sections are all the stuff I’m constantly saying we’re not talking about. The highlights are the earlier parts, the historical context, IMO.
Yall, I’m interested in your thoughts here. My priorities are shaped a lot by “ensure AI security knows what to ask the formal methods community for at crunch time” and the converse “ensure that formal methodsititians view the AI security community as their most important customer”. I think GSAI is not completely wrong, for the title of the newsletter! But I think it could be even less wrong. The main problems I see with “guarantee” is that it doesn’t evoke swiss cheese, and I think formal methods are a source of swiss cheese!
I also have to disambiguate AI security as I mean it (boxing) from what e.g. Palisade means (offensive infosec capabilities). Watch this space etc.
Also if you want to help me with the newsletter we could be a team! I don’t think I could get back up to once per month without some friends to help.
Caveat: this problem was way worse 1-2 years ago. I think as the Lean3->Lean4 migration cooled down, velocity as apparent in pretraining data got lower, plus language models getting much more powerful. Overall I’m curious if this velocity/version control problem will remain persistent, not turbo confident it will.
I think this shortform is pretty important, I do regret being unnuanced about curryhoward in the past, separately my worldview has gotten more nuanced and changed in other ways. Or I think back when I said “something something curryhoward” more, being directionally correct was enough, but now that more people are saying it, it’s time to be not just directionally correct but precise. You get the idea.