I think it's a good idea to include links to the originals:
https://arxiv.org/abs/2408.08152 - "DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search"
https://arxiv.org/abs/2408.08152 - "DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search"
https://github.com/deepseek-ai/DeepSeek-Prover-V1.5
DeepSeek-Prover-V1.5 is an improved open-source language model for theorem proving in Lean 4.
The paper continues pre-training DeepSeekMath-Base, a math foundation model, and then does supervised fine-tuning on a dataset of incomplete proofs in Lean 4, followed by RL with feedback sourced from Lean 4. DeepSeek-Prover-V1.5 finds proofs using truncate-and-resume, which combines existing proof generation techniques, and a novel Monte-Carlo-Tree-Search (MCTS) algorithm for superior performance.
Definitions:
Language models trained for theorem proving typically do one of two things:
DeepSeek-Prover-V1.5 combines proof-step and whole-proof generation, naming the combination truncate-and-resume. Truncate-and-resume starts with whole-proof generation. If Lean detects an error in the proof, all code succeeding the error is removed (truncate), and the valid proof steps are used as a prompt for the next proof segment (resume). This truncate and resume mechanism is used in conjunction with MCTS to search over possible proof states. A reward-free exploration algorithm that uses intrinsic motivation to explore the tactic space is introduced to address reward sparsity.
Pre-training:
Supervised fine-tuning:
The model is trained for 9 billion tokens on incomplete proofs that end with a natural language description of the tactic state that aims to align lean 4 code and natural language. It learns to predict both the content of this tactic state (as an auxiliary task) and the subsequent proof steps (main objective).
dataset includes synthetic proof code derived from a range of formal theorems (sourced from Lean 4, Mathlib4 and DeepSeek-Prover-V1.)
Thought-Augmented Proof Generation:
The paper uses a proof tree abstraction that defines a proof state and possible tactics to implement MCTS in the whole proof generation. The proof search tree is constructed at the tactic level, with each edge representing a single tactic state transition
The process then proceeds as,
Truncate
The model-generated proof is parsed into tactics, truncated at the first error, and segmented into valid tactic codes with associated comments, each corresponding to a tree edge and forming a path from the root to a specific node.
Resume
Multiple tactics can lead to the same tactic state, so each tree node stores equivalent tactic codes, with the search agent randomly selecting one for prompting the language model, which uses the incomplete proof code and tactic state information to guide further proof generation.
Intrinsic Rewards for MCTS are sparse, occurring only when a proof is completely solved, to address this, the RMaxTS algorithm is introduced[1], incorporating intrinsic rewards to encourage the search agent to explore and gather information about the environment, balancing the optimization of extrinsic rewards with the acquisition of general knowledge about the interactive proof environment.
The model is evaluated on the following benchmarks
Metrics
They compare the DeepSeekProver V1.5 to prior SOTA models.
RMax applied to Tree Search