DeepSeek-ai released DeepSeek-Prover-V2: an open source large language model designed to formal theorems, proven by sub-objective decomposition and reinforcement learning

liralbes May 1, 2025

1 4 minutes read

DeepSeek-ai released DeepSeek-Prover-V2: an open source large language model designed to formal theorems, proven by sub-objective decomposition and reinforcement learning

Formal mathematical reasoning has evolved into a professional subfield of artificial intelligence, requiring strict logical consistency. Unlike informal problem solving, which allows for intuitive and loosely defined heuristics, the formal theorem proves every step that is well described, accurate and verifiable by the computing system. Proof assistants (such as Lean, Coq, and Isabelle) are the structural frameworks that build these formal proofs. Their operations require logical soundness without space for omissions, approximations, or unspeakable assumptions. This makes the challenge of AI systems, especially large language models, that perform well in producing coherent natural language responses, but often lacks the rigor of producing verifiable formal proofs. However, the desire to integrate these advantages, informal informal reasoning and formal validation of structures has led to new innovations in interfaces for language modeling and formal logic automation.

A major problem is that the current language model cannot bridge the conceptual gap between informal and formal reasoning. Language models are often good at producing human-like explanations and solving mathematical problems written in natural language. However, this reasoning is informal in nature and often lacks the structural accuracy required by formal logic systems. Although humans can intuitively jump from one deduction step to another, the proof assistant requires a fully specified sequence of steps, without ambiguity. Therefore, the challenge is to guide AI models to produce logically consistent formal output from their otherwise informal and intuitive internal reasoning processes. This problem becomes increasingly complex when dealing with advanced theorems from domains such as digital theory or geometry, and precision is crucial.

Recent efforts have attempted to solve this problem by first guiding the model to generate a natural language proof sketch, and then manually or semi-automatically translate it into a formal proof step. Known strategies include decomposing complex theorems into smaller sub-objectives. Each sub-target represents a lemma that can be dealt with independently and can then be merged to form a complete proof. Frameworks such as “drafts, sketches and proofs” have applied the idea, which uses language models to generate proof outlines and then translates them into formal language. Another approach uses hierarchical enhancement learning to break down complex mathematical problems into simpler layers. However, these models often have difficulty producing fully verifiable outputs in lean or COQ environments. Furthermore, training data for these models are often limited and demonstrated that attempts often fail to produce successful results that provide useful learning signals.

A team of researchers at DeepSeek-Ai has introduced a new model, DeepSeek-Prover-V2, which aims to generate formal mathematical proofs by leveraging sub-objective decomposition and reinforcement learning. The core of their approach uses DeepSeek-V3 to break down complex theorems into manageable sub-objectives, each translated into a statement of “contains” in Lean 4, indicating that the proof is incomplete. These sub-targets are then passed to a 7B size proverb model that completes each proof step. Once all the steps are solved, they synthesize them into a complete lean proof and pair with the original natural language reasoning produced by DeepSeek-V3. This forms a rich cold-start dataset for enhanced learning. Importantly, the training of the model is guided entirely by synthetic data, without the steps that use human notifications.

The cold start pipeline first prompted DeepSeek-V3 to create proof sketches in natural language. These sketches are transformed into formal theorem statements with unresolved parts. A key innovation is to use 7B devotees to recursively addressing each sub-target, thereby reducing computational costs while remaining formally strict. The researchers constructed a curriculum learning framework that increased the complexity of the training task over time. They also implemented two types of sub-objective theorems, one taking previous sub-objectives into the premise and one treating them independently. This dual structure embeds the expert iteration phase of the model to gradually train it gradually challenging problem sets. The model’s capabilities are then enhanced during the training process through a consistency-based reward system to ensure that all decomposed lemma symbols are correctly incorporated into the final formal proof.

On the Minif2F test benchmark, the model achieved a pass rate of 88.9% (pass @8192) with a high sampling rate, compared to 82.0% for Kimina-Prover and 64.7% for Geodel-Prover. It also solved 49 of 658 problems from Putnambench, a platform with challenging math tasks. On the newly introduced ProverBench dataset, including 325 formal problems, the model involved 6 of 15 questions in the AIME (US Invitational Mathematics Exam) competition in 2024 and 2025. These benchmarks emphasize the model’s ability to generalize in multiple formal reasoning tasks. Even compared with DeepSeek-V3, which employs natural language reasoning, the new model demonstrates competitive performance, solving a considerable number of AIME problems while ensuring formal validation.

Several key points about the research on DeepSeek-Prover-V2:

DeepSeek-Prover-V2 achieved a pass rate of 88.9% on the Minif2F test (passed @8192), the highest reported in the formal inference model to date.
The model successfully solved 49 of 658 problems from the Putnambench dataset, which contained advanced math challenges.
It solves 6 of the 15 issues in AIME in recent 2024-2025, demonstrating the applicability in the real world.
A new benchmark has been introduced, including proverbench of 325 formal questions, to evaluate formal inference models.
The pipeline unifies natural language proof sketches and formal proof structures by combining DeepSeek-V3 and 7B model.
Two types of sub-eigen decomposition (a premise without dependencies) are used to train models in a structured, course-guided way.
Reinforcement learning based on consistency-based rewards significantly improves proof accuracy by performing structural alignment between sketches and solutions.
The entire training strategy relies on integrated cold start data, eliminating the dependence on manual tagging proofs.

View models on paper and on GitHub page. Also, don’t forget to follow us twitter And join us Telegram Channel and LinkedIn GrOUP. Don’t forget to join us 90K+ ml reddit.

🔥 [Register Now] Minicon Agesic AI Virtual Conference: Free Registration + Certificate of Attendance + 4-hour Short Event (May 21, 9am-1pm) + Hands-On the Workshop

Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.