NVIDIA introduces Prorl: Long Horse Reinforcement Learning Can Improve Inference and Generalization
Recent advances in inference-focused language models mark a significant change in AI by extending test time calculations. Strengthening learning (RL) is crucial to developing reasoning capabilities and mitigating reward hacker traps. However, there is still a basic debate: whether RL provides new inference functionality from the basic model or just helps optimize the sampling efficiency of existing solutions. Current research faces two key limitations: (a) a heavy reliance on specialized fields such as mathematics, where models are often overtrained and limit exploration potential, and (b) prematurely terminate RL training before the model can fully develop new inference capabilities, often limiting training to hundreds of steps.
The inference model represents a dedicated AI system that engages in a detailed long-bed processing process before generating the final answer. DeepSeek and Kimi have detailed methods of training inference models using reinforcement learning using verifiable rewards (RLVR), making algorithms like GRPO, Mirror Descent, and Rloo popular. Recently, methods such as Alphago and Alphazero have shown that AI agents can improve their performance indefinitely, suggesting that RL training helps agents develop new technologies that are not present in the basic model. In addition, existing works question whether RL training has truly improved LLM’s reasoning ability, believing that RLVR cannot expand its reasoning ability. Compared with the PASS@K indicator, the Pass@k indicator proves that there is no improvement compared to the basic model.
Researchers from NVIDIA proposed PRORL, an approach designed to achieve an extended RL training period that helps to explore inference strategies more deeply. PRORL supports over 2,000 training steps and scale training data across different tasks such as mathematics, coding, scientific problems, logic puzzles and following instructions. The researchers used PRORL to develop Nemotron-Research-Reasoning-Qwen-1.5b, the world’s best 1.5B inference model that outperforms its basic model, DeepSeek-R1-1.5B, and surpasses DeepSeek-R1-7B in various benchmark analyses. It shows that when sufficient training time is given and applied to novel inference tasks, RL can discover truly new solutions that do not exist in the basic model, suggesting that a true extension of inference capability is the exception of initial training.
The researchers built a diverse and verifiable training dataset across 136,000 examples across five task domains: mathematics, code, STEM, logic puzzles and the following instructions. This training uses the VERL framework for RL implementation, using the enhancement of the GRPO method proposed by DAPO. The proposed model is used in a wide range of evaluation benchmarks in multiple domains: mathematical evaluation includes AIME2024, AIME2025, AMC, MATH, MATH, MINERVA MATH and OLYMPIAD BANCH; coding evaluation uses Prime validation set, HumaneValplus and LiveCodebench; logical puzzle evaluation retains 100 samples from inference fitness tasks, while functional STEM reasoning and instructions are evaluated using a curated subset of GPQA diamonds and IFEVAL.
In mathematics, Nemotron-Reason-Reasoning-QWEN-1.5B has an average increase of 15.7% in the benchmark, while Pass@1 accuracy for competitive programming tasks has increased by 14.4%. STEM reasoning and instructions in the following domain resulted in a 25.9% increase in GPQA diamonds and IFEVAL’s diamonds earning 22.0% gains. The model shows a 54.8% increase in rewards, showing high accuracy in reasoning on the logic puzzle of gymnasiums. The evaluation of score ranges showed significant improvements in the three invisible reasoning fitness tasks, highlighting effective generalizations beyond training allocations. Compared with the domain-specific models DeepScaler-1.5b and DeepCoder-1.5b, PRORL trained models can achieve superior passes on 1 score in both mathematical (+4.6%) and code (+6.5%) benchmarks.
In this article, the researchers introduce PRORL, which provides evidence for extended, stable RL training that develops novel inference models beyond the initial capabilities of the fundamental model. Based on this approach, the researchers developed Nemotron-Research-Reasoning-Qwen-1.5b, the best 1.5B inference model in the world. Prorl demonstrates its ability to solve tasks that the base model initially struggles, suggesting that extended RL training helps model internalize abstract inference patterns and can be transferred outside the training distribution. These results challenge previous assumptions about RL limits and determine that sufficient training time using appropriate techniques can increase inference boundaries, thus paving the way for developing more capable inference models.
View the paper and model page . All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 95k+ ml reddit And subscribe Our newsletter.
Sajjad Ansari is a final year undergraduate student from IIT Kharagpur. As a technology enthusiast, he delves into the practical application of AI, focusing on understanding AI technology and its real-world impact. He aims to express complex AI concepts in a clear and easy way.