RL^V: Unified inference and validation in language models through valueless reinforcement learning

LLM gains excellent reasoning skills in correctness rewards through enhanced learning (RL). LLM’s modern RL algorithms, including GRPO, Vineppo and one-to-one PPO, tend to obtain the return of empirical estimates by eliminating the learning value function network, thus breaking away from the traditional PPO method. This reduces computational requirements and GPU memory consumption, making RL training more feasible. However, this efficiency has a trade-off – the value function can serve as a powerful result validator for evaluating the correctness of the inference chain. Without this component, LLM loses valuable validation capabilities and can enhance inferences through parallel search strategies like N’s best or weighted majority voting.
Recent advances in LLM inference explore various RL technologies, and traditional PPO algorithms show utility of value models as test time search validators. However, the growing trend of “priceless” RL approaches (GRPO, Vineppo, remaining PPO) eliminates this feature while requiring separate model training overhead. Test time verification methods are alternatives to improve inference through scaling calculations, including models trained through binary classification, preference learning, or next prediction techniques. But these models require a large amount of training data sets, other computing resources, and a large amount of GPU memory during inference.
Researchers at McGill University, University of Montreal, Microsoft Research and Google DeepMind propose RLv Potential to solve the median value of RL in LLM. RLv Use Generate Verifiers to enhance the “worthless” approach without compromising training scalability. RLv Optimize the model as an inference and validator by using the rich data generated during RL training, leveraging the generation function of LLM. This dual-function approach will validate the framework as the next prediction task, allowing the same LLM to generate solutions while providing intrinsic scores. The initial results show RLv Compared with the basic RL method, the mathematical accuracy is increased by more than 20% when using parallel sampling, achieving an efficient test time calculation 8-32 times.
RLv Unified inference and generation validators in a single LLM address four key research questions about parallel test time calculation extensions, validation program training methods, test time usage strategies, and interactions with sequential scaling in sequential scaling models. This setup uses Hendycks’ mathematical dataset for RL training, runs on a 4×A100 80G NVIDIA GPU for 3 hours, and reports the evaluation mathematically2GPQA and Aime’24 benchmarks. The researchers used the QWEN2.5 math 1.5B model, fine-tuned it using a vinepo algorithm with or without unified verification, and conducted shorter COT experiments. The training uses a 1024 token context window where inference generates tokens of up to 1024 tokens of 500 and 2048 tokens for other test sets.
RLv Shows excellent test time calculation scaling capabilities, 32 times more efficient and 4% more accurate than the baseline method of Math500 using 512 samples. Testing the best validation strategy shows that weighted voting outperforms the majority and best N methods when sampling 8+ solutions for each problem in short and long COT models. RLv Proof is related to continuous inference calculation scaling scalev Methods to achieve the highest success rate on AIME 24 over longer generation lengths. Training unified validators requires careful balance by validation coefficient λ, which is a significant trade-off in GRPOv Implementation – Increasing λ can improve the accuracy of the verifier (from ~50% to ~80%).
In this article, the researchers introduce RLvit integrates validation into a “valueless” RL framework without significant computational overhead and shows improvements in inference accuracy, test time computational efficiency, and cross-domain generalization across math, math, GPQA, GPQA and AIME 24 datasets. Future research directions could explore enhanced generative validators to generate clear COT interpretations, although such advances require specific validated COT data or dedicated RL training processes. A unified framework for solution generation and verification through RL has set a valuable foundation for the sustainable development of LLM’s inference capabilities.
Check Paper. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 90K+ ml reddit.
Here is a brief overview of what we built in Marktechpost:

Sajjad Ansari is a final year undergraduate student from IIT Kharagpur. As a technology enthusiast, he delves into the practical application of AI, focusing on understanding AI technology and its real-world impact. He aims to express complex AI concepts in a clear and easy way.