Artificial Intelligence

Self-taught AI: Tsinghua University’s “absolute zero” training LLM has zero external data

LLM demonstrates advances in reasoning ability through enhanced learning with proven rewards (RLVR), which relies on results-based feedback rather than imitating intermediate reasoning steps. Current RLVR work faces critical scalability challenges, as they depend to a collection of manually curated questions and training answers. As the inference model advances, building large-scale high-quality datasets has become increasingly unsustainable, similar to the bottleneck identified in the LLM trailer. Furthermore, exclusive reliance on artificially designed tasks may limit the ability of AI systems to learn autonomously and develop, especially when they transcend human intellectual abilities.

Researchers explore various methods to enhance LLM reasoning capabilities. Use expert iterations and reject sampling of responses validated by the result to improve reject sampling of COT inference. The O1 model deployed this concept on a large scale, achieving the latest results, and R1 then became the first open weight model to match or exceed O1 performance by introducing a “zero” setting that directly applies RL to the basic LLM. In addition, the self-play paradigm evolved from Schmiduber’s early two-institutional setup to more complex implementations such as Alphago and Alphazero. Recent methods such as spin, self-reward language models, SPC and SPAG apply self-play to language models for alignment and reasoning.

Researchers from Tsinghua University, Beijing General Artificial Intelligence Institute and Penn State University have proposed an RLVR paradigm called Absolute Zero to enable a single model to generate and solve tasks autonomously to maximize its learning progress without relying on any external data. Under this approach, the researchers introduced the absolute zero reasoner (AZR), which self-develops their training courses and reasoning abilities through code executors, thereby validating proposed code reasoning tasks and validating answers, thus providing a verifiable source of rewards to guide open-ended but rooted learning. AZR can be implemented efficiently on different model scales and is compatible with various model classes, indicating widespread applicability.

LLM provides an ideal framework for implementing AZR in a multi-task learning environment. In each online launch iteration of the target equation set by absolute zeros, AZR proposes new inference tasks based on task types and past self-generated examples, and explicitly prompts to generate various tasks, then try to solve these tasks and try to solve their model responses. AZR utilizes code executors as a flexible interface and a verifiable environment, allowing code inference tasks to be automatically constructed, executed and verified. Finally, the AZR algorithm includes buffer initialization, task proposal input and buffer management, effective task construction, solution verification, and advantage estimator calculations through task-related enhancement ++ calculations.

Absolute Zero Inference Code – 7B achieves state-of-the-art performance in the 7B population mean and coding average categories, although for mathematical and code reasoning benchmarks, the previous best model exceeds 1.8 absolute percentages despite being completely undistributed. It outperforms models trained by expertly planned human data, while in encoding, the absolute percentage does not access the data itself. Extended analysis shows that AZR brings greater benefits on larger models, while 7B and 14B models continue to improve over 200 training steps when the 3B model plateau. The performance of the distribution increases with the model size: +5.7, +10.2 and +13.2 for 3B, 7b and 14b, respectively.

In summary, the researchers introduced the absolute zero paradigm to address data limitations in existing RLVR frameworks. Under this approach, the researchers proposed AZR, which trains models to propose and solve code-related inference tasks established by code executors. However, there is a limitation on safety management in self-improvement systems. The team observed several examples of COT reasoning involving “UH-oh oh Moments” from the Camel-3.1-8b model. The results show that while the absolute zero paradigm reduces the need for human intervention, continuous supervision is needed to address lingering security issues, highlighting the key directions for future research.


Check Paper, hug face model and github page. Also, don’t forget to follow us twitter.

Here is a brief overview of what we built in Marktechpost:


Sajjad Ansari is a final year undergraduate student from IIT Kharagpur. As a technology enthusiast, he delves into the practical application of AI, focusing on understanding AI technology and its real-world impact. He aims to express complex AI concepts in a clear and easy way.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button