Meta-researchers introduce J1: a reinforcement learning framework that trains language models to judge with reasonable consistency and minimal data

0 0 4 minutes read

Meta-researchers introduce J1: a reinforcement learning framework that trains language models to judge with reasonable consistency and minimal data

Large language models are now used to evaluate and judge tasks, beyond their traditional text generation role. This results in “llm-as-aaa-gudge”, where the model evaluates the output of other language models. Such assessments are critical to reinforcement learning pipelines, benchmarking and system alignment. These judge models rely on internally thoughtful reasoning, reflecting the human judgment process. Unlike conventional reward models that provide direct scores, these models simulate thoughtful assessments, making them more suitable for complex tasks such as mathematical problem solving, moral reasoning, and user intention interpretation. Their ability to interpret and validate responses across languages and domains enhances automation and scalability in language model development.

However, the current AI judgment system faces the problems of inconsistency and shallow reasoning. Many rely on basic metrics or static comments that are not sufficient to evaluate subjective or open-ended prompts. A common problem is positional bias, the order of answers affects the final decision and harms fairness. Likewise, collecting data for human cancellation on a large scale is expensive and time-consuming, limiting the universality of these models.

Several existing approaches have addressed these challenges, but the success rate is limited. Systems such as AreadPlanner and DeepSeek-Grm rely on human-labeled data or rigid training programs that limit the adaptability of task types. Others, such as DeepSeek-R1, rely on distillation of large models, but do not perform well with ambiguous prompts. Static datasets and offline tuning strategies hinder dynamic inference, while newer methods using fractional formatting or structured cues show minimal improvement in accuracy. Despite the large datasets and models, performance improvements in traditional systems remain stagnant.

Researchers from Meta’s Genai and Fair team introduced J1 Resolve the above limitations. J1 trains the judgment model through a reinforcement-based learning framework so that it can learn through verifiable reward signals. The team uses synthetic data to create high-quality and low-quality responses to transform subjective tasks into verifiable pairwise judgments. The synthetic dataset consists of 22,000 pairs, between 17,000 hints and 5,000 mathematical queries in the Wildchat corpus, respectively. These are used to train two versions of J1: J1-LALAMA-8B and J1-LALAMA-70Binitialized from the basic models of Llama-3.1-8b-Instruct and Llama-3.3-70b-Instruct respectively. The model was trained using Group Relative Policy Optimization (GRPO), an enhancement algorithm that eliminates the need for the critic model and accelerates convergence.

At the core of the training strategy is inadequate learning, where both (x, a, b) and (x, b, a) input formats are used to train to prevent position bias. Likewise, consistency-based rewards can only be applied if the model provides the correct judgment in the two answer orders. This structure makes the judge fair and reliable regardless of how quick or how quick the order is answered. The training framework supports multiple variations: the model can output the final judgment, the numerical score for each answer, or both. Includes a judgment variant of the scale that evaluates a single response using scores of 0 to 10. These formats make J1 a versatile and popularizable system that can judge various tasks.

The results obtained using the J1 model reveal substantial improvements to existing systems. On a widely used Preference Proxy Evaluation (PPE) benchmark, the J1-LALAMA-70B has an overall accuracy of 69.6%, performing better than the training model with ten times more data. In contrast, models such as DeepSeek-Grm-27b and Evalplanner-llama-70B scored 67.2% and 65.6%, respectively. Even the smaller J1-LALAMA-8B model surpasses baseline systems such as Evalplanner-llama-8B, with a score of 62.2% vs. 55.5%. J1 also exhibits top performance in other key benchmarks such as RewardBench, RM Bench, Judge Bench, and Choldbencheval, which demonstrates a powerful generalization across verifiable and subjective tasks. Considering that the limited training data used in J1 is not only marginal, but also significant, compared to the extended datasets in other models.

Several key points about J1 research:

J1 uses 22,000 pairs of training, including Wildchat’s 17K and 5K training for math tasks.
This training uses GRPO, simplifying RL by avoiding the need for a separate critic model.
It introduces inadequate learning to reduce location bias through consistency-based rewards.
Two main model variants J1-LALAMA-8B and J1-LALA-70B were trained on modest data, but performed better than the large models.
J1-LALAMA-70B scored 69.6% on PPE, surpassing DeepSeek-Grm-27b (67.2%) and Evalplanner-llama-70b (65.6%).
Supports multiple judgment formats: paired with judgment, paired scores and fraction scores.
More than models of several tasks distilled out of DeepSeek-R1 and Openai O1 Minnie.
Proof of the quality of reasoning is not only the size of the dataset, but is crucial for accurate judgment.
J1’s framework makes it a generalist judge, suitable for both verifiable and unverified tasks.

In short, the J1 method fundamentally redefines how to train and evaluate judgment models. Synthetic data and reinforcement learning bypass traditional demand for expensive annotations while promoting fair, logical, and consistent assessments. This work shows that reason-oriented judgments can be better than larger models that rely heavily on data volume and static alignment techniques. It also verifies that the judgment model should be the thinker first, while the scorer should be the second. Due to performance and often exceeding state-of-the-art systems, J1 sets new benchmarks in training LLM-AS-AA-Gudge systems.

View paper. All credits for this study are to the researchers on the project. Also, please feel free to follow us twitter And don’t forget to join us 95k+ ml reddit And subscribe Our newsletter.

Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.