Artificial Intelligence

NVIDIA unleashes Cosmos-Reason 1: an AI model that drives physical common sense and reflective reasoning in the real world

AI has made progress in language processing, mathematics, and code generation, but extending these capabilities to physical environments is still challenging. Physical AI tries to close this gap by developing systems that perceive, understand and act in a dynamic real world. Unlike traditional AI that processes text or symbols, physical AI participates in sensory input, especially video, and generates responses based on real-world physics. These systems are designed to rely on common sense reasoning and embodied understanding of the laws of space, time and physics, and are designed for navigation, manipulation and interaction. Applications cover robotics, autonomous vehicles and human-machine collaborations, where adaptability to real-time perception is critical.

The weak connection between current AI models and real-world physics is a major limitation. Although they perform well on abstract tasks, they are often unable to predict physical consequences or respond appropriately to sensory data. Concepts such as gravity or spatial relationships are not intuitively understood, which makes them unreliable for specific tasks. Training directly in the physical world is expensive and risky, which hinders development and iteration. Lack of physical foundations and embodied understanding is an important obstacle to the effective deployment of AI in real-world applications.

Previously, tools for physical reasoning in AI were dispersed. The visual model connects visual and textual data, but lacks depth in reasoning. Rules-based systems are rigid and fail in novel situations. Simulating and synthesizing data often misses the nuances of real-world physics. Crucially, there is no standardized framework to define or evaluate physical common sense or manifest reasoning. Inconsistent methods and benchmarks make progress difficult to quantify. Reinforcement learning methods lack task-specific reward structures, resulting in models struggling with causal reasoning and physical feasibility.

NVIDIA researchers introduced Cosmos-Reason1a set of multi-modal large language models. These models, Cosmos-Reason1-7b and Cosmos-Reason1-56bIt is designed specifically for physical reasoning tasks. Each model is divided into two main stages: physical AI supervised fine-tuning (SFT) and physical AI enhanced learning (RL). The difference between this approach is the introduction of a dual body system. A hierarchical ontology organizes physical common sense into three main categories, namely space-time and basic physics, and is further divided into 16 subcategories. The second ontology is the two-dimensional and map reasoning capabilities of five embodied drugs, including humans, robotic arms, humanoid robots and autonomous vehicles. These ontology are training guides and assessment tools for body reasoning in benchmark AI.

Cosmos-Reason1’s architecture uses a decoder-only LLM that uses only visual encoder enhancements. Process the video to extract visual features and then project it into a shared space with a language token. This integration enables the model to deduce text and visual data simultaneously. The researchers curated a massive dataset that includes about 4 million annotated video text pairs for training. These include action descriptions, multiple choice questions, and long-term traces of thoughtful reasoning. The reinforcement learning phase is driven by rule-based, verifiable rewards obtained from multiple selection problems for human markers and self-supervised video tasks. These tasks include predicting the time direction of the video and solving problems with space-time patches, thus making training have a profound relationship with the physical logic of the real world.

The team constructed three benchmarks for physical meaning, space, time and fundamental physics, including 604 questions in 426 videos. Six benchmarks are used to embody reasoning, with 610 of 600 videos covering a wide range of tasks. The Cosmos-Reason1 model performed better than previous baselines, especially after the RL phase. It is worth noting that they have improved in task completion verification, predicting the next reasonable action, and evaluating the physical feasibility of the action. These growths were observed in both model sizes, and Universe-Season 1-56B showed stronger performance in most metrics. This performance improvement emphasizes the use of structured ontology and multi-modal data to enhance the effectiveness of physical reasoning in AI.

Universe – Several key points of season research:

  • Two models were introduced: Cosmos-Reason1-7b and Cosmos-Reason1-56B, which were specially trained for physical reasoning tasks.
  • These models are divided into two stages: physical AI supervised fine-tuning (SFT) and physical AI enhanced learning (RL).
  • The training dataset consists of approximately 4 million annotated video text pairs to curate physical inference.
  • Reinforcement learning uses rule-based and verifiable rewards that come from human annotation and video-based tasks.
  • The team relies on two ontology: one layered, with three categories and 16 subcategories, and the two-dimensional mapping proxy functionality.
  • Benchmark: 604 questions out of 426 videos, used for physical common sense, 600 manifestations from 600 videos, used for reflecting reasoning.
  • After RL training, performance improvements can be observed in all benchmarks, especially when predicting the next action and validation task completion.
  • Applicability to robots, vehicles and other specifically embodied agents in different environments.

In short, the Cosmos-Reason1 program demonstrates how to better equip the physical world with AI. It addresses key limitations of perception, reasoning, and decision making that hinder progress in deploying AI in decent situations. Structured training pipelines based on real-world data and ontological frameworks ensure that the model is accurate and adaptable. These advances point to an important step in bridging the gap between abstract AI reasoning and the system needs that must operate in unpredictable real-world environments.


View paper, project page, model on hug surface and GitHub page. All credits for this study are to the researchers on the project. Also, please feel free to follow us twitter And don’t forget to join us 95k+ ml reddit And subscribe Our newsletter.


Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.

🚨Build a Genai you can trust. ⭐️Parlant is your open source engine for controlled, compliance and purposeful AI conversations – Star Parlant on Github! (Promotion)

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button