Exploring NVIDIA Cosmos Reason2: A Reasoning Vision Language Model for Physical AI and Robotics

Summary

NVIDIA Cosmos Reason2 is an open-source, customizable reasoning vision language model (VLM) designed for physical AI and robotics. It enables robots and vision AI agents to reason like humans, leveraging prior knowledge, physics understanding, and common sense to comprehend and act in the real world. The model understands space, time, and fundamental physics, serving as a planning tool to determine the next steps for embodied agents. Available in 2B and 8B parameter versions, it requires at least 24GB GPU memory and supports Hopper and Blackwell architectures.

Introduction: Why Does Physical AI Need Human-Like “Thinking”?

Imagine designing a robot that navigates unknown environments: dodging obstacles, predicting object movements, and deciding the next action based on common sense. This sounds like science fiction, but NVIDIA Cosmos Reason2 is making it a reality. As someone who’s worked in AI and robotics for years, I often face the challenge: how to make machines not just “see,” but truly “understand” the physical world? Cosmos Reason2 addresses this by post-training with physical common sense and embodied reasoning capabilities, using chain-of-thought to mimic human reasoning without needing human annotations.

This model family was released on December 19, 2025, including 2B and 8B parameter versions, now available on Hugging Face. Whether you’re a robotics developer or a video analytics expert, this article will guide you step by step on how to set up, use, and extend it. We’ll start from the basics and dive deeper, ensuring you can get started easily.

NVIDIA Cosmos

Model Family: Choosing the Right Cosmos Reason2 Version for You

The Cosmos Reason2 family includes two main models: Cosmos-Reason2-2B and Cosmos-Reason2-8B. Each is based on the Qwen3-VL architecture, designed for multimodal inputs, handling text prompts alongside videos or images.

Cosmos-Reason2-2B: With approximately 2,000,000,000 parameters, it’s ideal for entry-level applications. Minimum GPU memory requirement is 24GB, suitable for video captioning or simple embodied reasoning tasks.
Cosmos-Reason2-8B: Precisely 8,767,123,696 parameters, more powerful, supporting long-context understanding up to 256K input tokens. Minimum GPU memory is 32GB, with enhancements in spatiotemporal understanding, timestamp precision, and object detection (including 2D/3D point localization and bounding box coordinates).

These models are post-trained using supervised fine-tuning and reinforcement learning, injecting physical common sense and embodied reasoning data. The result? They handle long-tail diverse physical scenarios, such as fast camera movements, overlapping human-object interactions, or low-light high-blur environments with complex dynamics.

If you’re a beginner, start with the 2B version for testing; for production-level applications like robot planning, the 8B version offers more precise spatiotemporal reasoning. The model developer is NVIDIA, ensuring commercial viability.

New Feature Highlights: What’s Upgraded in Cosmos Reason2?

Compared to its predecessor, Cosmos Reason2 introduces several key improvements, making it stand out in physical AI:

Enhanced Physical AI Reasoning: Improved spatiotemporal understanding and timestamp precision for pinpointing events in video frames.
Object Detection Support: Provides 2D/3D point localization, bounding box coordinates, along with reasoning explanations and labels.
Long-Context Handling: Input token limit of 256K, ideal for analyzing lengthy videos or complex instructions.

Use cases include:

Video Analytics AI Agents: Extract insights and perform root-cause analysis on massive video data volumes. Applicable to recorded or live video streams in city and industrial operations.
Data Curation and Annotation: Automate high-quality curation and annotation of large, diverse training datasets. NVIDIA Cosmos Curator framework, powered by Cosmos Reason, helps developers filter, annotate, and deduplicate sensor data for physical AI development.
Robot Planning and Reasoning: Acts as the brain for deliberate decision-making in vision language action (VLA) models. Robots like humanoids and autonomous vehicles (AVs) can interpret environments, break down complex commands, and execute them using common sense, even in unfamiliar settings.

These features position the model as a core planning component for physical AI, predicting the next actions for embodied agents.

Setting Up the Environment: Installing Cosmos Reason2 from Scratch

Setting up the Cosmos Reason2 environment is straightforward, but pay attention to hardware compatibility. The model works well on Hopper and Blackwell GPUs, tested on NVIDIA H100 (CUDA 12.8), GB200 (CUDA 13.0), DGX Spark (CUDA 13.0), and Jetson AGX Thor (CUDA 13.0). Other configurations may work but aren’t officially validated.

Choosing Your Environment

You have two main options: a virtual environment or a Docker container.

Virtual Environment Setup

Install system dependencies:

sudo apt-get install curl ffmpeg git git-lfs

Install the uv tool:

curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

Install Hugging Face CLI and log in:

uv tool install -U huggingface_hub
hf auth login

Clone the repository and install:

git clone https://github.com/nvidia-cosmos/cosmos-reason2.git
cd cosmos-reason2
uv sync --extra cu128
source .venv/bin/activate

CUDA variants:

CUDA Version	Arguments	Notes
12.8	–extra cu128	Requires corresponding NVIDIA driver
13.0	–extra cu130	Must use for DGX Spark and Jetson AGX; set TRITON_PTXAS_PATH=”/usr/local/cuda/bin/ptxas”

Docker Container Setup

Ensure you have Docker and the NVIDIA Container Toolkit installed.

Build the image:

image_tag=$(docker build -f Dockerfile --build-arg=CUDA_VERSION=12.8.1 -q .)

CUDA variants are similar to the table above.

Run the container:
```
docker run -it --gpus all --ipc=host --rm -v .:/workspace -v /workspace/.venv -v /workspace/examples/cosmos_rl/.venv -v /root/.cache:/root/.cache -e HF_TOKEN="$HF_TOKEN" $image_tag
```
Optional arguments explained:
- –ipc=host: Handles high shared memory needs for parallel torchrun.
- -v /root/.cache:/root/.cache: Avoids re-downloading cache.
- -e HF_TOKEN=”$HF_TOKEN”: Sets Hugging Face token.

These steps get you up and running quickly, without needing the repository for inference—just refer to the inference_sample.py script.

Inference Guide: How to Process Videos and Images with Cosmos Reason2

Inference is the core functionality of Cosmos Reason2, supporting Transformers and vLLM deployment. Minimum GPU memory: 24GB for 2B model, 32GB for 8B.

Inference with Transformers

Cosmos Reason2 is integrated in transformers>=4.57.0.

Minimal example (processing a video):

import transformers
import torch

model_name = "nvidia/Cosmos-Reason2-2B"
model = transformers.Qwen3VLForConditionalGeneration.from_pretrained(
    model_name, dtype=torch.float16, device_map="auto", attn_implementation="sdpa"
)
processor: transformers.Qwen3VLProcessor = (
    transformers.AutoProcessor.from_pretrained(model_name)
)

video_messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}],
    },
    {"role": "user", "content": [
            {
                "type": "video",
                "video": "file:///path/to/your/video.mp4",
                "fps": 4,
            },
            {"type": "text", "text": (
                    "Is it safe to turn right? Answer the question using the following format:\n\n<think>\nYour reasoning.\n</think>\n\nWrite your final answer immediately after the </think> tag."
                )
            },
        ]
    },
]

# Process inputs
inputs = processor.apply_chat_template(
    video_messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
    fps=4,
)
inputs = inputs.to(model.device)

# Run inference
generated_ids = model.generate(**inputs, max_new_tokens=4096)
generated_ids_trimmed = [
    out_ids[len(in_ids) :]
    for in_ids, out_ids in zip(inputs.input_ids, generated_ids, strict=False)
]
output_text = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False,
)

Key tips:

Inputs: Text + video (mp4, FPS=4 to match training) or images (jpg).
Outputs: Text strings, using and formats to encourage long chain-of-thought reasoning.
Recommend max_new_tokens=4096 or more to avoid truncating long responses.
The model recognizes timestamps at the bottom of frames for precise temporal localization.

On Jetson AGX, it supports Transformers inference; vLLM inference is coming soon.

Deployment: Online Serving and Offline Inference

Use vLLM>=11.0.0 for deployment.

Online Serving

Start the server:

vllm serve nvidia/Cosmos-Reason2-2B \
  --allowed-local-media-path "$(pwd)" \
  --max-model-len 16384 \
  --media-io-kwargs '{"video": {"num_frames": -1}}' \
  --reasoning-parser qwen3 \
  --port 8000

Optional:

–max-model-len 16384: Range 8192-16384 to prevent OOM.
–media-io-kwargs: Allows overriding FPS per sample.
–reasoning-parser qwen3: Parses reasoning traces.
–port 8000: Avoids address conflicts.

After startup, video captioning example:

cosmos-reason2-inference online --port 8000 -i prompts/caption.yaml --reasoning --videos assets/sample.mp4 --fps 4

Embodied reasoning example (with verbose output):

cosmos-reason2-inference online -v --port 8000 -i prompts/embodied_reasoning.yaml --reasoning --images assets/sample.png

Offline Inference

Temporal captioning a video and saving frames for debugging:

cosmos-reason2-inference offline -v --max-model-len 16384 -i prompts/temporal_localization.yaml --videos assets/sample.mp4 --fps 4 -o outputs/temporal_localization

Common parameters:

–model nvidia/Cosmos-Reason2-2B: Model name or path.

These tools make deployment simple for batch processing or real-time services.

Post-Training: Customizing Your Cosmos Reason2

Cosmos Reason2 supports post-training with TRL and Cosmos-RL examples.

TRL: See examples/notebooks/README.md for supervised fine-tuning.
Cosmos-RL: See examples/cosmos_rl/README.md for reinforcement learning on embodied reasoning datasets.

Training datasets include EgoExo4D, PerceptionTest, Language Table, IntPhys, InfLevel, and CLEVRER, with hybrid automatic/sensor collection and human/automated labeling. Evaluation datasets are the same, covering multimodal videos, sensor signals, and structured physical reasoning.

Data format: Videos (mp4) and text.

With these, you can fine-tune the model for specific robotics tasks.

Quantization: Optimizing Performance

Use llmcompressor for quantization (see docs/llmcompressor.md). This reduces memory usage while maintaining inference quality, ideal for edge devices like Jetson AGX.

Troubleshooting: Solving Common Issues

Running into problems? Check docs/troubleshooting.md. Common issues include:

OOM: Reduce max-model-len.
CUDA compatibility: Ensure drivers match.
Server startup failure: Verify port availability.

For DGX Spark and Jetson, always use CUDA 13.0 and set TRITON_PTXAS_PATH.

Additional Resources: Dive Deeper

Troubleshooting: docs/troubleshooting.md
Example prompts: prompts/README.md
Based on Qwen3-VL architecture: Refer to Qwen3-VL repository, vLLM docs, and Qwen3 documentation.
vLLM resources: Online serving, offline inference, multimodal inputs, and LoRA.

License and Contact: Key Considerations for Usage

NVIDIA Cosmos source code is released under the Apache 2.0 license. Models are under the NVIDIA Open Model License, version release date September 23, 2025.

Key terms:

Commercially usable; free to create and distribute derivative models; NVIDIA claims no ownership over outputs.
Definitions: Derivative models include modifications; legal entity defines control relationships.
Use conditions: Comply with AI ethics (NVIDIA Trustworthy AI terms); bypassing safeguards terminates rights; models may be designated special-purpose.
License grant: Perpetual, worldwide, non-exclusive, no-charge, revocable license for performing, displaying, reproducing, using, creating derivatives, making, selling, distributing, and importing.
IP ownership: NVIDIA owns the model and its derivatives; you own yours; no rights to outputs.
Redistribution: Provide a copy of this agreement; include “Licensed by NVIDIA Corporation” and “Built on NVIDIA Cosmos” attributions.
Separate components: May include open-source licenses.
Disclaimer: Provided AS IS, without warranties.
Limitation of liability: No liability for damages.
Indemnity: You indemnify NVIDIA against third-party claims.
Feedback: NVIDIA may use without compensation.
Governing law: U.S. and Delaware law; Santa Clara County courts have jurisdiction.
Trade compliance: Adhere to export, import, trade, and sanctions laws.

For a custom license, contact cosmos-license@nvidia.com.

Important: Bypassing safeguards (e.g., technical limitations, safety guardrails) automatically terminates rights.

FAQ: Answering Your Questions

What hardware is Cosmos Reason2 compatible with?

Tested on H100 (CUDA 12.8), GB200 (13.0), DGX Spark (13.0), and Jetson AGX Thor (13.0). Hopper and Blackwell are supported.

How do I handle long videos?

Use FPS=4; max_new_tokens=4096+; long-context support for 256K tokens.

Can the model detect objects?

Yes, the 8B version supports 2D/3D points, bounding boxes, and explanations.

What are the training datasets?

EgoExo4D, PerceptionTest, and others, with hybrid collection and labeling, covering videos and physical reasoning.

How can I customize the model?

Use TRL for fine-tuning or Cosmos-RL for reinforcement learning.

Is the license commercial-friendly?

Yes, but comply with the NVIDIA Open Model License.

How-To: Building a Video Analytics Agent Step by Step

Prepare the Environment: Follow the virtual environment setup and install dependencies.
Load the Model: Use Transformers to load nvidia/Cosmos-Reason2-8B.
Prepare Inputs: Create video_messages with system prompts and user content (video path, FPS=4, text prompt like “Analyze safe turning”).
Process and Generate: Apply chat_template and generate outputs.
Parse Responses: Extract reasoning and answers using and .
Deploy: Use vLLM for online serving and integrate into your agent.
Test: Run with sample videos and check spatiotemporal accuracy.

This workflow can extend to root-cause analysis or real-time streams.

Ethical Considerations: Responsible Use of Cosmos Reason2

NVIDIA emphasizes that Trustworthy AI is a shared responsibility. Model usage must align with Trustworthy AI terms.

Bias Mitigation: Training video sources include diverse physical embodiments (e.g., humans, cars, robots) and environments (indoor/outdoor), reducing biases through varied datasets.
Explainability: Model type is Transformer; outputs text; works via vision encoder + projector + LLM modules, supporting step-by-step reasoning.
Privacy: No generatable or reverse-engineerable personal data; datasets reviewed pre-release; complies with privacy policy.
Safety: No known life-critical impacts; restrictions: Principle of least privilege; model/dataset access controls.

Potential risks: Outputs may generate harmful text; users are responsible for inputs/outputs and guardrails.

Quality Benchmarks: How Cosmos Reason2 Performs

On the Physical AI Bench Leaderboard, Cosmos-Reason2-8B excels in spatiotemporal understanding and embodied reasoning. Benchmarks include EgoExo4D datasets, evaluating accuracy via visual question answering.

System requirements: Tested on H100/A100; BF16 precision inference; 32GB+ GPU memory.

Conclusion: Ushering in a New Era of Physical AI

Cosmos Reason2 isn’t just a model—it’s a gateway to intelligent robotics. By understanding the physical world, it empowers developers to build more reliable AI agents. From setup to deployment, this article provides practical guidance for video curation or robot planning. Start your journey: Clone the repo, run the samples, and explore endless possibilities. If you have feedback, NVIDIA is always open.

NVIDIA Cosmos Reason2: Build Smarter Robots with Human-Like Physical AI Reasoning