Cambrian-S: Teaching AI to Understand Space Like Humans Do – A Deep Dive into Spatial Supersensing
Imagine asking a home robot to “find the coffee mug you saw on the kitchen counter three hours ago.” For humans, this is effortless—we maintain an implicit mental model of our environment, effortlessly tracking objects and spaces over time. For today’s AI systems, this seemingly simple task remains nearly impossible. Most video AI models excel at describing what’s directly in front of them but struggle to build persistent, structured understandings of 3D space that survive viewpoint changes, occlusions, and long time gaps.
This article answers: Why do current video AI models fail at real-world spatial reasoning, and how does Cambrian-S represent a fundamental shift toward true spatial intelligence?
We recently spent months building and testing Cambrian-S, a new family of video understanding models designed specifically for spatial supersensing—the ability to not just see, but to construct, update, and predict using an implicit 3D world model. What we discovered challenges the prevailing assumption that scaling data and model size alone solves video understanding. The path forward requires a different paradigm entirely.
The Four Stages of True Visual Intelligence
This section answers: What does it mean for an AI to truly “see” and understand space, beyond just describing images?
Most benchmarks treat video understanding as a single capability. In reality, spatial intelligence develops in distinct stages, each building on the previous. Our framework identifies four developmental milestones:
Stage 1: Semantic Perception
This is where most today’s models live. The system can name objects, describe attributes, and answer direct questions about visible content. Ask “What color is the car?” and you get an answer. This works well for short clips with clear questions but breaks down when the task requires remembering what happened earlier or reasoning about hidden spatial relationships.
Scenario: A security camera system that flags when a “person” appears in frame operates at this level. It recognizes the object but doesn’t understand where the person came from, how they moved through the space, or what their trajectory implies about intent.
Stage 2: Streaming Event Cognition
Here, the system maintains awareness across continuous input streams. Instead of processing isolated clips, it tracks evolving situations and can proactively respond to events. This requires some form of memory—not just storing frames, but integrating information over time into coherent event representations.
Scenario: A smart home assistant that notices you’ve left the refrigerator door open for five minutes and reminds you. It needs to: (1) recognize the door state initially, (2) maintain that knowledge while you walk away, (3) track time, and (4) generate an interrupt when the condition persists.
Stage 3: Implicit 3D Spatial Cognition
This is where things get interesting. The system begins treating video not as flat pixels, but as projections of a hidden 3D world. It develops intuitions about object permanence, relative positions, scale, and how arrangements change with viewpoint. Questions like “If I’m standing by the refrigerator facing the washer, is the stove to my left or right?” become answerable.
Scenario: A delivery robot navigating an apartment building must understand that the hallway it sees from different angles is the same physical space. It needs to track its position, understand room connectivity, and remember object locations independent of its current viewpoint.
Stage 4: Predictive World Modeling
The final stage mirrors human cognition: the system builds unconscious internal models that predict future sensory input. When predictions fail—when something “surprising” happens—it triggers attention, memory encoding, and learning. This predictive mechanism naturally filters irrelevant information and highlights what matters, enabling efficient processing of arbitrarily long streams.
Scenario: When you walk into your living room and immediately notice a new vase on the table, your brain didn’t analyze every object from scratch. Your internal model predicted the scene, and the mismatch drew your attention instantly. Current AI lacks this entirely, analyzing every frame with equal (and expensive) attention.
The VSI-SUPER Benchmark: A Reality Check for Spatial AI
This section answers: How do we actually measure whether an AI system has spatial supersensing capabilities?
Existing video benchmarks mostly test Stage 1 and 2 capabilities. We audited popular benchmarks by feeding models three types of input: full video frames, single frames, and text captions. The results were revealing—many benchmarks show little difference between these conditions, suggesting they test linguistic knowledge more than visual sensing.
To expose this gap, we created VSI-SUPER (Visual-Spatial Intelligence for SUPersensing Evaluation and Research). Unlike traditional benchmarks, VSI-SUPER consists of two deliberately challenging tasks that resist shortcuts:
VSI-SUPER Recall (VSR): The Video Needle-in-Haystack Test
VSR tests long-horizon spatial observation and sequential recall. We took realistic room-tour videos and used generative models to insert unusual objects (like a teddy bear) into four distinct frames at different locations. These edited clips were concatenated with other videos to create streams of arbitrary length—10, 30, 60, 120, even 240 minutes.
The task: “Which of the following correctly represents the order in which the Teddy Bear appeared?”
Why this is hard:
-
Models must monitor the entire stream continuously, not just sample frames -
They need to remember both what appeared and where in sequence -
The “needles” are visually integrated into scenes, not obvious overlays -
Videos can exceed any fixed context window, forcing selective retention
VSI-SUPER Count (VSC): Continual Counting Across Scenes
VSC tests the ability to accumulate information across changing viewpoints. We concatenated multi-room tour videos and asked: “How many chairs are there in total across all rooms?”
Why this is hard:
-
Same objects appear multiple times from different angles -
The model must avoid double-counting while not missing anything -
The answer evolves as the video progresses (streaming evaluation) -
It requires generalizing the concept of counting, not memorizing training distributions
Our baseline results were sobering. Even Gemini-2.5-Flash, with a million-token context, dropped from 45.7% accuracy on 60-minute VSR to 41.5% at 120 minutes (hitting context limits). On VSC, its counts saturated at small constants—failing to scale with actual object quantities—revealing a lack of genuine counting ability.
Inside Cambrian-S: Architecture and Training Pipeline
This section answers: What exactly is Cambrian-S, and how was it built to tackle spatial reasoning?
Cambrian-S is a family of video Multimodal Large Language Models (MLLMs) with four model sizes (0.5B to 7B parameters). But the architecture choices and training recipe matter more than parameter count.
Architecture: Keeping It Simple and Efficient
We deliberately avoided complex fusion mechanisms. Our design follows a clean three-component structure:
-
Vision Encoder: SigLIP2-SO400M (trained with text prediction, contrastive, and masked self-prediction losses) -
Language Model: Instruction-tuned Qwen2.5 (0.5B, 1.5B, 3B, or 7B) -
Vision-Language Connector: A simple two-layer MLP with GELU activation
Why this matters: Many models use sophisticated but computationally expensive fusion strategies. Our simpler connector maintains performance while being more efficient—a crucial tradeoff when processing long video streams.
The Four-Stage Training Recipe
Stage 1: Vision-Language Alignment
We freeze everything except the connector and train on 2.5M image-text pairs. Images are padded to 384×384. This establishes basic cross-modal understanding.
Stage 2: Image Instruction Tuning
We unfreeze the connector and LLM, training on 7M image instructions using an AnyRes strategy. Images are resized to preserve aspect ratio and split into up to nine 384×384 sub-images. This teaches the model to handle high-resolution inputs flexibly.
Stage 3: General Video Instruction Tuning
We introduce video data—3 million samples from diverse sources like Ego4D, VideoChatGPT, and YouCook2. We uniformly sample 64 frames per video, resize to 384×384, and downsample feature maps to 8×8 (64 tokens per frame). This builds broad video understanding before specialization.
Stage 4: Spatial Video Instruction Tuning
This is where Cambrian-S differentiates. We fine-tune on our VSI-590K dataset (590K spatial QA pairs) mixed with general video data to prevent catastrophic forgetting. We increase to 128 frames per video and sequence length to 16,384 tokens. This final stage injects explicit spatial reasoning capabilities.
Reflection: We initially assumed general video training would automatically confer spatial reasoning. It didn’t. The performance gap between Stage 3 and Stage 4 was stark—general video understanding doesn’t generalize to precise spatial cognition. Spatial intelligence requires targeted, spatially-rich data and deliberate training.
The Data Engine: Building VSI-590K for Spatial Learning
This section answers: What kind of data actually teaches an AI to understand 3D space?
Data quality beats quantity for spatial reasoning. We curated VSI-590K from three distinct sources, each serving a specific purpose:
Annotated Real Videos: The Gold Standard
We repurposed 3D-annotated indoor datasets (ScanNet, ScanNet++, ARKitScenes, S3DIS, Aria Digital Twin) containing instance-level 3D reconstructions. From these, we extracted:
-
Object counts by category -
Precise bounding boxes in 3D space -
Room dimensions and layouts
We then generated diverse QA pairs using templates covering 12 question types: size, direction, count, distance, and appearance order—each in relative and absolute variants, from camera and object perspectives.
Example QA from ScanNet++:
“Standing by the backpack, looking toward the table, how far counterclockwise in degrees must I turn to see the trash bin?”
Answer: 334.09 degrees
This teaches the model to reason about rotational geometry from specific viewpoints—something impossible without accurate 3D annotations.
Simulated Data: Scaling Diversity
Real 3D annotations are scarce. We used ProcTHOR to procedurally generate 625 video traversals through diverse synthetic indoor environments with random layouts. From Hypersim, we sampled 5,113 photorealistic images from 461 scenes.
Why simulation works: We can generate unlimited variations in room layout, object placement, and camera trajectories. The model learns spatial invariants—what makes a kitchen a kitchen, how furniture typically arranges—that transfer to real scenes.
Unannotated Web Videos: Capturing Real-World Diversity
We collected ~19,000 room-tour videos from YouTube and robot datasets (Open-X-Embodiment, AgiBot-World). Since these lack 3D labels, we built a pseudo-annotation pipeline:
-
Frame sampling: Extract frames at regular intervals, filter blurry ones -
Object detection: Use Grounding-DINO to detect objects in predefined categories -
Segmentation: Apply SAM2 to get instance masks -
3D lifting: Use VGGT to extract 3D point clouds for each frame -
Refinement: Apply erosion to masks to clean boundary artifacts
Example output from a random YouTube tour:
“How many chairs are present?”
Answer: 6
This process yields noisy but useful 3D structure. Crucially, we process frames individually rather than across entire videos—this prevents error accumulation that plagues full-video reconstruction.
Data effectiveness ranking: Our ablations revealed a clear hierarchy: annotated real videos > simulated data > pseudo-annotated web videos. All three contributed positively, but nothing beats ground truth 3D geometry for teaching precise spatial reasoning.
Why Scale Alone Isn’t Enough: The Limitations We Discovered
This section answers: If we just keep making models bigger and training on more data, will they eventually develop spatial intelligence?
After training Cambrian-S, we achieved impressive results—67.5% on VSI-Bench, beating Gemini-2.5-Pro by 16 points. Our 0.5B model matched Gemini-1.5-Pro. We had proven that targeted data curation works.
Then we tested on VSI-SUPER. The results humbled us:
| Model | VSI-Bench (Short Videos) | VSR 10min | VSR 60min | VSC 10min | VSC 60min |
|---|---|---|---|---|---|
| Cambrian-S-7B | 67.5% | 38.3% | 6.0% | 0.6% | 0.0% |
Performance collapsed as video length grew. Even on 10-minute videos that fit comfortably in context, our model struggled. It wasn’t a context length problem—it was a fundamental limitation of the frame-by-frame paradigm.
What we learned: Current MLLMs treat video as a sequence to be memorized, not a world to be modeled. They lack:
-
Selective attention: Humans unconsciously filter 99% of sensory input. Models process everything equally. -
Predictive mechanisms: Humans anticipate what should happen next. Models react to what already happened. -
Event structure: Humans chunk experience into episodes. Models see a flat token sequence.
Reflection: This was our crucial insight. We had optimized the existing paradigm to its limits—better data, better training, better architecture. But the paradigm itself was the bottleneck. You can’t achieve human-like spatial reasoning by scaling a fundamentally reactive system. You need a system that anticipates and organizes experience.
This realization led us to prototype something different: predictive sensing.
Predictive Sensing: Teaching AI to Anticipate the World
This section answers: How can we build AI that doesn’t just see, but predicts and organizes sensory input like humans do?
Inspired by theories of human cognition—particularly Karl Friston’s free-energy principle and the violation-of-expectation paradigm—we prototyped a predictive sensing mechanism. The core idea: build an internal world model that continuously predicts the next sensory state, using prediction errors (“surprises”) to drive learning and memory.
The Latent Frame Prediction Head
We added a lightweight self-supervised module to Cambrian-S: a two-layer MLP that runs parallel to the language head, predicting the latent representation of the next video frame.
Training: During Stage 4, we jointly optimize two objectives:
-
The standard next-token prediction loss for QA tasks -
Cosine distance + MSE loss between predicted and actual next-frame features
The LFP head learns to model spatiotemporal dynamics without explicit supervision. It develops an internal physics model: objects persist, motion is continuous, scenes change gradually.
Inference: For each incoming frame, we compute the cosine distance between the LFP prediction and the actual encoded frame. Large distances = high surprise = something unexpected happened.
This surprise signal becomes a powerful, self-supervised control mechanism.
Case Study 1: Surprise-Driven Memory Management for VSR
Traditional memory systems store everything or use heuristics like frame similarity. Ours uses surprise to decide what to keep.
How it works:
-
Streaming encoding: Process frames with sliding window attention (fixed size) -
Surprise assignment: Compute prediction error for each frame -
Selective compression: Frames with surprise below threshold get 2x KV-cache compression before entering long-term memory -
Consolidation: When memory budget fills, we drop the least surprising frames first—those that matched predictions and likely contain redundant information -
Query retrieval: For questions, retrieve top-K most relevant frames by computing attention similarity between query and stored frame features
Results on VSR: Our surprise-driven system maintained 45%+ accuracy across all video lengths (10 to 240 minutes) while keeping GPU memory constant. In contrast, memory usage for standard methods grew linearly with video length.
Why this works: Predictable frames (slow camera pans, static scenes) compress heavily. Unexpected frames (object appearances, scene cuts) retain full resolution. The model automatically allocates resources to what matters.
Reflection: This felt like cheating. We didn’t teach the model what to remember—we taught it to recognize when something deviates from expectation and trust that deviation as important. The elegance is in the self-supervision: the supervision signal comes from the model’s own anticipation of reality.
Case Study 2: Surprise-Driven Event Segmentation for VSC
Counting across long videos requires knowing when to count, what to count, and how to combine partial counts. Humans chunk space into rooms—count this room, then that room, then sum.
Our approach: Use surprise to detect natural event boundaries (scene changes, room transitions), then count within each segment.
How it works:
-
Continuously accumulate frames in a buffer while monitoring surprise -
When surprise spikes (indicating scene change), freeze the buffer and ask: “How many chairs in this segment?” -
Store the segment answer and clear the buffer -
At video end, sum all segment answers
Results: On 10-minute videos, our method achieved 40% accuracy—modest but far above the near-zero baselines. More importantly, predictions scaled with ground truth, unlike Gemini’s saturated outputs. On 120-minute streams, accuracy remained stable around 34%.
The “doorway effect”: This mirrors human memory segmentation—passing through a doorway causes forgetting of the previous room’s details. Our surprise threshold acts as an artificial doorway, creating clean episodic boundaries that prevent counting errors from bleeding across scenes.
Reflection: We initially worried that imperfect surprise detection would compound errors. The opposite happened: even approximate segmentation dramatically outperformed no segmentation. The lesson? Structure matters more than perfection. Giving the model a framework for “one thing at a time” unlocks capabilities that monolithic processing suppresses.
Practical Implications: What Can You Build With Spatial Supersensing?
This section answers: What real applications become possible when AI can truly understand space?
The implications extend far beyond academic benchmarks. Spatial supersensing enables:
Persistent Embodied Assistants
A robot that helps elderly people could remember where it last saw medications, track whether doors were left open, and anticipate hazards based on learned patterns. It wouldn’t just react to commands—it would maintain a living model of the home and its occupants’ routines.
Implementation: Deploy Cambrian-S with LFP on a mobile robot. The surprise signal drives attention to anomalies (unusual objects, moved furniture), while the spatial memory supports natural language queries like “Where did I leave my glasses?”
Smart Warehouse Management
Current inventory systems rely on manual scanning or fixed cameras. A spatially-aware system could continuously monitor storage via drone patrols, automatically update inventory counts (VSC-style), and detect misplaced items (VSR-style recall).
Implementation: Use surprise-driven event segmentation to chunk drone video by aisle. Counting runs per aisle prevent double-counting, while memory compression handles hours of patrol footage without storage explosion.
Autonomous Navigation with Memory
Self-driving cars currently re-interpret the world at each frame. A predictive world model would anticipate pedestrian paths, remember temporarily occluded objects, and build maps that persist across trips.
Implementation: LFP predictions anticipate vehicle and pedestrian motion. Surprise triggers when behavior deviates from models (a child running into street), prompting immediate attention and evasive action.
AR/VR Content Creation
Creators could walk through real spaces, and the system would automatically build editable 3D scene graphs—not just photogrammetry meshes, but semantically organized spaces with object relationships.
Implementation: Process egocentric video through Cambrian-S to extract object counts, relative positions, and room connectivity. The resulting scene description feeds directly into game engines or design tools.
Reflection: What This Means for AI Development
This section answers: What did we learn about building intelligent systems, and where do we go from here?
After months of experiments, two truths became clear:
First, specialization requires generalization first. Our best spatial models came from strong general video foundations (Stage 3) then spatial specialization (Stage 4). Trying to train directly on spatial data from scratch performed poorly. This mirrors human development—we learn basic visual perception before mastering spatial reasoning.
Second, the right objective function matters as much as the right data. The LFP loss is tiny compared to the language modeling loss (we used a 0.1 coefficient), yet it fundamentally changed the model’s representational structure. The auxiliary task of prediction forced the model to learn physics, continuity, and causality—knowledge that directly transferred to spatial tasks.
Third, benchmarks shape paradigms. Before VSI-SUPER, we could have convinced ourselves that Cambrian-S’s strong VSI-Bench performance meant we had “solved” spatial reasoning. The new benchmark exposed our blind spots. This suggests we need more “unsolvable” benchmarks—tasks that current paradigms can’t crack—to force architectural innovation.
Limitations we acknowledge: Our prototype is just a first step. The LFP head is simple; a more sophisticated world model could yield richer surprise signals. Our evaluations are synthetic; real-world deployment will surface new challenges. And we haven’t fully explored the connection between predictive sensing and action—the next frontier is closing the loop between anticipation and behavior.
But the direction feels right. Watching Cambrian-S’s surprise scores spike precisely at event boundaries, seeing its memory system automatically preserve the unexpected while compressing the mundane—these weren’t engineered behaviors. They emerged from a simple principle: predict the future, attend to the unexpected.
Getting Started: Using Cambrian-S Today
This section answers: How can I experiment with these models and datasets right now?
Model Weights and Specifications
All Cambrian-S models are open-sourced and ready for deployment:
| Model | Base LLM | Vision Encoder | Context Length | Use Case |
|---|---|---|---|---|
| Cambrian-S-7B-LFP | Qwen2.5-7B-Instruct | SigLIP2-SO400M | 16,384 tokens | Full spatial supersensing with predictive sensing |
| Cambrian-S-7B | Qwen2.5-7B-Instruct | SigLIP2-SO400M | 16,384 tokens | Standard spatial MLLM |
| Cambrian-S-3B | Qwen2.5-3B-Instruct | SigLIP2-SO400M | 16,384 tokens | Edge deployment |
| Cambrian-S-1.5B | Qwen2.5-1.5B-Instruct | SigLIP2-SO400M | 16,384 tokens | Mobile applications |
| Cambrian-S-0.5B | Qwen2.5-0.5B-Instruct | SigLIP2-SO400M | 16,384 tokens | Real-time processing |
Access: All models available at nyu-visionx/Cambrian-S-7B-LFP and sibling repositories.
Quick Start Example
from transformers import AutoModel, AutoTokenizer
import torch
# Load Cambrian-S-7B-LFP
model = AutoModel.from_pretrained(
"nyu-visionx/Cambrian-S-7B-LFP",
trust_remote_code=True
).cuda().eval()
tokenizer = AutoTokenizer.from_pretrained("nyu-visionx/Cambrian-S-7B-LFP")
# Prepare video frames (list of PIL images, 1 FPS for LFP)
frames = load_your_video_frames(video_path, fps=1)
# For VSR-style recall queries
query = "In what order did the teddy bear appear?"
response = model.generate(
frames=frames,
query=query,
use_predictive_memory=True, # Enable surprise-driven memory
surprise_threshold=0.3,
max_new_tokens=512
)
# For VSC-style counting
query = "How many chairs total?"
response = model.generate(
frames=frames,
query=query,
use_event_segmentation=True, # Enable surprise-based counting
aggregation_method="sum"
)
Training Your Own Spatial Reasoning Model
The four-stage training recipe is fully reproducible. Key hyperparameters:
Stage 1 (Alignment):
-
Learning rate: 1e-3 -
Batch size: 512 -
Trainable: Vision-language connector only -
Data: Cambrian-Alignment-2.5M
Stage 2 (Image Tuning):
-
Learning rate: 1e-5 -
Batch size: 256 -
Trainable: Connector + LLM -
Data: Cambrian-7M with AnyRes -
Max sequence: 8,192 tokens
Stage 3 (General Video):
-
Frames: 64 uniform samples per video -
Tokens per frame: 64 (8×8 feature map) -
Data: CambrianS-3M (3M video QA pairs) -
Max sequence: 12,288 tokens
Stage 4 (Spatial Tuning):
-
Frames: 128 uniform samples -
Data: VSI-590K + 700K general video/image samples -
LFP loss coefficient: 0.1 (critical for predictive sensing) -
Max sequence: 16,384 tokens
Infrastructure: We trained on TPU v4 pods using GSPMD for automatic sharding and FlashAttention via Pallas to handle long sequences. The 7B model at Stage 4 fits on a TPU v4-512 pod.
VSI-590K Dataset Access
The full dataset is available at nyu-visionx/VSI-590K. It includes:
-
590,667 QA pairs -
44,858 images -
5,963 videos -
12 structured question types across configuration, measurement, and spatiotemporal groups
Format: Each sample includes video/image paths, question, answer, question type, and 3D metadata when available.
One-Page Summary: Cambrian-S at a Glance
Problem: Current video AI lacks genuine spatial reasoning—it can’t maintain persistent 3D world models, anticipate events, or efficiently process long streams.
Solution: Cambrian-S combines:
-
VSI-590K dataset: 590K spatial QA pairs from 3D-annotated real/simulated videos -
Four-stage training: Image alignment → image tuning → general video → spatial specialization -
Predictive sensing: Latent Frame Prediction head uses surprise signals for memory management and event segmentation
Key Results:
-
+30% absolute gain on VSI-Bench vs previous SOTA -
Competitive performance on general video benchmarks (Perception Test, EgoSchema) -
VSI-SUPER: First benchmark to test long-horizon spatial recall and continual counting -
Surprise-driven memory maintains performance across 240-minute videos with constant GPU memory
Model Family:
-
0.5B to 7B parameter versions -
All models use SigLIP2-SO400M vision encoder + Qwen2.5 LLM -
Context length: 16,384 tokens -
LFP variant enables predictive sensing
Use Cases: Embodied assistants, persistent navigation, smart inventory, AR/VR scene understanding
Limitations: Synthetic evaluations, simple LFP architecture, no action loop yet
Future: Richer world models, embodied action integration, real-world deployment testing
Get Started: pip install cambrian-s and load from Hugging Face: nyu-visionx/Cambrian-S-7B-LFP
Frequently Asked Questions
Q: How is Cambrian-S different from other video LLMs like LLaVA-Video or Qwen-VL?
A: While those models excel at general video QA, Cambrian-S is specifically engineered for spatial reasoning. The VSI-590K dataset provides 3D-grounded supervision, and the four-stage pipeline explicitly builds spatial cognition on top of general video understanding. Most importantly, only Cambrian-S includes the predictive sensing paradigm—using self-supervised next-frame prediction for memory and segmentation.
Q: Can I use Cambrian-S for regular video tasks like captioning and QA?
A: Yes. Stage 3 training ensures competitive performance on standard benchmarks. You can use Cambrian-S-7B (without LFP) as a drop-in replacement for general video understanding tasks. The LFP variant adds overhead but improves long-form spatial reasoning.
Q: What’s the computational cost of predictive sensing?
A: The LFP head is a tiny two-layer MLP (~0.1% of total parameters). During inference, computing surprise adds ~5-10% latency per frame but enables massive memory savings. For videos over 30 minutes, the memory compression outweighs the compute cost.
Q: Why didn’t you just use longer context windows?
A: We tried. Even with 1M token contexts, performance on VSI-SUPER plateaued then degraded. More fundamentally, infinite context doesn’t solve the underlying problem: models need to organize experience, not just store it. Humans process infinite streams with <10GB brain capacity—efficiency through prediction, not storage.
Q: How realistic are the VSI-SUPER tasks?
A: While synthetic, they abstract real challenges. VSR mirrors security footage review—finding anomalies in hours of video. VSC reflects inventory management—counting across warehouse zones. The tasks are designed to be simple for humans but hard for current AI, exposing paradigm gaps.
Q: Can I contribute data to VSI-590K?
A: The pipeline is open. You can add 3D-annotated scans (best quality), generate ProcTHOR/Hypersim videos (good diversity), or process web videos through our pseudo-annotation pipeline (easiest). We especially need outdoor and non-residential spaces to expand domain coverage.
Q: What’s the minimal hardware to run Cambrian-S-7B-LFP?
A: Inference requires ~24GB VRAM for 128-frame video processing. For real-time applications, Cambrian-S-3B runs on 12GB GPUs with minimal accuracy loss on short videos. Training requires TPUs or multi-GPU setups; Stage 4 for 7B needs ~512 TPU cores.
Q: How do I know if predictive sensing is working for my use case?
A: Monitor the surprise scores during inference. High variance in surprise indicates the model is detecting meaningful events. If surprise is flat, your video may be too homogeneous or the LFP head underfit. You can visualize surprise spikes overlaid on video to see what the model flags as important.
Q: What’s next for Cambrian-S research?
A: Three directions: (1) Richer world models using diffusion or autoregressive video generation; (2) Closing the action loop—connecting predictions to robot control; (3) Real-world validation in embodied agents and AR devices. We’re also exploring multi-modal surprise—synchronizing visual prediction with audio and language.
This article is based entirely on the research findings published in “Cambrian-S: Towards Spatial Supersensing in Video” and accompanying documentation. All model names, performance figures, and technical specifications are reproduced as reported by the authors.
