SpatialTree: How Spatial Abilities Hierarchically Develop in Multimodal LLMs

Have you ever wondered how AI perceives the size of objects, judges distances, or predicts movement when looking at an image? In cognitive science, human spatial ability develops progressively—from basic perception to complex reasoning and real-world interaction. Yet for multimodal large language models (MLLMs), this hierarchical structure has long been poorly understood, with most research focusing on isolated tasks rather than the bigger picture.

Today, we’ll explore SpatialTree—a cognitive science-inspired framework that organizes AI’s spatial abilities into four distinct layers. It also introduces the first capability-centric hierarchical benchmark, allowing us to systematically answer critical questions: How are AI’s spatial abilities structured? How do different layers interact? And what’s the best way to enhance these abilities?

Why Do We Need SpatialTree?

Before diving into SpatialTree itself, let’s address a fundamental question: Why bother hierarchizing AI’s spatial abilities?

Historically, research on AI spatial skills has been task-centric. Researchers might train a model to “judge which object is larger” or “teach a robot to grasp a cup,” but these tasks are fragmented—like scattered puzzle pieces with no clear picture of how they connect. For example, a model that excels at “distance estimation” might struggle with “path planning,” but we had no way to quantify or understand this relationship.

Cognitive science offers a key insight: Human intelligence is a “dynamic structure built through successive stages.” A child first learns to see (perception), then to describe what they see (language mapping), next to imagine outcomes (mental simulation), and finally to act in the physical world (interaction).

SpatialTree applies this logic to AI, organizing spatial abilities into a “tree” that grows from foundational perception to advanced interaction. This structure lets us:

Systematically measure AI performance across different skill levels
Uncover dependencies between layers
Identify targeted training strategies to strengthen the entire hierarchy

SpatialTree’s Four-Layer Spatial Ability Framework

SpatialTree structures AI’s spatial capabilities into four hierarchical layers—like the roots, trunk, branches, and leaves of a tree—each building on the previous one. Let’s break them down:

L1: Perception Layer — AI’s “Eyes”

The L1 layer corresponds to innate, language-free perception—similar to how humans intuitively process spatial information without conscious thought. When we see an apple, we instantly recognize its shape, color, and proximity; L1 enables AI to do the same.

It encompasses five core abilities:

Geometry Perception: Understanding physical form and metric properties
- Distance: Judging how far objects are from each other or the observer
- Size: Estimating dimensions, area, or volume (e.g., “Will this box fit in a backpack?”)
- Shape: Identifying contours and basic geometric forms (circles, squares, triangles)
Motion Perception: Processing dynamic visual signals over time
- Egocentric motion: Detecting the AI’s own movement direction (e.g., a game character moving forward)
- Allocentric motion: Perceiving the movement and speed of external objects (e.g., “The car is moving left”)
Orientation Perception: Judging “up/down” and object pose
- Gravity alignment: Recognizing vertical/horizontal axes (e.g., “Is the cup tilted?”)
- Object pose: Perceiving how an object is positioned (e.g., “Is the book lying flat or standing upright?”)
Relation Perception: Understanding spatial structure between objects
- Topological relations: Basic configurations like “inside,” “outside,” or “overlapping”
- Correspondence: Recognizing the same object across different viewpoints (e.g., “Is this a side profile of the same person in the front-facing photo?”)
Localization Perception: Pinpointing objects in 2D/3D space
- Detection: Identifying an object’s presence and spatial extent (e.g., “Where is the chair in the room?”)
- Grounding: Linking visual observations to specific coordinates (e.g., “Which building does the red dot on the map represent?”)

L2: Mental Mapping Layer — AI’s “Language Translator”

If L1 is “seeing,” L2 is “describing”—translating perceptual spatial information into language and forming language-structured spatial memories. Just as we might say “The table is to the left of the bed” and remember that layout, L2 enables AI to bridge vision and language.

It includes two core abilities:

Spatial Understanding: Converting perception to semantics
- Spatial captioning: Describing scenes in language (e.g., “The living room has a blue sofa with a coffee table in front of it”)
- Relational semantics: Distinguishing meaningful spatial relationships (e.g., “sitting on” vs. “standing next to”)
- Motion semantics: Interpreting the purpose of movement (e.g., “He is picking up a cup” rather than “His hand is moving”)
- Perspective taking: Adopting another’s viewpoint (e.g., “From her position, is the door on the left or right?”)
- Affordance understanding: Recognizing an object’s functional possibilities (e.g., “This handle can be grasped” or “This chair can be sat on”)
Spatial Memory: Retaining and retrieving spatial information
- Cognitive mapping: Synthesizing fragmented observations (e.g., video frames) into a unified global representation (e.g., a complete room layout)
- Memory retrieval: Recalling object positions or action timelines (e.g., “Where was the key seen earlier?”)

L3: Mental Simulation Layer — AI’s “Mental Rehearsal”

L3 is “thinking”—mentally simulating spatial changes, reasoning through outcomes, and planning actions. Just as we might imagine “What happens if I place the cup on the table?” before acting, L3 lets AI simulate scenarios without physical interaction.

It comprises two core abilities:

Causal Reasoning: Modeling spatial cause and effect
- Geometric reasoning: E.g., “Can these two blocks fit together?”
- Motion prediction: E.g., “Where will the ball roll if pushed?”
- Relational reasoning: E.g., “If A is left of B and B is left of C, is A left of C?”
Sequential Planning: Designing goal-directed action steps
- Step-by-step strategy: E.g., “To reach the top shelf, first grab a stool, then stand on it”
- Path planning: E.g., “Navigate from the bedroom to the living room while avoiding toys on the floor”

L4: Agentic Competence Layer — AI’s “Hands-On Ability”

L4 is “acting”—translating internal plans into tangible interactions with dynamic environments. This layer bridges cognitive planning and real-world execution, such as a robot grasping an object or a game character navigating a maze.

At its core is sequential decision-making, where AI integrates:

Current multimodal observations (e.g., video frames)
Historical context (e.g., past actions and memories)
Internal state (e.g., goals and ongoing plans)

to generate executable actions. Key application scenarios include:

Game character navigation (e.g., finding an exit in a 3D environment)
Robotic manipulation (e.g., a robotic arm picking up and moving objects)
Human hand interaction (e.g., simulating how to twist a bottle cap)

Figure 1: The hierarchical structure of SpatialTree. Rooted in foundational multimodal capabilities (L0), the tree progresses from basic perception (L1) to agentic competence (L4).

(Figure 1 illustrates SpatialTree’s overall structure, showing how each layer builds on the previous one to form a complete spatial ability system.)

How to Evaluate These Abilities? — The SpatialTree-Bench Benchmark

To measure AI performance across SpatialTree’s layers, researchers developed SpatialTree-Bench—the first capability-centric hierarchical benchmark. Here’s how it works:

1. Data Sources: Integration + Supplementation

Existing dataset integration: Reorganizing fragmented tasks from prior research (e.g., single-image spatial understanding, 3D point cloud processing, video reasoning) into SpatialTree’s hierarchical framework.
Missing data supplementation: Using a “Spatial Engine” (combining specialized models for depth estimation, object tracking, and orientation detection) to generate new data for underrepresented abilities—such as L1 orientation estimation and L4 agentic tasks.

2. Data Processing: Layer-Specific Design

Different layers require tailored data handling:

L1 (Perception): Extracting perceptual features (e.g., depth, motion) with specialized models, then generating QA pairs (e.g., “What is the distance between the two balls in the image?”).
L2 (Mental Mapping): Converting videos into bird’s-eye-view (BEV) maps via 3D reconstruction, then creating description or memory questions (e.g., “Based on the video, where is the sofa located in the room?”).
L3 (Mental Simulation): Enhancing existing reasoning tasks with chain-of-thought (CoT) prompts, encouraging AI to articulate its reasoning process (e.g., “To solve this problem, I first need to analyze the positions of A and B…”).
L4 (Agentic Competence): Collecting videos of games, robotics, and human interactions, then converting actions into AI-interpretable commands (e.g., breaking “twist a cap” into “grasp the cap” + “rotate clockwise”) and generating multiple-choice tasks (e.g., “What is the next action to complete the task?”).

3. Evaluation Metrics: Task-Tailored Scoring

Metrics are customized to match each task’s nature:

Multiple-choice questions (70.7%): Measuring accuracy by comparing AI selections to ground truth.
Numeric estimation (e.g., distance, angle): Using error metrics like Mean Squared Error (MSE) to quantify precision.
Complex reasoning/agentic tasks: Employing “LLM-as-a-Judge” (using a separate LLM to evaluate answer quality) or task success rates (e.g., “Did the robot successfully grasp the object?”).

Figure 2: Examples of tasks across layers. (a) L1 Relation Perception (judging inside/outside); (b) L2 Relation Understanding (describing object relationships); (c) L3 Causal Reasoning (solving complex relational problems).

(Figure 2 shows how the same “relational” concept varies across layers: L1 focuses on basic judgments, L2 on linguistic description, and L3 on logical reasoning.)

How Do Leading AI Models Perform? — Key Findings

Researchers tested mainstream MLLMs on SpatialTree-Bench, including closed-source models (e.g., GPT-4o, Gemini 2.5) and open-source models (e.g., Qwen2.5-VL, Kimi-VL). The results revealed critical patterns:

1. Ability Structure: Low-Layer Independence, High-Layer Dependence

L1 (Perception) abilities are largely independent: A model might excel at distance estimation but struggle with shape recognition, as these skills rely on distinct visual processing mechanisms.
High-layer abilities (L2–L4) are strongly correlated: Strong performance in L2 (Mental Mapping) typically predicts strong performance in L3 (Mental Simulation) and L4 (Agentic Competence). This confirms that higher-level skills build on lower-level foundations—like a house needing a solid base to support upper floors.

2. Model Performance by Category

Models were grouped into three categories, each with distinct strengths:

Model Category	Representative Models	Key Traits
Reasoning-Augmented	Gemini 2.5 Pro, GLM-4.5V	Strong at L3–L4 (reasoning and agentic tasks) but average at L1 (perception)
Non-Reasoning	GPT-4o, Gemini 2.5-Flash-Nonthinking	More consistent at L1 (perception) but weaker at complex reasoning
Open-Source	Qwen2.5-VL, Kimi-VL	Overall performance lags closed-source models but competitive in specific L1 tasks

Notably, Gemini 2.5 Pro achieved the highest overall score (50.1), while Qwen3VL-235B led among open-source models (40.0).

How to Enhance AI’s Spatial Abilities? — Training Insights

The research team conducted experiments with Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to identify effective training strategies:

1. Supervised Fine-Tuning (SFT): Low Layers Are the Foundation

Intra-layer training has tradeoffs: Training multiple L1 abilities simultaneously can cause negative transfer (e.g., improving distance estimation might degrade shape recognition).
Cross-layer training yields synergy: Prioritizing L1 (perception) training before L2–L4 leads to significant gains in higher layers. This highlights that strong foundational perception is critical for building advanced spatial skills.

2. Reinforcement Learning (RL): Balancing “Thinking” and “Perceiving”

Blindly encouraging “more thinking” is unreliable: While extended reasoning boosts L3–L4 performance, it harms L1 perception (e.g., overthinking distance estimation reduces precision).
Solution: Auto-Think Strategy: Suppress unnecessary deliberation for simple perceptual tasks (e.g., “trust intuition” for size judgments) and encourage deep reasoning for complex tasks (e.g., path planning). This balanced approach enables RL to improve performance across all layers consistently.

Frequently Asked Questions (FAQ)

1. How is SpatialTree different from previous spatial ability research?

Previous research was task-centric (e.g., focusing on “distance estimation” or “path planning in isolation”). SpatialTree is capability-centric—it organizes tasks into a hierarchical framework, revealing how abilities interact and depend on each other. Think of it as shifting from examining individual leaves to analyzing the entire tree’s structure.

2. Why are L1 abilities independent while higher layers are correlated?

L1 relies on specialized visual processing (e.g., separate neural pathways for shape, motion, and distance), so skills operate independently. Higher layers (L2–L4) all depend on language and logical reasoning systems—so improvements in one area often benefit others.

3. How can non-researchers use SpatialTree?

Evaluate AI models: Test an AI’s spatial skills by progressing through layers—start with simple L1 tasks (e.g., “Which object is larger?”) and move to complex L4 tasks (e.g., “Plan a route to the target”).
Optimize training: For developers, prioritize L1 foundational training before advancing to higher layers to maximize overall performance.

4. What’s the future of AI spatial abilities?

With frameworks like SpatialTree, we’ll see more “well-rounded” AI—excelling at both precise perception (e.g., millimeter-level distance estimation) and complex reasoning (e.g., cross-room navigation planning). Strategies like the auto-think approach will make AI more human-like: acting quickly on intuition for simple tasks and deliberating deeply for complex ones.

Conclusion

SpatialTree provides the first systematic framework for understanding AI’s spatial abilities—revealing that they are hierarchical, interdependent, and trainable with targeted strategies. For researchers, it offers a roadmap for future studies; for developers, it identifies clear paths to build more capable models; and for anyone curious about AI, it demystifies how machines “see,” “think,” and “act” in spatial environments.

As AI continues to evolve, SpatialTree will play a pivotal role in advancing spatial intelligence—bringing us closer to machines that interact with the physical world as naturally as humans do. Whether it’s robots navigating homes, AI assistants helping with interior design, or virtual characters moving realistically in games, the insights from SpatialTree will shape the next generation of multimodal AI.

SpatialTree: Decoding the Hidden Hierarchy of Spatial Intelligence in AI