Large Multimodal Reasoning Models: From Perception to Planning

In the field of artificial intelligence, large multimodal reasoning models (LMRMs) have garnered significant attention. These models integrate diverse modalities such as text, images, audio, and video to support complex reasoning capabilities, aiming to achieve comprehensive perception, precise understanding, and deep reasoning. This article delves into the evolution of large multimodal reasoning models, their key development stages, datasets and benchmarks, challenges, and future directions.

Evolution of Large Multimodal Reasoning Models

Stage 1: Perception-Driven Reasoning

In the early stages, multimodal reasoning primarily relied on task-specific modules, with reasoning implicitly embedded in stages of representation, alignment, and fusion. For instance, in 2016, the Neural Module Network (NMN) dynamically assembled task-specific modules for visual-textual reasoning. HieCoAtt aligned question semantics with image regions through hierarchical cross-modal attention. These models were predominantly trained using supervised learning.

Subsequently, vision-language models-based modular reasoning emerged. In 2019, ViLBERT aligned visual-text features via dual-stream Transformers with cross-modal attention. LXMERT enhanced cross-modal reasoning through dual-stream pretraining across diverse tasks. These models were typically trained through pretraining followed by fine-tuning.

Stage 2: Language-Centric Short Reasoning

With the advent of large-scale multimodal pretraining, models began to demonstrate preliminary reasoning capabilities. However, such inferences were often superficial, relying mainly on implicit correlations rather than explicit logical processes. To address this limitation, multimodal chain-of-thought (MCoT) emerged as an effective approach. By incorporating intermediate reasoning steps, MCoT improved cross-modal alignment, knowledge integration, and contextual grounding without requiring extensive supervision or significant architectural changes. In this stage, approaches were categorized into three paradigms: prompt-based MCoT, structural reasoning with predefined patterns, and tool-augmented reasoning leveraging lightweight external modules.

For example, the Cantor model decoupled perception and reasoning through feature extraction and CoT-style integration. TextCoT first summarized the visual context and then generated CoT-based responses. Grounding-Prompter performed global parsing, denoising, and partitioning before reasoning. Audio-CoT enhanced visual reasoning by utilizing three chain-of-thought paradigms.

Stage 3: Language-Centric Long Reasoning

To handle more complex multimodal tasks, research began to focus on developing system 2-style reasoning. Unlike fast and reactive strategies, this form of reasoning is deliberate, compositional, and guided by explicit planning. By extending reasoning chains, grounding them in multimodal inputs, and training with supervised or reinforcement signals, models began to exhibit long-horizon reasoning and adaptive problem decomposition.

Cross-modal reasoning became a key focus in this stage. For instance, IdealGPT used GPT to iteratively decompose and solve visual reasoning tasks. AssistGPT planned, executed, and inspected tasks using external tools such as GPT4, OCR, and grounding tools. Additionally, MM-O1 and MM-R1 models emerged. MM-O1 models, such as Macro-O1, employed MCTS-guided thinking for solution expansion and reasoning action strategies. MM-R1 models, like RLHF-V, utilized reinforcement learning algorithms like DPO to enhance VQA performance.

Towards Native Multimodal Reasoning Models

Despite the potential of large multimodal reasoning models in long-chain reasoning, their language-centric architectures limit their effectiveness in real-world scenarios. Their reliance on vision and language modalities restricts their ability to process and reason with interleaved diverse data types, and their performance in real-time, iterative interactions with dynamic environments remains underdeveloped.

To address these limitations, native large multimodal reasoning models (N-LMRMs) were proposed. These models aim to achieve broader multimodal integration and more advanced interactive reasoning. For example, R1-Searcher, Search-o1, and DeepResearcher enhanced LLM search capabilities through reinforcement learning (RL) for multi-hop QA and mathematical tasks. Magma was pre-trained on 820K spatial-verbal labeled data to handle multimodal understanding and spatial reasoning tasks. N-LMRMs can be categorized into agentic models and omni-modal models. Agentic models, such as R1-Searcher, focus on enhancing language models’ search capabilities through RL. Omni-modal models, like Gemini 2.0 & 2.5 and GPT-4o, aim to integrate multiple modalities such as text, images, and audio to achieve multimodal understanding and generation.

Datasets and Benchmarks

The development of multimodal reasoning models relies on rich datasets and benchmarks for training and evaluation.

Multimodal Understanding

Multimodal understanding encompasses visual-centric and audio-centric understanding. Benchmarks like VQA, GQA, and DocVQA assess model performance in visual question answering tasks. Data sources such as ALIGN, LTIP, and YFCC100M provide large-scale visual-text pairs for model training. For audio-centric understanding, benchmarks like AudioBench and VoiceBench evaluate model performance in audio-related tasks, while datasets like Librispeech and Common Voice offer abundant audio data.

Multimodal Generation

Multimodal generation includes cross-modal generation and joint multimodal generation. GenEval and T2I-CompBench++ evaluate model performance in image generation tasks. Data sources like MS-COCO and Flickr30k provide rich image and text data for model training. For joint multimodal generation, benchmarks such as DPG-Bench and GenAI-Bench assess model capabilities in generating multimodal content, while datasets like Conceptual Captions and RedCaps offer diverse multimodal data.

Multimodal Reasoning

Multimodal reasoning involves general visual reasoning and domain-specific reasoning. NaturalBench and VCR evaluate model performance in visual commonsense reasoning tasks. Data sources like VCR and TDIUC provide complex visual reasoning scenarios for model training. For domain-specific reasoning, benchmarks such as MathVista and VLM-Bench assess model performance in specialized fields like mathematics and vision-language integration, while datasets like Habitat and AI2-THOR offer task-specific environments.

Multimodal Planning

Multimodal planning includes GUI navigation and embodied and simulated environments. Benchmarks like WebArena and Mind2Web evaluate model performance in web navigation tasks. Data sources such as AMEX and RiCo provide rich interactive environments for model training. For embodied and simulated environments, benchmarks like MineDojo and V-MAGE assess model capabilities in simulated environments, while datasets like AndroidEnv and GUI-World offer realistic interaction scenarios.

Challenges and Future Directions

Despite significant progress in large multimodal reasoning models, several challenges remain. For example, achieving true visual-centric long reasoning and conducting interactive multimodal reasoning are still underdeveloped areas. Future directions include building multimodal agents capable of proactive environmental interaction, integrating semantics across all modalities, and resolving ambiguities in complex, open-world contexts.

Conclusion

The evolution of large multimodal reasoning models reflects a progression from perception-driven modular reasoning to language-centric short reasoning, and finally to language-centric long reasoning. While impressive achievements have been made, challenges persist. By developing native large multimodal reasoning models that support scalable, agentic, and adaptive reasoning and planning in complex real-world environments, we aim to bridge the gap between isolated task performance and generalized real-world problem-solving, advancing the field of artificial intelligence.

If you found this article helpful, please cite it as follows:

@article{li2025perception,   title={Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models},   author={Li, Yunxin and Liu, Zhenyu and Li, Zitao and Zhang, Xuanyu and Xu, Zhenran and Chen, Xinyu and Shi, Haoyuan and Jiang, Shenyuan and Wang, Xintong and Wang, Jifang and others},   journal={arXiv preprint arXiv:2505.04921},   year={2025} }