Breaking the Cognitive Boundaries of Visual Question Answering: How Knowledge and Visual Notes Enhance Multimodal Large Model Reasoning
Introduction: The Cognitive Challenges of Visual Question Answering
In today’s information explosion era, visual question answering (VQA) systems need to understand image content and answer complex questions like humans. However, existing multimodal large language models (MLLMs) often face two core challenges when dealing with visual problems requiring external knowledge:
1.1 Limitations of Traditional Methods
Traditional knowledge-based visual question answering (KB-VQA) methods mainly fall into two categories:
-
Explicit retrieval methods: Rely on external knowledge bases but introduce noisy information -
Implicit LLM methods: Utilize internal knowledge of large models but suffer from insufficient visual perception

Typical cases show that when models face questions about “home run”:
-
Retrieved knowledge contains valuable information like “players trotting around bases after hitting the ball” -
But directly answering produces “Stealing” instead -
This indicates models fail to effectively activate relevant internal knowledge
1.2 Bottlenecks in Visual Perception
In traffic light recognition cases:
-
Models misidentify green lights as “Stop” signals -
Reveals insufficient capture of fine-grained visual features -
Traditional attention mechanisms struggle to accurately locate key regions
2. NoteMR Framework: Dual Guidance Mechanism
2.1 Core Innovations
The NoteMR framework breaks traditional limitations through two key components:
-
Knowledge Notes: Integrate explicit and implicit knowledge -
Visual Notes: Enhance fine-grained visual perception
2.2 Knowledge Note Generation Mechanism
Detailed Steps:
-
Knowledge Retrieval:
-
Use PREFLMR retriever to obtain top-k relevant passages from Google search corpus -
OK-VQA dataset uses GS knowledge base, A-OKVQA uses Wikipedia
-
-
Multimodal Guidance:
-
Input retrieved knowledge and original image to frozen MLLM -
Formula:$$N_{kl} = \mathcal{P}_{MLLM}(c_k, V, P)
$$
-
-
Knowledge Filtering & Enhancement:
-
Filter noise from explicit knowledge -
Activate MLLM’s relevant implicit knowledge -
Generate knowledge summaries tightly connected to images
-

2.3 Visual Note Generation Principle
Implementation Mechanism:
-
Cross-modal Attention Calculation:
-
Use GradCAM to compute attention matrix between image patches and knowledge notes -
Formula:$$head^i = \text{softmax}(\frac{N_{kl}^i W_q^i}{\sqrt{D_N^i}}(W_k^i V_p^i)^T)(W_v^i V_p^i)
$$
-
-
Region Screening:
-
Set threshold λ=0.6 to filter low-relevance regions -
Generate binary mask matrix -
Formula:$$\text{Mask}(i,j) = \begin{cases} 1 & H_{i,j} > \lambda \\ 0 & \text{otherwise} \end{cases}
$$
-
-
Visual Feature Extraction:
-
Apply mask to original image -
Preserve key visual regions -
Formula:$$N_{vl} = \sum_{i=1}^L \sum_{j=1}^M V_{i,j} \cdot \text{Mask}(i,j)
$$
-
3. Experimental Verification and Performance Breakthroughs
3.1 Datasets and Baseline Methods
Dataset | Scale | Knowledge Source | Evaluation Metric |
---|---|---|---|
OK-VQA | 9K train/5K test | Google search corpus | VQA standard score |
A-OKVQA | 17K train/7K test | Wikipedia | MC/DA task score |
3.2 Main Experimental Results
OK-VQA Dataset Performance:
Model | Parameter Size | Score (%) |
---|---|---|
LLaVA-NeXT-7B | 7B | 58.7 |
LLaVA-NeXT-8B | 8B | 62.2 |
NoteMR(Qwen2-VL-7B) | 7B | 64.8 |
NoteMR(LLaVA-NeXT-7B) | 7B | 68.2 |
NoteMR(LLaVA-NeXT-8B) | 8B | 70.0 |
A-OKVQA Dataset Performance:
Model | MC Task (%) | DA Task (%) |
---|---|---|
SKP | – | 65.3 |
NoteMR | 88.1 | 68.7 |
3.3 Ablation Study Analysis
Verify module contributions through step-wise component addition:
Configuration | Score (%) | Improvement |
---|---|---|
Original Model | 62.2 | – |
+ Retrieved Knowledge | 65.3 | +3.1 |
+ Knowledge Notes | 68.8 | +3.5 |
+ Visual Notes | 69.6 | +0.8 |
+ Candidate Answer Optimization | 70.0 | +0.4 |
4. Case Study Analysis
4.1 Knowledge Note Effectiveness Case
Question: What’s the term for running bases after hitting?
Traditional Method困境:
-
Retrieved knowledge includes “players trotting around bases” and noise like “put batting over my cushions” -
Direct use leads to wrong answer “Cushions”
NoteMR Solution:
-
Knowledge notes filter irrelevant information, retain core knowledge -
Activate MLLM’s internal “home run” related implicit knowledge -
Generate correct answer “Foam”
4.2 Visual Note Effectiveness Case
Question: What should drivers do when traffic light shows green?
Traditional Method Error:
-
Mistakenly identifies green light as “Stop”
NoteMR Improvement:
-
Compute attention between image and knowledge notes -
Generate visual notes highlighting traffic light regions -
Guide model to correctly identify “Go” signal

5. Technical Innovations and Future Prospects
5.1 Core Technical Innovations
-
Knowledge Fusion Mechanism:
-
First implementation of explicit-implicit knowledge synergy -
Use MLLM’s generative ability to filter knowledge noise
-
-
Visual Perception Enhancement:
-
Introduce cross-modal attention to locate key regions -
Effectively alleviate visual hallucination issues
-
-
Multi-stage Optimization:
-
Candidate answer re-injection mechanism improves output stability -
Three-stage reasoning ensures inference quality
-
5.2 Application Prospects
This framework provides new technical ideas for:
-
Intelligent education systems -
Medical image analysis -
Autonomous driving perception -
Industrial quality inspection systems
Conclusion: A New Paradigm for Cognitive Enhancement
The NoteMR framework breaks traditional MLLM limitations through dual guidance of knowledge and vision. Its innovative approach internalizes external knowledge while enhancing visual perception, offering new solutions for knowledge-based visual tasks.
As multimodal large models continue developing, this “note-guided” concept may become an important paradigm for enhancing model reasoning capabilities, demonstrating application value in broader cognitive intelligence fields.