Breaking the Cognitive Boundaries of Visual Question Answering: How Knowledge and Visual Notes Enhance Multimodal Large Model Reasoning

Introduction: The Cognitive Challenges of Visual Question Answering

In today’s information explosion era, visual question answering (VQA) systems need to understand image content and answer complex questions like humans. However, existing multimodal large language models (MLLMs) often face two core challenges when dealing with visual problems requiring external knowledge:

1.1 Limitations of Traditional Methods

Traditional knowledge-based visual question answering (KB-VQA) methods mainly fall into two categories:

Explicit retrieval methods: Rely on external knowledge bases but introduce noisy information
Implicit LLM methods: Utilize internal knowledge of large models but suffer from insufficient visual perception

Typical cases show that when models face questions about “home run”:

Retrieved knowledge contains valuable information like “players trotting around bases after hitting the ball”
But directly answering produces “Stealing” instead
This indicates models fail to effectively activate relevant internal knowledge

1.2 Bottlenecks in Visual Perception

In traffic light recognition cases:

Models misidentify green lights as “Stop” signals
Reveals insufficient capture of fine-grained visual features
Traditional attention mechanisms struggle to accurately locate key regions

2. NoteMR Framework: Dual Guidance Mechanism

2.1 Core Innovations

The NoteMR framework breaks traditional limitations through two key components:

Knowledge Notes: Integrate explicit and implicit knowledge
Visual Notes: Enhance fine-grained visual perception

2.2 Knowledge Note Generation Mechanism

Detailed Steps:

Knowledge Retrieval:
- Use PREFLMR retriever to obtain top-k relevant passages from Google search corpus
- OK-VQA dataset uses GS knowledge base, A-OKVQA uses Wikipedia
Multimodal Guidance:
- Input retrieved knowledge and original image to frozen MLLM
- Formula:$$N_{kl} = \mathcal{P}_{MLLM}(c_k, V, P)
  $$
Knowledge Filtering & Enhancement:
- Filter noise from explicit knowledge
- Activate MLLM’s relevant implicit knowledge
- Generate knowledge summaries tightly connected to images

2.3 Visual Note Generation Principle

Implementation Mechanism:

Cross-modal Attention Calculation:
- Use GradCAM to compute attention matrix between image patches and knowledge notes
- Formula:$$head^i = \text{softmax}(\frac{N_{kl}^i W_q^i}{\sqrt{D_N^i}}(W_k^i V_p^i)^T)(W_v^i V_p^i)
  $$
Region Screening:
- Set threshold λ=0.6 to filter low-relevance regions
- Generate binary mask matrix
- Formula:$$\text{Mask}(i,j) = \begin{cases} 1 & H_{i,j} > \lambda \\ 0 & \text{otherwise} \end{cases}
  $$
Visual Feature Extraction:
- Apply mask to original image
- Preserve key visual regions
- Formula:$$N_{vl} = \sum_{i=1}^L \sum_{j=1}^M V_{i,j} \cdot \text{Mask}(i,j)
  $$

3. Experimental Verification and Performance Breakthroughs

3.1 Datasets and Baseline Methods

Dataset	Scale	Knowledge Source	Evaluation Metric
OK-VQA	9K train/5K test	Google search corpus	VQA standard score
A-OKVQA	17K train/7K test	Wikipedia	MC/DA task score

3.2 Main Experimental Results

OK-VQA Dataset Performance:

Model	Parameter Size	Score (%)
LLaVA-NeXT-7B	7B	58.7
LLaVA-NeXT-8B	8B	62.2
NoteMR(Qwen2-VL-7B)	7B	64.8
NoteMR(LLaVA-NeXT-7B)	7B	68.2
NoteMR(LLaVA-NeXT-8B)	8B	70.0

A-OKVQA Dataset Performance:

Model	MC Task (%)	DA Task (%)
SKP	–	65.3
NoteMR	88.1	68.7

3.3 Ablation Study Analysis

Verify module contributions through step-wise component addition:

Configuration	Score (%)	Improvement
Original Model	62.2	–
+ Retrieved Knowledge	65.3	+3.1
+ Knowledge Notes	68.8	+3.5
+ Visual Notes	69.6	+0.8
+ Candidate Answer Optimization	70.0	+0.4

4. Case Study Analysis

4.1 Knowledge Note Effectiveness Case

Question: What’s the term for running bases after hitting?

Traditional Method困境:

Retrieved knowledge includes “players trotting around bases” and noise like “put batting over my cushions”
Direct use leads to wrong answer “Cushions”

NoteMR Solution:

Knowledge notes filter irrelevant information, retain core knowledge
Activate MLLM’s internal “home run” related implicit knowledge
Generate correct answer “Foam”

4.2 Visual Note Effectiveness Case

Question: What should drivers do when traffic light shows green?

Traditional Method Error:

Mistakenly identifies green light as “Stop”

NoteMR Improvement:

Compute attention between image and knowledge notes
Generate visual notes highlighting traffic light regions
Guide model to correctly identify “Go” signal

5. Technical Innovations and Future Prospects

5.1 Core Technical Innovations

Knowledge Fusion Mechanism:
- First implementation of explicit-implicit knowledge synergy
- Use MLLM’s generative ability to filter knowledge noise
Visual Perception Enhancement:
- Introduce cross-modal attention to locate key regions
- Effectively alleviate visual hallucination issues
Multi-stage Optimization:
- Candidate answer re-injection mechanism improves output stability
- Three-stage reasoning ensures inference quality

5.2 Application Prospects

This framework provides new technical ideas for:

Intelligent education systems
Medical image analysis
Autonomous driving perception
Industrial quality inspection systems

Conclusion: A New Paradigm for Cognitive Enhancement

The NoteMR framework breaks traditional MLLM limitations through dual guidance of knowledge and vision. Its innovative approach internalizes external knowledge while enhancing visual perception, offering new solutions for knowledge-based visual tasks.

As multimodal large models continue developing, this “note-guided” concept may become an important paradigm for enhancing model reasoning capabilities, demonstrating application value in broader cognitive intelligence fields.

Visual Question Answering Breakthrough: How NoteMR Enhances Multimodal Model Reasoning