Breaking the Cognitive Boundaries of Visual Question Answering: How Knowledge and Visual Notes Enhance Multimodal Large Model Reasoning

Introduction: The Cognitive Challenges of Visual Question Answering

In today’s information explosion era, visual question answering (VQA) systems need to understand image content and answer complex questions like humans. However, existing multimodal large language models (MLLMs) often face two core challenges when dealing with visual problems requiring external knowledge:

1.1 Limitations of Traditional Methods

Traditional knowledge-based visual question answering (KB-VQA) methods mainly fall into two categories:

  • Explicit retrieval methods: Rely on external knowledge bases but introduce noisy information
  • Implicit LLM methods: Utilize internal knowledge of large models but suffer from insufficient visual perception
Comparison of Traditional Methods

Typical cases show that when models face questions about “home run”:

  • Retrieved knowledge contains valuable information like “players trotting around bases after hitting the ball”
  • But directly answering produces “Stealing” instead
  • This indicates models fail to effectively activate relevant internal knowledge

1.2 Bottlenecks in Visual Perception

In traffic light recognition cases:

  • Models misidentify green lights as “Stop” signals
  • Reveals insufficient capture of fine-grained visual features
  • Traditional attention mechanisms struggle to accurately locate key regions

2. NoteMR Framework: Dual Guidance Mechanism

2.1 Core Innovations

The NoteMR framework breaks traditional limitations through two key components:

  1. Knowledge Notes: Integrate explicit and implicit knowledge
  2. Visual Notes: Enhance fine-grained visual perception
NoteMR Framework Diagram

2.2 Knowledge Note Generation Mechanism

Detailed Steps:

  1. Knowledge Retrieval:

    • Use PREFLMR retriever to obtain top-k relevant passages from Google search corpus
    • OK-VQA dataset uses GS knowledge base, A-OKVQA uses Wikipedia
  2. Multimodal Guidance:

    • Input retrieved knowledge and original image to frozen MLLM
    • Formula:$$N_{kl} = \mathcal{P}_{MLLM}(c_k, V, P)
      $$
  3. Knowledge Filtering & Enhancement:

    • Filter noise from explicit knowledge
    • Activate MLLM’s relevant implicit knowledge
    • Generate knowledge summaries tightly connected to images
Knowledge Note Generation Process

2.3 Visual Note Generation Principle

Implementation Mechanism:

  1. Cross-modal Attention Calculation:

    • Use GradCAM to compute attention matrix between image patches and knowledge notes
    • Formula:$$head^i = \text{softmax}(\frac{N_{kl}^i W_q^i}{\sqrt{D_N^i}}(W_k^i V_p^i)^T)(W_v^i V_p^i)
      $$
  2. Region Screening:

    • Set threshold λ=0.6 to filter low-relevance regions
    • Generate binary mask matrix
    • Formula:$$\text{Mask}(i,j) = \begin{cases} 1 & H_{i,j} > \lambda \\ 0 & \text{otherwise} \end{cases}
      $$
  3. Visual Feature Extraction:

    • Apply mask to original image
    • Preserve key visual regions
    • Formula:$$N_{vl} = \sum_{i=1}^L \sum_{j=1}^M V_{i,j} \cdot \text{Mask}(i,j)
      $$
Visual Note Generation Principle

3. Experimental Verification and Performance Breakthroughs

3.1 Datasets and Baseline Methods

Dataset Scale Knowledge Source Evaluation Metric
OK-VQA 9K train/5K test Google search corpus VQA standard score
A-OKVQA 17K train/7K test Wikipedia MC/DA task score

3.2 Main Experimental Results

OK-VQA Dataset Performance:

Model Parameter Size Score (%)
LLaVA-NeXT-7B 7B 58.7
LLaVA-NeXT-8B 8B 62.2
NoteMR(Qwen2-VL-7B) 7B 64.8
NoteMR(LLaVA-NeXT-7B) 7B 68.2
NoteMR(LLaVA-NeXT-8B) 8B 70.0

A-OKVQA Dataset Performance:

Model MC Task (%) DA Task (%)
SKP 65.3
NoteMR 88.1 68.7

3.3 Ablation Study Analysis

Verify module contributions through step-wise component addition:

Configuration Score (%) Improvement
Original Model 62.2
+ Retrieved Knowledge 65.3 +3.1
+ Knowledge Notes 68.8 +3.5
+ Visual Notes 69.6 +0.8
+ Candidate Answer Optimization 70.0 +0.4

4. Case Study Analysis

4.1 Knowledge Note Effectiveness Case

Question: What’s the term for running bases after hitting?

Traditional Method困境:

  • Retrieved knowledge includes “players trotting around bases” and noise like “put batting over my cushions”
  • Direct use leads to wrong answer “Cushions”

NoteMR Solution:

  • Knowledge notes filter irrelevant information, retain core knowledge
  • Activate MLLM’s internal “home run” related implicit knowledge
  • Generate correct answer “Foam”
Knowledge Note Case

4.2 Visual Note Effectiveness Case

Question: What should drivers do when traffic light shows green?

Traditional Method Error:

  • Mistakenly identifies green light as “Stop”

NoteMR Improvement:

  1. Compute attention between image and knowledge notes
  2. Generate visual notes highlighting traffic light regions
  3. Guide model to correctly identify “Go” signal
Visual Note Case

5. Technical Innovations and Future Prospects

5.1 Core Technical Innovations

  1. Knowledge Fusion Mechanism:

    • First implementation of explicit-implicit knowledge synergy
    • Use MLLM’s generative ability to filter knowledge noise
  2. Visual Perception Enhancement:

    • Introduce cross-modal attention to locate key regions
    • Effectively alleviate visual hallucination issues
  3. Multi-stage Optimization:

    • Candidate answer re-injection mechanism improves output stability
    • Three-stage reasoning ensures inference quality

5.2 Application Prospects

This framework provides new technical ideas for:

  • Intelligent education systems
  • Medical image analysis
  • Autonomous driving perception
  • Industrial quality inspection systems

Conclusion: A New Paradigm for Cognitive Enhancement

The NoteMR framework breaks traditional MLLM limitations through dual guidance of knowledge and vision. Its innovative approach internalizes external knowledge while enhancing visual perception, offering new solutions for knowledge-based visual tasks.

As multimodal large models continue developing, this “note-guided” concept may become an important paradigm for enhancing model reasoning capabilities, demonstrating application value in broader cognitive intelligence fields.

Future Outlook