Notes-Guided MLLM Reasoning: Enhancing Visual Question Answering with Knowledge and Visual Notes

“

This article explores NoteMR, an innovative framework proposed by South China Normal University researchers at CVPR 2025. By implementing dual-note mechanisms, it solves knowledge noise interference and visual hallucination problems in knowledge-based visual question answering, achieving up to 5.31% performance improvement on OK-VQA and A-OKVQA datasets.

Multimodal AI Concept
(Image: Unsplash – Illustrating multimodal AI processing visual-textual information)

I. Challenges in Knowledge-Based Visual Question Answering

Knowledge-Based Visual Question Answering (KB-VQA) requires models to integrate image content with external knowledge for reasoning. For example, when shown a baseball game image and asked “What is it called when players run around bases after hitting the ball?”, the correct answer “home run” requires sports knowledge. Current methods face two critical bottlenecks:

Problem 1: Knowledge Noise Disrupts Reasoning

Explicit knowledge retrieval defects: Knowledge from Google Search/Wikidata contains redundant or incorrect information (e.g., retrieving “stealing bases” instead of “home run”)
Inefficient implicit knowledge utilization: Although Multimodal Large Language Models (MLLMs) store correct knowledge, they accurately retrieve it only 38% of the time (validated in Figure 1a of the paper)

Problem 2: Visual Hallucination Phenomenon

Caused by visual encoders’ insufficient fine-grained feature perception (e.g., misidentifying traffic light colors):

# Typical error case (Paper Figure 1b)
Input: Green light image + Question "What does the signal indicate drivers should do?"
Model Output: "Stop"  # Correct answer should be "Go"

II. Architectural Principles of NoteMR Framework

NoteMR innovatively introduces a dual-note mechanism—knowledge notes filter knowledge noise while visual notes enhance fine-grained perception. The workflow comprises three phases:

NoteMR Framework
(Framework analogy: Knowledge notes act as study summaries, visual notes function as textbook annotations)

Phase 1: Knowledge Note Generation (Knowledge Purification)

graph LR
A[Raw Image] --> B(Knowledge Retriever)
C[Question Text] --> B
B --> D[Top-k Knowledge Passages]
D + A --> E[MLLM Knowledge Filter]
E --> F[Knowledge Notes]

Core Technical Components:
1. PREFLMR retriever fetches top-5 relevant passages from Google Search/Wikidata
2. Frozen-parameter MLLM (e.g., LLaVA-NeXT-8B) processes image + retrieved knowledge inputs
3. Generates condensed knowledge notes (example):
  
  “
  
  “In baseball, when a batter hits the ball over the outfield fence, they can leisurely circle the bases to score. This action is called a Home Run.”

Phase 2: Visual Note Generation (Focal Enhancement)

# GradCAM-based visual focus algorithm (Paper Equations 6-9)
def generate_visual_notes(image, knowledge_note):
    patches = split_image(image, patch_size=16)  # Divide into 576 patches
    attention_map = cross_modal_attention(knowledge_note, patches)
    mask = threshold_filter(attention_map, λ=0.6)  # Retain high-attention regions
    return image * mask  # Generate visual notes

Effect comparison:
- Original image: Full traffic light scene
- Visual notes: Focus on green light area (red zones in heatmap)

Phase 3: Dual-Note Collaborative Reasoning

Input Composition = Raw Image + Question + Knowledge Notes + Visual Notes
↓
MLLM generates 3 candidate answers → Secondary filtering → Final output

III. Breakthrough Experimental Results

3.1 Performance on OK-VQA Dataset

Method	Base Model	Accuracy(%)
LLaVA-NeXT-8B	–	62.2
SKP (SOTA Baseline)	Vicuna-7b	63.3
NoteMR (Ours)	LLaVA-NeXT-8B	70.0

“

✔️ 5.31% Relative Improvement: Demonstrates how smaller models (8B) outperform 13B models via NoteMR

3.2 Performance on A-OKVQA Dataset

Task Type	Best Baseline	NoteMR	Improvement
Direct Answer (DA)	65.3%	68.7%	+3.4%
Multiple Choice (MC)	76.4%	88.1%	+11.7%

3.3 Ablation Study Validation

Configuration	Accuracy	Key Insight
Image + Question Only (Baseline)	62.2%	Reference
+ Raw Retrieved Knowledge	65.3%	+3.1%
+ Knowledge Notes	68.8%	+3.5% from knowledge filtering
+ Knowledge & Visual Notes	69.6%	+0.8% from visual focus
+ Candidate Answer Refinement	70.0%	Final solution

IV. Representative Case Studies

Case 1: Knowledge Note Error Correction (Paper Figure 4)

Question: “What material fills the sofa cushions?”
Raw Retrieved Knowledge:

“

“…some people add batting to sofa cushions…” # Misleading term “batting” (baseball terminology)
Knowledge Note Output:

“

“Polyurethane foam is commonly used in furniture filling for its high elasticity” # Filters noise, adds expertise
Result: Model answer corrected from “baseball pads” to “foam”

Case 2: Visual Hallucination Prevention (Paper Figure 5)

Question: “What does the traffic signal indicate?”
Without Visual Notes: Misidentifies as “Stop” (ignores green light)
Visual Notes: Focuses on green signal area
Corrected Output: “Go”

V. Technical Advantages and Value

Core Innovations

Knowledge Distiller
Explicit knowledge → Guides MLLM to activate implicit knowledge → Generates denoised knowledge notes
Visual Focus Module
Cross-modal attention → Extracts λ>0.6 high-relevance regions → Creates anti-hallucination visual notes

Engineering Value

Computational Efficiency: Deployable on single A6000 GPU (43GB VRAM)
Adaptability: Validated on mainstream MLLMs (LLaVA/Qwen2-VL)
Open-Source Progress: Code publicly released (specific link not in paper)

VI. Application Prospects

Methodology extendable to:

Medical imaging diagnosis (integrated with medical KBs)
Industrial quality inspection (equipment manuals + defect localization)
Autonomous driving (traffic rules + real-time scene understanding)

“

As concluded in the paper: “NoteMR enables models to precisely focus on ‘exam key points’ through dual notes, achieving reliable reasoning—much like prepared students.”

Resources:

Original Paper: [Fang et al. CVPR 2025 Open Access]
Datasets: OK-VQA (14K samples) / A-OKVQA (25K samples)
Base Models: LLaVA-NeXT-7B/8B (Available on HuggingFace)

NoteMR Breakthrough: How Dual-Note Mechanisms Revolutionize Visual Question Answering