Notes-Guided MLLM Reasoning: Enhancing Visual Question Answering with Knowledge and Visual Notes
“
This article explores NoteMR, an innovative framework proposed by South China Normal University researchers at CVPR 2025. By implementing dual-note mechanisms, it solves knowledge noise interference and visual hallucination problems in knowledge-based visual question answering, achieving up to 5.31% performance improvement on OK-VQA and A-OKVQA datasets.
(Image: Unsplash – Illustrating multimodal AI processing visual-textual information)
I. Challenges in Knowledge-Based Visual Question Answering
Knowledge-Based Visual Question Answering (KB-VQA) requires models to integrate image content with external knowledge for reasoning. For example, when shown a baseball game image and asked “What is it called when players run around bases after hitting the ball?”, the correct answer “home run” requires sports knowledge. Current methods face two critical bottlenecks:
Problem 1: Knowledge Noise Disrupts Reasoning
-
Explicit knowledge retrieval defects: Knowledge from Google Search/Wikidata contains redundant or incorrect information (e.g., retrieving “stealing bases” instead of “home run”) -
Inefficient implicit knowledge utilization: Although Multimodal Large Language Models (MLLMs) store correct knowledge, they accurately retrieve it only 38% of the time (validated in Figure 1a of the paper)
Problem 2: Visual Hallucination Phenomenon
Caused by visual encoders’ insufficient fine-grained feature perception (e.g., misidentifying traffic light colors):
# Typical error case (Paper Figure 1b)
Input: Green light image + Question "What does the signal indicate drivers should do?"
Model Output: "Stop" # Correct answer should be "Go"
II. Architectural Principles of NoteMR Framework
NoteMR innovatively introduces a dual-note mechanism—knowledge notes filter knowledge noise while visual notes enhance fine-grained perception. The workflow comprises three phases:
(Framework analogy: Knowledge notes act as study summaries, visual notes function as textbook annotations)
Phase 1: Knowledge Note Generation (Knowledge Purification)
graph LR
A[Raw Image] --> B(Knowledge Retriever)
C[Question Text] --> B
B --> D[Top-k Knowledge Passages]
D + A --> E[MLLM Knowledge Filter]
E --> F[Knowledge Notes]
-
Core Technical Components: -
PREFLMR retriever fetches top-5 relevant passages from Google Search/Wikidata -
Frozen-parameter MLLM (e.g., LLaVA-NeXT-8B) processes image + retrieved knowledge inputs -
Generates condensed knowledge notes (example): “
“In baseball, when a batter hits the ball over the outfield fence, they can leisurely circle the bases to score. This action is called a Home Run.”
-
Phase 2: Visual Note Generation (Focal Enhancement)
# GradCAM-based visual focus algorithm (Paper Equations 6-9)
def generate_visual_notes(image, knowledge_note):
patches = split_image(image, patch_size=16) # Divide into 576 patches
attention_map = cross_modal_attention(knowledge_note, patches)
mask = threshold_filter(attention_map, λ=0.6) # Retain high-attention regions
return image * mask # Generate visual notes
-
Effect comparison: -
Original image: Full traffic light scene -
Visual notes: Focus on green light area (red zones in heatmap)
-
Phase 3: Dual-Note Collaborative Reasoning
Input Composition = Raw Image + Question + Knowledge Notes + Visual Notes
↓
MLLM generates 3 candidate answers → Secondary filtering → Final output
III. Breakthrough Experimental Results
3.1 Performance on OK-VQA Dataset
“
✔️ 5.31% Relative Improvement: Demonstrates how smaller models (8B) outperform 13B models via NoteMR
3.2 Performance on A-OKVQA Dataset
3.3 Ablation Study Validation
IV. Representative Case Studies
Case 1: Knowledge Note Error Correction (Paper Figure 4)
-
Question: “What material fills the sofa cushions?” -
Raw Retrieved Knowledge: “
“…some people add batting to sofa cushions…” # Misleading term “batting” (baseball terminology)
-
Knowledge Note Output: “
“Polyurethane foam is commonly used in furniture filling for its high elasticity” # Filters noise, adds expertise
-
Result: Model answer corrected from “baseball pads” to “foam”
Case 2: Visual Hallucination Prevention (Paper Figure 5)
-
Question: “What does the traffic signal indicate?” -
Without Visual Notes: Misidentifies as “Stop” (ignores green light) -
Visual Notes: Focuses on green signal area -
Corrected Output: “Go”
V. Technical Advantages and Value
Core Innovations
-
Knowledge Distiller
Explicit knowledge → Guides MLLM to activate implicit knowledge → Generates denoised knowledge notes -
Visual Focus Module
Cross-modal attention → Extracts λ>0.6 high-relevance regions → Creates anti-hallucination visual notes
Engineering Value
-
Computational Efficiency: Deployable on single A6000 GPU (43GB VRAM) -
Adaptability: Validated on mainstream MLLMs (LLaVA/Qwen2-VL) -
Open-Source Progress: Code publicly released (specific link not in paper)
VI. Application Prospects
Methodology extendable to:
-
Medical imaging diagnosis (integrated with medical KBs) -
Industrial quality inspection (equipment manuals + defect localization) -
Autonomous driving (traffic rules + real-time scene understanding)
“
As concluded in the paper: “NoteMR enables models to precisely focus on ‘exam key points’ through dual notes, achieving reliable reasoning—much like prepared students.”
Resources:
-
Original Paper: [Fang et al. CVPR 2025 Open Access] -
Datasets: OK-VQA (14K samples) / A-OKVQA (25K samples) -
Base Models: LLaVA-NeXT-7B/8B (Available on HuggingFace)