MemoryVLA: Revolutionizing Robotic Manipulation with Human-Inspired Memory Systems
Core Question
How does MemoryVLA address the limitations of existing Vision-Language-Action (VLA) models in handling long-term dependencies for robotic manipulation?
MemoryVLA introduces a dual-memory architecture inspired by human cognitive systems, enabling robots to handle complex, time-dependent tasks that traditional models struggle with. By integrating perceptual details and high-level semantics into a unified memory framework, it achieves state-of-the-art performance across 150+ tasks in simulation and real-world environments.
1. The Challenge of Temporal Dependencies in Robotics
1.1 Why Existing Models Fail
Modern VLA models like OpenVLA and π₀ rely on single-frame inputs, ignoring historical context. This leads to failures in tasks where:
-
Visual ambiguity exists (e.g., button states look identical before/after pressing) -
Multi-step planning is required (e.g., “clean table and count” needs progress tracking)
Key limitation: Without memory, models cannot distinguish similar states or maintain task context over time.
2. MemoryVLA’s Dual-Memory Architecture
2.1 Core Innovation
MemoryVLA combines two memory systems:
-
Working memory: Short-term storage of current observations -
Perceptual-Cognitive Memory Bank (PCMB): Long-term storage of historical data
# Simplified architecture diagram
[Current RGB Image + Language Instruction]
↓
Vision-Language Cognition Module → Working Memory (p, c)
↓
PCMB (stores perceptual details + cognitive semantics)
↓
Memory-Conditioned Diffusion Action Expert → 7-DoF actions
3. Key Components Explained
3.1 Vision-Language Cognition Module
Core Question: How does MemoryVLA process visual and language inputs?
-
Visual encoding: Uses DINOv2 + SigLIP backbones to extract 256-dimensional perceptual tokens -
Language processing: LLaMA-7B generates a compact cognitive token from instructions -
Output: Working memory = {perceptual tokens (p), cognitive token (c)}
Example: For a “pick apple” task, the module encodes the apple’s position (perceptual) and the goal “place in basket” (cognitive).
3.2 Perceptual-Cognitive Memory Bank (PCMB)
Core Question: How does PCMB store and retrieve historical information?
3.2.1 Memory Storage
-
Stores two streams: -
Perceptual memory: Fine-grained visual details (e.g., object positions) -
Cognitive memory: High-level semantic summaries (e.g., “door opened”)
-
3.2.2 Memory Retrieval
-
Uses temporal positional encoding to query relevant history -
Example: In “push buttons” task, retrieves past button states to infer completion
3.2.3 Memory Fusion
-
Gate mechanism dynamically weights historical vs. current data: \tilde{x} = g^x \odot H^x + (1-g^x) \odot x
-
Balances short-term observations with long-term context
3.2.4 Memory Consolidation
-
Merges similar adjacent entries to prevent memory bloat -
Uses cosine similarity > 0.85 as threshold
3.3 Memory-Conditioned Action Expert
Core Question: How does MemoryVLA generate temporally aware actions?
-
Diffusion-based policy with 10 denoising steps -
Conditions on: -
Cognitive tokens (high-level guidance) -
Perceptual tokens (fine-grained visual details)
-
Result: Predicts 16-step action sequences (e.g., “grasp → lift → place”) for complex tasks.
4. Experimental Results
4.1 Simulation Benchmarks
4.1.1 SimplerEnv-Bridge (Table 1)
Case Study: “Eggplant in Basket” task (100% success)
-
Traditional models fail due to visual similarity between pre/post-placement states -
MemoryVLA retrieves stored basket position from PCMB to guide placement
4.1.2 LIBERO-90 (Table 3)
-
90 long-horizon tasks with 95.6% success rate -
Outperforms CogACT by 3.5% despite using only third-person RGB (no wrist cameras)
4.2 Real-World Evaluation (Table 4)
Example: “Clean Restaurant Table”
-
Requires sorting 5 objects into trash/storage bins -
MemoryVLA tracks placed items via cognitive memory, avoiding repetition
5. Ablation Studies
5.1 Memory Type Impact (Table 5)
Reflection: Neither visual details nor language semantics alone suffice—dual memory is critical.
5.2 Optimal Memory Length (Table 5)
-
16-step memory achieves best performance (71.9%) -
Longer memories (64 steps) degrade performance due to noise
6. Real-World Applications
6.1 Coffee Machine Operation
Task: Power on → insert capsule → select cup size → brew
MemoryVLA Advantage:
-
Stores capsule position (perceptual) and selected cup size (cognitive) -
Solves visual similarity between “ready” and “brewing” states
6.2 Lab Equipment Organization
Task: Sort 10 types of labware into labeled drawers
Key Challenge: Distinguishing similar-looking items (e.g., test tubes vs. pipettes)
Result: 37% higher success than CogACT
7. Conclusion & Future Directions
7.1 Key Contributions
-
First VLA model with human-inspired dual-memory system -
SOTA performance on 150+ tasks -
Robustness to visual ambiguity and long-term dependencies
7.2 Future Work
-
Memory reflection: Embed long-term memory into LLM input space -
Lifelong learning: Distill frequently used experiences into permanent representations
Practical Implementation Guide
Action Checklist
-
Input Requirements: -
224×224 RGB image + text instruction -
50+ expert demonstrations per task
-
-
Training Setup: # Hyperparameters (from paper) batch_size = 256 learning_rate = 2e-5 memory_length = 16
-
Inference: model = MemoryVLA.load_pretrained("7B") actions = model.predict(obs, instruction)
One-Page Summary
FAQ
Q1: How does MemoryVLA handle visually identical states (e.g., button press)?
A: Uses cognitive memory to track action completion state.
Q2: What’s the optimal memory length?
A: 16 steps (Table 5).
Q3: Does it require depth sensors?
A: No—uses only RGB input.
Q4: How does it perform under lighting changes?
A: 86.7% success in SimplerEnv-Fractal visual aggregation tests.
Q5: Can it handle multi-robot coordination?
A: Current focus is single-arm; multi-robot is future work.
Based on ICLR 2025 paper “MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models”
Images from paper figures; code/model details at project page