MemoryVLA: Revolutionizing Robotic Manipulation with Human-Inspired Memory Systems

Core Question

How does MemoryVLA address the limitations of existing Vision-Language-Action (VLA) models in handling long-term dependencies for robotic manipulation?

MemoryVLA introduces a dual-memory architecture inspired by human cognitive systems, enabling robots to handle complex, time-dependent tasks that traditional models struggle with. By integrating perceptual details and high-level semantics into a unified memory framework, it achieves state-of-the-art performance across 150+ tasks in simulation and real-world environments.

1. The Challenge of Temporal Dependencies in Robotics

1.1 Why Existing Models Fail

Modern VLA models like OpenVLA and π₀ rely on single-frame inputs, ignoring historical context. This leads to failures in tasks where:

Visual ambiguity exists (e.g., button states look identical before/after pressing)
Multi-step planning is required (e.g., “clean table and count” needs progress tracking)

Key limitation: Without memory, models cannot distinguish similar states or maintain task context over time.

2. MemoryVLA’s Dual-Memory Architecture

2.1 Core Innovation

MemoryVLA combines two memory systems:

Working memory: Short-term storage of current observations
Perceptual-Cognitive Memory Bank (PCMB): Long-term storage of historical data

# Simplified architecture diagram
[Current RGB Image + Language Instruction]  
       ↓  
Vision-Language Cognition Module → Working Memory (p, c)  
       ↓  
PCMB (stores perceptual details + cognitive semantics)  
       ↓  
Memory-Conditioned Diffusion Action Expert → 7-DoF actions

3. Key Components Explained

3.1 Vision-Language Cognition Module

Core Question: How does MemoryVLA process visual and language inputs?

Visual encoding: Uses DINOv2 + SigLIP backbones to extract 256-dimensional perceptual tokens
Language processing: LLaMA-7B generates a compact cognitive token from instructions
Output: Working memory = {perceptual tokens (p), cognitive token (c)}

Example: For a “pick apple” task, the module encodes the apple’s position (perceptual) and the goal “place in basket” (cognitive).

3.2 Perceptual-Cognitive Memory Bank (PCMB)

Core Question: How does PCMB store and retrieve historical information?

3.2.1 Memory Storage

Stores two streams:
- Perceptual memory: Fine-grained visual details (e.g., object positions)
- Cognitive memory: High-level semantic summaries (e.g., “door opened”)

3.2.2 Memory Retrieval

Uses temporal positional encoding to query relevant history
Example: In “push buttons” task, retrieves past button states to infer completion

3.2.3 Memory Fusion

Gate mechanism dynamically weights historical vs. current data:
```
\tilde{x} = g^x \odot H^x + (1-g^x) \odot x  
```
Balances short-term observations with long-term context

3.2.4 Memory Consolidation

Merges similar adjacent entries to prevent memory bloat
Uses cosine similarity > 0.85 as threshold

3.3 Memory-Conditioned Action Expert

Core Question: How does MemoryVLA generate temporally aware actions?

Diffusion-based policy with 10 denoising steps
Conditions on:
- Cognitive tokens (high-level guidance)
- Perceptual tokens (fine-grained visual details)

Result: Predicts 16-step action sequences (e.g., “grasp → lift → place”) for complex tasks.

4. Experimental Results

4.1 Simulation Benchmarks

4.1.1 SimplerEnv-Bridge (Table 1)

Model	Success Rate	Key Improvement
RT-1-X	1.1%	Baseline
OpenVLA	4.2%	VLA baseline
CogACT-Large	57.3%	Previous SOTA
MemoryVLA	71.9%	+14.6% (outperforms all baselines)

Case Study: “Eggplant in Basket” task (100% success)

Traditional models fail due to visual similarity between pre/post-placement states
MemoryVLA retrieves stored basket position from PCMB to guide placement

4.1.2 LIBERO-90 (Table 3)

90 long-horizon tasks with 95.6% success rate
Outperforms CogACT by 3.5% despite using only third-person RGB (no wrist cameras)

4.2 Real-World Evaluation (Table 4)

Task Type	OpenVLA	π₀	CogACT	MemoryVLA
General Tasks	31%	72%	76%	85%
Temporal Tasks	9%	52%	57%	83%

Example: “Clean Restaurant Table”

Requires sorting 5 objects into trash/storage bins
MemoryVLA tracks placed items via cognitive memory, avoiding repetition

5. Ablation Studies

5.1 Memory Type Impact (Table 5)

Configuration	Success Rate
Cognitive Only	63.5%
Perceptual Only	64.6%
Dual Memory	71.9%

Reflection: Neither visual details nor language semantics alone suffice—dual memory is critical.

5.2 Optimal Memory Length (Table 5)

16-step memory achieves best performance (71.9%)
Longer memories (64 steps) degrade performance due to noise

6. Real-World Applications

6.1 Coffee Machine Operation

Task: Power on → insert capsule → select cup size → brew
MemoryVLA Advantage:

Stores capsule position (perceptual) and selected cup size (cognitive)
Solves visual similarity between “ready” and “brewing” states

6.2 Lab Equipment Organization

Task: Sort 10 types of labware into labeled drawers
Key Challenge: Distinguishing similar-looking items (e.g., test tubes vs. pipettes)
Result: 37% higher success than CogACT

7. Conclusion & Future Directions

7.1 Key Contributions

First VLA model with human-inspired dual-memory system
SOTA performance on 150+ tasks
Robustness to visual ambiguity and long-term dependencies

7.2 Future Work

Memory reflection: Embed long-term memory into LLM input space
Lifelong learning: Distill frequently used experiences into permanent representations

Practical Implementation Guide

Action Checklist

Input Requirements:
- 224×224 RGB image + text instruction
- 50+ expert demonstrations per task

Training Setup:

# Hyperparameters (from paper)  
batch_size = 256  
learning_rate = 2e-5  
memory_length = 16

Inference:

model = MemoryVLA.load_pretrained("7B")  
actions = model.predict(obs, instruction)

One-Page Summary

Feature	Details
Architecture	Dual-memory (perceptual + cognitive)
Key Innovation	PCMB with retrieval/fusion/consolidation
Parameters	7B (VLM) + 300M (action head)
Input	RGB + text
Success Rate	84% real-world average

FAQ

Q1: How does MemoryVLA handle visually identical states (e.g., button press)?
A: Uses cognitive memory to track action completion state.

Q2: What’s the optimal memory length?
A: 16 steps (Table 5).

Q3: Does it require depth sensors?
A: No—uses only RGB input.

Q4: How does it perform under lighting changes?
A: 86.7% success in SimplerEnv-Fractal visual aggregation tests.

Q5: Can it handle multi-robot coordination?
A: Current focus is single-arm; multi-robot is future work.

Based on ICLR 2025 paper “MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models”
Images from paper figures; code/model details at project page

MemoryVLA: How Dual-Memory Robotics Solves Long-Term Task Challenges