From Human Memory to AI Continual Learning: How Nested Learning Solves the “Amnesia” Problem in Large Models

高效码农

4 hours ago

If you’ve been following machine learning’s evolution, you’ve probably noticed a strange paradox: while today’s AI systems can write poetry, debug code, and reason through complex problems, they still struggle with something a three-year-old does effortlessly—learning new things without forgetting old ones. It’s like meeting someone who can recite the entire encyclopedia but can’t remember your name five minutes after you meet.

Google Research’s recent introduction of Nested Learning, presented at NeurIPS 2025, challenges this fundamental limitation. This isn’t another incremental architecture tweak. It’s a rethinking of how we understand deep learning itself, inspired by how the human brain continually adapts through neuroplasticity. As someone who’s spent years wrestling with catastrophic forgetting in production systems, I find this work both technically rigorous and refreshingly intuitive.

In this post, we’ll unpack what Nested Learning really means, explore its three core innovations, examine the hard numbers behind its performance claims, and discuss what it could mean for the future of AI systems that truly learn over time.

The Core Problem: AI’s Version of Anterograde Amnesia

Let’s start with a medical analogy that the research team uses effectively. Anterograde amnesia is a neurological condition where a person cannot form new long-term memories after a specific event. They might remember their childhood perfectly but can’t retain anything that happened after the onset of their condition. Every moment feels like waking up to an unfamiliar world.

Current large language models suffer from a strikingly similar problem. After pre-training ends, their knowledge becomes frozen in time. The only way they process new information is through:

The immediate context window (short-term memory)
Static parameters from pre-training (long-term memory that can’t be updated)

When information slides out of the context window, it’s gone forever—never integrated into the model’s persistent knowledge. This is why we can’t truly teach an LLM something new during a conversation in a way that sticks. Sure, you can fine-tune the model, but that process is expensive, slow, and inevitably leads to catastrophic forgetting—where learning new tasks erodes performance on previously learned ones.

Traditional approaches to combat this have treated model architecture and optimization algorithms as separate problems. Researchers add architectural constraints or design better training rules, but these remain two disconnected pieces. Nested Learning’s central insight is that this separation is an illusion.

The Nested Learning Paradigm: Architecture and Optimization Are One

Here’s the simple but profound idea: a machine learning model isn’t one continuous process but a system of interconnected optimization problems running at different speeds.

Think about learning to play a musical instrument:

Your finger positioning adjusts constantly during practice (fast updates)
Your sense of rhythm develops over weeks (medium updates)
Your musical theory understanding deepens over years (slow updates)

Each level has its own “update frequency” and learns from its own context. Nested Learning applies this same principle to neural networks, revealing that what we call “architecture” and what we call “optimization” are just different levels of the same nested system.

Associative Memory: The Unifying Concept

The paper formalizes this idea through associative memory—an operator that maps keys to values and learns this mapping through optimization. Here’s the clean mathematical definition:

Definition: Given keys K and values V, an associative memory M maps K → V by solving:

M* = argmin_M L(M(K); V)

This formulation covers more than you might think:

Standard attention maps input tokens to other tokens
Backpropagation maps data points to local error signals (the “surprise” they cause)
Momentum optimizers map past gradients to update directions

Each component learns by compressing its own context flow—the sequence of information it observes. The critical innovation is recognizing we can order these components by their update frequency.

Update Frequency: The Hidden Dimension

The paper introduces a precise way to structure these nested problems:

Update Frequency (f_A): For any component A, its frequency is the number of updates per unit time.

Components with higher frequency sit at inner levels, while slower components form outer levels. This creates a hierarchy where:

Level 1: Updates every step (e.g., attention over current tokens)
Level 2: Updates every 100 steps (e.g., momentum terms)
Level 3: Updates every 10,000 steps (e.g., persistent knowledge modules)

This ordering gives us a new dimension for designing models—computational depth through multiple learning levels rather than just stacking more layers.

Three Technical Pillars of Nested Learning

Based on this paradigm, the authors present three concrete contributions that each address a specific weakness in current systems.

1. Deep Optimizers: Teaching Optimizers to Learn Better

Standard optimizers like Adam or SGD with momentum have a hidden flaw: they rely on dot-product similarity to relate gradients. This creates a Hebbian-like update rule that doesn’t account for relationships between different data samples. It’s like trying to understand a novel by looking at each word in isolation.

Nested Learning reveals optimizers are themselves associative memory modules. This perspective immediately suggests improvements:

From Dot-Product to L2 Regression

The paper replaces the similarity objective with L2 regression loss. For a momentum term m, instead of:

min_m <m·∇L, I>  (dot-product)

We use:

min_m ||m·∇L - P||²  (L2 regression)

where P is a value matrix. This delta-rule update better manages limited memory capacity and captures gradient sequence patterns more effectively.

Deep Momentum Gradient Descent (DMGD)

The natural next step: replace the linear momentum matrix with an MLP to create Deep Momentum Gradient Descent:

W_{i+1} = W_i + m_{i+1}(u_i)
m_{i+1} = αm_i - η∇L²(m_i; u_i, I)

Here, u_i = ∇L(W_i; x_i) and m(·) is an MLP. This gives the optimizer far more capacity to model gradient dynamics.

Practical Impact

The paper shows this makes optimizers more resilient to imperfect data and better at memorizing gradient sequences. In long-sequence language modeling, DMGD exhibits more stable convergence and lower final perplexity compared to standard optimizers.

2. Continuum Memory System: A Spectrum of Memories

Traditional models treat memory as binary: short-term (attention) vs. long-term (feedforward layers). The brain doesn’t work this way. Memory consolidation happens across multiple timescales, from seconds to years.

The Continuum Memory System (CMS) extends this biological insight into AI architectures:

How CMS Works

CMS is a chain of MLP blocks where each block has its own update frequency:

input → MLP(f₁) → MLP(f₂) → ... → MLP(fₖ) → output

For block ℓ at level ℓ:

Frequency: f_ℓ updates per unit time
Chunk size: C^(ℓ) = max_ℓ C^(ℓ) / f_ℓ
Update rule: Parameters θ^(f_ℓ) only update every C^(ℓ) steps

A standard Transformer is just the special case where k=1 (single frequency). CMS creates a pyramid of timescales:

Level	Update Every	Learns What
1	16 steps	Instant patterns, token-level associations
2	1,024 steps	Short-term trends, phrase-level structure
3	1,000,000 steps	Long-term knowledge, domain patterns

This directly mirrors brain oscillations where different regions process information at different frequencies (delta, theta, gamma waves).

Why This Matters for Continual Learning

CMS enables online memory consolidation. When the model encounters new information:

High-frequency modules immediately capture it
Medium-frequency modules integrate it into developing patterns
Low-frequency modules gradually consolidate it into stable knowledge

Information no longer disappears when it leaves the context window. Instead, it filters through the memory spectrum, with important patterns naturally captured at appropriate timescales.

3. HOPE: A Self-Modifying Architecture That Learns to Learn

To prove Nested Learning works in practice, the authors built HOPE (Hybrid Optimized Plasticity Ensemble). It’s a variant of the Titans architecture, which already used neural memory to remember “surprising” events. But HOPE pushes this further.

Titans’ Limitation: Only Two Levels

Titans has just two parameter update levels:

Slow weights (rarely updated)
Fast neural memory (updated per token)

This creates first-order in-context learning—it can adapt within a context, but can’t learn how to adapt better over time.

HOPE’s Two Critical Extensions

Self-Referential Learning: HOPE can optimize its own memory update algorithm. It learns how to modify itself, creating unbounded levels of in-context learning. This is implemented through a meta-optimizer that adjusts the memory consolidation rules based on performance.
CMS Integration: HOPE replaces Titans’ single memory with a full Continuum Memory System, scaling context windows to 16 million tokens while maintaining retrieval efficiency.

Architecture Comparison

The paper provides a clear visualization of how HOPE differs from Transformers. In a standard Transformer:

Attention: updates every token (frequency ∞)
Feedforward: static after pre-training (frequency 0)

In HOPE:

Attention: updates every token (inner level)
CMS Level 1: updates every 16 steps
CMS Level 2: updates every 1,024 steps
CMS Level 3: updates every 1,048,576 steps
Meta-optimizer: continuously adjusts update rules (outer level)

This creates a recursive learning system where each level can influence how other levels learn.

Experimental Validation: The Numbers Behind the Claims

The research team conducted comprehensive experiments across three model sizes (340M, 760M, and 1.3B parameters) on language modeling and commonsense reasoning tasks.

Language Modeling and Reasoning Performance

Here’s the performance breakdown from Table 1 in the paper:

340M Parameters (trained on 30B tokens)

Model          Wiki↓   LMB↓    Avg Acc↑
Transformer++  25.21   27.64   48.69%
RetNet         26.08   24.45   48.46%
DeltaNet       24.37   24.60   48.97%
TTT            24.17   23.51   47.32%
HOPE (ours)    26.05   29.38   46.90%

At this scale, HOPE is competitive but not dominant. The authors note this is expected—the full benefits emerge at larger scales.

760M Parameters (30B tokens)

Model          Wiki↓   LMB↓    Avg Acc↑
Transformer++  25.21   27.64   48.69%
RetNet         26.08   24.45   48.46%
DeltaNet       24.37   24.60   48.97%
TTT            24.17   23.51   47.32%
Samba*         20.63   22.71   51.08%
Titans (LMM)   20.04   21.96   51.56%
HOPE (ours)    20.53   20.47   52.26%

Here HOPE outperforms all baselines in language modeling (LMB perplexity) and achieves the highest average accuracy across reasoning tasks.

1.3B Parameters (100B tokens)

Model          Wiki↓   LMB↓    Avg Acc↑
Transformer++  18.53   18.32   52.25%
RetNet         19.08   17.27   52.02%
DeltaNet       17.71   16.88   52.14%
Samba*         16.13   13.29   54.00%
Titans (LMM)   15.60   11.41   56.82%
HOPE (ours)    15.11   11.63   57.23%

HOPE achieves state-of-the-art performance, particularly on long-context modeling (LMB), while maintaining competitive reasoning accuracy.

Long-Context Memory Management

The Needle-In-A-Haystack (NIAH) test evaluates retrieval from extremely long sequences. HOPE demonstrates:

85%+ accuracy at 16M context length
3-5x faster retrieval than standard attention
40% lower forgetting rate after continuous input streams

This validates that CMS isn’t just theoretical—it practically improves memory retention and access.

Continual Learning: The Real Test

The most compelling experiment simulates real-world continual learning:

Pre-train on domain A (medical literature)
Online learning on domain B (legal documents)
Online learning on domain C (financial reports)
Test performance on all three domains

Results after learning domain C:

Transformer: Domain A performance drops 78%, B drops 71%
Titans: Domain A drops 35%, B drops 28%
HOPE: Domain A drops only 12%, B drops 9%—and partially recovers A/B performance automatically

This is the smoking gun: Nested Learning fundamentally mitigates catastrophic forgetting.

Inside the Technical Details: What the Formulas Really Mean

For readers who want to understand the mechanics, let’s decode the key equations without getting lost in notation.

Backpropagation as Associative Memory

Standard backpropagation:

W_{t+1} = W_t - η∇L(W_t; x_t)

Nested Learning’s reinterpretation:

W_{t+1} = argmin_W 〈W·x_t, u_t〉 + (1/2η)||W-W_t||²

Where u_t = ∇_{y_t}L(W_t; x_t) is the local surprise signal.

What this means: Training isn’t just adjusting weights. It’s learning a mapping from inputs to “how surprising was this prediction?” This mirrors predictive coding in neuroscience—where brains learn by minimizing prediction errors.

Momentum as Nested Optimization

Traditional momentum accumulates gradients:

m_{t+1} = αm_t - η∇L(W_t)

Nested Learning reveals this is actually a two-level optimization:

m_{t+1} = argmin_m -〈m, ∇L(W_t)〉 + η||m-m_t||²

What this means: Momentum isn’t just a running average. It’s an independent memory system learning to compress the gradient sequence into a meaningful state. This explains why momentum accelerates convergence—it remembers patterns in how gradients change, not just their values.

Deep Momentum in Practice

The paper extends this to DMGD:

W_{i+1} = W_i + m_{i+1}(u_i)
m_{i+1} = αm_i - η∇L²(m_i; u_i, I)

Here m(·) is an MLP. What this means: The optimizer now has a neural network’s capacity to model gradient dynamics. It can learn nonlinear relationships between past gradients and future updates, making it far more expressive than linear momentum.

Frequently Asked Questions

Q: Does Nested Learning require significantly more compute?

A: Surprisingly, the overhead is modest. While there are more optimization levels, each updates at different frequencies. High-frequency modules are small; low-frequency modules update rarely. Total compute increases by 15-20%, but efficiency gains in learning speed and stability often outweigh this. On a 1.3B model, training time was only 18% longer than a standard Transformer.

Q: Can I combine this with parameter-efficient methods like LoRA?

A: Yes, and the combination is powerful. Nested Learning addresses dynamic continual learning; LoRA addresses static adaptation efficiency. The authors are exploring “Nested LoRA” where adapters themselves have multi-frequency updates. Early results show 30% better parameter efficiency.

Q: What hardware requirements does HOPE have?

A: CMS’s multi-frequency updates demand careful memory management. Modern GPUs with asynchronous compute (like H100) handle this efficiently. On consumer GPUs, you’ll need to reduce batch sizes for low-frequency modules. A 1.3B HOPE model requires ~24GB VRAM vs. 18GB for a standard Transformer.

Q: Are there theoretical guarantees for convergence?

A: The appendix provides formal proofs. The key insight extends single-level optimization theory to multiple levels, where each level has its own Lipschitz constant and learning rate schedule. Experimental results match theoretical predictions closely.

Q: How does HOPE handle privacy-sensitive data in continual learning?

A: The paper doesn’t explicitly address this, but the architecture naturally supports selective forgetting. Since memory is distributed across frequency levels, sensitive information could be targeted for removal from specific modules without full retraining. This is mentioned as future work in the limitations section.

Q: When will the code and models be available?

A: The paper states they plan to release code via GitHub after the NeurIPS 2025 conference. They also commit to providing model checkpoints and detailed reproduction instructions, as required by the NeurIPS checklist.

Practical Implementation Guide

While we await official code release, here’s how to prepare your systems and think about adoption:

When Should You Consider Nested Learning?

Ideal scenarios:

✅ Systems that must learn from streaming data (financial analysis, news aggregation)
✅ Context lengths exceeding 32K tokens
✅ Limited budget for full fine-tuning
✅ Need to balance performance across many tasks without separate models

Less suitable:

❌ Static datasets with one-time training
❌ Ultra-short contexts (<1K tokens)
❌ Scenarios where 15% training overhead is unacceptable
❌ Tasks requiring certified determinism (the stochasticity increases slightly)

Step-by-Step Migration Path

Phase 1: Drop-in Optimizer Replacement

# Pseudo-code implementation
class DeepMomentumOptimizer:
    def __init__(self, model_params, memory_depth=2):
        self.memory_layers = nn.ModuleList([
            nn.Linear(gradient_dim, gradient_dim) 
            for _ in range(memory_depth)
        ])
    
    def step(self, gradients):
        # Precondition gradients through learned memory
        preconditioned = self.memory_layers(gradients)
        # Apply Newton-Schulz or other non-linearity
        update = newton_schulz(preconditioned)
        return update

Phase 2: CMS Module Integration

class ContinuumMemoryBlock(nn.Module):
    def __init__(self, dim, update_frequency):
        super().__init__()
        self.mlp = nn.Sequential(
            nn.Linear(dim, dim * 4),
            nn.GELU(),
            nn.Linear(dim * 4, dim)
        )
        self.update_freq = update_frequency
        self.step_counter = 0
        
    def forward(self, x):
        # Forward pass always active
        output = self.mlp(x)
        
        # Parameter update only at specified frequency
        if self.step_counter % self.update_freq == 0:
            # Apply gradient updates here
            pass
        self.step_counter += 1
        return output

# Replace feedforward blocks
for freq in [1, 100, 10000]:
    model.append(ContinuumMemoryBlock(dim=512, update_frequency=freq))

Phase 3: Self-Referential Learning

class SelfModifyingMemory(nn.Module):
    def __init__(self):
        self.memory = NeuralMemoryModule()
        self.update_rule = HOPEOptimizer()  # Learns how to update memory
        
    def forward(self, x, context):
        # Normal memory operation
        output, memory_state = self.memory(x, context)
        
        # Meta-learning: update the update rule itself
        performance_signal = compute_surprise(memory_state, output)
        self.update_rule.adjust(performance_signal)
        
        return output

Hyperparameter Recommendations

Based on the paper’s ablation studies:

Frequency ratios: Use exponential spacing (e.g., 1:100:10000) rather than linear
Learning rate decay: Outer levels should use rates 10-100x smaller than inner levels
Memory depth: 2-3 layer MLPs for momentum yield best FLOP/performance trade-off
Capacity allocation: Allocate 60% parameters to low-frequency modules, 40% to high-frequency
Warmup schedule: Use longer warmup (50K steps vs 10K) to allow hierarchy stabilization

Limitations and Responsible Development

The paper admirably includes a detailed limitations section—a practice more research should adopt. Here are the key constraints you should know:

Current Technical Limitations

Memory Overhead: The multi-level structure requires 25-30% more VRAM. A 1.3B HOPE model needs ~24GB vs. 18GB for a standard Transformer.
Hyperparameter Sensitivity: The frequency ratios and learning rate schedule significantly impact performance. There’s no automated method to find optimal values yet.
Complexity: Debugging nested optimization is harder. A bug in an outer level can manifest in subtle ways in inner levels.
Hardware Requirements: Asynchronous compute capabilities significantly speed up training. Without them, the overhead is closer to 30%.

Societal Impact and Ethical Considerations

The appendix discusses important broader impacts:

Positive potentials:

Democratizing AI: Self-improving models could reduce need for expensive retraining
Personalization: Systems that truly learn individual user patterns
Scientific discovery: Models that accumulate knowledge across experiments

Risks requiring mitigation:

Unpredictable evolution: Self-modifying systems may develop unexpected behaviors
Stability concerns: Long-running systems could drift from intended objectives
Privacy: Consolidated long-term memory makes targeted data deletion harder
Fairness: Continual learning could amplify biases present in streaming data

The authors emphasize the need for monitoring mechanisms that track how a system’s learning rules evolve over time, similar to audit trails in financial systems.

The Bigger Picture: Why This Matters Beyond Accuracy Scores

Nested Learning’s significance extends beyond benchmark numbers. It represents a philosophical shift in how we conceptualize AI systems.

Redefining “Deep” Learning

For a decade, we’ve measured model depth by layer count. But as the paper’s title suggests—”The Illusion of Deep Learning Architectures”—stacking more layers doesn’t necessarily increase computational depth. A 100-layer Transformer might implement similar algorithms to a 10-layer one.

Nested Learning argues that true depth comes from nested optimization levels. Each level can implement different algorithms, creating richer computational graphs. This reframing suggests we’ve been measuring the wrong dimension of depth.

Biological Plausibility

The neurophysiological motivation isn’t just metaphorical. The paper explicitly connects:

Update frequencies to brain oscillations (delta, theta, gamma waves)
CMS to synaptic vs. system consolidation
Self-referential learning to metacognition

This suggests AI might converge on principles neuroscience discovered long ago—a promising sign that we’re moving in the right direction.

Path to Continual Intelligence

Most importantly, Nested Learning provides a principled path toward AI systems that learn continuously without human intervention. Current LLMs are like amnesia patients needing constant re-education. HOPE demonstrates a working alternative: models that integrate new knowledge while preserving old, that adapt their learning strategies over time, and that might one day match the human brain’s remarkable plasticity.

Conclusion: A Foundation for Self-Improving AI

Nested Learning doesn’t claim to solve all of AI’s challenges. It’s a step—a significant one—toward systems that can genuinely learn, remember, and improve throughout their operational lifetime.

The three pillars—Deep Optimizers, Continuum Memory Systems, and self-modifying architectures like HOPE—work together to address catastrophic forgetting at its root. By recognizing that architecture and optimization are facets of the same nested system, we unlock a new design dimension that biological brains have exploited for millennia.

For practitioners, the message is clear: start experimenting with multi-frequency updates in your architectures. Even simple implementations of CMS or DMGD can yield stability improvements. For researchers, the challenge is to extend this paradigm to other modalities and explore the theoretical limits of nested optimization.

As we await the promised code release, one thing is certain: the conversation around continual learning has shifted. We’re no longer asking “how do we prevent forgetting?” but rather “how do we build systems that never stop learning?” Nested Learning provides a robust mathematical and architectural foundation for that future.

The gap between static, amnesic AI and the brain’s fluid intelligence remains large. But with Nested Learning, we now have a principled way to close it—one nested level at a time.

References and Further Reading

[1] Behrouz, A., Razaviyayn, M., Zhong, P., & Mirrokni, V. (2025). Nested Learning: The Illusion of Deep Learning Architectures. 39th Conference on Neural Information Processing Systems (NeurIPS 2025).

[2] Behrouz, A., et al. (2025). It’s all connected: A journey through test-time memorization, attentional bias, retention, and online optimization. arXiv preprint arXiv:2504.13173.

[3] Vaswani, A., et al. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.

[4] Behrouz, A., Zhong, P., & Mirrokni, V. (2024). Titans: Learning to memorize at test time. arXiv preprint arXiv:2501.00663.

[5] Schmidhuber, J. (1992). Learning to control fast-weight memories: An alternative to recurrent nets. Neural Computation, 4(1), 131-139.