MemFlow: How to Stop AI-Generated Long Videos from “Forgetting”? A Deep Dive into a Breakthrough Memory Mechanism

Have you ever used AI to generate a video, only to be frustrated when it seems to forget what happened just seconds before? For example, you ask for “a girl walking in a park, then she sits on a bench to read,” but the girl’s outfit changes abruptly, or she transforms into a different person entirely? This is the notorious “memory loss” problem plaguing current long-form video generation AI—they lack long-term consistency, struggling to maintain narrative coherence.

Today, we will delve into a recent groundbreaking study from researchers at The University of Hong Kong, Kuaishou Technology’s Kling Team, and The Hong Kong University of Science and Technology (Guangzhou): MemFlow. This technology proposes a novel “Flowing Adaptive Memory” mechanism designed to enable AI to generate ultra-long videos like a human director—firmly remembering characters, scenes, and plot logic while maintaining high generation speed. Let’s unpack how this technology works and why it’s a significant leap forward.

The Achilles’ Heel of Long Video Generation: The Memory-Efficiency Paradox

In recent years, Text-to-Video (T2V) models have made remarkable strides in quality. From short clips of a few seconds to the current pursuit of cinematic-length narratives, the boundaries of the technology are constantly expanding. However, generating a coherent video lasting a minute or more remains a formidable challenge.

The core conflict lies in the memory-efficiency paradox:

To Remember Well: The model needs to continuously review all previously generated frames, understanding the continuity of character poses, scene layouts, and lighting changes. This requires immense computational resources and memory, akin to asking you to remember every detail of an entire book before writing the next chapter—an overwhelming burden.
To Compute Fast: For real-time generation, models typically predict the next frame based only on the most recent few frames (the “local context window”). This is like writing a long story by looking only at the previous sentence; it’s easy to veer off track, forget earlier setups, and cause characters to “mutate” or scenes to “jump-cut.”

Existing solutions, such as simply using the first frame as fixed memory or employing fixed compression schemes for historical frames, are like giving AI a limited-capacity, unorganized hard drive. When the storyline becomes complex, requiring scene transitions or introducing new elements, AI cannot quickly retrieve the truly relevant memory fragments from this “drive,” leading to narrative chaos.

Figure: The overall framework of MemFlow. It ensures long-video coherence through dynamic memory retrieval and updating during autoregressive generation.

MemFlow’s “Smart Memory”: Narrative Adaptive Memory (NAM)

The core innovation of MemFlow is endowing AI with an “intelligent, flowing” memory system, called Narrative Adaptive Memory (NAM). This system no longer passively stores all history but actively “recalls” based on current needs.

Imagine the video generation process as filming a TV series. Each time a new 5-second clip (an episode) is generated, the director (the AI) needs to do two things:

1. Semantic Retrieval: Finding “Reference Footage” Based on the “Script”
Before filming a new episode, the director has the script for it (the text prompt). They use this script text to search the archive of past footage (the memory bank) to find the semantically most relevant clips.

Technical Implementation: The model computes the attention scores between the word embeddings of the current text prompt and the visual feature embeddings of each frame in the memory bank. A higher score indicates that historical frame is more relevant to the content to be generated.
Simplified Understanding: Relevance Score = Match(New Script Keywords, Old Frame Description)
Thus, when the script switches from “a girl walking in a park” to “she sits on a bench to read,” the system can accurately retrieve the visual features of “that girl” and “that park” from earlier generations, instead of arbitrarily generating a new girl.

2. Redundancy Removal & Updating: Efficiently Archiving the “Previous Episode”
After filming an episode, the director doesn’t dump all shots (maybe dozens) from that episode into the archive, which would fill up quickly. Instead, they intelligently select the most representative frame of that episode (e.g., the opening keyframe) as its “summary” for storage.

Technical Implementation: Leveraging the high temporal redundancy within short video clips, it directly selects the KV cache (an efficient intermediate representation in neural networks) of the first frame from the previously generated chunk as its prototype, storing it in the memory bank.
Benefit: Dramatically compresses memory capacity while retaining core visual and contextual information.

Through these two steps—”Semantic Retrieval” and “Redundancy Removal”—MemFlow’s memory bank remains highly relevant and lightweight, ensuring that when generating each new clip, the AI can invoke the most pertinent historical context to maintain long-term consistency of characters, objects, and scenes.

Qualitative comparison of MemFlow against other methods
Figure: In a 60-second generation task with multiple prompt switches, MemFlow maintains character consistency, while other methods suffer from duplicated characters, disappearance, or scene drifting.

The Secret to Speed: Sparse Memory Activation (SMA)

Introducing dynamic memory makes the system smarter, but could it slow down generation? MemFlow’s second key technology, Sparse Memory Activation (SMA), is born to address this very concern.

Imagine a director who, despite finding 10 relevant clips in the archive, might only intently reference the 2-3 most relevant ones when filming a specific shot. SMA does precisely this.

How it Works: At each computational step for generating the current frame, the model assesses the relevance of all frames in the memory bank to the visual content currently being generated.
Dynamic Filtering: It only activates (i.e., involves in computation) the top k most relevant memory frames, ignoring the less relevant ones.
Simplified Understanding: The current visual query vector is dot-producted with the key vectors of each frame in memory, and the top k scoring frames are selected.

This process can be represented as:

Activated Memory ≈ Attention_Computation(Current_Query, Top-K(Memory_Keys), Top-K(Memory_Values))

By doing this, the scope of attention computation shrinks from the entire large memory bank to a small, most relevant subset, significantly reducing computational overhead with almost no loss in generation quality.

Qualitative analysis of different memory mechanisms
Figure: Qualitative comparison of different memory mechanisms. “No memory” causes scene jumps; “Fixed memory” only remembers the beginning; “Full NAM memory” works best but is slower; “NAM+SMA” achieves nearly the best quality while improving efficiency.

How Does It Perform? Evidence from Experiments and Data

Researchers comprehensively evaluated MemFlow on several standard benchmarks, and the results are impressive.

Performance on Multi-Prompt Interactive Long Video Generation

This is MemFlow’s primary battlefield. The test simulates a scenario where a user continuously inputs new instructions to guide the video narrative (e.g., “girl walking” -> “she sits down” -> “starts reading”).

Key Data (See Table 1 in the paper):

Overall Quality Score: 85.02, outperforming all compared mainstream long video generation models (e.g., SkyReels-V2, Self-Forcing, LongLive).
Aesthetic Score: 61.07, ranking first, indicating its generated frames are more visually pleasing and that it effectively mitigates error accumulation over long sequences.
Semantic Alignment (CLIP Score): In the latter half of the video (30-60 seconds), MemFlow’s CLIP scores are significantly better than or on par with the best baseline, proving its narrative coherence and prompt-following ability remain robust over long distances.
Generation Speed: Achieves 18.7 FPS on a single NVIDIA H100 GPU, only about 7.9% slower than the memory-free baseline, striking an outstanding balance between efficiency and performance.

Performance on Single-Prompt Long Video Generation

Even in conventional long video generation tasks without frequent prompt switches, MemFlow’s advantages are equally evident.

Key Data (See Table 4 in the paper, 30-second generation):

Overall Quality Score: 84.51, leading other compared models.
Semantic Score: 78.87, a substantial lead, directly benefiting from its text-retrieval-based memory mechanism, which better understands and maintains the semantic context throughout the entire video.

Ablation Studies: Every Component is Essential

Ablation experiments verify the necessity of each component of NAM and SMA (See Table 3 in the paper):

No Memory: Fastest speed (23.5 FPS), but lowest subject and background consistency scores; narratives easily break.
Fixed Memory (e.g., LongLive): Decent speed (20.3 FPS), can remember elements from the video’s beginning but fails for later introduced or switched elements.
Full NAM (without SMA): Highest consistency scores, but speed drops to 17.6 FPS.
NAM + SMA (Full MemFlow): Maintains nearly the best consistency of NAM (98.01 subject consistency) while boosting speed back to 18.7 FPS, perfectly fulfilling the design goal.

MemFlow At-a-Glance

For a clearer understanding of MemFlow’s core, we summarize its key information below:

Component	Name	Core Function	Problem It Solves
Memory Mechanism	Narrative Adaptive Memory (NAM)	1. Semantic Retrieval: Uses current text prompt to find relevant frames from history. 2. Redundancy Removal: Uses the first frame of the previous chunk as its prototype to update memory.	AI “cannot find” or “misuses” historical references when generating new clips, causing inconsistency.
Acceleration Mechanism	Sparse Memory Activation (SMA)	During attention computation, only activates the Top-K memory frames most relevant to the current generation.	Extra computational burden introduced by dynamic memory, impacting generation efficiency.
Core Advantage	–	Consistency, Semantic Coherence, High Efficiency	The memory-efficiency paradox in long video generation.
Measured Speed	–	18.7 FPS on a single H100 GPU.	Demonstrates potential for practical application.
Compatibility	–	Can be integrated with any streaming video generation model supporting KV cache.	Facilitates migration and application to existing frameworks.

Potential Applications and Future Outlook

The capabilities demonstrated by MemFlow open doors to a range of exciting applications:

Interactive Films & Games: Players or viewers could change the plot direction in real-time using natural language, with AI generating coherent follow-up video based on instructions.
Ultra-Long Pre-visualization for Film/TV: Directors and screenwriters could rapidly generate consistent storyboards or animatics lasting several minutes.
Personalized Long-Form Content Creation: Users could describe a series of events for AI to generate a complete, character-unified personal story short film.
Educational & Corporate Training Videos: Automating the generation of long videos demonstrating complex processes or case studies, ensuring continuity of the subject and environment.

The significance of this work lies in its approach to endowing AI with “memory” capability that aligns more closely with narrative logic at a mechanistic level, rather than merely “forcing” performance through increased model parameters or data volume. As such technologies mature, AI video generation will truly evolve from “impressive clips” to “credible stories.”

Frequently Asked Questions (FAQ) About MemFlow

Q1: Is MemFlow a brand-new Text-to-Video model?
A: Not exactly. MemFlow is essentially a memory-augmentation module. It is designed to be integrated into existing streaming video generation frameworks based on autoregressive and diffusion models (e.g., the Wan2.1-T2V model used in the paper). You can think of it as an “intelligent RAM module” added to existing AI video models.

Q2: How does it understand “semantic relevance”? Does it use another large model?
A: No additional model is needed. MemFlow cleverly utilizes the internal cross-attention mechanism of the video generation model itself. Text prompts are internally converted into a set of query vectors, while visual features of historical frames are stored as key-value pairs. Calculating the attention scores between these two is inherently a process of measuring semantic relevance. MemFlow directly leverages this readily available, precise relevance signal.

Q3: Does “Sparse Memory Activation” discard important information?
A: Experimental data in the paper shows that with a reasonable Top-K selection (e.g., activating only the most relevant portion of memory), generation quality is almost unaffected. This is because the filtered-out memory frames have low relevance to the current generation content and may even contain distracting information. Selective focusing actually helps the model absorb useful information from history more clearly, similar to human “selective memory.”

Q4: Can this technology be tried out now?
A: The research team has open-sourced the project code on GitHub (https://github.com/KlingTeam/MemFlow). This means any interested researcher or developer can examine, reproduce, or build upon it. However, turning it into a product ready for direct public use typically requires further engineering refinement and integration.

Q5: Does MemFlow have high hardware requirements?
A: The most efficient results reported in the paper were achieved on an NVIDIA H100 GPU, a top-tier data-center grade GPU. For broader deployment, further performance testing and optimization on consumer-grade graphics cards (like RTX 4090, etc.) would be needed. However, its design goal of incurring only about 8% speed penalty demonstrates a solid efficiency foundation.

Q6: Can it generate videos of any length?
A: In theory, due to its autoregressive and memory-updating mechanism, MemFlow can generate videos continuously. The paper primarily evaluated sequences up to 60 seconds. The generation length is primarily limited by GPU memory (for storing the growing memory bank and intermediate states) and potential error accumulation. However, MemFlow’s dynamic memory management mechanism is precisely designed to delay error accumulation and support longer generation.

MemFlow Breakthrough: Ending AI Video Forgetting with Adaptive Memory