DeepSeek MODEL1 Breakdown: How Infinite Memory AI Will Revolutionize Long-Context Processing

高效码农

2 months ago

DeepSeek MODEL1 Revealed: FlashMLA Code Updates Hint at Next-Gen AI Model—How Will “Infinite Memory” Transform the Way We Use AI?

Summary

DeepSeek updated 114 files in its FlashMLA GitHub repository, with 28 references to a new MODEL1 model developed in parallel with the existing V3.2 series. MODEL1 introduces optimizations in KV cache layout, sparse attention mechanisms, and FP8 decoding, potentially incorporating Engram conditional memory technology for breakthrough long-context processing capabilities, expected to debut in the V4 flagship model launching mid-February.

What Exactly Did DeepSeek Update on GitHub?

In January 2025, coinciding with the one-year anniversary of DeepSeek-R1’s release, the DeepSeek team made a notable update to the FlashMLA repository on GitHub. This update spanned 114 files, but what truly excited the technical community was a new identifier appearing repeatedly throughout the code—MODEL1.

FlashMLA is DeepSeek’s core library specifically developed for optimizing attention mechanisms, providing foundational computational support for the DeepSeek-V3 and DeepSeek-V3.2-Exp models. In this update, MODEL1 appeared 28 times as an independent model identifier, clearly distinguished from V32 (the internal codename for DeepSeek-V3.2). This code-level separation indicates that MODEL1 isn’t merely a version iteration but rather an entirely new model architecture.

What Are the Technical Foundations of FlashMLA?

To understand the significance of MODEL1, we first need to grasp the FlashMLA technical infrastructure. FlashMLA is DeepSeek’s efficient Multi-head Latent Attention Kernels, containing four core attention computation kernels:

Sparse Attention Kernels support two operational modes. During the prefill stage, they employ token-level sparse attention mechanisms; during the decoding stage, they similarly use token-level sparse attention but pair it with FP8-format KV cache to enhance efficiency.

Dense Attention Kernels provide separate support for prefill and decoding stages. These kernels perform exceptionally well on NVIDIA H800 SXM5 GPUs—the dense MLA decoding kernel achieves bandwidth of 3,000 GB/s under memory-bound configurations and reaches 660 TFLOPS of computational power in compute-intensive configurations.

What Technical Details Does MODEL1 Reveal in the Code?

Based on the FlashMLA code changes, MODEL1 introduces several key technical improvements:

KV Cache Layout Optimization represents one of the most significant changes. In the existing FP8 KV cache format, each token’s cache occupies 656 bytes: the first 512 bytes store the quantized NoPE portion (containing 512 float8_e4m3 values), the next 16 bytes store 4 float32 scale factors, and the final 128 bytes preserve the unquantized RoPE portion (64 bfloat16 values). MODEL1 likely introduces further optimizations to this layout.

Sparse Processing Mechanisms have also been upgraded. Current sparse attention uses an indices tensor to specify computation ranges. This tensor has a shape of (batch_size, seq_len_q, topk), where each index value follows the formula: page block index × page block size + intra-page offset. MODEL1 appears to introduce more flexible sparse patterns based on this foundation.

FP8 Decoding Capabilities have been strengthened. The existing token-level sparse MLA decoding kernel achieves 410 TFLOPS on H800 SXM5 and 350 TFLOPS on B200. MODEL1’s optimizations may further enhance these performance metrics.

What Exactly Is the Engram Technology Behind “Infinite Memory”?

The “infinite memory” concept circulating online actually refers to the Engram conditional memory technology proposed in DeepSeek’s latest research paper. The core innovation of this technology lies in decoupling computation from storage, enabling AI models to more efficiently search for and utilize foundational information.

How Does Engram Break Through Traditional Memory Limitations?

Traditional large language models face a fundamental contradiction when processing long contexts: model parameters are fixed, but the volume of information requiring memorization grows linearly with conversation length. When context length exceeds the model’s designed capacity, the model either forgets early information or cannot continue running due to insufficient memory.

Engram employs an innovative architectural design that separates “what to remember” from “how to remember it.” Specifically, through conditional memory mechanisms, it allows models to store vast amounts of information in external memory units rather than cramming everything into model parameters. When specific information is needed, the model generates queries based on the current context and precisely retrieves relevant content from external memory.

This design brings three key advantages: First, storage capacity is no longer limited by GPU memory and can theoretically expand indefinitely; second, retrieval efficiency improves dramatically because the model doesn’t need to traverse all historical information; third, the model can more aggressively scale parameter size since parameter growth no longer directly increases memory pressure.

What Effects Do the Experimental Data Show?

DeepSeek’s research team validated Engram technology in a model with 27 billion parameters. Experimental results showed that models applying Engram improved performance on major industry benchmarks by several percentage points. While percentage-point improvements may seem modest, in the large model domain, this level of enhancement often signifies a qualitative leap, especially on long-context tasks.

Researchers specifically tested the model’s ability to handle extremely long documents. Results demonstrated that models equipped with Engram could still accurately reference earlier details and maintain logical consistency when processing texts of hundreds of thousands of words—something traditional architectures struggle to achieve.

What’s the Relationship Between MODEL1 and the Upcoming V4?

From a timeline perspective, MODEL1’s exposure coincides precisely with news that DeepSeek plans to release its next-generation flagship model V4 in mid-February (around the Lunar New Year). Industry consensus suggests MODEL1 is likely either V4’s internal codename or a core technical component.

What Is V4’s Positioning?

According to public information, DeepSeek V4’s primary focus is code generation capabilities. This positioning is very clear—in the current context of rapidly developing AI-assisted programming, a flagship model specialized in code understanding and generation addresses enormous market demand.

Code generation tasks place extremely high demands on models’ long-context capabilities. A complete software project may contain hundreds of files and tens of thousands of lines of code. The model needs to understand the entire project architecture, dependencies between modules, naming conventions, and coding styles. This is precisely where Engram technology shines—it enables the model to “remember” the entire codebase and maintain consistency with existing code when generating new code.

How Do FlashMLA’s Performance Improvements Support V4?

FlashMLA’s new version released on April 22, 2025, delivered 5% to 15% performance improvements, achieving peak performance of 660 TFLOPS under compute-intensive workloads. While this improvement appears moderate, for application scenarios requiring ultra-long context processing, the cumulative effect is quite substantial.

More importantly, the new FlashMLA version introduces optimizations targeting SM90 and SM100 architectures. The sparse decoding kernel supported by SM90 achieves 410 TFLOPS on H800 and 350 TFLOPS on B200; the SM100 sparse prefill kernel reaches 640 TFLOPS on H800 and an impressive 1,450 TFLOPS on B200. These performance figures provide solid computational foundation for V4 to process ultra-large-scale codebases.

How Will Long-Context Capabilities Transform the Way We Use AI?

If MODEL1 truly implements Engram technology in production, the way we interact with AI will fundamentally change. Let’s examine several specific application scenarios.

How Will Writing Novels Be Different?

The biggest pain point with traditional AI assisting in writing is “amnesia”—by the fiftieth chapter, it may have forgotten the protagonist’s personality settings from chapter one, causing inconsistencies. Authors must repeatedly remind the AI about character backgrounds, plot threads, and events that have occurred, severely disrupting the creative flow.

AI equipped with long-context capabilities can ingest an entire novel draft at once, even if it contains hundreds of thousands of words. It will remember each character’s personality traits, growth trajectories, and relationship networks; remember the setup and foreshadowing of every plot line; remember the worldbuilding rules and background settings you’ve established. When you ask it to continue writing a chapter, it can naturally reference details from earlier chapters, making plot developments logical and maintaining consistent style throughout.

Furthermore, AI can help you perform global editorial work. For instance, if you want to modify a key decision made by the protagonist, AI can automatically check which subsequent plots this change would affect and remind you which chapters need adjustment to maintain logical coherence. This global perspective editing capability is completely impossible for short-context models.

What Level of Financial Report Analysis Becomes Possible?

Financial analysis is another scenario with extremely high long-context capability requirements. A complete public company annual report typically spans two to three hundred pages, containing financial statements, management discussions, risk disclosures, footnote explanations, and vast amounts of information. Traditional AI can only process fragments, forcing analysts to manually split documents, ask multiple questions, and then integrate conclusions themselves.

AI with long-context capabilities can ingest complete financial reports at once, establishing connections between various financial metrics. It can compare multi-year data trends to discover whether revenue growth stems from new product contributions or price increases; cross-validate data consistency between different statements to identify potential accounting adjustments; combine footnote explanations to understand the impact of special accounting treatments.

When you ask “How is this company’s earnings quality?” AI won’t just analyze net profit margins but will delve into cash flow statements to examine operating cash flows, check trends in accounts receivable and inventory, review the proportion of non-recurring gains and losses, and ultimately provide comprehensive, evidence-based judgment. All this analysis builds on deep understanding of the complete financial report rather than fragmented information piecing.

What Changes Will Occur in Project Management and Code Development?

For software developers, the value of long-context capabilities is even more tangible. Modern software projects routinely contain hundreds of source code files and tens of thousands of lines of code. When you need to add new features or fix bugs, you often must understand interaction logic between multiple modules, data flow paths, and dependency relationships.

Traditional AI assistants can only “see” the few files you currently have open, knowing nothing about the overall project architecture. The suggestions it provides may seem reasonable locally but will conflict with existing designs when placed in the context of the entire project. Developers must spend considerable time explaining project structure to the AI—a process that’s inherently inefficient.

AI equipped with long-context capabilities can “read” the entire codebase, understanding the project’s layered architecture, module divisions, interface definitions, and coding standards. When you ask it to implement a new feature, it will consider which module the feature should reside in, which existing interfaces need calling, and whether it conforms to the project’s design patterns. The code it generates will automatically follow the project’s naming conventions, error handling approaches, and comment styles—just like a senior developer familiar with the project history.

Even more powerful, AI can perform cross-file code refactoring. For example, if you want to extract a common function for use by multiple modules, AI can identify all locations requiring modification, generate a consistent refactoring plan, and even update related unit tests. This global code manipulation capability can significantly enhance development efficiency and code quality.

How Will Long-Term Conversations and Personalized Learning Work?

Long-context capabilities will also change the timescale of our AI interactions. Current AI conversations are typically limited to single sessions—each time you open a new conversation, you must reintroduce the background. AI with long-term memory can maintain conversation continuity across days, weeks, or even months.

Imagine you’re learning a new programming language, asking AI a few questions each day. AI with long-term memory will remember which concepts you’ve mastered, where your weaknesses lie, and your preferred learning style. On day ten, it won’t repeat explanations of basic knowledge you understood on day two. Instead, it will recommend what to learn next based on your progress and explain new concepts in ways you find easy to understand.

This personalized, continuous tutoring experience differs fundamentally from the traditional “question-and-answer” mode. AI is no longer a stateless tool but a genuine assistant that understands you and can continuously accompany your growth.

What Technical Implementation Challenges Remain?

While Engram technology shows tremendous potential in laboratory environments, applying it to production environments still faces several critical challenges.

How Can Retrieval Accuracy Be Guaranteed?

The core of external memory mechanisms is retrieval—models need to accurately find relevant information from memory based on current questions. This process resembles human memory recall but is far more complex than human recollection. Models must understand the true intent of questions, transform them into effective queries, and locate the most relevant few items from memory banks potentially containing millions of entries.

If retrieval is inaccurate, models might “confuse” information, applying Project A’s coding standards to Project B, or citing data from the wrong year when analyzing financial reports. These errors are more dangerous than complete forgetting because they’re subtle and difficult for users to detect.

The sparse attention mechanisms implemented in FlashMLA partially address this problem. Through the indices tensor, models can precisely specify which tokens require attention, avoiding wasted computational resources on irrelevant information. However, this is infrastructure-level optimization; designing upper-layer retrieval strategies still requires extensive engineering practice for validation and refinement.

How Much Will Inference Costs Increase?

Long-context capabilities don’t come free—they inevitably increase inference costs. Even with optimizations like sparse attention and external memory, processing ultra-long contexts still requires more computational resources.

FlashMLA’s performance data reveals some clues. The sparse decoding kernel’s computational power (410 TFLOPS) is noticeably lower than the dense decoding kernel’s peak performance (660 TFLOPS) because sparse computation introduces additional indexing overhead and memory access pattern complexity. For applications processing hundreds of thousands of tokens, these overheads accumulate substantially.

DeepSeek needs to find a balance between performance and cost. One possible strategy is dynamically adjusting context length based on task type—using short contexts for simple conversations for quick responses, enabling full long-context capabilities only for complex analysis tasks. Another possibility is adopting tiered memory architecture, saving hotspot information in fast cache while storing less frequently accessed information in slower but larger-capacity external storage.

How Can Real-Time Response Speed Be Ensured?

For ordinary users, AI response speed directly impacts user experience. If processing long contexts makes every reply several seconds or even dozens of seconds slower, the technology’s practicality will be greatly diminished.

FlashMLA’s latest optimizations have done considerable work in this area. Through pre-computation of tile scheduler metadata, the decoding stage can leverage KV cache without repeated computation. FP8 quantization compresses cache size to half the original, reducing memory bandwidth pressure. These optimizations keep single-token generation latency within acceptable ranges.

However, when context length expands from a few thousand tokens to hundreds of thousands, even if single-token latency remains constant, overall time-to-first-token (TTFT) will increase significantly. This requires more aggressive parallelization strategies and cache preloading mechanisms. MODEL1’s optimizations to KV cache layout in the code may be designed to support more efficient preloading and parallel processing.

How Far From Code Exposure to Product Launch?

While GitHub code updates and research papers showcase exciting technical prospects, we must also rationally assess technology maturity and launch timelines.

What Is the Current Technology Maturity Level?

Engram technology currently remains in the research stage. Although it has been validated on a 27-billion-parameter model, this scale is far smaller than DeepSeek-V3’s 671 billion parameters. Scaling technology from small models to ultra-large-scale models often encounters new technical challenges.

FlashMLA’s code updates reflect engineering progress. From supporting dense attention to supporting sparse attention, from supporting only SM90 architecture to simultaneously supporting SM90 and SM100, each step accompanies extensive performance tuning and stability testing. MODEL1’s emergence indicates the team is actively advancing productization work, but final release may still require several months of refinement.

Following DeepSeek’s release rhythm, V4 is planned for mid-February launch. If MODEL1 is indeed V4’s core technology, we’ll soon see long-context capabilities’ performance in actual products. However, MODEL1 might only be a technical preview branch of V4, with true large-scale application awaiting even later versions.

When Can Ordinary Users Access This?

Even if V4 launches on schedule, the opening strategy for long-context features remains a question. Considering inference costs and server loads, DeepSeek will likely adopt a phased opening strategy:

Initially, it may only be available to API paying users with strict rate limits and usage quotas. This allows collecting genuine user feedback while controlling server pressure.

In the medium term, they may introduce tiered services where ordinary users receive basic long-context quotas (such as processing a few long documents daily) while paying users enjoy higher quotas and priority.

Long-term, as technology optimizes and hardware costs decline, long-context capabilities may gradually become standard features, just like current multimodal capabilities.

How Does Competitor Progress Influence DeepSeek’s Strategy?

AI competition is shifting from model scale to application capabilities. OpenAI’s GPT-4, Anthropic’s Claude, and Google’s Gemini are all continuously expanding context windows. OpenAI’s GPT-4 Turbo supports 128K token contexts, while Anthropic’s Claude 2.1 reaches 200K tokens.

DeepSeek’s advantage lies in its open-source ecosystem and local deployment capabilities. If V4 truly achieves “infinite memory”-level long-context capabilities, this will be a significant differentiating advantage. How long this advantage persists depends on implementation technical details and competitors’ follow-up speed.

From FlashMLA’s open-source strategy, we can see DeepSeek is willing to share foundational technology, which benefits building developer communities and ecosystems. However, whether core algorithms and model weights are fully open-sourced still awaits subsequent release announcements.

How Should We Prepare for the Long-Context Era?

As AI users and beneficiaries, we can prepare for the approaching long-context era in several ways.

Rethinking Workflows

Long-context AI isn’t just an enhanced version of existing tools—it will change how we organize information and design workflows. For instance, in document management, we may no longer need to meticulously maintain classification directories and tagging systems because AI can precisely locate needed information within massive document collections.

In team collaboration, meeting minutes, project documentation, and code comments can be more natural and detailed because AI can automatically extract key information without requiring manual summarization and compression. This means we can devote more energy to creative work while delegating information management burdens to AI.

Cultivating New Skills

Effective collaboration with long-context AI requires some new skills. First is learning to provide structured background information—although AI can remember all content, rational information organization still improves interaction efficiency.

Second is learning to validate AI outputs. The stronger long-context capabilities become, the higher AI’s “confidence” may be when making mistakes. We need to cultivate critical thinking, conducting reasonableness checks on information AI provides, especially in professional domains.

Finally is learning to leverage AI’s long-term memory characteristics to design progressive tasks. For instance, when learning new knowledge, you can progressively deepen through multiple conversation rounds, letting AI adjust explanation methods based on your comprehension progress rather than expecting one conversation to solve all problems.

Focusing on Privacy and Data Security

Long-context capabilities mean AI will remember more personal information and work details. While this brings convenience, it also raises privacy concerns. We need to understand service providers’ data processing policies:

Where is memory data stored? Is it on users’ local devices, private clouds, or public clouds? Different storage locations imply different security levels and privacy protection degrees.

How long is memory data retained? Is it valid only during sessions, or permanently saved? Do users have rights to delete or export their memory data?

Will memory data be used for model training? If so, are there opt-out mechanisms? These questions need clarification before using long-context AI services.

For enterprise users, evaluating long-context AI’s impact on data compliance is also necessary. If AI remembers customers’ sensitive information, processing this memory data must comply with data protection regulations like GDPR and CCPA.

Frequently Asked Questions

What’s the relationship between MODEL1 and DeepSeek-V3.2?

MODEL1 and V3.2 (codename V32) are explicitly distinguished as different models in the code. From FlashMLA’s updates, they differ in KV cache layout, sparse attention processing methods, and other aspects. MODEL1 is likely the internal codename or core technical component of next-generation flagship model V4, expected to launch mid-February.

Is Engram technology’s “infinite memory” truly infinite?

“Infinite memory” is more a figurative expression than literally infinite. Engram supports theoretically far longer context lengths than traditional architectures by decoupling computation from storage, but practical applications remain constrained by hardware resources, inference costs, and response speeds. Current experimental data shows the technology works on 27-billion-parameter models, but exactly how long contexts it can support awaits post-productization performance.

What does FlashMLA’s performance improvement mean for ordinary users?

FlashMLA’s performance optimizations directly impact AI response speed and usage costs. 5% to 15% performance improvements mean identical hardware configurations can support more concurrent users or provide faster response speeds at the same cost. For users needing to process long documents, these performance improvements accumulate into more significant experience enhancements.

Will long-context AI be much more expensive than current AI?

Long-context processing will indeed increase computational costs, but specific increases depend on technical implementation. Through optimizations like sparse attention and tiered caching, cost increases can be controlled within reasonable ranges. Service providers may also adopt tiered pricing strategies where basic long-context functionality is included in standard subscriptions while ultra-large-scale context processing requires additional fees.

I want to try long-context features now—what options exist?

DeepSeek’s MODEL1 hasn’t officially launched yet, but some AI models supporting longer contexts are already available. Anthropic’s Claude 2.1 supports 200K token contexts, while OpenAI’s GPT-4 Turbo supports 128K tokens. While these aren’t yet “infinite memory” level, they’re sufficient for most practical applications. You can first familiarize yourself with long-context usage patterns using these tools while preparing for DeepSeek’s new model launch.

Can the open-source community access these technologies?

DeepSeek has maintained a friendly attitude toward the open-source community—FlashMLA itself is open-sourced under the MIT license. While complete model weights’ open-source status remains uncertain, core attention mechanism implementation code is already public, enabling the technical community to conduct research and improvements based on this code. Additionally, multiple domestic and international hardware manufacturers have adapted FlashMLA, including MetaX, Moore Threads, Hygon DCU, Intellifusion, Iluvatar Corex, and AMD Instinct, providing diverse hardware choices for open-source deployment.