DeepSeek UE8M0 FP8 Optimization: Revolutionizing Domestic AI-Semiconductor Synergy

高效码农

3 months ago

DeepSeek UE8M0 FP8 Optimization: A Critical Breakthrough in the Synergy Between Domestic AI and Semiconductors

In today’s rapidly evolving field of artificial intelligence (AI), the efficiency of model training and the cost of deployment have become core concerns for the industry. Floating-point numbers— the fundamental way computers process decimals— play a direct role in determining an AI system’s precision, speed, and resource consumption. In recent years, low-precision floating-point formats, particularly 8-bit floating-point (FP8), have emerged as a key solution for balancing performance and efficiency. Among these innovations, the UE8M0 FP8 format developed by the Chinese team at DeepSeek stands out for its unique design philosophy and strategic positioning, marking an important milestone in the collaborative development of domestic AI and semiconductor industries.

1. Floating-Point Numbers: The “Precision vs. Efficiency” Balancing Act in AI Computing

To understand the significance of FP8 and UE8M0, it first helps to clarify the role of floating-point numbers in computing. Simply put, floating-point numbers are the “universal language” computers use to represent decimal values. They consist of three core components:

Sign bit: A single bit that indicates whether the value is positive (0) or negative (1);
Exponent: Determines the “magnitude” of the value— similar to the “power of 10” in scientific notation— and affects the dynamic range (the largest and smallest values that can be represented);
Mantissa: Controls the “fineness” of the value— like the “significant digits” in scientific notation— and influences representation precision.

For example, if we use a floating-point number to represent “0.3952”, the sign bit would be 0 (since it’s positive), the exponent would determine whether the number is closer to 0.1 or 1, and the mantissa would define details like whether the value is 0.39 or 0.3952.

The “bit width” of a floating-point number directly impacts performance and cost:

More bits (e.g., 32-bit FP32, 16-bit FP16) mean a longer mantissa, higher precision, but also greater memory usage, more computing power consumption, and higher bandwidth requirements during data transfer;
Fewer bits (e.g., 8-bit FP8) may reduce precision, but they significantly cut memory usage and computing costs— making them ideal for training and deploying large-scale AI models.

In the AI field, model parameters often reach billions or even hundreds of billions. Choosing the right floating-point format is therefore a “make-or-break” decision. The industry’s key challenge— how to perform computations with fewer bits while keeping precision loss within acceptable limits— has made FP8 the focus of intensive research and development.

2. Mainstream FP8 Formats: Evolution of Low-Precision Computing Through NVIDIA’s Technical Path

In the global AI hardware ecosystem, NVIDIA GPUs have long held a dominant position, and their exploration of FP8 has provided valuable insights for the entire industry. Currently, NVIDIA GPUs support two mainstream FP8 formats: E4M3 and E5M2.

E4M3: 4 bits allocated to the exponent, 3 bits to the mantissa (total 8 bits);
E5M2: 5 bits allocated to the exponent, 2 bits to the mantissa (total 8 bits).

The difference between these two formats lies in their trade-off between “dynamic range” and “precision”:

E5M2 uses more bits for the exponent, allowing it to represent larger or smaller values (wider dynamic range);
E4M3 dedicates more bits to the mantissa, resulting in relatively higher precision.

To address the limited dynamic range of FP8 (which can lead to value overflow), NVIDIA has developed a series of optimization strategies, such as “per-tensor scaling” and “per-block scaling”. In simple terms, these techniques dynamically adjust the value range based on data distribution to prevent overflow. Additionally, NVIDIA’s Tensor Cores have been updated with dedicated FP8 instruction sets, enabling high-end GPUs like the H100 to fully leverage the computing advantages of FP8.

In NVIDIA’s next-generation Blackwell architecture, the company has further introduced “microscaling formats”, including MXFP8 (8-bit), MXFP6 (6-bit), and MXFP4 (4-bit). Research data shows that an 800-million-parameter model using the MXFP8-E4M3 format— combined with optimized value conversion strategies— can achieve training results nearly identical to those of the traditional 16-bit brain floating-point format (BF16). This indicates that MXFP8 is becoming the preferred solution for balancing performance and precision on the Blackwell platform.

These technical advancements demonstrate that low-precision floating-point numbers have moved from “experimental edge cases” to “mainstream choices”. The deep synergy between hardware and software is the key to their successful implementation.

3. DeepSeek UE8M0 FP8: An Alternative Low-Precision Design

Unlike NVIDIA’s technical path, the UE8M0 FP8 format proposed by the Chinese team at DeepSeek for its V3.1 model takes a “minimalist” approach.

The design logic of UE8M0 is straightforward: all 8 bits are allocated to the exponent, with 0 bits reserved for the mantissa. This means UE8M0 completely abandons the precision improvements offered by the mantissa, instead concentrating all “bit resources” on the exponent to achieve the maximum possible dynamic range.

We can understand this difference through a concrete example: when representing the value 0.3952, formats like E4M3 and E5M2 would use their mantissas to approximate this value as closely as possible. In contrast, UE8M0— without a mantissa— can only represent it as the nearest “integer power of two” (e.g., 0.5). While UE8M0 clearly suffers greater precision loss, this “extreme” design offers unique advantages:

Simpler hardware implementation: Eliminating the need to handle complex mantissa computations reduces chip design complexity, making it easier to adapt to domestic semiconductor manufacturing processes;
Maximized dynamic range: An 8-bit exponent covers a wider value range, reducing calculation errors caused by value overflow;
Flexibility for domestic ecosystems: Defining the format at the model level avoids reliance on foreign hardware format standards, paving the way for synergy between domestic AI and chip development.

This design is not a “compromise” but a strategic choice based on the current state of the domestic industry. When domestic chip computing power still lags behind foreign alternatives, software format innovation lowers the barrier for hardware adaptation, accelerating the implementation of domestic AI technologies.

4. Practical Value of FP8 and UE8M0: Advantages and Necessary Trade-Offs

Whether discussing mainstream FP8 formats or UE8M0, the core goal of low-precision design is to balance “efficiency” and “precision” within controllable limits. Their practical value manifests across multiple dimensions:

4.1 Memory and Bandwidth: Significantly Reducing Resource Consumption

AI model training and inference require frequent reading and writing of parameters and intermediate results. Memory usage and data transfer bandwidth are critical bottlenecks in this process. Compared to 16-bit FP16, 8-bit FP8 directly reduces memory usage by 50% while halving data transfer volume. This brings three key benefits:

Under the same hardware conditions, larger models can be supported (e.g., expanding from 10 billion parameters to 20 billion);
Higher parallelism (processing more data simultaneously) or larger batch sizes, improving training efficiency;
For bandwidth-constrained scenarios (e.g., edge device deployment), FP8 drastically reduces transfer pressure.

4.2 Throughput and Energy Efficiency: Improving Computing Efficiency

The width of a data path directly affects a chip’s processing capabilities. With the same core frequency and memory bandwidth, an 8-bit data path can handle twice as many operators (computing units) as a 16-bit path. This delivers two immediate advantages:

Increased throughput: Completing more computations per unit time shortens model training cycles or improves inference response speeds;
Optimized energy efficiency: FP8 consumes less power when performing the same computing tasks, aligning with the “low-carbon” trend in data centers.

For domestic computing environments, improved energy efficiency is particularly significant. Under constraints of both energy costs and hardware computing power, FP8 enables equivalent performance at lower costs.

4.3 Cost and Deployment: Lowering the Barrier to AI Implementation

One of the biggest obstacles to widespread AI adoption is deployment cost. By reducing reliance on high-end hardware, FP8 makes AI technology accessible to more enterprises and scenarios:

There is no need to purchase top-tier GPUs; mid-range domestic chips— when paired with FP8 optimization— can meet most requirements;
Edge devices (e.g., automobiles, IoT terminals) have limited computing power and storage. FP8 allows them to run more complex models;
Reduced hardware investment and operation costs in data centers accelerate the penetration of AI technology into traditional industries (e.g., manufacturing, healthcare).

4.4 Hardware-Software Synergy: Unlocking the Potential of Integration

When models and hardware are co-designed around a specific floating-point format, the combined effect exceeds the sum of individual parts. When DeepSeek launched UE8M0, it explicitly tied the format to “domestic chip optimization”: the model is trained to adapt to low-precision computing logic, while hardware is optimized for FP8 instruction sets. The result is higher efficiency than the combination of “general-purpose hardware + general-purpose models”.

4.5 Necessary Challenge: Precision and Robustness

The cost of low precision is precision loss— a challenge that is particularly acute for UE8M0’s “mantissa-free” design, which places higher demands on model robustness. To address this limitation, optimizations are required at multiple levels:

Training algorithm compensation: Through Quantization-Aware Training (QAT), models are trained to adapt to low-precision computing, minimizing precision loss;
Calibration strategies: During inference, dynamic adjustments to the value range ensure the representation precision of key parameters;
Hardware support mechanisms: Chips must provide dedicated low-precision computing units and overflow protection logic to work with software in reducing errors.

Currently, academia and industry are exploring the “application boundaries” of FP8 in training and inference— identifying which scenarios can tolerate precision loss and which require high precision. This research will provide clearer guidance for the implementation of low-precision technologies.

5. Strategic Logic of UE8M0: Software-Led Promotion of Hardware Ecosystem Synergy

The significance of UE8M0 extends beyond technical innovation; it embodies a strategic mindset of “software-defined hardware”.

In the traditional model, floating-point formats are typically defined by hardware manufacturers (e.g., NVIDIA). After chip design is completed, software and models passively adapt to these formats. This approach results in low hardware-software synergy efficiency and容易 ties the industry to a single vendor’s technical path.

DeepSeek has reversed this process: it first adopted the UE8M0 format at the model level, publicly shared its training and scaling strategies, and proactively proposed adaptation requirements to hardware manufacturers and toolchains. This initiative— essentially driving hardware synergy through software innovation— allows AI models to define “technical standards” and encourages the hardware ecosystem to follow.

This “model-first” approach is widely regarded as a milestone in the integration of domestic AI hardware and software. Its advantages include:

Accelerated ecosystem integration: Preventing fragmented development among domestic chips by forming unified low-precision adaptation standards around mainstream models;
Reduced synergy costs: Aligning the goals of model developers and hardware manufacturers, minimizing redundant development and compatibility issues;
Enhanced industrial discourse power: Shifting from “following foreign formats” to “independently defining formats” strengthens the technical leadership of domestic AI.

To date, more than 15 domestic enterprises— including industry leaders like Huawei and China Mobile— have announced plans to adjust their hardware to support DeepSeek models, covering sectors such as telecommunications, automotive, and mobile technology. This cross-industry collaboration is creating a positive feedback loop: “model optimization → hardware adaptation → application implementation → feedback iteration”.

6. FP8 Layout in Domestic Chips: Exploration Paths of Cambricon and Huawei

The promotion of UE8M0 depends on support from domestic chips. Currently, leading enterprises like Cambricon and Huawei have launched in-depth FP8 initiatives, developing distinct technical paths.

6.1 Cambricon: FP8 Support Focused on Inference Optimization

As an early entrant in the domestic AI chip space, Cambricon’s MLU series (including the思元370, 思元590, and latest 思元690) explicitly supports FP8 or “Block FP8” (a format that uses the same scaling factor for an entire data block).

On the software side, Cambricon’s NeuWare software stack provides a complete low-precision toolchain:

Quantization tools: Support model conversion from high-precision formats (e.g., FP32) to FP8, with calibration to minimize precision loss;
Mixed-precision scheduling: Automatically allocate FP8 and high-precision formats based on the precision sensitivity of operators, balancing efficiency and precision;
Framework compatibility: Works with mainstream AI frameworks like TensorFlow and PyTorch, lowering the barrier for developers.

On the hardware side, the Cambricon MLU architecture optimizes FP8 performance through three key designs:

Dedicated operator engines: Custom computing units for core AI operators like matrix multiplication, boosting FP8 throughput;
On-chip cache optimization: Expanding on-chip cache capacity to reduce memory access for FP8 data;
Tensor core acceleration: Drawing on the concept of Tensor Cores to optimize the efficiency of FP8 tensor operations.

According to media reports, the 思元690 has achieved significant improvements in low-precision computing power and energy efficiency, with compatibility for DeepSeek models. However, whether it supports extreme formats like UE8M0 still awaits further verification through SDK and model-level adaptation.

6.2 Huawei: HiFloat8 Solution for Both Training and Inference

Huawei has developed the unique HiFloat8 (HiF8) solution, which differs from E4M3/E5M2 and UE8M0. HiF8 uses a “tapered precision” design: it dynamically allocates bits between the exponent and mantissa based on value size— smaller values receive more mantissa bits (higher precision), while larger values get more exponent bits (wider dynamic range).

This design excels at balancing the precision of small values (e.g., gradients in model training) and the range of large values (e.g., activation values), making it well-suited for precision-sensitive training scenarios.

Huawei’s Ascend series chips already support quantization and mixed-precision computing on platforms like OptiQuant and Atlas, with HiF8 positioned as a core future direction. Unlike Cambricon’s focus on inference, Huawei emphasizes HiF8’s support for the full training workflow (forward and backward propagation), aiming to build a more versatile FP8 training solution.

While Huawei’s HiF8 and DeepSeek’s UE8M0 differ in design philosophy, they share a common goal: reducing reliance on foreign technologies through independent floating-point format innovation and building a domestic AI computing ecosystem.

7. From Technological Breakthrough to Ecosystem Building: The Broader Landscape of AI-Semiconductor Synergy

Behind DeepSeek’s UE8M0 FP8 optimization lies a strategic transformation in China’s AI industry— shifting from “technical follower” to “ecosystem leader”.

Artificial intelligence has become a core component of national strategy, but its development faces a critical bottleneck: the synergy between “software algorithms” and “hardware computing power”. For a long time, domestic AI enterprises have relied heavily on foreign chips (e.g., NVIDIA GPUs), resulting in an imbalance of “strong algorithms, weak computing power”. Algorithm innovation is limited by hardware performance, while hardware upgrades lack guidance from domestic algorithms.

The promotion of UE8M0 breaks this cycle: by defining formats at the model level, it drives adaptation in domestic chips, creating a closed loop of “algorithms → chips → applications”. The advantages of this collaborative model include:

Semiconductor manufacturers: Using DeepSeek models as a benchmark to clarify optimization directions and avoid blind R&D;
AI enterprises: Securing a hardware foundation for technical implementation through collaboration with domestic chips, accelerating commercialization;
Industry as a whole: Synchronized iteration of software and hardware, potentially outpacing the fragmented “AI companies relying on external chips” model common in foreign markets.

From a broader perspective, this process represents the practice of “independent and controllable AI”— encompassing not just independent algorithms, but also end-to-end independence from floating-point formats and chip instruction sets to ecosystem standards.

For the industry, this shift means:

AI competition is no longer about individual algorithms or chips, but about ecosystem strength;
Deep integration of domestic AI and semiconductors will reshape the global AI industry landscape;
Low-precision computing (e.g., FP8) will become the “universal language” for future AI implementation, making standard-setting authority critical.

Conclusion

DeepSeek’s UE8M0 FP8 optimization may seem like a technical adjustment to a floating-point format, but it is actually a strategic turning point in the synergy between domestic AI and semiconductors. It demonstrates a key insight: in the AI field, technical innovation requires not just algorithmic breakthroughs, but also an ecosystem mindset of “software defining hardware and hardware supporting software”.

With continued investment from enterprises like Cambricon and Huawei, and the participation of more industry partners, the domestic AI ecosystem is moving from “isolated breakthroughs” to “systematic synergy”. While this process will undoubtedly face challenges, it provides a clear path for China to achieve independent and controllable AI— and UE8M0 is a critical step along this path.