DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference
A New Architecture That Boosts Multi-Turn AI System Performance Through Dual-Path KV-Cache Loading
Introduction: When AI Agents Become Mainstream, Inference Architectures Face New Challenges
Large Language Models (LLMs) are evolving from simple single-turn chatbots into intelligent agent systems capable of autonomous planning, tool invocation, and solving real-world tasks through multi-turn interactions. Whether it’s coding assistants or automated task agents, these applications all rely on multi-turn LLM inference—a long session process where context accumulates over time.
This transformation brings a fundamental technical challenge: Agentic workloads become extremely I/O-intensive. Imagine an AI agent interacting with external environments (like browsers, Python interpreters) over dozens or even hundreds of dialogue rounds. While each round’s tool calls or feedback are short (typically only a few hundred tokens), context length can grow to extreme levels—reaching up to one million tokens.
More critically, since most of the context (usually over 95% of tokens) is reused across rounds, KV-Cache hit rates are extremely high. This means KV-Cache loading efficiency, rather than pure computing power, becomes the dominant factor in system performance.
This article provides an in-depth look at DualPath—a breakthrough inference system architecture that solves storage bandwidth bottleneck problems in modern AI data centers through an innovative dual-path KV-Cache loading mechanism.
The Core Problem: Why Existing PD-Disaggregated Architectures Hit Bottlenecks
What is PD-Disaggregated Architecture?
In modern LLM inference systems, Prefill-Decode (PD) disaggregation has become the de facto standard architecture. This design divides the inference process into two phases:
| Phase | Name | Characteristics | Responsible Engine |
|---|---|---|---|
| First Phase | Prefill | Compute-intensive, batchable | Prefill Engines (PE) |
| Second Phase | Decode | Memory-bound, latency-sensitive | Decode Engines (DE) |
Prefill Phase: Processes the input prompt, computes and generates KV-Cache. This phase requires substantial computation but can be efficiently batched.
Decode Phase: Based on the KV-Cache, autoregressively generates tokens. This phase is extremely sensitive to latency because users need to see generated content in real-time.
The advantage of PD disaggregation lies in reducing interference between the two phases, allowing each to adopt targeted optimization strategies and hardware configurations.
Where is the Bottleneck?
In standard PD-disaggregated architectures, there exists a fundamental efficiency problem:
┌─────────────────────────────────────────┐
│ Bandwidth Utilization in Traditional Architecture │
├─────────────────────────────────────────┤
│ Prefill Engine (PE) │
│ ├─ Compute NIC: 40% utilization │
│ ├─ Storage NIC: 100% utilization ← Bottleneck! │
│ └─ GPU: Partially idle, waiting for data │
├─────────────────────────────────────────┤
│ Decode Engine (DE) │
│ ├─ Compute NIC: Low utilization │
│ ├─ Storage NIC: Nearly idle ← Wasted! │
│ └─ GPU: Active, but storage bandwidth unused │
└─────────────────────────────────────────┘
The Essence of the Problem: Prefill engines must load massive KV-Cache from remote storage, causing their storage network bandwidth (SNIC) to remain constantly saturated; meanwhile, decode engines’ storage NICs remain largely idle. This asymmetric bandwidth utilization severely constrains overall system throughput.
Why not simply provision more bandwidth to prefill engines? In general-purpose clusters, this is not only costly but often impractical. Therefore, a better approach is to leverage the available I/O bandwidth of all engines rather than overloading prefill engines alone.
DualPath’s Core Innovation: Dual-Path KV-Cache Loading
What is Dual-Path Loading?
DualPath’s core insight is: KV-Cache loading does not have to be prefill-centric. Traditional systems always load KV-Cache directly from storage into prefill engines, which cannot utilize the remote storage bandwidth of decode engines.
DualPath introduces a second path—in addition to the traditional storage→prefill path, it allows KV-Cache to be loaded into decode engines first, then transferred to prefill engines via high-performance RDMA networks (compute networks).
┌─────────────────────────────────────────┐
│ Bandwidth Utilization in DualPath Architecture │
├─────────────────────────────────────────┤
│ Prefill Engine (PE) │
│ ├─ Compute NIC: 80% utilization (RDMA transfer) │
│ ├─ Storage NIC: 100% utilization │
│ └─ GPU: Higher utilization, less waiting │
├─────────────────────────────────────────┤
│ Decode Engine (DE) │
│ ├─ Compute NIC: Active (RDMA transfer) │
│ ├─ Storage NIC: 100% utilization ← Fully utilized! │
│ └─ GPU: Normal decoding work │
└─────────────────────────────────────────┘
By dynamically selecting between these two paths, DualPath redistributes network load and alleviates prefill-side bandwidth saturation.
Detailed Data Flows of Both Paths
DualPath allocates a small amount of DRAM as buffers (PE Buffer and DE Buffer) for each engine, supporting two reading paths:
Path One: PE Read Path (Traditional)
-
Storage→PE Buffer: KV-Cache is read from persistent storage into the prefill engine’s buffer -
PE Buffer→PE HBM: Before attention layer computation, that layer’s KV-Cache is transferred to GPU high-bandwidth memory -
PE HBM→DE Buffer: After computation completes, KV-Cache is transferred to the decode engine’s buffer -
Repeat: The above process repeats n_layer times, overlapping with computation
Path Two: DE Read Path (Innovative)
-
Storage→DE Buffer: KV-Cache is first read into the decode engine’s buffer -
DE Buffer→PE HBM: During prefill, corresponding layer KV-Cache is read from DE Buffer via RDMA -
PE HBM→DE Buffer: After layer computation completes, only miss token KV-Cache is transferred back to DE Buffer to merge with hit tokens -
Repeat: Similarly repeats n_layer times
Why Doesn’t This Design Introduce New Bottlenecks?
You might worry: adding a data path could cause compute network congestion or make DRAM bandwidth the new bottleneck.
DualPath’s theoretical analysis shows that under reasonable P/D (Prefill/Decode) ratios, the system can achieve bottleneck-free operation. Key conditions include:
-
Compute NIC bandwidth: Through fine-grained traffic management, ensuring KV-Cache transfer doesn’t interfere with latency-sensitive model execution communications -
DRAM pressure: Through hybrid layout of Layer Blocks and Full Blocks, optimizing memory access patterns -
PCIe bandwidth: Adopting CNIC-centric data transfer, uniformly managing all GPU PCIe traffic
For typical 8-GPU node configurations (g=8, storage bandwidth ratio s=1), the bottleneck-free P/D ratio range is 1/7 to 7/2, covering the vast majority of practical deployment scenarios.
Technical Implementation: Three Core Components
1. CNIC-Centric Traffic Manager
Modern LLM inference uses various advanced data transfer technologies, such as GPUDirect Storage and CUDA copy engines. However, these technologies share a common problem: they cannot isolate KV-Cache traffic from latency-sensitive collective communications.
Why does this matter? During model execution, collective communications (such as AllToAll in expert parallelism, ReduceScatter/AllGather in tensor/context parallelism) occur in rapid, sub-millisecond bursts. If KV-Cache transfer interferes with these critical communications, inference latency can degrade severely.
DualPath’s Solution: Adopt a CNIC-Centric (Compute NIC-Centric) data transfer approach. All data traffic in or out of a GPU, including local H2D/D2H copies, must go through the GPU’s paired compute NIC using GPUDirect RDMA data paths.
Specific Implementation:
-
For InfiniBand networks: Leverage Virtual Lanes (VL) for traffic isolation -
High-priority VL: Dedicated to model inference communications (reserving 99% bandwidth) -
Low-priority VL: Used for KV-Cache transfer (preventing starvation)
-
-
For RoCE networks: Use DSCP markings and Traffic Classes (TC) for similar isolation
Unexpected Benefit: CNIC-assisted H2D/D2H outperforms CUDA copy engine when handling large numbers of small data chunks. A single RDMA write operation takes only about 1 microsecond, compared to 5-7 microseconds for cudaMemcpyAsync.
2. Adaptive Request Scheduler
DualPath needs to decide in real-time: which path should each request use? This involves scheduling at two levels:
Inter-Engine Scheduling
Engines are organized into groups; only each group’s Leader Engine interacts with the central scheduler. Three key metrics are considered when scheduling:
-
seq_e: Number of requests assigned to the engine that haven’t completed yet -
tok_e: Total token count over those seq_e requests -
read_q_n(e): Disk reading queue length of the node that engine e belongs to
Prefill Engine Scheduling Strategy:
-
Overloaded engines ( tok_e > β): No new requests assigned -
Short-queue engines ( read_q_n(e) ≤ αandtok_e ≤ β): Priority assignment, preventing storage NIC underutilization -
Long-queue engines ( read_q_n(e) > αandtok_e ≤ β): Secondary priority
Decode Engine Scheduling Strategy:
-
Phase 1: Cross-group balancing, assigning requests to the group with minimum total tok_e -
Phase 2: Within-group balancing, considering HBM capacity and token count to avoid fragmentation
KV-Cache Read Task Scheduling: After selecting PE and DE, choose the side with the shorter reading queue to execute storage reads. Future work considers splitting requests to read from both sides in parallel.
Intra-Engine Scheduling
Only prefill engines require intra-engine scheduling. Since data parallelism (DP) configurations have each GPU serving different request sets, workload imbalance can cause GPUs to idle at synchronization points.
Solution: Compute quota-based batch selection
-
Use FIFO packing to decide how many requests to include in a forward batch -
Each request is described by a (cached, bsz)pair: number of tokens with KV-Cache already available, number of tokens requiring KV-Cache computation in this forward batch -
Predict attention layer execution time, ensuring it doesn’t exceed a predefined compute quota (e.g., 300ms) -
If adding a request would exceed quota, perform chunked prefill for that request
3. Hierarchical KV-Cache Block Layout
Layer-wise prefill reduces KV-Cache block size to 1/layer of the original, increasing block count by layer times. This poses challenges to transfer and storage performance.
DualPath’s Dual-Block Design:
| Block Type | Shape | Purpose |
|---|---|---|
| Layer Block | [1, tokens, bytes] |
Stores single-layer KV-Cache, used for layer-wise streaming transfer |
| Full Block | [layer, tokens, bytes] |
Stores complete layer KV-Cache, used for storage interactions |
Advantages:
-
All interactions with storage use Full Blocks, optimizing storage performance -
Layer-wise streaming transfers use Layer Blocks, reducing memory footprint -
No manual KV-Cache memory layout conversion needed; simply concatenating Layer Blocks yields Full Blocks
KV-Cache is organized in distributed storage using a Trie structure, where each tree node corresponds to a Full Block.
Performance Evaluation: Significant Improvements in Real Scenarios
Experimental Setup
Test Platform:
-
GPU cluster: NVIDIA Hopper GPUs, InfiniBand interconnect -
Node configuration: 8 GPUs per node, 8×400Gbps compute NICs + 1×400Gbps storage NIC -
Storage backend: 3FS (no internal DRAM cache, can saturate storage NIC bandwidth)
Evaluation Models:
-
DeepSeek-V3.2 660B (MoE, sparse attention) -
DeepSeek 27B (Internal experimental model, scaled-down version of 660B) -
Qwen2.5-32B (Dense model, GQA)
Datasets: Three trajectory datasets collected from production agentic RL training workloads
| Dataset | Max Length | Avg Turns | Avg Append | Avg Generate | Avg Total | Avg Context |
|---|---|---|---|---|---|---|
| 32K | 32K | 60 | 608 | 148 | 28,639 | 17,183 |
| 48K | 48K | 106 | 474 | 172 | 42,607 | 25,120 |
| 64K | 64K | 157 | 429 | 176 | 55,958 | 32,721 |
Comparison Baselines:
-
Basic: Unmodified internal inference framework -
SGL(MC): SGLang + Mooncake + 3FS (N/A for some configurations due to unsupported features) -
Oracle: Theoretical upper bound, bypassing all disk reads and transfers
Offline Batch Inference Results
In the rollout phase of RL training, where all agents start inference simultaneously, measuring Job Completion Time (JCT).
Key Findings:
-
DeepSeek 660B: DualPath achieves up to 1.87× improvement over Basic, with performance close to Oracle, indicating KV-Cache I/O bottlenecks are largely eliminated
-
DeepSeek 27B: Up to 1.78× improvement, but 1.09-1.85× slower than Oracle due to limited storage bandwidth in 1P1D configuration
-
Qwen 32B: Consistent trends with DeepSeek models
-
Scale Effects: DualPath benefits more from larger batch sizes and longer context lengths
P/D Ratio Impact:
-
Across 1P1D, 2P1D, and 1P2D configurations, DualPath achieves average 1.64× speedup (up to 2.46×) -
Interesting phenomenon: Basic 1P1D ≈ Basic 1P2D, DualPath 1P1D ≈ Basic 2P1D, DualPath 2P1D ≈ DualPath 1P2D -
This confirms storage bandwidth is the dominant bottleneck in agentic inference, not compute capacity
Impact of Append and Generation Length:
-
As append length increases, Basic performance gradually approaches DualPath (compute pressure increases, I/O pressure relatively decreases) -
As generation length increases, larger prefill gap time reduces KV-Cache loading pressure -
In typical agentic scenarios (short append, short generation), DualPath maintains 1.82-1.99× advantage
Online Serving Results
In online serving scenarios, agents arrive according to a Poisson process, measuring latency metrics:
-
TTFT (Time To First Token): First token latency -
TTST (Time To Second Token): Second token latency -
TPOT (Time Per Output Token): Per-output-token time
Service Level Objective (SLO): TTFT ≤ 4 seconds, TPOT ≤ 50 milliseconds
Key Findings:
-
Throughput improvement: DualPath’s agents-per-second (APS) capacity improves by 1.67× (27B) to 2.25× (660B) over Basic -
Latency characteristics: TTST comparable to Basic, TPOT shows no increase, proving DualPath introduces no additional decoding overhead -
TTFT stability: DualPath maintains stable TTFT components across different arrival rates, while Basic’s queuing time grows dramatically due to insufficient storage bandwidth
Ablation Study
Gradually adding technical components to quantify individual contributions (64K context, 1024/2048 agents):
| Configuration | JCT Reduction vs. Basic |
|---|---|
| Basic + Layerwise Prefill | 17.21% |
| + Dual-path Loading | Additional 38.19% (cumulative 55.4%) |
| + Scheduling Algorithm | Additional 45.62% (cumulative optimal) |
Load Balancing Effects:
-
Storage NIC traffic balance: Improved from 1.53 (round-robin) to 1.18 (Max/Avg ratio, 1.0 being perfect balance) -
Attention layer execution time balance: Maintains Max/Avg ratio as low as 1.06 during first 5% of task, significantly reducing GPU idle bubbles
Large-Scale Scalability
Validated on large-scale clusters of up to 1,152 GPUs:
| Scenario | Configuration | Result |
|---|---|---|
| Offline inference | 2P4D (2K agents) → 48P96D (48K agents) | JCT from 3,167s to 3,201s, near-linear scaling |
| Online serving | 44P88D | 22× throughput improvement (0.4→8.8 APS), stable latency |
Scheduler CPU usage remains below 10 cores, confirming it’s not a bottleneck.
Deep Understanding: Why DualPath Works
Hardware Trends vs. Workload Mismatch
From NVIDIA Ampere to Blackwell architecture, the I/O-compute ratio decreases by 14.4×. This means:
-
GPU compute capacity (FLOPS) grows rapidly -
Network bandwidth and HBM capacity lag behind -
In agentic workloads, we hit the memory wall and communication wall earlier
Small HBM capacity limits token batch sizes, preventing full utilization of compute units like Tensor Cores. Low NIC bandwidth limits KV-Cache loading speed, causing GPUs to idle while waiting.
The Importance of Cache-Compute Ratio
Definition: Ratio of KV-Cache to load versus computation needed (GB/PFLOP)
| Model | Cache-Compute Ratio (16K-64K context) |
|---|---|
| Qwen2.5-32B (FP16) | 117-267 |
| GPT-OSS-120B | 47-95 |
| Qwen3-235B-A22B | 39-60 |
| DeepSeek-V3.2 660B | 13-36 |
| DeepSeek-V3 660B | 4.8-5.8 |
Even for DeepSeek-V3.2 with MLA optimization, at 32K context with 429-token append length, the cache-compute ratio reaches approximately 22GB/PFLOP, posing significant pressure on storage bandwidth. For models with larger KV-Cache sizes, the situation is even worse.
Constraints of Modern AI Data Center Architecture
Standard NVIDIA DGX SuperPOD design principles:
-
Compute network (east-west): High-speed NVLink + 400Gbps compute NICs for GPU-to-GPU communication -
Storage network (north-south): Independent 400Gbps storage NICs for datasets, checkpoints, KV-Cache access -
Key isolation: Compute and storage networks are physically isolated to prevent mutual interference
DualPath’s innovation: Without breaking this isolation, transfer KV-Cache through the compute network, thereby aggregating storage bandwidth from all nodes.
Practical Deployment Considerations
Applicable Scenarios
DualPath is particularly suitable for:
-
Agentic RL training: Rollout phases require processing large numbers of multi-step agent trajectories, DRAM occupied by training states, must rely on external storage -
Long-context online serving: Multi-turn dialogue applications with context accumulating to extreme lengths -
High KV-Cache hit rate workloads: Short-append, multi-turn interaction patterns
Configuration Recommendations
P/D Ratio Selection:
-
Ensure P/D ratio is within reasonable range based on theoretical analysis (e.g., 1/7 to 7/2 for g=8, s=1) -
Determine optimal ratio through small-scale experiments in actual deployment
DRAM Allocation:
-
DeepSeek models: 80GB per node -
Models with larger KV-Cache like Qwen: 320GB per node -
Compared to pure DRAM caching solutions (e.g., Mooncake requires 1.5TB/node), significantly reduces memory costs
Traffic Isolation Configuration (InfiniBand example):
qos_max_vls 4
qos_high_limit 240
qos_vlarb_high 0:192,1:192,2:0,3:192
qos_vlarb_low 0:192,1:192,2:64,3:192
Limitations and Future Work
-
Dynamic Adaptability: Current workloads are highly dynamic (e.g., in RL training, prefill pressure is significantly higher in the first half than the second half), requiring more flexible parallelism strategies and online P/D ratio adjustment mechanisms
-
Scheduling Optimization: Under large-scale deployment, there remains room to reduce tail latency for TTFT
-
Split Reading: Current implementation chooses single-side reading; future work could explore splitting requests to read from both sides in parallel
-
Working Set Management: In online scenarios, as arrival rates and JCT increase, KV-Cache working sets may exceed memory, requiring smarter caching strategies
Frequently Asked Questions (FAQ)
Q1: What is the difference between DualPath and distributed memory caching solutions like Mooncake?
A: Mooncake builds a distributed DRAM pool to cache KV-Cache, using affinity-aware scheduling to maximize hit rates. This works well in memory-sufficient scenarios but has two limitations:
-
Memory-constrained scenarios (e.g., RL training rollout phases where DRAM is occupied by training states) cannot be used -
Ultra-large working set scenarios (online serving) are cost-prohibitive because DRAM is much more expensive than SSD
DualPath targets the storage backend directly, balancing traffic across all storage NICs, greatly reducing DRAM usage without harming performance. The two can be combined, but gains are marginal.
Q2: Why not use GPUDirect Storage to read KV-Cache directly to GPU?
A: GPUDirect Storage can indeed read directly from storage to GPU HBM, but it cannot isolate KV-Cache traffic from latency-sensitive model execution communications. In agentic inference, collective communication latency (e.g., AllToAll) is critical. The CNIC-centric approach, while seemingly taking a detour, is currently the only practical method to ensure KV-Cache loading doesn’t degrade critical model execution performance.
Q3: Does DualPath increase latency in the decode phase?
A: No. Experimental data shows DualPath’s TPOT (Time Per Output Token) is comparable to baseline, and TTST (Time To Second Token) shows no significant difference. This is because:
-
Compute network QoS mechanisms ensure KV-Cache transfer uses spare bandwidth -
DE Buffer design optimizes data transfer paths -
Decode phase computation and communication patterns remain undisturbed
Q4: How is the decision made between PE read path and DE read path?
A: The scheduler dynamically decides based on real-time load information:
-
Compares storage reading queue lengths between prefill and decode engine nodes -
Chooses the side with the shorter queue to execute storage reads -
Considers multiple dimensions including GPU load, token count, and HBM capacity
Q5: Is layer-wise prefill mandatory? What is its impact on performance?
A: Layer-wise prefill is an important foundation of DualPath, solving HBM capacity bottlenecks in long-context prefilling. By loading only the current layer’s required KV-Cache per layer, effective batch size (in tokens) increases by approximately layer times. Ablation studies show that adding layer-wise prefill alone reduces JCT by 17.21%.
Q6: Which network types does DualPath support?
A: While experiments were conducted on InfiniBand, design principles apply to:
-
InfiniBand: Uses Virtual Lanes (VL) for QoS isolation -
RoCE: Uses DSCP and Traffic Class (TC) markings -
Emerging technologies: UnifiedBus, Ultra Ethernet, and other converged QoS mechanisms
Q7: Is DualPath equally effective for small models (e.g., 27B)?
A: Effective, but improvement magnitude is limited by storage bandwidth. In small model 1P1D configurations, due to limited total storage bandwidth, DualPath remains 1.09-1.85× slower than Oracle. Increasing node count or optimizing P/D ratios is recommended to unlock greater potential.
Summary
DualPath solves the core bottleneck in agentic LLM inference by rethinking KV-Cache loading in PD-disaggregated architectures. Its key innovations include:
-
Dual-path loading architecture: Breaking prefill-centric limitations, fully utilizing idle storage bandwidth of decode engines -
CNIC-centric traffic management: Ensuring latency-sensitive communications remain unaffected while opportunistically utilizing spare compute network bandwidth -
Adaptive scheduling algorithms: Dynamically balancing compute and network resources for global optimization
Under real agentic workloads, DualPath achieves up to 1.87× throughput improvement for offline inference and average 1.96× throughput improvement for online serving, while maintaining stable latency metrics. This architecture provides a solid foundation for efficient deployment of next-generation long-context, multi-turn interactive AI applications.
Reference Information
This article is based on the following research paper:
-
DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference -
Authors: Yongtong Wu, Shaoyuan Chen, Yinmin Zhong, et al. (Peking University, Tsinghua University, DeepSeek-AI) -
Published: arXiv:2602.21548v2 [cs.DC], February 2026
Related Technologies:
-
PD-disaggregated architecture: Splitwise, DistServe -
Layer-wise prefill: LayerKV, PrefillOnly -
Distributed KV-Cache: Mooncake, TokenLake -
Efficient attention kernels: FlashMLA, FlashInfer -
Expert parallelism communication: DeepEP

