DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

A New Architecture That Boosts Multi-Turn AI System Performance Through Dual-Path KV-Cache Loading

Introduction: When AI Agents Become Mainstream, Inference Architectures Face New Challenges

Large Language Models (LLMs) are evolving from simple single-turn chatbots into intelligent agent systems capable of autonomous planning, tool invocation, and solving real-world tasks through multi-turn interactions. Whether it’s coding assistants or automated task agents, these applications all rely on multi-turn LLM inference—a long session process where context accumulates over time.

This transformation brings a fundamental technical challenge: Agentic workloads become extremely I/O-intensive. Imagine an AI agent interacting with external environments (like browsers, Python interpreters) over dozens or even hundreds of dialogue rounds. While each round’s tool calls or feedback are short (typically only a few hundred tokens), context length can grow to extreme levels—reaching up to one million tokens.

More critically, since most of the context (usually over 95% of tokens) is reused across rounds, KV-Cache hit rates are extremely high. This means KV-Cache loading efficiency, rather than pure computing power, becomes the dominant factor in system performance.

This article provides an in-depth look at DualPath—a breakthrough inference system architecture that solves storage bandwidth bottleneck problems in modern AI data centers through an innovative dual-path KV-Cache loading mechanism.

The Core Problem: Why Existing PD-Disaggregated Architectures Hit Bottlenecks

What is PD-Disaggregated Architecture?

In modern LLM inference systems, Prefill-Decode (PD) disaggregation has become the de facto standard architecture. This design divides the inference process into two phases:

Phase	Name	Characteristics	Responsible Engine
First Phase	Prefill	Compute-intensive, batchable	Prefill Engines (PE)
Second Phase	Decode	Memory-bound, latency-sensitive	Decode Engines (DE)

Prefill Phase: Processes the input prompt, computes and generates KV-Cache. This phase requires substantial computation but can be efficiently batched.

Decode Phase: Based on the KV-Cache, autoregressively generates tokens. This phase is extremely sensitive to latency because users need to see generated content in real-time.

The advantage of PD disaggregation lies in reducing interference between the two phases, allowing each to adopt targeted optimization strategies and hardware configurations.

Where is the Bottleneck?

In standard PD-disaggregated architectures, there exists a fundamental efficiency problem:

┌─────────────────────────────────────────┐
│  Bandwidth Utilization in Traditional Architecture │
├─────────────────────────────────────────┤
│  Prefill Engine (PE)                     │
│  ├─ Compute NIC: 40% utilization         │
│  ├─ Storage NIC: 100% utilization ← Bottleneck! │
│  └─ GPU: Partially idle, waiting for data │
├─────────────────────────────────────────┤
│  Decode Engine (DE)                      │
│  ├─ Compute NIC: Low utilization         │
│  ├─ Storage NIC: Nearly idle ← Wasted!   │
│  └─ GPU: Active, but storage bandwidth unused │
└─────────────────────────────────────────┘

The Essence of the Problem: Prefill engines must load massive KV-Cache from remote storage, causing their storage network bandwidth (SNIC) to remain constantly saturated; meanwhile, decode engines’ storage NICs remain largely idle. This asymmetric bandwidth utilization severely constrains overall system throughput.

Why not simply provision more bandwidth to prefill engines? In general-purpose clusters, this is not only costly but often impractical. Therefore, a better approach is to leverage the available I/O bandwidth of all engines rather than overloading prefill engines alone.

DualPath’s Core Innovation: Dual-Path KV-Cache Loading

What is Dual-Path Loading?

DualPath’s core insight is: KV-Cache loading does not have to be prefill-centric. Traditional systems always load KV-Cache directly from storage into prefill engines, which cannot utilize the remote storage bandwidth of decode engines.

DualPath introduces a second path—in addition to the traditional storage→prefill path, it allows KV-Cache to be loaded into decode engines first, then transferred to prefill engines via high-performance RDMA networks (compute networks).

┌─────────────────────────────────────────┐
│  Bandwidth Utilization in DualPath Architecture │
├─────────────────────────────────────────┤
│  Prefill Engine (PE)                     │
│  ├─ Compute NIC: 80% utilization (RDMA transfer) │
│  ├─ Storage NIC: 100% utilization        │
│  └─ GPU: Higher utilization, less waiting │
├─────────────────────────────────────────┤
│  Decode Engine (DE)                      │
│  ├─ Compute NIC: Active (RDMA transfer)  │
│  ├─ Storage NIC: 100% utilization ← Fully utilized! │
│  └─ GPU: Normal decoding work            │
└─────────────────────────────────────────┘

By dynamically selecting between these two paths, DualPath redistributes network load and alleviates prefill-side bandwidth saturation.

Detailed Data Flows of Both Paths

DualPath allocates a small amount of DRAM as buffers (PE Buffer and DE Buffer) for each engine, supporting two reading paths:

Path One: PE Read Path (Traditional)

Storage→PE Buffer: KV-Cache is read from persistent storage into the prefill engine’s buffer
PE Buffer→PE HBM: Before attention layer computation, that layer’s KV-Cache is transferred to GPU high-bandwidth memory
PE HBM→DE Buffer: After computation completes, KV-Cache is transferred to the decode engine’s buffer
Repeat: The above process repeats n_layer times, overlapping with computation

Path Two: DE Read Path (Innovative)

Storage→DE Buffer: KV-Cache is first read into the decode engine’s buffer
DE Buffer→PE HBM: During prefill, corresponding layer KV-Cache is read from DE Buffer via RDMA
PE HBM→DE Buffer: After layer computation completes, only miss token KV-Cache is transferred back to DE Buffer to merge with hit tokens
Repeat: Similarly repeats n_layer times

Why Doesn’t This Design Introduce New Bottlenecks?

You might worry: adding a data path could cause compute network congestion or make DRAM bandwidth the new bottleneck.

DualPath’s theoretical analysis shows that under reasonable P/D (Prefill/Decode) ratios, the system can achieve bottleneck-free operation. Key conditions include:

Compute NIC bandwidth: Through fine-grained traffic management, ensuring KV-Cache transfer doesn’t interfere with latency-sensitive model execution communications
DRAM pressure: Through hybrid layout of Layer Blocks and Full Blocks, optimizing memory access patterns
PCIe bandwidth: Adopting CNIC-centric data transfer, uniformly managing all GPU PCIe traffic

For typical 8-GPU node configurations (g=8, storage bandwidth ratio s=1), the bottleneck-free P/D ratio range is 1/7 to 7/2, covering the vast majority of practical deployment scenarios.

Technical Implementation: Three Core Components

1. CNIC-Centric Traffic Manager

Modern LLM inference uses various advanced data transfer technologies, such as GPUDirect Storage and CUDA copy engines. However, these technologies share a common problem: they cannot isolate KV-Cache traffic from latency-sensitive collective communications.

Why does this matter? During model execution, collective communications (such as AllToAll in expert parallelism, ReduceScatter/AllGather in tensor/context parallelism) occur in rapid, sub-millisecond bursts. If KV-Cache transfer interferes with these critical communications, inference latency can degrade severely.

DualPath’s Solution: Adopt a CNIC-Centric (Compute NIC-Centric) data transfer approach. All data traffic in or out of a GPU, including local H2D/D2H copies, must go through the GPU’s paired compute NIC using GPUDirect RDMA data paths.

Specific Implementation:

For InfiniBand networks: Leverage Virtual Lanes (VL) for traffic isolation
- High-priority VL: Dedicated to model inference communications (reserving 99% bandwidth)
- Low-priority VL: Used for KV-Cache transfer (preventing starvation)
For RoCE networks: Use DSCP markings and Traffic Classes (TC) for similar isolation

Unexpected Benefit: CNIC-assisted H2D/D2H outperforms CUDA copy engine when handling large numbers of small data chunks. A single RDMA write operation takes only about 1 microsecond, compared to 5-7 microseconds for cudaMemcpyAsync.

2. Adaptive Request Scheduler

DualPath needs to decide in real-time: which path should each request use? This involves scheduling at two levels:

Inter-Engine Scheduling

Engines are organized into groups; only each group’s Leader Engine interacts with the central scheduler. Three key metrics are considered when scheduling:

seq_e: Number of requests assigned to the engine that haven’t completed yet
tok_e: Total token count over those seq_e requests
read_q_n(e): Disk reading queue length of the node that engine e belongs to

Prefill Engine Scheduling Strategy:

Overloaded engines (tok_e > β): No new requests assigned
Short-queue engines (read_q_n(e) ≤ α and tok_e ≤ β): Priority assignment, preventing storage NIC underutilization
Long-queue engines (read_q_n(e) > α and tok_e ≤ β): Secondary priority

Decode Engine Scheduling Strategy:

Phase 1: Cross-group balancing, assigning requests to the group with minimum total tok_e
Phase 2: Within-group balancing, considering HBM capacity and token count to avoid fragmentation

KV-Cache Read Task Scheduling: After selecting PE and DE, choose the side with the shorter reading queue to execute storage reads. Future work considers splitting requests to read from both sides in parallel.

Intra-Engine Scheduling

Only prefill engines require intra-engine scheduling. Since data parallelism (DP) configurations have each GPU serving different request sets, workload imbalance can cause GPUs to idle at synchronization points.

Solution: Compute quota-based batch selection

Use FIFO packing to decide how many requests to include in a forward batch
Each request is described by a (cached, bsz) pair: number of tokens with KV-Cache already available, number of tokens requiring KV-Cache computation in this forward batch
Predict attention layer execution time, ensuring it doesn’t exceed a predefined compute quota (e.g., 300ms)
If adding a request would exceed quota, perform chunked prefill for that request

3. Hierarchical KV-Cache Block Layout

Layer-wise prefill reduces KV-Cache block size to 1/layer of the original, increasing block count by layer times. This poses challenges to transfer and storage performance.

DualPath’s Dual-Block Design:

Block Type	Shape	Purpose
Layer Block	`[1, tokens, bytes]`	Stores single-layer KV-Cache, used for layer-wise streaming transfer
Full Block	`[layer, tokens, bytes]`	Stores complete layer KV-Cache, used for storage interactions

Advantages:

All interactions with storage use Full Blocks, optimizing storage performance
Layer-wise streaming transfers use Layer Blocks, reducing memory footprint
No manual KV-Cache memory layout conversion needed; simply concatenating Layer Blocks yields Full Blocks

KV-Cache is organized in distributed storage using a Trie structure, where each tree node corresponds to a Full Block.

Performance Evaluation: Significant Improvements in Real Scenarios

Experimental Setup

Test Platform:

GPU cluster: NVIDIA Hopper GPUs, InfiniBand interconnect
Node configuration: 8 GPUs per node, 8×400Gbps compute NICs + 1×400Gbps storage NIC
Storage backend: 3FS (no internal DRAM cache, can saturate storage NIC bandwidth)

Evaluation Models:

DeepSeek-V3.2 660B (MoE, sparse attention)
DeepSeek 27B (Internal experimental model, scaled-down version of 660B)
Qwen2.5-32B (Dense model, GQA)

Datasets: Three trajectory datasets collected from production agentic RL training workloads

Dataset	Max Length	Avg Turns	Avg Append	Avg Generate	Avg Total	Avg Context
32K	32K	60	608	148	28,639	17,183
48K	48K	106	474	172	42,607	25,120
64K	64K	157	429	176	55,958	32,721

Comparison Baselines:

Basic: Unmodified internal inference framework
SGL(MC): SGLang + Mooncake + 3FS (N/A for some configurations due to unsupported features)
Oracle: Theoretical upper bound, bypassing all disk reads and transfers

Offline Batch Inference Results

In the rollout phase of RL training, where all agents start inference simultaneously, measuring Job Completion Time (JCT).

Key Findings:

DeepSeek 660B: DualPath achieves up to 1.87× improvement over Basic, with performance close to Oracle, indicating KV-Cache I/O bottlenecks are largely eliminated
DeepSeek 27B: Up to 1.78× improvement, but 1.09-1.85× slower than Oracle due to limited storage bandwidth in 1P1D configuration
Qwen 32B: Consistent trends with DeepSeek models
Scale Effects: DualPath benefits more from larger batch sizes and longer context lengths

P/D Ratio Impact:

Across 1P1D, 2P1D, and 1P2D configurations, DualPath achieves average 1.64× speedup (up to 2.46×)
Interesting phenomenon: Basic 1P1D ≈ Basic 1P2D, DualPath 1P1D ≈ Basic 2P1D, DualPath 2P1D ≈ DualPath 1P2D
This confirms storage bandwidth is the dominant bottleneck in agentic inference, not compute capacity

Impact of Append and Generation Length:

As append length increases, Basic performance gradually approaches DualPath (compute pressure increases, I/O pressure relatively decreases)
As generation length increases, larger prefill gap time reduces KV-Cache loading pressure
In typical agentic scenarios (short append, short generation), DualPath maintains 1.82-1.99× advantage

Online Serving Results

In online serving scenarios, agents arrive according to a Poisson process, measuring latency metrics:

TTFT (Time To First Token): First token latency
TTST (Time To Second Token): Second token latency
TPOT (Time Per Output Token): Per-output-token time

Service Level Objective (SLO): TTFT ≤ 4 seconds, TPOT ≤ 50 milliseconds

Key Findings:

Throughput improvement: DualPath’s agents-per-second (APS) capacity improves by 1.67× (27B) to 2.25× (660B) over Basic
Latency characteristics: TTST comparable to Basic, TPOT shows no increase, proving DualPath introduces no additional decoding overhead
TTFT stability: DualPath maintains stable TTFT components across different arrival rates, while Basic’s queuing time grows dramatically due to insufficient storage bandwidth

Ablation Study

Gradually adding technical components to quantify individual contributions (64K context, 1024/2048 agents):

Configuration	JCT Reduction vs. Basic
Basic + Layerwise Prefill	17.21%
+ Dual-path Loading	Additional 38.19% (cumulative 55.4%)
+ Scheduling Algorithm	Additional 45.62% (cumulative optimal)

Load Balancing Effects:

Storage NIC traffic balance: Improved from 1.53 (round-robin) to 1.18 (Max/Avg ratio, 1.0 being perfect balance)
Attention layer execution time balance: Maintains Max/Avg ratio as low as 1.06 during first 5% of task, significantly reducing GPU idle bubbles

Large-Scale Scalability

Validated on large-scale clusters of up to 1,152 GPUs:

Scenario	Configuration	Result
Offline inference	2P4D (2K agents) → 48P96D (48K agents)	JCT from 3,167s to 3,201s, near-linear scaling
Online serving	44P88D	22× throughput improvement (0.4→8.8 APS), stable latency

Scheduler CPU usage remains below 10 cores, confirming it’s not a bottleneck.

Deep Understanding: Why DualPath Works

Hardware Trends vs. Workload Mismatch

From NVIDIA Ampere to Blackwell architecture, the I/O-compute ratio decreases by 14.4×. This means:

GPU compute capacity (FLOPS) grows rapidly
Network bandwidth and HBM capacity lag behind
In agentic workloads, we hit the memory wall and communication wall earlier

Small HBM capacity limits token batch sizes, preventing full utilization of compute units like Tensor Cores. Low NIC bandwidth limits KV-Cache loading speed, causing GPUs to idle while waiting.

The Importance of Cache-Compute Ratio

Definition: Ratio of KV-Cache to load versus computation needed (GB/PFLOP)

Model	Cache-Compute Ratio (16K-64K context)
Qwen2.5-32B (FP16)	117-267
GPT-OSS-120B	47-95
Qwen3-235B-A22B	39-60
DeepSeek-V3.2 660B	13-36
DeepSeek-V3 660B	4.8-5.8

Even for DeepSeek-V3.2 with MLA optimization, at 32K context with 429-token append length, the cache-compute ratio reaches approximately 22GB/PFLOP, posing significant pressure on storage bandwidth. For models with larger KV-Cache sizes, the situation is even worse.

Constraints of Modern AI Data Center Architecture

Standard NVIDIA DGX SuperPOD design principles:

Compute network (east-west): High-speed NVLink + 400Gbps compute NICs for GPU-to-GPU communication
Storage network (north-south): Independent 400Gbps storage NICs for datasets, checkpoints, KV-Cache access
Key isolation: Compute and storage networks are physically isolated to prevent mutual interference

DualPath’s innovation: Without breaking this isolation, transfer KV-Cache through the compute network, thereby aggregating storage bandwidth from all nodes.

Practical Deployment Considerations

Applicable Scenarios

DualPath is particularly suitable for:

Agentic RL training: Rollout phases require processing large numbers of multi-step agent trajectories, DRAM occupied by training states, must rely on external storage
Long-context online serving: Multi-turn dialogue applications with context accumulating to extreme lengths
High KV-Cache hit rate workloads: Short-append, multi-turn interaction patterns

Configuration Recommendations

P/D Ratio Selection:

Ensure P/D ratio is within reasonable range based on theoretical analysis (e.g., 1/7 to 7/2 for g=8, s=1)
Determine optimal ratio through small-scale experiments in actual deployment

DRAM Allocation:

DeepSeek models: 80GB per node
Models with larger KV-Cache like Qwen: 320GB per node
Compared to pure DRAM caching solutions (e.g., Mooncake requires 1.5TB/node), significantly reduces memory costs

Traffic Isolation Configuration (InfiniBand example):

qos_max_vls 4
qos_high_limit 240
qos_vlarb_high 0:192,1:192,2:0,3:192
qos_vlarb_low 0:192,1:192,2:64,3:192

Limitations and Future Work

Dynamic Adaptability: Current workloads are highly dynamic (e.g., in RL training, prefill pressure is significantly higher in the first half than the second half), requiring more flexible parallelism strategies and online P/D ratio adjustment mechanisms
Scheduling Optimization: Under large-scale deployment, there remains room to reduce tail latency for TTFT
Split Reading: Current implementation chooses single-side reading; future work could explore splitting requests to read from both sides in parallel
Working Set Management: In online scenarios, as arrival rates and JCT increase, KV-Cache working sets may exceed memory, requiring smarter caching strategies

Frequently Asked Questions (FAQ)

Q1: What is the difference between DualPath and distributed memory caching solutions like Mooncake?

A: Mooncake builds a distributed DRAM pool to cache KV-Cache, using affinity-aware scheduling to maximize hit rates. This works well in memory-sufficient scenarios but has two limitations:

Memory-constrained scenarios (e.g., RL training rollout phases where DRAM is occupied by training states) cannot be used
Ultra-large working set scenarios (online serving) are cost-prohibitive because DRAM is much more expensive than SSD

DualPath targets the storage backend directly, balancing traffic across all storage NICs, greatly reducing DRAM usage without harming performance. The two can be combined, but gains are marginal.

Q2: Why not use GPUDirect Storage to read KV-Cache directly to GPU?

A: GPUDirect Storage can indeed read directly from storage to GPU HBM, but it cannot isolate KV-Cache traffic from latency-sensitive model execution communications. In agentic inference, collective communication latency (e.g., AllToAll) is critical. The CNIC-centric approach, while seemingly taking a detour, is currently the only practical method to ensure KV-Cache loading doesn’t degrade critical model execution performance.

Q3: Does DualPath increase latency in the decode phase?

A: No. Experimental data shows DualPath’s TPOT (Time Per Output Token) is comparable to baseline, and TTST (Time To Second Token) shows no significant difference. This is because:

Compute network QoS mechanisms ensure KV-Cache transfer uses spare bandwidth
DE Buffer design optimizes data transfer paths
Decode phase computation and communication patterns remain undisturbed

Q4: How is the decision made between PE read path and DE read path?

A: The scheduler dynamically decides based on real-time load information:

Compares storage reading queue lengths between prefill and decode engine nodes
Chooses the side with the shorter queue to execute storage reads
Considers multiple dimensions including GPU load, token count, and HBM capacity

Q5: Is layer-wise prefill mandatory? What is its impact on performance?

A: Layer-wise prefill is an important foundation of DualPath, solving HBM capacity bottlenecks in long-context prefilling. By loading only the current layer’s required KV-Cache per layer, effective batch size (in tokens) increases by approximately layer times. Ablation studies show that adding layer-wise prefill alone reduces JCT by 17.21%.

Q6: Which network types does DualPath support?

A: While experiments were conducted on InfiniBand, design principles apply to:

InfiniBand: Uses Virtual Lanes (VL) for QoS isolation
RoCE: Uses DSCP and Traffic Class (TC) markings
Emerging technologies: UnifiedBus, Ultra Ethernet, and other converged QoS mechanisms

Q7: Is DualPath equally effective for small models (e.g., 27B)?

A: Effective, but improvement magnitude is limited by storage bandwidth. In small model 1P1D configurations, due to limited total storage bandwidth, DualPath remains 1.09-1.85× slower than Oracle. Increasing node count or optimizing P/D ratios is recommended to unlock greater potential.

Summary

DualPath solves the core bottleneck in agentic LLM inference by rethinking KV-Cache loading in PD-disaggregated architectures. Its key innovations include:

Dual-path loading architecture: Breaking prefill-centric limitations, fully utilizing idle storage bandwidth of decode engines
CNIC-centric traffic management: Ensuring latency-sensitive communications remain unaffected while opportunistically utilizing spare compute network bandwidth
Adaptive scheduling algorithms: Dynamically balancing compute and network resources for global optimization

Under real agentic workloads, DualPath achieves up to 1.87× throughput improvement for offline inference and average 1.96× throughput improvement for online serving, while maintaining stable latency metrics. This architecture provides a solid foundation for efficient deployment of next-generation long-context, multi-turn interactive AI applications.

Reference Information

This article is based on the following research paper:

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference
Authors: Yongtong Wu, Shaoyuan Chen, Yinmin Zhong, et al. (Peking University, Tsinghua University, DeepSeek-AI)
Published: arXiv:2602.21548v2 [cs.DC], February 2026

Related Technologies:

PD-disaggregated architecture: Splitwise, DistServe
Layer-wise prefill: LayerKV, PrefillOnly
Distributed KV-Cache: Mooncake, TokenLake
Efficient attention kernels: FlashMLA, FlashInfer
Expert parallelism communication: DeepEP

DualPath: How a New LLM Inference Architecture Breaks the Storage Bandwidth Bottleneck