Introduction: When LLM Scale Meets Network Bottlenecks Imagine trying to run a large language model with trillions of parameters, such as DeepSeek V3 (671 billion parameters) or Kimi K2 (1 trillion parameters). These models can no longer be fully deployed on a single 8-GPU server and must be distributed across multiple computing nodes. This reveals a surprising reality: the main constraint on performance is no longer computational power (FLOPs), but rather the efficiency of network communication between GPUs. This is the core challenge facing modern large language model systems. As model sizes explode, traditional collective communication libraries (like NCCL) struggle …