DeepSeek Sparse Attentionarchive

From O(n²) to O(L·√L): How DeepSeek-V3.2-Exp Slashes Long-Context Costs Without Hurting Quality

3 months ago 高效码农

A 5-minute read for engineers who need 128 K tokens tonight, not next quarter. 1. The Scene: 2 A.M. and the Context-Length Wall Li, a Beijing-based ML engineer, just wanted his 671 B model to read a 100 k-token spec and answer one obscure question. By token 60 k the GPU fans sounded like jet engines; at 90 k the server threw an OOM and the latency graph looked like Everest. Sound familiar? Long-context is the new memory wall—and the bill is paid in both dollars and sleep. The next morning DeepSeek dropped an experimental image on Docker Hub: lmsysorg/sglang:dsv32 …