MoGA: The Sparse Attention Trick That Lets One GPU Generate a 60-second, Multi-shot Video at 24 fps—Without Blowing Up Memory

17 hours ago 高效码农

What exactly makes long-video generation with Transformers so expensive, and how does MoGA solve it in practice? Quadratic full-attention is the culprit; MoGA replaces it with a learnable token-router that sends each token to one of M semantic groups, runs full attention only inside the group, and drops FLOPs by 70 % while keeping visual quality. What problem is this article solving? Reader question: “Why can’t I just scale Diffusion Transformers to minute-long videos, and what does MoGA change?” Answer: Context length explodes to 580 k tokens; full attention becomes 330 Peta-FLOPs on a single GPU and OOM. MoGA introduces …