NVIDIA Nemotron-3-Nano Architecture: How the 31B MoE Model with Mamba-2 Delivers 1M Context

6 days ago 高效码农

Nemotron-3-Nano Under the Hood: 31 B Parameters, 3 B Active, 1 M Context, 3× Faster Inference “ TL;DR: NVIDIA’s latest open-weight model keeps 128 experts on standby, wakes up only 6, and mixes Mamba-2 with Group-Query Attention to deliver 25 T token pre-training, multi-environment RL, and FP8 inference that outruns models twice its activated size while supporting 1 M token context. What Makes Nemotron-3-Nano Special in One Sentence? It achieves higher accuracy than Nemotron-2-Nano and competitive models while activating less than half the parameters per forward pass and delivering up to 3.3× higher inference throughput on a single H200 GPU. …