NVIDIA Nemotron-3-Nano Architecture: How the 31B MoE Model with Mamba-2 Delivers 1M Context

高效码农

2 months ago

Nemotron-3-Nano Under the Hood: 31 B Parameters, 3 B Active, 1 M Context, 3× Faster Inference

“

TL;DR: NVIDIA’s latest open-weight model keeps 128 experts on standby, wakes up only 6, and mixes Mamba-2 with Group-Query Attention to deliver 25 T token pre-training, multi-environment RL, and FP8 inference that outruns models twice its activated size while supporting 1 M token context.

What Makes Nemotron-3-Nano Special in One Sentence?

It achieves higher accuracy than Nemotron-2-Nano and competitive models while activating less than half the parameters per forward pass and delivering up to 3.3× higher inference throughput on a single H200 GPU.

How Does the MoE + Mamba Hybrid Architecture Save Compute?

Core question: Why can a 31.6 B model feel like 3 B during inference?

128 routed experts, only 6 active per token
2 shared experts always on for common knowledge
No positional embeddings, no dropout, no linear bias → less memory traffic
Mamba-2 layers give linear context scaling; GQA slashes KV-cache

Scenario: You deploy a code-completion backend serving 500 concurrent developers. With dense 30 B models you’d need 16 A100s; Nemotron-3-Nano fits the same QPS on 6 H200s because each request touches only 3.2 B weights and KV-cache is 4× smaller. Latency drops from 1.8 s to 0.6 s per 1 k token output.

Author reflection: I used to believe sparse routing meant jittery latency. Seeing the aux-loss-free load balancer keep batch-to-batch expert utilization within 2 % changed my mind—proper engineering beats theoretical worries.

What Exactly Was Fed to the Model for 25 Trillion Tokens?

Core question: Which data mixture lets a sparse model outperform dense competitors?

Phase 1 (0–23.5 T tokens)

Recent Common-Crawl snapshots filtered into five quality buckets
428 B code tokens from rendered HTML + Phi-4 cleaning
2.1 T synthetic rewrites of medium-high pages
9-language back-translation into English for extra high-quality volume
STEM Reasoning QA (RQA) 4.3 M examples ≈ 31.7 B tokens

Phase 2 (23.5–25 T)

Wikipedia refresh, math textbooks, scientific coding articles
Graduate-level competitive code problems cross-bred with physics/chemistry concepts (InfiniByte)
Prompt-to-code pairs distilled from GitHub missing repos

Scenario: A financial analyst wants 10-year macro indicators summarized. Phase-2’s synthetic textbooks included time-series explanations, so the model produces Stata-ready code plus economic interpretation without extra plugins.

How Is Post-Training Organized—SFT, RLVR, RLHF?

Core question: Why three stages instead of one giant fine-tune?

1. Supervised Fine-Tuning (SFT)

18 M samples, 256 k packed length, chat-template wrapped
10 % samples stripped of reasoning tokens → learns on/off control
3 % reasoning truncated → budget control

2. Multi-Environment RL with Verifiable Rewards (RLVR)

Math, code, QA, JSON, long-context, tool-use environments trained simultaneously; curriculum sampling shifts from easy to hard within each batch. Result: GPQA +7 pts, LiveCodeBench +6 pts, no catastrophic forgetting.

3. Reinforcement Learning from Human Feedback (RLHF)

A generative reward model (GenRM) first reasons about pairwise answers, then outputs helpfulness scores. Group-relative length control keeps answers concise; verbosity drops 30 % with no accuracy loss.

Code snippet: Minimal tool-calling prompt

<tools>
<tool name="python_interpreter"/>
</tools>
<reasoning>
Need to simulate 1 M random walks.
</reasoning>
<content>
```python
import numpy as np
paths = np.random.randn(1000000, 100).cumsum(axis=1)

“`

How Good Is the Long-Context Ability Really?

Core question: Can 1 M token length be more than a marketing number?

Continuous pre-training on 512 k + 4 k mixture
3× larger document QA dataset than Nano-2
Extra 1 % retrieval-focused synthetic data

Benchmarks:
RULER-100 @ 1 M → 86.34 % (Qwen3-30B 77.5 %)
AA-LCR → 35.85 % (higher is better; no comparable figure given)

Scenario: Legal-tech startup feeds 300 contracts (950 k tokens) and asks for cross-document clause inconsistency report. The model returns a 120-line bullet list citing article numbers; total inference time 28 s on 8×H100, cost ⅓ of GPT-4-32k API.

FP8 Post-Training Quantization: What Survives the Bit Squeeze?

Core question: Which parts can be 8-bit without destroying accuracy?

Selective strategy:

Keep 6 self-attention layers + 6 preceding Mamba layers in BF16
Quantize remaining 40 layers, weights, activations, KV-cache to FP8

Outcome: Median accuracy recovery 99 %, throughput +250 % on single H200 (8 k in / 16 k out).

Ablation insight: Attending layers are most sensitive; once they’re safe, the rest tolerates 8 bit, especially because MoE experts are naturally sparse and redundant.

Putting It to Work: End-to-End Deployment Recipe

Core question: How do I serve the model this afternoon?

Pull weights

git lfs install
git clone https://huggingface.co/nvidia/Nemotron-3-Nano-30B-A3B-FP8

Install dependencies

pip install transformers>=4.47 accelerate triton vllm

Launch vLLM

python -m vllm.entrypoints.api_server \
  --model ./Nemotron-3-Nano-30B-A3B-FP8 \
  --tensor-parallel-size 2 \
  --pipeline-parallel-size 2 \
  --dtype float8_e4m3fn \
  --max-model-len 1048576

Client call

import requests, json
prompt = "<|im_start|>user\nWrite a CUDA kernel for vector add…<|im_end|>\n<|im_start|>assistant\n"
rsp = requests.post("http://localhost:8000/generate",
                    json={"prompt": prompt, "max_tokens": 2048, "temperature": 0.3})
print(rsp.json()["text"])

Cost footprint: FP8 weights 15 GB, KV-cache 1 MB per 1 k token @ 128 k context → 16 GB total, fits in 2×L40-48G for small-scale serving.

Action Checklist / Implementation Steps

[ ] Decide context length: 4 k / 64 k / 256 k / 1 M → pick batch-size accordingly
[ ] Choose precision: FP8 for throughput, BF16 for finetune-ready checkpoints
[ ] Allocate GPUs: 1×H200 handles 8 k→16 k at 25 req/s; scale linearly with tensor parallelism
[ ] Tune sampling: math/code temp 0.2–0.3, chat 0.6; enable reasoning for multi-step problems
[ ] Monitor expert load: use NeMo-Gym dashboard to keep utilization <5 % imbalance
[ ] Cache calibration: reuse the shipped 1 k SFT samples for any custom FP8 re-calibration

One-Page Overview

Nemotron-3-Nano is a 31 B parameter MoE model that activates only 3 B per token. It combines Mamba-2 and Group-Query Attention for linear context growth, trains on 25 T tokens with a two-stage curriculum, and uses multi-environment RL to boost math, code, and tool-use benchmarks without forgetting. Selective FP8 quantization keeps 99 % accuracy while increasing throughput 2.5×. The open release includes base and post-trained checkpoints, data recipes, and serving code—ready for 1 M token contexts on a single H200 node.

FAQ

How much GPU memory is required for 1 M token context?
~80 GB with FP8 weights and 128 k KV-cache compression; 8×H100 recommended for comfortable batching.
Is the model truly open for commercial use?
Yes, under the NVIDIA Open Model License; no gated API required.
Will I lose accuracy if I skip FP8 and stay in BF16?
No—FP8 is for speed; BF16 is the reference accuracy. You can serve in BF16 if memory allows.
Can I fine-tune the FP8 checkpoint further?
Not directly. Load the BF16 checkpoint, perform QAT or LoRA, then re-run selective FP8 calibration.
Which inference frameworks are supported?
vLLM and TensorRT-LLM both work; NVIDIA provides example scripts for each.
Does the model support languages other than English?
Pre-training included 19 languages and translation pairs; MMLU-ProX average 59.5 %, suitable for multilingual RAG but primary strength remains English STEM.
What hyper-parameter stops tool hallucination?
A 50-step DPO add-on drops hallucination from 8 % to <1 %; RLVR alone achieves similar results without extra data.