Nemotron-3-Nano Under the Hood: 31 B Parameters, 3 B Active, 1 M Context, 3× Faster Inference
“
TL;DR: NVIDIA’s latest open-weight model keeps 128 experts on standby, wakes up only 6, and mixes Mamba-2 with Group-Query Attention to deliver 25 T token pre-training, multi-environment RL, and FP8 inference that outruns models twice its activated size while supporting 1 M token context.
What Makes Nemotron-3-Nano Special in One Sentence?
It achieves higher accuracy than Nemotron-2-Nano and competitive models while activating less than half the parameters per forward pass and delivering up to 3.3× higher inference throughput on a single H200 GPU.
How Does the MoE + Mamba Hybrid Architecture Save Compute?
Core question: Why can a 31.6 B model feel like 3 B during inference?
-
128 routed experts, only 6 active per token -
2 shared experts always on for common knowledge -
No positional embeddings, no dropout, no linear bias → less memory traffic -
Mamba-2 layers give linear context scaling; GQA slashes KV-cache
Scenario: You deploy a code-completion backend serving 500 concurrent developers. With dense 30 B models you’d need 16 A100s; Nemotron-3-Nano fits the same QPS on 6 H200s because each request touches only 3.2 B weights and KV-cache is 4× smaller. Latency drops from 1.8 s to 0.6 s per 1 k token output.
Author reflection: I used to believe sparse routing meant jittery latency. Seeing the aux-loss-free load balancer keep batch-to-batch expert utilization within 2 % changed my mind—proper engineering beats theoretical worries.
What Exactly Was Fed to the Model for 25 Trillion Tokens?
Core question: Which data mixture lets a sparse model outperform dense competitors?
Phase 1 (0–23.5 T tokens)
-
Recent Common-Crawl snapshots filtered into five quality buckets -
428 B code tokens from rendered HTML + Phi-4 cleaning -
2.1 T synthetic rewrites of medium-high pages -
9-language back-translation into English for extra high-quality volume -
STEM Reasoning QA (RQA) 4.3 M examples ≈ 31.7 B tokens
Phase 2 (23.5–25 T)
-
Wikipedia refresh, math textbooks, scientific coding articles -
Graduate-level competitive code problems cross-bred with physics/chemistry concepts (InfiniByte) -
Prompt-to-code pairs distilled from GitHub missing repos
Scenario: A financial analyst wants 10-year macro indicators summarized. Phase-2’s synthetic textbooks included time-series explanations, so the model produces Stata-ready code plus economic interpretation without extra plugins.
How Is Post-Training Organized—SFT, RLVR, RLHF?
Core question: Why three stages instead of one giant fine-tune?
1. Supervised Fine-Tuning (SFT)
-
18 M samples, 256 k packed length, chat-template wrapped -
10 % samples stripped of reasoning tokens → learns on/off control -
3 % reasoning truncated → budget control
2. Multi-Environment RL with Verifiable Rewards (RLVR)
Math, code, QA, JSON, long-context, tool-use environments trained simultaneously; curriculum sampling shifts from easy to hard within each batch. Result: GPQA +7 pts, LiveCodeBench +6 pts, no catastrophic forgetting.
3. Reinforcement Learning from Human Feedback (RLHF)
A generative reward model (GenRM) first reasons about pairwise answers, then outputs helpfulness scores. Group-relative length control keeps answers concise; verbosity drops 30 % with no accuracy loss.
Code snippet: Minimal tool-calling prompt
<tools>
<tool name="python_interpreter"/>
</tools>
<reasoning>
Need to simulate 1 M random walks.
</reasoning>
<content>
```python
import numpy as np
paths = np.random.randn(1000000, 100).cumsum(axis=1)
“`
How Good Is the Long-Context Ability Really?
Core question: Can 1 M token length be more than a marketing number?
-
Continuous pre-training on 512 k + 4 k mixture -
3× larger document QA dataset than Nano-2 -
Extra 1 % retrieval-focused synthetic data
Benchmarks:
RULER-100 @ 1 M → 86.34 % (Qwen3-30B 77.5 %)
AA-LCR → 35.85 % (higher is better; no comparable figure given)
Scenario: Legal-tech startup feeds 300 contracts (950 k tokens) and asks for cross-document clause inconsistency report. The model returns a 120-line bullet list citing article numbers; total inference time 28 s on 8×H100, cost ⅓ of GPT-4-32k API.
FP8 Post-Training Quantization: What Survives the Bit Squeeze?
Core question: Which parts can be 8-bit without destroying accuracy?
Selective strategy:
-
Keep 6 self-attention layers + 6 preceding Mamba layers in BF16 -
Quantize remaining 40 layers, weights, activations, KV-cache to FP8
Outcome: Median accuracy recovery 99 %, throughput +250 % on single H200 (8 k in / 16 k out).
Ablation insight: Attending layers are most sensitive; once they’re safe, the rest tolerates 8 bit, especially because MoE experts are naturally sparse and redundant.
Putting It to Work: End-to-End Deployment Recipe
Core question: How do I serve the model this afternoon?
-
Pull weights git lfs install git clone https://huggingface.co/nvidia/Nemotron-3-Nano-30B-A3B-FP8 -
Install dependencies pip install transformers>=4.47 accelerate triton vllm -
Launch vLLM python -m vllm.entrypoints.api_server \ --model ./Nemotron-3-Nano-30B-A3B-FP8 \ --tensor-parallel-size 2 \ --pipeline-parallel-size 2 \ --dtype float8_e4m3fn \ --max-model-len 1048576 -
Client call import requests, json prompt = "<|im_start|>user\nWrite a CUDA kernel for vector add…<|im_end|>\n<|im_start|>assistant\n" rsp = requests.post("http://localhost:8000/generate", json={"prompt": prompt, "max_tokens": 2048, "temperature": 0.3}) print(rsp.json()["text"])
Cost footprint: FP8 weights 15 GB, KV-cache 1 MB per 1 k token @ 128 k context → 16 GB total, fits in 2×L40-48G for small-scale serving.
Action Checklist / Implementation Steps
-
[ ] Decide context length: 4 k / 64 k / 256 k / 1 M → pick batch-size accordingly -
[ ] Choose precision: FP8 for throughput, BF16 for finetune-ready checkpoints -
[ ] Allocate GPUs: 1×H200 handles 8 k→16 k at 25 req/s; scale linearly with tensor parallelism -
[ ] Tune sampling: math/code temp 0.2–0.3, chat 0.6; enable reasoning for multi-step problems -
[ ] Monitor expert load: use NeMo-Gym dashboard to keep utilization <5 % imbalance -
[ ] Cache calibration: reuse the shipped 1 k SFT samples for any custom FP8 re-calibration
One-Page Overview
Nemotron-3-Nano is a 31 B parameter MoE model that activates only 3 B per token. It combines Mamba-2 and Group-Query Attention for linear context growth, trains on 25 T tokens with a two-stage curriculum, and uses multi-environment RL to boost math, code, and tool-use benchmarks without forgetting. Selective FP8 quantization keeps 99 % accuracy while increasing throughput 2.5×. The open release includes base and post-trained checkpoints, data recipes, and serving code—ready for 1 M token contexts on a single H200 node.
FAQ
-
How much GPU memory is required for 1 M token context?
~80 GB with FP8 weights and 128 k KV-cache compression; 8×H100 recommended for comfortable batching. -
Is the model truly open for commercial use?
Yes, under the NVIDIA Open Model License; no gated API required. -
Will I lose accuracy if I skip FP8 and stay in BF16?
No—FP8 is for speed; BF16 is the reference accuracy. You can serve in BF16 if memory allows. -
Can I fine-tune the FP8 checkpoint further?
Not directly. Load the BF16 checkpoint, perform QAT or LoRA, then re-run selective FP8 calibration. -
Which inference frameworks are supported?
vLLM and TensorRT-LLM both work; NVIDIA provides example scripts for each. -
Does the model support languages other than English?
Pre-training included 19 languages and translation pairs; MMLU-ProX average 59.5 %, suitable for multilingual RAG but primary strength remains English STEM. -
What hyper-parameter stops tool hallucination?
A 50-step DPO add-on drops hallucination from 8 % to <1 %; RLVR alone achieves similar results without extra data.
