Step3: How a 321-Billion-Parameter Model Runs Cheaper Than a 37-Billion One
A Plain-English Guide for Developers, Students, and Curious Minds
Quick Takeaways
| What you get | Number | 
|---|---|
| Cost per 1 M tokens (32 K context) | 0.13 USD (vs. 0.21 for DeepSeek-V3) | 
| Tokens per second on one H800 GPU | 4 039 (vs. 2 324 for DeepSeek-V3) | 
| GPUs to start serving | 32 (vs. 128–320 for similar models) | 
If you only remember three things, remember those.
1. What Exactly Is Step3?
Step3 is a vision-language model with 321 billion total parameters, but only 38 billion are active for each token.
Think of it like a huge library where the librarian only opens the exact shelves you need—so the place is massive, yet you pay only for the books you actually read.
1.1 Model Card at a Glance
| Item | Value | 
|---|---|
| Layers | 61 | 
| Hidden size | 7 168 | 
| Attention type | Multi-Matrix Factorization Attention (MFA) | 
| Query heads | 64 | 
| KV cache size (8-bit) | 2.56 × 10⁸ bytes @ 8 K context | 
| Max context | 65 536 tokens | 
| Total LLM params | 316 B | 
| Activated per token | 38 B | 
| Total VLM params | 321 B | 
2. The Two Secret Ingredients
2.1 MFA—Smarter Attention, Smaller Memory
Traditional attention keeps every Key-Value pair in memory, which explodes as your text gets longer.
MFA factorizes the Query matrix into two smaller matrices:
- 
Less KV to store (only 90 % of DeepSeek-V3’s KV budget).  - 
Fewer FLOPs per layer (¼ of GQA).  - 
Same expressive power—effective rank stays at 16 384.  
Imagine compressing a 4 K photo into a 1 K thumbnail that still looks crisp.
2.2 AFD—Splitting Work Like a Factory Line
Most systems run Attention and Feed-Forward Network (FFN) on the same GPUs.
Step3 disaggregates them:
- 
Attention GPUs handle memory-heavy KV cache.  - 
FFN GPUs handle compute-heavy experts.  - 
Data flows over a fast network in a three-stage pipeline, so communication hides behind computation.  
Result: each side runs at high hardware utilization instead of waiting for the other.
3. Cost Breakdown: Where Do the Savings Come From?
3.1 Memory Traffic (8 K context, per token)
| Model | KV Bytes | Attention FLOPs | FFN FLOPs | 
|---|---|---|---|
| DeepSeek-V3 | 2.88 × 10⁸ | 1.47 × 10¹¹ | 4.84 × 10¹⁰ | 
| Step3 | 2.56 × 10⁸ | 3.27 × 10¹⁰ | 5.33 × 10¹⁰ | 
Step3 already moves less data and does fewer math.
3.2 Price Tag (USD per 1 M tokens, 32 K context)
| Model | Hardware Combo | Cost | 
|---|---|---|
| DeepSeek-V3 | H800 + EP | 0.211 | 
| Qwen3-MoE | H800 + AFD | 0.193 | 
| Step3 | H800 + H20 + AFD | 0.129 | 
AFD lets us pick cheapest-fit hardware for each stage.
4. Hands-On: Running Step3 Yourself
4.1 Grab the Weights
- 
Hugging Face: stepfun-ai/step3 - 
ModelScope: mirror available  - 
Precision: - 
BF16 (universal)  - 
Block-FP8 (smaller, needs Hopper/Ada)  
 - 
 
4.2 Supported Engines
- 
vLLM  - 
SGLang  
Both have ready-made Docker images and OpenAI-compatible REST endpoints.
4.3 Minimal Production Setup (32 K context)
| Role | GPU | Count | Memory per GPU | 
|---|---|---|---|
| Attention | NVIDIA H800 80 GB | 2 nodes × 8 | KV cache | 
| FFN | NVIDIA H20 80 GB | 2 nodes × 8 | Experts | 
| Network | 400 Gbps RoCE | 8 NICs/node | < 16.6 ms hop | 
Total: 32 GPUs instead of 128–320.
4.4 Quick Start with vLLM
# 1. Clone weights
git lfs install
git clone https://huggingface.co/stepfun-ai/step3
# 2. Start server
docker run --gpus all -p 8000:8000 \
  -v $(pwd)/step3:/model \
  vllm/vllm:latest \
  python -m vllm.entrypoints.openai.api_server \
  --model /model \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 4 \
  --max-model-len 32768
Client call (identical to OpenAI):
import openai
openai.api_base = "http://localhost:8000/v1"
openai.ChatCompletion.create(
  model="step3",
  messages=[{"role": "user", "content": "Explain quantum tunneling simply."}]
)
5. Performance in the Wild
| Metric | DeepSeek-V3 | Step3 (FP8) | Delta | 
|---|---|---|---|
| Peak tokens/GPU/s @ 4 K ctx | 2 324 | 4 039 | +74 % | 
| GPUs for same SLA | 128 | 32 | –75 % | 
| Cost per 1 M tokens | 0.211 USD | 0.129 USD | –39 % | 
Figures measured on identical hardware under 20 tokens/s latency SLA.
6. Common Questions
Q1: My lab only has A800 or 910B GPUs. Will it work?
Yes. Step3’s MFA is memory-bandwidth friendly:
- 
On A800, attention cost only rises 0.01 USD/1 M tokens.  - 
DeepSeek-V3’s MLA on A800 triples the cost.  
Q2: Is the model truly open-source?
- 
Code: Apache 2.0  - 
Weights: Apache 2.0  - 
Commercial use: Allowed without restrictions.  
Q3: How much VRAM do I really need?
| Scenario | VRAM per GPU | Notes | 
|---|---|---|
| Attention | 80 GB | Holds 32 K context, batch=64 | 
| FFN | 80 GB | Expert shards | 
| Smallest cluster | 32 × 80 GB | 2A2F topology | 
Q4: What about training?
This article focuses on inference. Training costs follow classic scaling laws; Step3’s activated parameters (38 B) keep training within reach of most research groups.
7. Roadmap from the Authors
- 
Multi-Token Prediction (MTP): +50 % throughput on long context.  - 
4-bit KV cache: Another halving of memory traffic.  - 
800 Gbps domain networks: Unlocking sparser MoE in future releases.  
8. When Should You Pick Step3?
| Your Need | Recommendation | 
|---|---|
| Tight budget, long context | Step3 + AFD starts at 32 GPUs | 
| Mixed GPU types | AFD lets H800 + H20 + A800 work together | 
| Commercial deployment | Apache 2.0 license, no strings attached | 
| Quick POC | Ready-made vLLM/SGLang images | 
9. References
- 
Technical report: arXiv:2507.19427  - 
Official blog: https://stepfun.ai/research/step3  - 
Weights & code: https://huggingface.co/collections/stepfun-ai/step3  
