Step3: How a 321-Billion-Parameter Model Runs Cheaper Than a 37-Billion One
A Plain-English Guide for Developers, Students, and Curious Minds
Quick Takeaways
What you get | Number |
---|---|
Cost per 1 M tokens (32 K context) | 0.13 USD (vs. 0.21 for DeepSeek-V3) |
Tokens per second on one H800 GPU | 4 039 (vs. 2 324 for DeepSeek-V3) |
GPUs to start serving | 32 (vs. 128–320 for similar models) |
If you only remember three things, remember those.
1. What Exactly Is Step3?
Step3 is a vision-language model with 321 billion total parameters, but only 38 billion are active for each token.
Think of it like a huge library where the librarian only opens the exact shelves you need—so the place is massive, yet you pay only for the books you actually read.
1.1 Model Card at a Glance
Item | Value |
---|---|
Layers | 61 |
Hidden size | 7 168 |
Attention type | Multi-Matrix Factorization Attention (MFA) |
Query heads | 64 |
KV cache size (8-bit) | 2.56 × 10⁸ bytes @ 8 K context |
Max context | 65 536 tokens |
Total LLM params | 316 B |
Activated per token | 38 B |
Total VLM params | 321 B |
2. The Two Secret Ingredients
2.1 MFA—Smarter Attention, Smaller Memory
Traditional attention keeps every Key-Value pair in memory, which explodes as your text gets longer.
MFA factorizes the Query matrix into two smaller matrices:
-
Less KV to store (only 90 % of DeepSeek-V3’s KV budget). -
Fewer FLOPs per layer (¼ of GQA). -
Same expressive power—effective rank stays at 16 384.
Imagine compressing a 4 K photo into a 1 K thumbnail that still looks crisp.
2.2 AFD—Splitting Work Like a Factory Line
Most systems run Attention and Feed-Forward Network (FFN) on the same GPUs.
Step3 disaggregates them:
-
Attention GPUs handle memory-heavy KV cache. -
FFN GPUs handle compute-heavy experts. -
Data flows over a fast network in a three-stage pipeline, so communication hides behind computation.
Result: each side runs at high hardware utilization instead of waiting for the other.
3. Cost Breakdown: Where Do the Savings Come From?
3.1 Memory Traffic (8 K context, per token)
Model | KV Bytes | Attention FLOPs | FFN FLOPs |
---|---|---|---|
DeepSeek-V3 | 2.88 × 10⁸ | 1.47 × 10¹¹ | 4.84 × 10¹⁰ |
Step3 | 2.56 × 10⁸ | 3.27 × 10¹⁰ | 5.33 × 10¹⁰ |
Step3 already moves less data and does fewer math.
3.2 Price Tag (USD per 1 M tokens, 32 K context)
Model | Hardware Combo | Cost |
---|---|---|
DeepSeek-V3 | H800 + EP | 0.211 |
Qwen3-MoE | H800 + AFD | 0.193 |
Step3 | H800 + H20 + AFD | 0.129 |
AFD lets us pick cheapest-fit hardware for each stage.
4. Hands-On: Running Step3 Yourself
4.1 Grab the Weights
-
Hugging Face: stepfun-ai/step3
-
ModelScope: mirror available -
Precision: -
BF16 (universal) -
Block-FP8 (smaller, needs Hopper/Ada)
-
4.2 Supported Engines
-
vLLM -
SGLang
Both have ready-made Docker images and OpenAI-compatible REST endpoints.
4.3 Minimal Production Setup (32 K context)
Role | GPU | Count | Memory per GPU |
---|---|---|---|
Attention | NVIDIA H800 80 GB | 2 nodes × 8 | KV cache |
FFN | NVIDIA H20 80 GB | 2 nodes × 8 | Experts |
Network | 400 Gbps RoCE | 8 NICs/node | < 16.6 ms hop |
Total: 32 GPUs instead of 128–320.
4.4 Quick Start with vLLM
# 1. Clone weights
git lfs install
git clone https://huggingface.co/stepfun-ai/step3
# 2. Start server
docker run --gpus all -p 8000:8000 \
-v $(pwd)/step3:/model \
vllm/vllm:latest \
python -m vllm.entrypoints.openai.api_server \
--model /model \
--tensor-parallel-size 8 \
--pipeline-parallel-size 4 \
--max-model-len 32768
Client call (identical to OpenAI):
import openai
openai.api_base = "http://localhost:8000/v1"
openai.ChatCompletion.create(
model="step3",
messages=[{"role": "user", "content": "Explain quantum tunneling simply."}]
)
5. Performance in the Wild
Metric | DeepSeek-V3 | Step3 (FP8) | Delta |
---|---|---|---|
Peak tokens/GPU/s @ 4 K ctx | 2 324 | 4 039 | +74 % |
GPUs for same SLA | 128 | 32 | –75 % |
Cost per 1 M tokens | 0.211 USD | 0.129 USD | –39 % |
Figures measured on identical hardware under 20 tokens/s latency SLA.
6. Common Questions
Q1: My lab only has A800 or 910B GPUs. Will it work?
Yes. Step3’s MFA is memory-bandwidth friendly:
-
On A800, attention cost only rises 0.01 USD/1 M tokens. -
DeepSeek-V3’s MLA on A800 triples the cost.
Q2: Is the model truly open-source?
-
Code: Apache 2.0 -
Weights: Apache 2.0 -
Commercial use: Allowed without restrictions.
Q3: How much VRAM do I really need?
Scenario | VRAM per GPU | Notes |
---|---|---|
Attention | 80 GB | Holds 32 K context, batch=64 |
FFN | 80 GB | Expert shards |
Smallest cluster | 32 × 80 GB | 2A2F topology |
Q4: What about training?
This article focuses on inference. Training costs follow classic scaling laws; Step3’s activated parameters (38 B) keep training within reach of most research groups.
7. Roadmap from the Authors
-
Multi-Token Prediction (MTP): +50 % throughput on long context. -
4-bit KV cache: Another halving of memory traffic. -
800 Gbps domain networks: Unlocking sparser MoE in future releases.
8. When Should You Pick Step3?
Your Need | Recommendation |
---|---|
Tight budget, long context | Step3 + AFD starts at 32 GPUs |
Mixed GPU types | AFD lets H800 + H20 + A800 work together |
Commercial deployment | Apache 2.0 license, no strings attached |
Quick POC | Ready-made vLLM/SGLang images |
9. References
-
Technical report: arXiv:2507.19427 -
Official blog: https://stepfun.ai/research/step3 -
Weights & code: https://huggingface.co/collections/stepfun-ai/step3