Step3 Model: How a 321B-Parameter AI Beats 37B Models at 39% Lower Cost

高效码农

5 months ago

Step3: How a 321-Billion-Parameter Model Runs Cheaper Than a 37-Billion One

A Plain-English Guide for Developers, Students, and Curious Minds

Quick Takeaways

What you get	Number
Cost per 1 M tokens (32 K context)	0.13 USD (vs. 0.21 for DeepSeek-V3)
Tokens per second on one H800 GPU	4 039 (vs. 2 324 for DeepSeek-V3)
GPUs to start serving	32 (vs. 128–320 for similar models)

If you only remember three things, remember those.

1. What Exactly Is Step3?

Step3 is a vision-language model with 321 billion total parameters, but only 38 billion are active for each token.
Think of it like a huge library where the librarian only opens the exact shelves you need—so the place is massive, yet you pay only for the books you actually read.

1.1 Model Card at a Glance

Item	Value
Layers	61
Hidden size	7 168
Attention type	Multi-Matrix Factorization Attention (MFA)
Query heads	64
KV cache size (8-bit)	2.56 × 10⁸ bytes @ 8 K context
Max context	65 536 tokens
Total LLM params	316 B
Activated per token	38 B
Total VLM params	321 B

2. The Two Secret Ingredients

2.1 MFA—Smarter Attention, Smaller Memory

Traditional attention keeps every Key-Value pair in memory, which explodes as your text gets longer.
MFA factorizes the Query matrix into two smaller matrices:

Less KV to store (only 90 % of DeepSeek-V3’s KV budget).
Fewer FLOPs per layer (¼ of GQA).
Same expressive power—effective rank stays at 16 384.

Imagine compressing a 4 K photo into a 1 K thumbnail that still looks crisp.

2.2 AFD—Splitting Work Like a Factory Line

Most systems run Attention and Feed-Forward Network (FFN) on the same GPUs.
Step3 disaggregates them:

Attention GPUs handle memory-heavy KV cache.
FFN GPUs handle compute-heavy experts.
Data flows over a fast network in a three-stage pipeline, so communication hides behind computation.

Result: each side runs at high hardware utilization instead of waiting for the other.

3. Cost Breakdown: Where Do the Savings Come From?

3.1 Memory Traffic (8 K context, per token)

Model	KV Bytes	Attention FLOPs	FFN FLOPs
DeepSeek-V3	2.88 × 10⁸	1.47 × 10¹¹	4.84 × 10¹⁰
Step3	2.56 × 10⁸	3.27 × 10¹⁰	5.33 × 10¹⁰

Step3 already moves less data and does fewer math.

3.2 Price Tag (USD per 1 M tokens, 32 K context)

Model	Hardware Combo	Cost
DeepSeek-V3	H800 + EP	0.211
Qwen3-MoE	H800 + AFD	0.193
Step3	H800 + H20 + AFD	0.129

AFD lets us pick cheapest-fit hardware for each stage.

4. Hands-On: Running Step3 Yourself

4.1 Grab the Weights

Hugging Face: stepfun-ai/step3
ModelScope: mirror available
Precision:
- BF16 (universal)
- Block-FP8 (smaller, needs Hopper/Ada)

4.2 Supported Engines

vLLM
SGLang

Both have ready-made Docker images and OpenAI-compatible REST endpoints.

4.3 Minimal Production Setup (32 K context)

Role	GPU	Count	Memory per GPU
Attention	NVIDIA H800 80 GB	2 nodes × 8	KV cache
FFN	NVIDIA H20 80 GB	2 nodes × 8	Experts
Network	400 Gbps RoCE	8 NICs/node	< 16.6 ms hop

Total: 32 GPUs instead of 128–320.

4.4 Quick Start with vLLM

# 1. Clone weights
git lfs install
git clone https://huggingface.co/stepfun-ai/step3

# 2. Start server
docker run --gpus all -p 8000:8000 \
  -v $(pwd)/step3:/model \
  vllm/vllm:latest \
  python -m vllm.entrypoints.openai.api_server \
  --model /model \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 4 \
  --max-model-len 32768

Client call (identical to OpenAI):

import openai
openai.api_base = "http://localhost:8000/v1"
openai.ChatCompletion.create(
  model="step3",
  messages=[{"role": "user", "content": "Explain quantum tunneling simply."}]
)

5. Performance in the Wild

Metric	DeepSeek-V3	Step3 (FP8)	Delta
Peak tokens/GPU/s @ 4 K ctx	2 324	4 039	+74 %
GPUs for same SLA	128	32	–75 %
Cost per 1 M tokens	0.211 USD	0.129 USD	–39 %

Figures measured on identical hardware under 20 tokens/s latency SLA.

6. Common Questions

Q1: My lab only has A800 or 910B GPUs. Will it work?

Yes. Step3’s MFA is memory-bandwidth friendly:

On A800, attention cost only rises 0.01 USD/1 M tokens.
DeepSeek-V3’s MLA on A800 triples the cost.

Q2: Is the model truly open-source?

Code: Apache 2.0
Weights: Apache 2.0
Commercial use: Allowed without restrictions.

Q3: How much VRAM do I really need?

Scenario	VRAM per GPU	Notes
Attention	80 GB	Holds 32 K context, batch=64
FFN	80 GB	Expert shards
Smallest cluster	32 × 80 GB	2A2F topology

Q4: What about training?

This article focuses on inference. Training costs follow classic scaling laws; Step3’s activated parameters (38 B) keep training within reach of most research groups.

7. Roadmap from the Authors

Multi-Token Prediction (MTP): +50 % throughput on long context.
4-bit KV cache: Another halving of memory traffic.
800 Gbps domain networks: Unlocking sparser MoE in future releases.

8. When Should You Pick Step3?

Your Need	Recommendation
Tight budget, long context	Step3 + AFD starts at 32 GPUs
Mixed GPU types	AFD lets H800 + H20 + A800 work together
Commercial deployment	Apache 2.0 license, no strings attached
Quick POC	Ready-made vLLM/SGLang images

9. References

Technical report: arXiv:2507.19427
Official blog: https://stepfun.ai/research/step3
Weights & code: https://huggingface.co/collections/stepfun-ai/step3