Step3: How a 321-Billion-Parameter Model Runs Cheaper Than a 37-Billion One

A Plain-English Guide for Developers, Students, and Curious Minds


Quick Takeaways

What you get Number
Cost per 1 M tokens (32 K context) 0.13 USD (vs. 0.21 for DeepSeek-V3)
Tokens per second on one H800 GPU 4 039 (vs. 2 324 for DeepSeek-V3)
GPUs to start serving 32 (vs. 128–320 for similar models)

If you only remember three things, remember those.


1. What Exactly Is Step3?

Step3 is a vision-language model with 321 billion total parameters, but only 38 billion are active for each token.
Think of it like a huge library where the librarian only opens the exact shelves you need—so the place is massive, yet you pay only for the books you actually read.

1.1 Model Card at a Glance

Item Value
Layers 61
Hidden size 7 168
Attention type Multi-Matrix Factorization Attention (MFA)
Query heads 64
KV cache size (8-bit) 2.56 × 10⁸ bytes @ 8 K context
Max context 65 536 tokens
Total LLM params 316 B
Activated per token 38 B
Total VLM params 321 B

2. The Two Secret Ingredients

2.1 MFA—Smarter Attention, Smaller Memory

Traditional attention keeps every Key-Value pair in memory, which explodes as your text gets longer.
MFA factorizes the Query matrix into two smaller matrices:

  • Less KV to store (only 90 % of DeepSeek-V3’s KV budget).
  • Fewer FLOPs per layer (¼ of GQA).
  • Same expressive power—effective rank stays at 16 384.

Imagine compressing a 4 K photo into a 1 K thumbnail that still looks crisp.

2.2 AFD—Splitting Work Like a Factory Line

Most systems run Attention and Feed-Forward Network (FFN) on the same GPUs.
Step3 disaggregates them:

  • Attention GPUs handle memory-heavy KV cache.
  • FFN GPUs handle compute-heavy experts.
  • Data flows over a fast network in a three-stage pipeline, so communication hides behind computation.

Result: each side runs at high hardware utilization instead of waiting for the other.


3. Cost Breakdown: Where Do the Savings Come From?

3.1 Memory Traffic (8 K context, per token)

Model KV Bytes Attention FLOPs FFN FLOPs
DeepSeek-V3 2.88 × 10⁸ 1.47 × 10¹¹ 4.84 × 10¹⁰
Step3 2.56 × 10⁸ 3.27 × 10¹⁰ 5.33 × 10¹⁰

Step3 already moves less data and does fewer math.

3.2 Price Tag (USD per 1 M tokens, 32 K context)

Model Hardware Combo Cost
DeepSeek-V3 H800 + EP 0.211
Qwen3-MoE H800 + AFD 0.193
Step3 H800 + H20 + AFD 0.129

AFD lets us pick cheapest-fit hardware for each stage.


4. Hands-On: Running Step3 Yourself

4.1 Grab the Weights

  • Hugging Face: stepfun-ai/step3
  • ModelScope: mirror available
  • Precision:

    • BF16 (universal)
    • Block-FP8 (smaller, needs Hopper/Ada)

4.2 Supported Engines

  • vLLM
  • SGLang

Both have ready-made Docker images and OpenAI-compatible REST endpoints.

4.3 Minimal Production Setup (32 K context)

Role GPU Count Memory per GPU
Attention NVIDIA H800 80 GB 2 nodes × 8 KV cache
FFN NVIDIA H20 80 GB 2 nodes × 8 Experts
Network 400 Gbps RoCE 8 NICs/node < 16.6 ms hop

Total: 32 GPUs instead of 128–320.

4.4 Quick Start with vLLM

# 1. Clone weights
git lfs install
git clone https://huggingface.co/stepfun-ai/step3

# 2. Start server
docker run --gpus all -p 8000:8000 \
  -v $(pwd)/step3:/model \
  vllm/vllm:latest \
  python -m vllm.entrypoints.openai.api_server \
  --model /model \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 4 \
  --max-model-len 32768

Client call (identical to OpenAI):

import openai
openai.api_base = "http://localhost:8000/v1"
openai.ChatCompletion.create(
  model="step3",
  messages=[{"role": "user", "content": "Explain quantum tunneling simply."}]
)

5. Performance in the Wild

Metric DeepSeek-V3 Step3 (FP8) Delta
Peak tokens/GPU/s @ 4 K ctx 2 324 4 039 +74 %
GPUs for same SLA 128 32 –75 %
Cost per 1 M tokens 0.211 USD 0.129 USD –39 %

Figures measured on identical hardware under 20 tokens/s latency SLA.


6. Common Questions

Q1: My lab only has A800 or 910B GPUs. Will it work?

Yes. Step3’s MFA is memory-bandwidth friendly:

  • On A800, attention cost only rises 0.01 USD/1 M tokens.
  • DeepSeek-V3’s MLA on A800 triples the cost.

Q2: Is the model truly open-source?

  • Code: Apache 2.0
  • Weights: Apache 2.0
  • Commercial use: Allowed without restrictions.

Q3: How much VRAM do I really need?

Scenario VRAM per GPU Notes
Attention 80 GB Holds 32 K context, batch=64
FFN 80 GB Expert shards
Smallest cluster 32 × 80 GB 2A2F topology

Q4: What about training?

This article focuses on inference. Training costs follow classic scaling laws; Step3’s activated parameters (38 B) keep training within reach of most research groups.


7. Roadmap from the Authors

  • Multi-Token Prediction (MTP): +50 % throughput on long context.
  • 4-bit KV cache: Another halving of memory traffic.
  • 800 Gbps domain networks: Unlocking sparser MoE in future releases.

8. When Should You Pick Step3?

Your Need Recommendation
Tight budget, long context Step3 + AFD starts at 32 GPUs
Mixed GPU types AFD lets H800 + H20 + A800 work together
Commercial deployment Apache 2.0 license, no strings attached
Quick POC Ready-made vLLM/SGLang images

9. References

  • Technical report: arXiv:2507.19427
  • Official blog: https://stepfun.ai/research/step3
  • Weights & code: https://huggingface.co/collections/stepfun-ai/step3