腾讯混元模型保姆级部署指南：0.5B到7B全系列实战教程+性能对比

高效码农

5 月前

腾讯混元 0.5B/1.8B/4B/7B 小模型全攻略：从下载到部署，一篇就够

对话式长文 | 适合专科及以上读者 | 基于官方 README 2025-08-04 版

先回答你最关心的 3 个问题

问题	一句话答案
“我只有一张 4090，能跑哪个模型？”	7B 也能跑，显存 24 GB 足够；想再省，用 4B 或 1.8B。
“模型在哪下？”	GitHub + Hugging Face 双源直链，支持 `git clone` 或浏览器一键下载。
“最快多久能出结果？”	以 7B 为例，用 vLLM 单卡 BF16 推理，首 token 延迟 < 200 ms；4-bit 量化后还能再快 30 %。

模型家族速览
性能到底怎么样？一图看懂
下载与文件结构
零门槛体验：transformers 两行代码
生产级部署：TensorRT-LLM / vLLM / sglang 三选一
微调实战：LLaMA-Factory 全流程
量化压缩：FP8 / Int4 真机测试
常见问题 FAQ（持续更新）
延伸阅读与官方链接

1. 模型家族速览

名称	参数量	典型显存占用 (BF16)	推荐场景
Hunyuan-0.5B	5 亿	≈ 1.2 GB	手机、树莓派、车载 MCU
Hunyuan-1.8B	18 亿	≈ 4.0 GB	家用路由器、平板
Hunyuan-4B	40 亿	≈ 8.4 GB	笔记本 3060、边缘盒子
Hunyuan-7B	70 亿	≈ 14 GB	单卡 4090、服务器

统一支持 256 K 上下文，内置“快思考 / 慢思考”双模式。

2. 性能到底怎么样？一图看懂

学术基准平均分（节选）

维度	0.5B	1.8B	4B	7B
语言理解 (MMLU)	54.0	64.6	74.0	79.8
数学 (GSM8K)	55.6	77.3	87.5	88.3
代码 (HumanEval+)	39.7	60.7	67.8	67.0

长文 & Agent 实测

场景	评测集	0.5B	7B
128 K 文档问答	longbench-v2	34.7	43.0
工具调用成功率	BFCL-v3	49.8	70.8

结论：7B 基本达到 GPT-3.5 水平，4B 性价比最高，1.8B 以下适合离线终端。

3. 下载与文件结构

GitHub 直链

git clone https://github.com/Tencent-Hunyuan/Hunyuan-7B
# 或
git clone https://github.com/Tencent-Hunyuan/Hunyuan-4B
# 其余同理

Hugging Face 直链

模型	地址
0.5B Instruct	https://huggingface.co/tencent/Hunyuan-0.5B-Instruct
1.8B Instruct	https://huggingface.co/tencent/Hunyuan-1.8B-Instruct
4B Instruct	https://huggingface.co/tencent/Hunyuan-4B-Instruct
7B Instruct	https://huggingface.co/tencent/Hunyuan-7B-Instruct

目录长什么样？

Hunyuan-7B/
├── config.json          # 模型配置
├── pytorch_model.bin    # 权重（可分片）
├── tokenizer.json
├── tokenizer_config.json
└── README.md

4. 零门槛体验：transformers 两行代码

4.1 最简推理

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_path = "tencent/Hunyuan-7B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [{"role": "user", "content": "海水为什么是咸的？"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

out = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(out[0]))

4.2 关闭“慢思考”

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False  # 关闭 CoT
)

或在 prompt 前加 /no_think。

5. 生产级部署：TensorRT-LLM / vLLM / sglang 三选一

5.1 TensorRT-LLM 方案（官方镜像）

Docker 一条命令

docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm
docker run --gpus all --rm -it -p 8000:8000 \
  hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm \
  trtllm-serve /model --host 0.0.0.0 --port 8000 --backend pytorch --tp_size 2

本地 Python 调用

import openai
client = openai.Client(base_url="http://localhost:8000/v1", api_key="EMPTY")
print(client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "写一首五言绝句"}]
))

5.2 vLLM 方案（多卡并行）

启动服务

docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm
docker run --gpus all --net=host \
  -v /your/model:/model \
  hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm \
  python -m vllm.entrypoints.openai.api_server \
  --model /model --tensor-parallel-size 4 --trust-remote-code

5.3 sglang 方案（轻量极速）

启动

docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-sglang
docker run --gpus all --shm-size 32g -p 30000:30000 \
  hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-sglang \
  python -m sglang.launch_server --model-path /model --tp 4 --host 0.0.0.0 --port 30000

6. 微调实战：LLaMA-Factory 全流程

6.1 环境准备

pip install git+https://github.com/huggingface/transformers@4970b23
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e .

6.2 数据格式

[
  {
    "messages": [
      {"role": "system", "content": "你是法律助手。"},
      {"role": "user", "content": "合同违约金上限是多少？"},
      {"role": "assistant", "content": "<think>...</think><answer>根据《民法典》第585条...</answer>"}
    ]
  }
]

6.3 启动训练

export DISABLE_VERSION_CHECK=1
llamafactory-cli train examples/hunyuan/hunyuan_full.yaml

单卡：直接运行
多机：FORCE_TORCHRUN=1 NNODES=... 详见官方 README。

7. 量化压缩：FP8 / Int4 真机测试

7.1 为什么量化？

FP8：显存减半，精度几乎无损
Int4：显存再减半，适合 6 GB 显卡

7.2 一键下载已量化模型

精度	Hugging Face 地址
FP8	官方即将上架，可先自量化
Int4 GPTQ	同目录带 `-GPTQ-4bit` 后缀

7.3 自量化示例（AngleSlim）

git clone https://github.com/tencent/AngelSlim
python AngelSlim/cli.py quantize \
  --model /path/Hunyuan-7B-Instruct \
  --quant_type int4-gptq \
  --calib_dataset c4 \
  --output_dir Hunyuan-7B-Instruct-Int4

8. 常见问题 FAQ

Q1：显存不够怎么办？

先尝试 --load-in-4bit 或 Int4 权重。
再不行就换 4B / 1.8B 模型，仍保留 256 K 长文能力。

Q2：如何关闭思考过程，只拿答案？

prompt 前加 /no_think 即可。
或在 apply_chat_template 中设 enable_thinking=False。

Q3：Windows 能跑吗？

可以。WSL2 + Docker Desktop 体验最佳；
原生 Windows 需安装 CUDA 12.8 以上 + PyTorch 2.4。

Q4：商用授权？

统一 Apache-2.0，可商用；需遵守许可证条款。

9. 延伸阅读与官方链接

资源	链接
官网首页	https://hunyuan.tencent.com
在线 Demo	https://hunyuan.tencent.com/?model=hunyuan-a13b
Hugging Face 合集	https://huggingface.co/tencent
技术报告	README_CN.md 同仓库
邮件反馈	hunyuan_opensource@tencent.com

写在最后
混元小模型的最大价值，是把原本只有大厂才能玩的 256 K 长文本 + 工具调用能力，降到了一张消费级显卡甚至一部手机就能跑的程度。无论你是独立开发者、高校实验室，还是车载、家居场景的产品经理，总有一款尺寸适合你。祝你玩得开心，跑通记得回来交作业！