Tencent Hunyuan Compact Models: The Ultimate Hands-On Guide for Developers

高效码农

5 months ago

Tencent Hunyuan 0.5B/1.8B/4B/7B Compact Models: A Complete Hands-On Guide

From download to production deployment—no hype, just facts

Quick answers to the three most-asked questions

Question	Straight answer
“I only have one RTX 4090. Which model can I run?”	7 B fits in 24 GB VRAM; if you need even more head-room, use 4 B or 1.8 B.
“Where do I download the files?”	GitHub mirrors and Hugging Face hubs are both live; `git clone` or browser downloads work.
“How fast is ‘fast’?”	7 B on a single card with vLLM BF16 gives < 200 ms time-to-first-token; 4-bit quant shaves another 30 % off.

Model family at a glance
Benchmarks in plain numbers
Download mirrors and file layout
Zero-to-hero with transformers (two lines of code)
Production deployments: TensorRT-LLM vs vLLM vs sglang
Fine-tuning walk-through with LLaMA-Factory
Quantization: FP8 & Int4 step-by-step
Living FAQ (updated as issues come in)
Further reading and official links

1. Model family at a glance

Model	Parameters	VRAM (BF16)	Where it shines
Hunyuan-0.5B	0.5 B	~1.2 GB	Phones, Raspberry Pi, in-vehicle MCU
Hunyuan-1.8B	1.8 B	~4.0 GB	Smart speakers, home routers, tablets
Hunyuan-4B	4.0 B	~8.4 GB	Gaming laptops with 3060, edge boxes
Hunyuan-7B	7.0 B	~14 GB	Single RTX 4090, small servers

All four share the same 256 k-token native context window and the same dual-mode inference switch (“fast think” vs “slow think”).

2. Benchmarks in plain numbers

Academic average (selected tasks)

Dimension	0.5B	1.8B	4B	7B
Language understanding (MMLU)	54.0	64.6	74.0	79.8
Math (GSM8K)	55.6	77.3	87.5	88.3
Code (HumanEval+)	39.7	60.7	67.8	67.0

Long-context & agent tests

Scenario	Benchmark	0.5B	7B
128 k-document QA	longbench-v2	34.7	43.0
Tool-use success	BFCL-v3	49.8 %	70.8 %

Take-away: 7B roughly matches GPT-3.5 class models; 4B hits the best price/performance point; anything below 2 B is for offline or battery-first devices.

3. Download mirrors and file layout

GitHub mirrors

git clone https://github.com/Tencent-Hunyuan/Hunyuan-7B
# or
git clone https://github.com/Tencent-Hunyuan/Hunyuan-4B
# same pattern for 1.8 B and 0.5 B

Hugging Face hubs

Model	Hub URL
0.5 B Instruct	https://huggingface.co/tencent/Hunyuan-0.5B-Instruct
1.8 B Instruct	https://huggingface.co/tencent/Hunyuan-1.8B-Instruct
4 B Instruct	https://huggingface.co/tencent/Hunyuan-4B-Instruct
7 B Instruct	https://huggingface.co/tencent/Hunyuan-7B-Instruct

What the folder looks like

Hunyuan-7B/
├── config.json            # architecture
├── pytorch_model.bin      # weights (may be sharded)
├── tokenizer.json
├── tokenizer_config.json
└── README.md

4. Zero-to-hero with transformers (two lines of code)

4.1 Minimal inference

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_path = "tencent/Hunyuan-7B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [{"role": "user", "content": "Why is sea water salty?"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

out = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(out[0]))

4.2 Turning off “slow thinking”

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False   # disables CoT
)

Alternatively, prepend /no_think to the prompt.

5. Production deployments: TensorRT-LLM vs vLLM vs sglang

5.1 TensorRT-LLM path (official container)

One-liner Docker launch

docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm
docker run --gpus all --rm -it -p 8000:8000 \
  hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm \
  trtllm-serve /model --host 0.0.0.0 --port 8000 --backend pytorch --tp_size 2

Local Python client

import openai
client = openai.Client(base_url="http://localhost:8000/v1", api_key="EMPTY")
print(client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Write a haiku about code."}]
))

5.2 vLLM path (multi-GPU ready)

Start server

docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm
docker run --gpus all --net=host \
  -v /your/model:/model \
  hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm \
  python -m vllm.entrypoints.openai.api_server \
  --model /model --tensor-parallel-size 4 --trust-remote-code

5.3 sglang path (lightweight & fast)

Launch

docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-sglang
docker run --gpus all --shm-size 32g -p 30000:30000 \
  hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-sglang \
  python -m sglang.launch_server --model-path /model --tp 4 --host 0.0.0.0 --port 30000

6. Fine-tuning walk-through with LLaMA-Factory

6.1 Install

pip install git+https://github.com/huggingface/transformers@4970b23
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e .

6.2 Data format (sharegpt style)

[
  {
    "messages": [
      {"role": "system", "content": "You are a legal assistant."},
      {"role": "user", "content": "What is the cap on contract damages?"},
      {"role": "assistant", "content": "<think>...</think><answer>According to Article 585 of the Civil Code...</answer>"}
    ]
  }
]

6.3 Kick off training

export DISABLE_VERSION_CHECK=1
llamafactory-cli train examples/hunyuan/hunyuan_full.yaml

Single GPU: run as-is
Multi-node: set FORCE_TORCHRUN=1 NNODES=... per README.

7. Quantization: FP8 & Int4 step-by-step

7.1 Why quantize?

FP8: ~50 % VRAM cut, almost no accuracy loss
Int4: ~75 % VRAM cut, great for 6 GB cards

7.2 Ready-to-use checkpoints

Precision	Location
FP8	Will be uploaded; self-quantize meanwhile
Int4 GPTQ	Same repo, look for `-GPTQ-4bit` suffix

7.3 Roll your own with AngleSlim

git clone https://github.com/tencent/AngelSlim
python AngleSlim/cli.py quantize \
  --model /path/Hunyuan-7B-Instruct \
  --quant_type int4-gptq \
  --calib_dataset c4 \
  --output_dir Hunyuan-7B-Instruct-Int4

8. Living FAQ

“I only have 12 GB VRAM—can I still run 7B?”

Try --load-in-4bit or the official Int4 checkpoint.
If that fails, drop to 4 B; you keep the 256 k context.

“How do I hide the chain-of-thought output?”

Add /no_think at the start of the prompt, or
Set enable_thinking=False in apply_chat_template.

“Does Windows work?”

Yes. WSL2 + Docker Desktop is smoothest.
Native Windows needs CUDA ≥ 12.8 and PyTorch 2.4.

“Commercial license?”

Apache-2.0 for every model; commercial use is allowed.

9. Further reading and official links

Resource	URL
Official site	https://hunyuan.tencent.com
Web demo	https://hunyuan.tencent.com/?model=hunyuan-a13b
Hugging Face org	https://huggingface.co/tencent
Technical README	Same repo as the models
Contact	hunyuan_opensource@tencent.com

“

Closing thought
These compact Hunyuan models shrink the once data-center-only perks—256 k context, tool calls, state-of-the-art reasoning—down to a single consumer GPU or even a mobile phone. Whether you are an indie dev, a university lab, or a product manager eyeing in-vehicle or smart-home use cases, there is a size that fits. Happy hacking, and see you in the issues tab!