Tencent Hunyuan Compact Models: The Ultimate Hands-On Guide for Developers
高效码农
Tencent Hunyuan 0.5B/1.8B/4B/7B Compact Models: A Complete Hands-On Guide
From download to production deployment—no hype, just facts
Quick answers to the three most-asked questions
Question
Straight answer
“I only have one RTX 4090. Which model can I run?”
7 B fits in 24 GB VRAM; if you need even more head-room, use 4 B or 1.8 B.
“Where do I download the files?”
GitHub mirrors and Hugging Face hubs are both live; git clone or browser downloads work.
“How fast is ‘fast’?”
7 B on a single card with vLLM BF16 gives < 200 ms time-to-first-token; 4-bit quant shaves another 30 % off.
Table of contents
Model family at a glance
Benchmarks in plain numbers
Download mirrors and file layout
Zero-to-hero with transformers (two lines of code)
Production deployments: TensorRT-LLM vs vLLM vs sglang
Fine-tuning walk-through with LLaMA-Factory
Quantization: FP8 & Int4 step-by-step
Living FAQ (updated as issues come in)
Further reading and official links
1. Model family at a glance
Model
Parameters
VRAM (BF16)
Where it shines
Hunyuan-0.5B
0.5 B
~1.2 GB
Phones, Raspberry Pi, in-vehicle MCU
Hunyuan-1.8B
1.8 B
~4.0 GB
Smart speakers, home routers, tablets
Hunyuan-4B
4.0 B
~8.4 GB
Gaming laptops with 3060, edge boxes
Hunyuan-7B
7.0 B
~14 GB
Single RTX 4090, small servers
All four share the same 256 k-token native context window and the same dual-mode inference switch (“fast think” vs “slow think”).
2. Benchmarks in plain numbers
Academic average (selected tasks)
Dimension
0.5B
1.8B
4B
7B
Language understanding (MMLU)
54.0
64.6
74.0
79.8
Math (GSM8K)
55.6
77.3
87.5
88.3
Code (HumanEval+)
39.7
60.7
67.8
67.0
Long-context & agent tests
Scenario
Benchmark
0.5B
7B
128 k-document QA
longbench-v2
34.7
43.0
Tool-use success
BFCL-v3
49.8 %
70.8 %
Take-away: 7B roughly matches GPT-3.5 class models; 4B hits the best price/performance point; anything below 2 B is for offline or battery-first devices.
3. Download mirrors and file layout
GitHub mirrors
git clone https://github.com/Tencent-Hunyuan/Hunyuan-7B
# or
git clone https://github.com/Tencent-Hunyuan/Hunyuan-4B
# same pattern for 1.8 B and 0.5 B
[{"messages":[{"role":"system","content":"You are a legal assistant."},{"role":"user","content":"What is the cap on contract damages?"},{"role":"assistant","content":"<think>...</think><answer>According to Article 585 of the Civil Code...</answer>"}]}]
Try --load-in-4bit or the official Int4 checkpoint.
If that fails, drop to 4 B; you keep the 256 k context.
“How do I hide the chain-of-thought output?”
Add /no_think at the start of the prompt, or
Set enable_thinking=False in apply_chat_template.
“Does Windows work?”
Yes. WSL2 + Docker Desktop is smoothest.
Native Windows needs CUDA ≥ 12.8 and PyTorch 2.4.
“Commercial license?”
Apache-2.0 for every model; commercial use is allowed.
9. Further reading and official links
Resource
URL
Official site
https://hunyuan.tencent.com
Web demo
https://hunyuan.tencent.com/?model=hunyuan-a13b
Hugging Face org
https://huggingface.co/tencent
Technical README
Same repo as the models
Contact
hunyuan_opensource@tencent.com
“
Closing thought
These compact Hunyuan models shrink the once data-center-only perks—256 k context, tool calls, state-of-the-art reasoning—down to a single consumer GPU or even a mobile phone. Whether you are an indie dev, a university lab, or a product manager eyeing in-vehicle or smart-home use cases, there is a size that fits. Happy hacking, and see you in the issues tab!
To provide the best experiences, we use technologies like cookies to store and/or access device information. Consenting to these technologies will allow us to process data such as browsing behavior or unique IDs on this site. Not consenting or withdrawing consent, may adversely affect certain features and functions.