Tencent Hunyuan 0.5B/1.8B/4B/7B Compact Models: A Complete Hands-On Guide
From download to production deployment—no hype, just facts
Quick answers to the three most-asked questions
Table of contents
- 
Model family at a glance  - 
Benchmarks in plain numbers  - 
Download mirrors and file layout  - 
Zero-to-hero with transformers (two lines of code)  - 
Production deployments: TensorRT-LLM vs vLLM vs sglang  - 
Fine-tuning walk-through with LLaMA-Factory  - 
Quantization: FP8 & Int4 step-by-step  - 
Living FAQ (updated as issues come in)  - 
Further reading and official links  
1. Model family at a glance
All four share the same 256 k-token native context window and the same dual-mode inference switch (“fast think” vs “slow think”).
2. Benchmarks in plain numbers
Academic average (selected tasks)
Long-context & agent tests
Take-away: 7B roughly matches GPT-3.5 class models; 4B hits the best price/performance point; anything below 2 B is for offline or battery-first devices.
3. Download mirrors and file layout
GitHub mirrors
git clone https://github.com/Tencent-Hunyuan/Hunyuan-7B
# or
git clone https://github.com/Tencent-Hunyuan/Hunyuan-4B
# same pattern for 1.8 B and 0.5 B
Hugging Face hubs
What the folder looks like
Hunyuan-7B/
├── config.json            # architecture
├── pytorch_model.bin      # weights (may be sharded)
├── tokenizer.json
├── tokenizer_config.json
└── README.md
4. Zero-to-hero with transformers (two lines of code)
4.1 Minimal inference
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_path = "tencent/Hunyuan-7B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
messages = [{"role": "user", "content": "Why is sea water salty?"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(out[0]))
4.2 Turning off “slow thinking”
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False   # disables CoT
)
Alternatively, prepend /no_think to the prompt.
5. Production deployments: TensorRT-LLM vs vLLM vs sglang
5.1 TensorRT-LLM path (official container)
One-liner Docker launch
docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm
docker run --gpus all --rm -it -p 8000:8000 \
  hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm \
  trtllm-serve /model --host 0.0.0.0 --port 8000 --backend pytorch --tp_size 2
Local Python client
import openai
client = openai.Client(base_url="http://localhost:8000/v1", api_key="EMPTY")
print(client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Write a haiku about code."}]
))
5.2 vLLM path (multi-GPU ready)
Start server
docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm
docker run --gpus all --net=host \
  -v /your/model:/model \
  hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm \
  python -m vllm.entrypoints.openai.api_server \
  --model /model --tensor-parallel-size 4 --trust-remote-code
5.3 sglang path (lightweight & fast)
Launch
docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-sglang
docker run --gpus all --shm-size 32g -p 30000:30000 \
  hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-sglang \
  python -m sglang.launch_server --model-path /model --tp 4 --host 0.0.0.0 --port 30000
6. Fine-tuning walk-through with LLaMA-Factory
6.1 Install
pip install git+https://github.com/huggingface/transformers@4970b23
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e .
6.2 Data format (sharegpt style)
[
  {
    "messages": [
      {"role": "system", "content": "You are a legal assistant."},
      {"role": "user", "content": "What is the cap on contract damages?"},
      {"role": "assistant", "content": "<think>...</think><answer>According to Article 585 of the Civil Code...</answer>"}
    ]
  }
]
6.3 Kick off training
export DISABLE_VERSION_CHECK=1
llamafactory-cli train examples/hunyuan/hunyuan_full.yaml
- 
Single GPU: run as-is  - 
Multi-node: set FORCE_TORCHRUN=1 NNODES=...per README. 
7. Quantization: FP8 & Int4 step-by-step
7.1 Why quantize?
- 
FP8: ~50 % VRAM cut, almost no accuracy loss  - 
Int4: ~75 % VRAM cut, great for 6 GB cards  
7.2 Ready-to-use checkpoints
7.3 Roll your own with AngleSlim
git clone https://github.com/tencent/AngelSlim
python AngleSlim/cli.py quantize \
  --model /path/Hunyuan-7B-Instruct \
  --quant_type int4-gptq \
  --calib_dataset c4 \
  --output_dir Hunyuan-7B-Instruct-Int4
8. Living FAQ
“I only have 12 GB VRAM—can I still run 7B?”
- 
Try --load-in-4bitor the official Int4 checkpoint. - 
If that fails, drop to 4 B; you keep the 256 k context.  
“How do I hide the chain-of-thought output?”
- 
Add /no_thinkat the start of the prompt, or - 
Set enable_thinking=Falseinapply_chat_template. 
“Does Windows work?”
- 
Yes. WSL2 + Docker Desktop is smoothest.  - 
Native Windows needs CUDA ≥ 12.8 and PyTorch 2.4.  
“Commercial license?”
- 
Apache-2.0 for every model; commercial use is allowed.  
9. Further reading and official links
“
Closing thought
These compact Hunyuan models shrink the once data-center-only perks—256 k context, tool calls, state-of-the-art reasoning—down to a single consumer GPU or even a mobile phone. Whether you are an indie dev, a university lab, or a product manager eyeing in-vehicle or smart-home use cases, there is a size that fits. Happy hacking, and see you in the issues tab!

