Tencent Hunyuan 0.5B/1.8B/4B/7B Compact Models: A Complete Hands-On Guide

From download to production deployment—no hype, just facts


Quick answers to the three most-asked questions

Question Straight answer
“I only have one RTX 4090. Which model can I run?” 7 B fits in 24 GB VRAM; if you need even more head-room, use 4 B or 1.8 B.
“Where do I download the files?” GitHub mirrors and Hugging Face hubs are both live; git clone or browser downloads work.
“How fast is ‘fast’?” 7 B on a single card with vLLM BF16 gives < 200 ms time-to-first-token; 4-bit quant shaves another 30 % off.

Table of contents

  1. Model family at a glance
  2. Benchmarks in plain numbers
  3. Download mirrors and file layout
  4. Zero-to-hero with transformers (two lines of code)
  5. Production deployments: TensorRT-LLM vs vLLM vs sglang
  6. Fine-tuning walk-through with LLaMA-Factory
  7. Quantization: FP8 & Int4 step-by-step
  8. Living FAQ (updated as issues come in)
  9. Further reading and official links

1. Model family at a glance

Model Parameters VRAM (BF16) Where it shines
Hunyuan-0.5B 0.5 B ~1.2 GB Phones, Raspberry Pi, in-vehicle MCU
Hunyuan-1.8B 1.8 B ~4.0 GB Smart speakers, home routers, tablets
Hunyuan-4B 4.0 B ~8.4 GB Gaming laptops with 3060, edge boxes
Hunyuan-7B 7.0 B ~14 GB Single RTX 4090, small servers

All four share the same 256 k-token native context window and the same dual-mode inference switch (“fast think” vs “slow think”).


2. Benchmarks in plain numbers

Academic average (selected tasks)

Dimension 0.5B 1.8B 4B 7B
Language understanding (MMLU) 54.0 64.6 74.0 79.8
Math (GSM8K) 55.6 77.3 87.5 88.3
Code (HumanEval+) 39.7 60.7 67.8 67.0

Long-context & agent tests

Scenario Benchmark 0.5B 7B
128 k-document QA longbench-v2 34.7 43.0
Tool-use success BFCL-v3 49.8 % 70.8 %

Take-away: 7B roughly matches GPT-3.5 class models; 4B hits the best price/performance point; anything below 2 B is for offline or battery-first devices.


3. Download mirrors and file layout

GitHub mirrors

git clone https://github.com/Tencent-Hunyuan/Hunyuan-7B
# or
git clone https://github.com/Tencent-Hunyuan/Hunyuan-4B
# same pattern for 1.8 B and 0.5 B

Hugging Face hubs

Model Hub URL
0.5 B Instruct https://huggingface.co/tencent/Hunyuan-0.5B-Instruct
1.8 B Instruct https://huggingface.co/tencent/Hunyuan-1.8B-Instruct
4 B Instruct https://huggingface.co/tencent/Hunyuan-4B-Instruct
7 B Instruct https://huggingface.co/tencent/Hunyuan-7B-Instruct

What the folder looks like

Hunyuan-7B/
├── config.json            # architecture
├── pytorch_model.bin      # weights (may be sharded)
├── tokenizer.json
├── tokenizer_config.json
└── README.md

4. Zero-to-hero with transformers (two lines of code)

4.1 Minimal inference

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_path = "tencent/Hunyuan-7B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [{"role": "user", "content": "Why is sea water salty?"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

out = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(out[0]))

4.2 Turning off “slow thinking”

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False   # disables CoT
)

Alternatively, prepend /no_think to the prompt.


5. Production deployments: TensorRT-LLM vs vLLM vs sglang

5.1 TensorRT-LLM path (official container)

One-liner Docker launch

docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm
docker run --gpus all --rm -it -p 8000:8000 \
  hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm \
  trtllm-serve /model --host 0.0.0.0 --port 8000 --backend pytorch --tp_size 2

Local Python client

import openai
client = openai.Client(base_url="http://localhost:8000/v1", api_key="EMPTY")
print(client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Write a haiku about code."}]
))

5.2 vLLM path (multi-GPU ready)

Start server

docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm
docker run --gpus all --net=host \
  -v /your/model:/model \
  hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm \
  python -m vllm.entrypoints.openai.api_server \
  --model /model --tensor-parallel-size 4 --trust-remote-code

5.3 sglang path (lightweight & fast)

Launch

docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-sglang
docker run --gpus all --shm-size 32g -p 30000:30000 \
  hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-sglang \
  python -m sglang.launch_server --model-path /model --tp 4 --host 0.0.0.0 --port 30000

6. Fine-tuning walk-through with LLaMA-Factory

6.1 Install

pip install git+https://github.com/huggingface/transformers@4970b23
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e .

6.2 Data format (sharegpt style)

[
  {
    "messages": [
      {"role": "system", "content": "You are a legal assistant."},
      {"role": "user", "content": "What is the cap on contract damages?"},
      {"role": "assistant", "content": "<think>...</think><answer>According to Article 585 of the Civil Code...</answer>"}
    ]
  }
]

6.3 Kick off training

export DISABLE_VERSION_CHECK=1
llamafactory-cli train examples/hunyuan/hunyuan_full.yaml
  • Single GPU: run as-is
  • Multi-node: set FORCE_TORCHRUN=1 NNODES=... per README.

7. Quantization: FP8 & Int4 step-by-step

7.1 Why quantize?

  • FP8: ~50 % VRAM cut, almost no accuracy loss
  • Int4: ~75 % VRAM cut, great for 6 GB cards

7.2 Ready-to-use checkpoints

Precision Location
FP8 Will be uploaded; self-quantize meanwhile
Int4 GPTQ Same repo, look for -GPTQ-4bit suffix

7.3 Roll your own with AngleSlim

git clone https://github.com/tencent/AngelSlim
python AngleSlim/cli.py quantize \
  --model /path/Hunyuan-7B-Instruct \
  --quant_type int4-gptq \
  --calib_dataset c4 \
  --output_dir Hunyuan-7B-Instruct-Int4

8. Living FAQ

“I only have 12 GB VRAM—can I still run 7B?”
  • Try --load-in-4bit or the official Int4 checkpoint.
  • If that fails, drop to 4 B; you keep the 256 k context.
“How do I hide the chain-of-thought output?”
  • Add /no_think at the start of the prompt, or
  • Set enable_thinking=False in apply_chat_template.
“Does Windows work?”
  • Yes. WSL2 + Docker Desktop is smoothest.
  • Native Windows needs CUDA ≥ 12.8 and PyTorch 2.4.
“Commercial license?”
  • Apache-2.0 for every model; commercial use is allowed.

9. Further reading and official links

Resource URL
Official site https://hunyuan.tencent.com
Web demo https://hunyuan.tencent.com/?model=hunyuan-a13b
Hugging Face org https://huggingface.co/tencent
Technical README Same repo as the models
Contact hunyuan_opensource@tencent.com

Closing thought
These compact Hunyuan models shrink the once data-center-only perks—256 k context, tool calls, state-of-the-art reasoning—down to a single consumer GPU or even a mobile phone. Whether you are an indie dev, a university lab, or a product manager eyeing in-vehicle or smart-home use cases, there is a size that fits. Happy hacking, and see you in the issues tab!