Hunyuan-MT: A 7-Billion-Parameter Translation Model That Outperforms Giants
“Can a 7-billion-parameter model really beat 200-billion-parameter giants at translation?”
“Is open-source finally good enough for Tibetan, Uyghur, Kazakh, and Mongolian?”
“How long does it take to get it running on my own GPU?”
If you have asked any of these questions, you are in the right place.
This post translates the official Hunyuan-MT technical report and README into plain English. Every figure, command, and benchmark comes straight from the released files—nothing added, nothing removed.
Quick overview
Item | Hunyuan-MT-7B | Hunyuan-MT-Chimera-7B |
---|---|---|
Size | 7 B parameters | 7 B parameters (fusion model) |
Languages | 33, incl. Chinese, English, Japanese, French, German, Korean, Tibetan, Uyghur, Kazakh, Mongolian | same |
Training stages | 5-stage pipeline (general pre-training → MT pre-training → SFT → RL → weak-to-strong RL) | built on top of the base model |
Key achievement | first place in 30 of 31 WMT 2025 language pairs | first open-source translation fusion model |
License | Apache-2.0 weights + code | same |
Links | GitHub Hugging Face | same |
Table of contents
-
Why another translation model? -
Five training stages in everyday words -
Data pipeline: from 1.3 T tokens to clean sentence pairs -
Benchmarks: numbers and what they mean -
Case studies: social-media slang, medical terms, place names -
Step-by-step setup -
Fine-tuning your own data with LLaMA-Factory -
Production deployment: TensorRT-LLM, vLLM, or sglang -
FAQ
1. Why another translation model? {#why-another-translation-model}
Machine translation has two stubborn problems:
-
Low-resource languages get poor coverage. -
Proprietary APIs keep the best quality locked away.
Hunyuan-MT tries to solve both. It keeps the parameter count small (7 B) so you can run it on one consumer GPU, yet it outperforms much larger closed models on 33 languages, including Tibetan, Uyghur, Kazakh, and Mongolian.
2. Five training stages in everyday words {#five-training-stages}
Stage 1 – General pre-training
Goal: Teach the model general language understanding.
Data: 1.3 trillion tokens across 112 non-Chinese/English languages plus Chinese and English.
Cleanup:
-
A three-dimension quality model scores every document (knowledge value, authenticity, writing style). -
Three tagging systems guarantee balanced topics: 24 industries × 24 themes.
Outcome: Hunyuan-7B-Base
.
Stage 2 – Translation-oriented pre-training
Goal: Focus only on translation skills.
Data mix: monolingual (mC4, OSCAR) + parallel (OPUS, ParaCrawl).
Trick 1 – RegMix: run small-scale experiments to find the best mixing ratio instead of guessing.
Trick 2 – Replay buffer: 20 % of general data is re-inserted to avoid forgetting.
Outcome: Hunyuan-7B-Base★
(★ = after MT pre-training).
Stage 3 – Supervised Fine-Tuning (SFT)
Two data passes
Pass | Size | Source | Purpose |
---|---|---|---|
Stage I | 3 million pairs | Flores-200, WMT tests, human-checked minority↔Chinese, synthetic data | teach basic translation |
Stage II | 268 k pairs | deep-cleaned, many-shot filtered, human-verified | polish quality |
Prompt templates
Chinese⇄Other
把下面的文本翻译成<target_language>,不要额外解释。
<source_text>
Any other pair
Translate the following segment into <target_language>, without additional explanation.
<source_text>
Stage 4 – Reinforcement Learning (RL)
Algorithm: GRPO (Group Relative Policy Optimization).
Reward components
-
XCOMET-XXL – automatic metric close to human judgment -
DeepSeek-V3 score – secondary semantic check -
Terminology overlap – word-alignment-based for domain terms -
Repetition penalty – stops the model from looping
Stage 5 – Weak-to-Strong RL (fusion)
Idea: at inference time, collect 6 candidate translations, then ask the same 7 B model to “refine” them into one.
Gain: extra 2–5 % XCOMET score without extra parameters.
3. Data pipeline: from 1.3 T tokens to clean sentence pairs {#data-pipeline}
Step | Tool | What it removes |
---|---|---|
Language ID | fastText | mis-classified documents |
Near-duplicates | minLSH | duplicate pages |
Perplexity filter | KenLM | low-quality or garbled text |
Parallel-corpus quality | CometKiwi | bad alignments |
4. Benchmarks: numbers and what they mean {#benchmarks}
Automatic metrics (higher is better)
Test set | Direction | Hunyuan-MT-7B | Qwen3-235B-A22B | Gemini-2.5-Pro |
---|---|---|---|---|
WMT24pp | English → XX | 85.9 | 76.7 | 82.5 |
Mandarin ↔ Minority | Chinese ↔ Tibetan/Uyghur/Kazakh/Mongol | 60.8 | 44.9 | 58.1 |
Flores-200 | Chinese → XX | 87.6 | 85.1 | 91.5 |
Human evaluation (0–4 scale)
Model | Chinese → English | English → Chinese | Average |
---|---|---|---|
Hunyuan-MT-7B | 3.26 | 3.16 | 3.21 |
Gemini-2.5-Pro | 3.23 | 3.22 | 3.22 |
Google Translate | 2.84 | 2.10 | 2.47 |
5. Case studies {#case-studies}
Scenario | Input | Hunyuan-MT-7B | Google Translate |
---|---|---|---|
Social media slang | 小红薯在国外疯魔了 | REDnote has become incredibly popular abroad | sweet potatoes are popular abroad |
English idiom | You are killing me! | 你真要把我笑死了! | you are going to kill me |
Medical terms | 尿酸性肾结石 | uric acid kidney stones | uricidal kidney stones |
Place name | 654 Huangpu Drive | 黄埔大道654号 | (kept in English) |
6. Step-by-step setup {#step-by-step-setup}
Prerequisites
-
Python 3.9+ -
CUDA 11.8 or 12.x -
16 GB GPU memory (8 GB if you use the fp8 or int4 model)
Install
pip install transformers==4.56.0 torch
Minimal working example
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "tencent/Hunyuan-MT-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
messages = [
{"role": "user",
"content": "Translate into English:\n\n海水为什么是咸的?"}
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
outputs = model.generate(
inputs,
max_new_tokens=128,
temperature=0.7,
top_p=0.6,
top_k=20,
repetition_penalty=1.05
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Expected output
Why is seawater salty? Because it contains large amounts of dissolved salts and minerals.
7. Fine-tuning your own data with LLaMA-Factory {#fine-tuning}
1. Install LLaMA-Factory
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e .
2. Prepare your data
File: data/my_translate.json
[
{
"messages": [
{"role": "user", "content": "Translate into Kazakh:\nOne Belt One Road"},
{"role": "assistant", "content": "Бір белдеу, бір жол"}
]
}
]
Update data/dataset_info.json
"my_translate": {
"file_name": "my_translate.json",
"formatting": "sharegpt",
"columns": {"messages": "messages"},
"tags": {
"role_tag": "role",
"content_tag": "content",
"user_tag": "user",
"assistant_tag": "assistant"
}
}
3. Launch training (single GPU)
export DISABLE_VERSION_CHECK=1
llamafactory-cli train examples/hunyuan/hunyuan_full.yaml \
--model_name_or_path tencent/Hunyuan-MT-7B \
--dataset my_translate \
--output_dir ./hunyuan-ft-7b
8. Production deployment {#deployment}
Framework | Pros | One-liner |
---|---|---|
TensorRT-LLM | lowest latency | docker pull hunyuaninfer/hunyuan-7b:hunyuan-7b-trtllm |
vLLM | high throughput | python -m vllm.entrypoints.openai.api_server --model tencent/Hunyuan-MT-7B |
sglang | minimal code | python -m sglang.launch_server --model-path tencent/Hunyuan-MT-7B |
Example: vLLM server
python -m vllm.entrypoints.openai.api_server \
--model tencent/Hunyuan-MT-7B \
--host 0.0.0.0 \
--port 8000 \
--dtype bfloat16
Then query:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "tencent/Hunyuan-MT-7B",
"messages": [
{"role": "user", "content": "Translate into Tibetan:\nArtificial intelligence is changing our lives."}
]
}'
9. FAQ {#faq}
Q: Is 7 B really enough for commercial use?
A: In blind human tests the gap between Hunyuan-MT-7B and Gemini-2.5-Pro is <0.1 points. For minority languages it is often better.
Q: How much GPU memory do I need?
A: 16 GB for bf16, 8 GB for fp8, 6 GB for int4.
Q: What if I only have a few thousand sentence pairs?
A: Use the LoRA script above; even 2–3 k high-quality pairs can noticeably improve the model.
Q: Does Chimera work offline?
A: Yes. You provide six candidate translations and run Chimera locally; no Internet required.
Q: Chain-of-thought for translation?
A: The paper tried it. Without joint reward on both CoT and the final sentence, the model only produced boiler-plate text, so it is disabled by default.
Wrap-up
Hunyuan-MT proves that 7 B parameters are already enough to challenge proprietary giants, provided you invest in clean data and a disciplined five-stage training recipe.
If you need a drop-in open-source translator for 33 languages—including low-resource ones like Tibetan or Kazakh—start with the single-line install above and fine-tune with your own data in hours, not weeks.
Happy translating!