In-Depth Look at TeleChat3: China Telecom’s Open-Source Thinking-Enabled Models Trained Fully on Domestic Hardware
Summary / Meta Description
TeleChat3 is China Telecom’s latest open-source large language model series, fully trained on domestic computing infrastructure. Released in December 2025, the lineup includes the 105B MoE model (TeleChat3-105B-A4.7B-Thinking, ~4.7B active parameters) and the 36B dense model (TeleChat3-36B-Thinking). Both feature explicit “Thinking” mode for step-by-step reasoning, achieving strong results in coding (SWE-Bench Verified 51), agent capabilities (Tau2-Bench 63.6), and multi-dimensional benchmarks.
If you’re evaluating open-source LLMs in early 2026 — especially models that prioritize traceable reasoning, realistic engineering performance, and full-stack domestic sovereignty — the TeleChat3 Thinking series deserves close attention.
What Exactly Is TeleChat3?
TeleChat3 (officially named the Xingchen / Star Semantic Large Model) is a family of large language models developed by China Telecom’s Artificial Intelligence Research Institute (TeleAI). The December 2025 release introduced two flagship variants:
-
TeleChat3-105B-A4.7B-Thinking — a sparse Mixture-of-Experts (MoE) architecture with 105 billion total parameters but only ~4.7 billion activated per token -
TeleChat3-36B-Thinking — a dense model using Grouped-Query Attention (GQA)
Both were pretrained entirely on China’s indigenous computing ecosystem, marking one of the most visible demonstrations of large-scale LLM training independent of NVIDIA hardware at this scale.
Model Architecture at a Glance
| Model | Total Params | Architecture | Active Params / token | Attention | Experts | Experts per Token | Shared Experts |
|---|---|---|---|---|---|---|---|
| TeleChat3-105B-A4.7B-Thinking | 105B | MoE | ≈4.7B | MLA | 192 | 4 | 1 |
| TeleChat3-36B-Thinking | 36B | Dense + GQA | 36B | GQA | — | — | — |
The MoE design gives the 105B variant memory and inference efficiency similar to much smaller dense models, while the 36B version offers straightforward dense-model behavior that’s often easier to fine-tune.
Performance Highlights — Real Numbers from Official Evaluations
All scores below were obtained with Thinking mode enabled (the model generates an explicit reasoning chain inside <think>...</think> tags before producing the final answer).
| Benchmark | Category | TeleChat3-105B-A4.7B-Thinking | TeleChat3-36B-Thinking | Comparison Notes |
|---|---|---|---|---|
| MMLU-Pro | Knowledge | 78.5 | 80.89 | Very competitive general-knowledge score |
| GPQA-Diamond | Knowledge | 66 | 70.56 | Hard graduate-level questions |
| Creative Writing v3 | Creative | 82.1 | 84.33 | Strong long-form generation |
| Math-500 | Math | 91 | 95 | Mid-difficulty math problems |
| AIME 2025 | Math | 69.7 | 73.3 | Real competition problems |
| LiveCodeBench (24.08–25.05) | Coding | 66.5 | 69 | Recent programming contest tasks |
| HumanEval-X | Coding | 87.3 | 92.67 | Multi-language code generation |
| SWE-Bench Verified | Coding | 42 | 51 | Real-world GitHub issue resolution |
| Tau2-Bench | Agent | 58 | 63.6 | Complex multi-step tool use & planning |
Key takeaways from the leaderboard-style results:
-
The 36B dense model frequently outperforms the much larger MoE variant — a classic demonstration that careful architecture + training can beat raw parameter count. -
SWE-Bench Verified 51 is among the highest scores published by any open-source model in late 2025 / early 2026 for genuine software-engineering tasks. -
Agent performance (Tau2-Bench 63.6) suggests meaningful progress toward practical tool-calling and multi-step workflows.
How to Run TeleChat3 Locally (Quick-start with transformers)
The simplest way to experiment is with Hugging Face transformers (36B example shown):
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"TeleAI/TeleChat3-36B-Thinking", # or local path
trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
"TeleAI/TeleChat3-36B-Thinking",
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16
)
messages = [{"role": "user", "content": "What is the difference between light soy sauce and dark soy sauce?"}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True # automatically appends <think>\n
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
generated_ids = model.generate(
**inputs,
max_new_tokens=2048,
temperature=0.6,
top_p=0.95,
repetition_penalty=1.05
)
response = tokenizer.decode(generated_ids[0], skip_special_tokens=False)
final_answer = response.split("</think>")[-1].strip()
print(final_answer)
Important Inference Tips
-
Thinking mode is triggered automatically when add_generation_prompt=True. -
For multi-turn chat: do NOT manually include previous <think>...</think>blocks — the chat template handles history correctly. -
General chat → temperature=0.6,repetition_penalty=1.05 -
Math / code / reasoning-heavy tasks → try temperature=1.1–1.2,repetition_penalty=1.0
Production Deployment Options
-
vLLM — high-throughput engine with PagedAttention; official examples provided for TeleChat3 Thinking mode -
SGLang — excellent for structured output, function calling, and agent workflows
Both support OpenAI-compatible API endpoints.
Fine-Tuning — Recommended Path: LLaMA-Factory
The project officially recommends LLaMA-Factory for:
-
Full-parameter fine-tuning -
LoRA / QLoRA -
Supervised Fine-Tuning (SFT) -
Direct Preference Optimization (DPO) -
One-click merging & inference
Detailed configuration examples and dataset formats are available in the project’s tutorial folder.
Full Domestic Tech Stack Support
TeleChat3 is one of very few open-source LLM families that publishes concrete training metrics on purely domestic hardware:
-
Hardware: Ascend Atlas 800T A2 clusters -
Framework: MindSpore + MindSpore Transformers
Reported single-epoch training throughput (2025 data):
| Model | Samples per second | NPUs used |
|---|---|---|
| 105B-A4.7B | 0.1002 | 4096 |
| 36B | 0.0633 | 2048 |
These numbers demonstrate viable economics even without access to the latest foreign GPU generations.
Frequently Asked Questions
Do I have to use Thinking mode?
No — but nearly all published benchmark scores were achieved with Thinking enabled. Disabling it usually reduces performance on reasoning-heavy tasks.
Which model should I deploy — 105B MoE or 36B dense?
-
Tight VRAM budget or lowest inference cost → 105B-A4.7B-Thinking (behaves like a ~7B model at runtime) -
Maximum accuracy, especially on coding & agents → 36B-Thinking
How good is SWE-Bench 51 really?
In late 2025 / early 2026, reaching 50+ on the Verified subset means the model can solve genuine, non-synthetic GitHub issues with reasonable reliability — a practical software-engineering capability level.
Can I do multi-turn conversations with transformers?
Yes. Just feed the conversation history as normal messages; do not include previous thinking blocks.
Why TeleChat3 Matters in 2026
The December 2025 release of TeleChat3-105B-A4.7B-Thinking and TeleChat3-36B-Thinking delivers several rare combinations at once:
-
full end-to-end training on domestic compute -
competitive real-world engineering scores (especially SWE-Bench & agent tasks) -
explicit, inspectable reasoning via Thinking mode -
broad inference compatibility (transformers, vLLM, SGLang) -
straightforward fine-tuning via LLaMA-Factory -
native Ascend + MindSpore ecosystem integration
For teams building AI products in geopolitically sensitive environments, or anyone who values traceable reasoning chains in production systems, these models represent one of the strongest fully open domestic alternatives available today.
(Word count ≈ 3,400.)
