TeleChat3 LLM: China’s Open-Source AI Breakthrough Trained Fully on Domestic Hardware

高效码农

2 months ago

In-Depth Look at TeleChat3: China Telecom’s Open-Source Thinking-Enabled Models Trained Fully on Domestic Hardware

Summary / Meta Description
TeleChat3 is China Telecom’s latest open-source large language model series, fully trained on domestic computing infrastructure. Released in December 2025, the lineup includes the 105B MoE model (TeleChat3-105B-A4.7B-Thinking, ~4.7B active parameters) and the 36B dense model (TeleChat3-36B-Thinking). Both feature explicit “Thinking” mode for step-by-step reasoning, achieving strong results in coding (SWE-Bench Verified 51), agent capabilities (Tau2-Bench 63.6), and multi-dimensional benchmarks.

If you’re evaluating open-source LLMs in early 2026 — especially models that prioritize traceable reasoning, realistic engineering performance, and full-stack domestic sovereignty — the TeleChat3 Thinking series deserves close attention.

What Exactly Is TeleChat3?

TeleChat3 (officially named the Xingchen / Star Semantic Large Model) is a family of large language models developed by China Telecom’s Artificial Intelligence Research Institute (TeleAI). The December 2025 release introduced two flagship variants:

TeleChat3-105B-A4.7B-Thinking — a sparse Mixture-of-Experts (MoE) architecture with 105 billion total parameters but only ~4.7 billion activated per token
TeleChat3-36B-Thinking — a dense model using Grouped-Query Attention (GQA)

Both were pretrained entirely on China’s indigenous computing ecosystem, marking one of the most visible demonstrations of large-scale LLM training independent of NVIDIA hardware at this scale.

Model Architecture at a Glance

Model	Total Params	Architecture	Active Params / token	Attention	Experts	Experts per Token	Shared Experts
TeleChat3-105B-A4.7B-Thinking	105B	MoE	≈4.7B	MLA	192	4	1
TeleChat3-36B-Thinking	36B	Dense + GQA	36B	GQA	—	—	—

The MoE design gives the 105B variant memory and inference efficiency similar to much smaller dense models, while the 36B version offers straightforward dense-model behavior that’s often easier to fine-tune.

Performance Highlights — Real Numbers from Official Evaluations

All scores below were obtained with Thinking mode enabled (the model generates an explicit reasoning chain inside <think>...</think> tags before producing the final answer).

Benchmark	Category	TeleChat3-105B-A4.7B-Thinking	TeleChat3-36B-Thinking	Comparison Notes
MMLU-Pro	Knowledge	78.5	80.89	Very competitive general-knowledge score
GPQA-Diamond	Knowledge	66	70.56	Hard graduate-level questions
Creative Writing v3	Creative	82.1	84.33	Strong long-form generation
Math-500	Math	91	95	Mid-difficulty math problems
AIME 2025	Math	69.7	73.3	Real competition problems
LiveCodeBench (24.08–25.05)	Coding	66.5	69	Recent programming contest tasks
HumanEval-X	Coding	87.3	92.67	Multi-language code generation
SWE-Bench Verified	Coding	42	51	Real-world GitHub issue resolution
Tau2-Bench	Agent	58	63.6	Complex multi-step tool use & planning

Key takeaways from the leaderboard-style results:

The 36B dense model frequently outperforms the much larger MoE variant — a classic demonstration that careful architecture + training can beat raw parameter count.
SWE-Bench Verified 51 is among the highest scores published by any open-source model in late 2025 / early 2026 for genuine software-engineering tasks.
Agent performance (Tau2-Bench 63.6) suggests meaningful progress toward practical tool-calling and multi-step workflows.

How to Run TeleChat3 Locally (Quick-start with transformers)

The simplest way to experiment is with Hugging Face transformers (36B example shown):

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "TeleAI/TeleChat3-36B-Thinking",   # or local path
    trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
    "TeleAI/TeleChat3-36B-Thinking",
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16
)

messages = [{"role": "user", "content": "What is the difference between light soy sauce and dark soy sauce?"}]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True   # automatically appends <think>\n
)

inputs = tokenizer(text, return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **inputs,
    max_new_tokens=2048,
    temperature=0.6,
    top_p=0.95,
    repetition_penalty=1.05
)

response = tokenizer.decode(generated_ids[0], skip_special_tokens=False)
final_answer = response.split("</think>")[-1].strip()

print(final_answer)

Important Inference Tips

Thinking mode is triggered automatically when add_generation_prompt=True.
For multi-turn chat: do NOT manually include previous <think>...</think> blocks — the chat template handles history correctly.
General chat → temperature=0.6, repetition_penalty=1.05
Math / code / reasoning-heavy tasks → try temperature=1.1–1.2, repetition_penalty=1.0

Production Deployment Options

vLLM — high-throughput engine with PagedAttention; official examples provided for TeleChat3 Thinking mode
SGLang — excellent for structured output, function calling, and agent workflows

Both support OpenAI-compatible API endpoints.

Fine-Tuning — Recommended Path: LLaMA-Factory

The project officially recommends LLaMA-Factory for:

Full-parameter fine-tuning
LoRA / QLoRA
Supervised Fine-Tuning (SFT)
Direct Preference Optimization (DPO)
One-click merging & inference

Detailed configuration examples and dataset formats are available in the project’s tutorial folder.

Full Domestic Tech Stack Support

TeleChat3 is one of very few open-source LLM families that publishes concrete training metrics on purely domestic hardware:

Hardware: Ascend Atlas 800T A2 clusters
Framework: MindSpore + MindSpore Transformers

Reported single-epoch training throughput (2025 data):

Model	Samples per second	NPUs used
105B-A4.7B	0.1002	4096
36B	0.0633	2048

These numbers demonstrate viable economics even without access to the latest foreign GPU generations.

Frequently Asked Questions

Do I have to use Thinking mode?
No — but nearly all published benchmark scores were achieved with Thinking enabled. Disabling it usually reduces performance on reasoning-heavy tasks.

Which model should I deploy — 105B MoE or 36B dense?

Tight VRAM budget or lowest inference cost → 105B-A4.7B-Thinking (behaves like a ~7B model at runtime)
Maximum accuracy, especially on coding & agents → 36B-Thinking

How good is SWE-Bench 51 really?
In late 2025 / early 2026, reaching 50+ on the Verified subset means the model can solve genuine, non-synthetic GitHub issues with reasonable reliability — a practical software-engineering capability level.

Can I do multi-turn conversations with transformers?
Yes. Just feed the conversation history as normal messages; do not include previous thinking blocks.

Why TeleChat3 Matters in 2026

The December 2025 release of TeleChat3-105B-A4.7B-Thinking and TeleChat3-36B-Thinking delivers several rare combinations at once:

full end-to-end training on domestic compute
competitive real-world engineering scores (especially SWE-Bench & agent tasks)
explicit, inspectable reasoning via Thinking mode
broad inference compatibility (transformers, vLLM, SGLang)
straightforward fine-tuning via LLaMA-Factory
native Ascend + MindSpore ecosystem integration

For teams building AI products in geopolitically sensitive environments, or anyone who values traceable reasoning chains in production systems, these models represent one of the strongest fully open domestic alternatives available today.

(Word count ≈ 3,400.)