The Ultimate Guide to Chinese Spelling & Grammar Correction: Champion Models in Action
Do you struggle with confusing “的,” “得,” and “地” in Chinese writing? Or worry about typos in important documents? This guide reveals award-winning AI tools that have dominated NLP competitions for three consecutive years – complete with practical implementation tutorials.
1. Core Technology Breakdown
1.1 Evolution of Champion Models
This project has won three consecutive championships in authoritative competitions:
-
🏆 2024 CCL Champion (Research Paper) -
🏆 2023 NLPCC-NaCGEC Champion -
🏆 2022 FCGEC Champion
1.2 Model Capability Matrix
Model Name | Correction Type | Best For | Key Features |
---|---|---|---|
ChineseErrorCorrector3-4B | Grammar+Spelling | Professional editing | 18-point lead over competitors |
ChineseErrorCorrector2-7B | Grammar+Spelling | Business documents | Trained on 2M samples |
ChineseErrorCorrector-7B | Spelling | Basic proofreading | Handles visual/sound-alike errors |
ChineseErrorCorrector-1.5B | Spelling | Mobile deployment | Lightweight solution |

2. Performance Benchmarks
2.1 Spelling Correction (F1 Scores)
Model | General Text | Legal Docs | Medical Texts | Official Documents |
---|---|---|---|---|
1.5B | 0.346 | 0.517 | 0.433 | 0.540 |
7B | 0.592 | 0.787 | 0.677 | 0.793 |
32B | 0.594 | 0.776 | 0.794 | 0.864 |
2.2 Grammar Correction Dominance (NaCGEC Dataset)
# Champion model evaluation:
Precision = 0.743
Recall = 0.7294
F0.5 = 0.7402 # 10 points above Huawei's solution
3. 4-Step Implementation Guide
3.1 Environment Setup
conda create -n zh_correct python=3.10
conda activate zh_correct
pip install transformers vllm==0.8.5
3.2 Single-Sentence Correction (Transformers)
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("twnlp/ChineseErrorCorrector3-4B")
tokenizer = AutoTokenizer.from_pretrained("twnlp/ChineseErrorCorrector3-4B")
# Professional correction prompt
prompt = "As a text correction expert, fix grammatical errors in: "
text_input = "对待每一项工作都要一丝不够。" # Contains error
messages = [{"role": "user", "content": prompt + text_input}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# Execute correction
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))
# Output: 对待每一项工作都要一丝不苟。 (Corrected)
3.3 Batch Processing (VLLM – Production Recommended)
from vllm import LLM, SamplingParams
llm = LLM(model="twnlp/ChineseErrorCorrector3-4B")
sampling_params = SamplingParams(max_tokens=512)
# Batch input example
batch_texts = [
"这个苹果非常好吃。", # Correct
"他昨天去图书馆读书。", # Correct
"会议将在明天下午三点钟举行。" # Correct
]
results = llm.generate(batch_texts, sampling_params)
for output in results:
print(f"Corrected: {output.outputs[0].text}")
3.4 Engineering Deployment
-
Clone repository: git clone https://github.com/TW-NLP/ChineseErrorCorrector cd ChineseErrorCorrector
-
Configure settings: # config.py key settings: DEFAULT_CKPT_PATH = "twnlp/ChineseErrorCorrector3-4B" # Model selection USE_VLLM = True # Enable high-performance inference
-
Run batch processing: python main.py
4. Training Datasets Explained
4.1 Dataset Composition
Dataset | Size | Content Type |
---|---|---|
ChinseseErrorCorrectData | 2M | Comprehensive corpus |
CSC (Spelling) | 380K | Medical/Legal/Official |
CGC (Grammar) | 68K | Academic/Business |
Lang8+HSK | 1.56M | Daily conversations |
5. Frequently Asked Questions
5.1 Which model should I choose?
-
Professional Editing: ChineseErrorCorrector3-4B -
Real-time Systems: ChineseErrorCorrector-1.5B -
Domain-Specific Texts: Use custom training tools
5.2 What error types are covered?
-
Spelling Errors: Character confusion (e.g., “帐号” → “账号”) -
Grammar Errors: Sentence structure issues (e.g., “原因是…造成”) -
Collocation Errors: Measure word misuse (e.g., “一个书”) -
Redundancy: Duplicate words (e.g., “大约半小时左右”)
5.3 How to improve domain-specific results?
# Legal document enhancement example:
prompt = "As a legal document proofreader, correct errors in this clause: "
text_input = "甲方应于签约后三十日个工作日内付款。" # Contains error
6. Technical Deep Dive
6.1 Multi-Stage Training
-
Data Augmentation: 14 grammatical error patterns -
Iterative Training: Multi-round fine-tuning with 2M samples -
Domain Adaptation: Legal/medical/official text optimization
6.2 Champion Model Architecture
graph LR
A[Input Text] --> B(Error Detection)
B --> C{Error Type}
C -->|Spelling| D[Character Similarity Analysis]
C -->|Grammar| E[Dependency Parsing]
D & E --> F[Correction Generation]
F --> G[Corrected Output]
7. Academic Reference
@inproceedings{wei2024automated,
title={Automated Detection, Correction and Fluency Assessment of Grammatical Errors in Chinese Composition},
author={Wei, Tian},
booktitle={Proceedings of the 23rd Chinese National Conference on Computational Linguistics},
pages={278--284},
year={2024}
}
Project evolution: