The Ultimate Guide to Chinese Spelling & Grammar Correction: Champion Models in Action

Do you struggle with confusing “的,” “得,” and “地” in Chinese writing? Or worry about typos in important documents? This guide reveals award-winning AI tools that have dominated NLP competitions for three consecutive years – complete with practical implementation tutorials.

1. Core Technology Breakdown

1.1 Evolution of Champion Models

This project has won three consecutive championships in authoritative competitions:

🏆 2024 CCL Champion (Research Paper)
🏆 2023 NLPCC-NaCGEC Champion
🏆 2022 FCGEC Champion

1.2 Model Capability Matrix

Model Name	Correction Type	Best For	Key Features
ChineseErrorCorrector3-4B	Grammar+Spelling	Professional editing	18-point lead over competitors
ChineseErrorCorrector2-7B	Grammar+Spelling	Business documents	Trained on 2M samples
ChineseErrorCorrector-7B	Spelling	Basic proofreading	Handles visual/sound-alike errors
ChineseErrorCorrector-1.5B	Spelling	Mobile deployment	Lightweight solution

2. Performance Benchmarks

2.1 Spelling Correction (F1 Scores)

Model	General Text	Legal Docs	Medical Texts	Official Documents
1.5B	0.346	0.517	0.433	0.540
7B	0.592	0.787	0.677	0.793
32B	0.594	0.776	0.794	0.864

2.2 Grammar Correction Dominance (NaCGEC Dataset)

# Champion model evaluation:
Precision = 0.743
Recall = 0.7294
F0.5 = 0.7402  # 10 points above Huawei's solution

3. 4-Step Implementation Guide

3.1 Environment Setup

conda create -n zh_correct python=3.10
conda activate zh_correct
pip install transformers vllm==0.8.5

3.2 Single-Sentence Correction (Transformers)

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("twnlp/ChineseErrorCorrector3-4B")
tokenizer = AutoTokenizer.from_pretrained("twnlp/ChineseErrorCorrector3-4B")

# Professional correction prompt
prompt = "As a text correction expert, fix grammatical errors in: "
text_input = "对待每一项工作都要一丝不够。"  # Contains error

messages = [{"role": "user", "content": prompt + text_input}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Execute correction
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))
# Output: 对待每一项工作都要一丝不苟。 (Corrected)

3.3 Batch Processing (VLLM – Production Recommended)

from vllm import LLM, SamplingParams

llm = LLM(model="twnlp/ChineseErrorCorrector3-4B")
sampling_params = SamplingParams(max_tokens=512)

# Batch input example
batch_texts = [
    "这个苹果非常好吃。",  # Correct
    "他昨天去图书馆读书。",  # Correct
    "会议将在明天下午三点钟举行。"  # Correct
]

results = llm.generate(batch_texts, sampling_params)
for output in results:
    print(f"Corrected: {output.outputs[0].text}")

3.4 Engineering Deployment

Clone repository:

git clone https://github.com/TW-NLP/ChineseErrorCorrector
cd ChineseErrorCorrector

Configure settings:

# config.py key settings:
DEFAULT_CKPT_PATH = "twnlp/ChineseErrorCorrector3-4B"  # Model selection
USE_VLLM = True  # Enable high-performance inference

Run batch processing:
```
python main.py
```

4. Training Datasets Explained

4.1 Dataset Composition

Dataset	Size	Content Type
ChinseseErrorCorrectData	2M	Comprehensive corpus
CSC (Spelling)	380K	Medical/Legal/Official
CGC (Grammar)	68K	Academic/Business
Lang8+HSK	1.56M	Daily conversations

5. Frequently Asked Questions

5.1 Which model should I choose?

Professional Editing: ChineseErrorCorrector3-4B
Real-time Systems: ChineseErrorCorrector-1.5B
Domain-Specific Texts: Use custom training tools

5.2 What error types are covered?

Spelling Errors: Character confusion (e.g., “帐号” → “账号”)
Grammar Errors: Sentence structure issues (e.g., “原因是…造成”)
Collocation Errors: Measure word misuse (e.g., “一个书”)
Redundancy: Duplicate words (e.g., “大约半小时左右”)

5.3 How to improve domain-specific results?

# Legal document enhancement example:
prompt = "As a legal document proofreader, correct errors in this clause: "
text_input = "甲方应于签约后三十日个工作日内付款。"  # Contains error

6. Technical Deep Dive

6.1 Multi-Stage Training

Data Augmentation: 14 grammatical error patterns
Iterative Training: Multi-round fine-tuning with 2M samples
Domain Adaptation: Legal/medical/official text optimization

6.2 Champion Model Architecture

graph LR
A[Input Text] --> B(Error Detection)
B --> C{Error Type}
C -->|Spelling| D[Character Similarity Analysis]
C -->|Grammar| E[Dependency Parsing]
D & E --> F[Correction Generation]
F --> G[Corrected Output]

7. Academic Reference

@inproceedings{wei2024automated,
  title={Automated Detection, Correction and Fluency Assessment of Grammatical Errors in Chinese Composition},
  author={Wei, Tian},
  booktitle={Proceedings of the 23rd Chinese National Conference on Computational Linguistics},
  pages={278--284},
  year={2024}
}

Project evolution:

Champion Chinese Spelling & Grammar Correction Models: 3-Time Winning AI Revealed

The Ultimate Guide to Chinese Spelling & Grammar Correction: Champion Models in Action

1. Core Technology Breakdown

1.1 Evolution of Champion Models

1.2 Model Capability Matrix

2. Performance Benchmarks

2.1 Spelling Correction (F1 Scores)

2.2 Grammar Correction Dominance (NaCGEC Dataset)

3. 4-Step Implementation Guide

3.1 Environment Setup

3.2 Single-Sentence Correction (Transformers)

3.3 Batch Processing (VLLM – Production Recommended)

3.4 Engineering Deployment

4. Training Datasets Explained

4.1 Dataset Composition

5. Frequently Asked Questions

5.1 Which model should I choose?

5.2 What error types are covered?

5.3 How to improve domain-specific results?

6. Technical Deep Dive

6.1 Multi-Stage Training

6.2 Champion Model Architecture

7. Academic Reference

Related Posts