The Ultimate Guide to Chinese Spelling & Grammar Correction: Champion Models in Action

Do you struggle with confusing “的,” “得,” and “地” in Chinese writing? Or worry about typos in important documents? This guide reveals award-winning AI tools that have dominated NLP competitions for three consecutive years – complete with practical implementation tutorials.

1. Core Technology Breakdown

1.1 Evolution of Champion Models

This project has won three consecutive championships in authoritative competitions:

  • 🏆 2024 CCL Champion (Research Paper)
  • 🏆 2023 NLPCC-NaCGEC Champion
  • 🏆 2022 FCGEC Champion

1.2 Model Capability Matrix

Model Name Correction Type Best For Key Features
ChineseErrorCorrector3-4B Grammar+Spelling Professional editing 18-point lead over competitors
ChineseErrorCorrector2-7B Grammar+Spelling Business documents Trained on 2M samples
ChineseErrorCorrector-7B Spelling Basic proofreading Handles visual/sound-alike errors
ChineseErrorCorrector-1.5B Spelling Mobile deployment Lightweight solution
Chinese Correction System Architecture

2. Performance Benchmarks

2.1 Spelling Correction (F1 Scores)

Model General Text Legal Docs Medical Texts Official Documents
1.5B 0.346 0.517 0.433 0.540
7B 0.592 0.787 0.677 0.793
32B 0.594 0.776 0.794 0.864

2.2 Grammar Correction Dominance (NaCGEC Dataset)

# Champion model evaluation:
Precision = 0.743
Recall = 0.7294
F0.5 = 0.7402  # 10 points above Huawei's solution

3. 4-Step Implementation Guide

3.1 Environment Setup

conda create -n zh_correct python=3.10
conda activate zh_correct
pip install transformers vllm==0.8.5

3.2 Single-Sentence Correction (Transformers)

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("twnlp/ChineseErrorCorrector3-4B")
tokenizer = AutoTokenizer.from_pretrained("twnlp/ChineseErrorCorrector3-4B")

# Professional correction prompt
prompt = "As a text correction expert, fix grammatical errors in: "
text_input = "对待每一项工作都要一丝不够。"  # Contains error

messages = [{"role": "user", "content": prompt + text_input}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Execute correction
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))
# Output: 对待每一项工作都要一丝不苟。 (Corrected)

3.3 Batch Processing (VLLM – Production Recommended)

from vllm import LLM, SamplingParams

llm = LLM(model="twnlp/ChineseErrorCorrector3-4B")
sampling_params = SamplingParams(max_tokens=512)

# Batch input example
batch_texts = [
    "这个苹果非常好吃。",  # Correct
    "他昨天去图书馆读书。",  # Correct
    "会议将在明天下午三点钟举行。"  # Correct
]

results = llm.generate(batch_texts, sampling_params)
for output in results:
    print(f"Corrected: {output.outputs[0].text}")

3.4 Engineering Deployment

  1. Clone repository:

    git clone https://github.com/TW-NLP/ChineseErrorCorrector
    cd ChineseErrorCorrector
    
  2. Configure settings:

    # config.py key settings:
    DEFAULT_CKPT_PATH = "twnlp/ChineseErrorCorrector3-4B"  # Model selection
    USE_VLLM = True  # Enable high-performance inference
    
  3. Run batch processing:

    python main.py
    

4. Training Datasets Explained

4.1 Dataset Composition

Dataset Size Content Type
ChinseseErrorCorrectData 2M Comprehensive corpus
CSC (Spelling) 380K Medical/Legal/Official
CGC (Grammar) 68K Academic/Business
Lang8+HSK 1.56M Daily conversations

5. Frequently Asked Questions

5.1 Which model should I choose?

  • Professional Editing: ChineseErrorCorrector3-4B
  • Real-time Systems: ChineseErrorCorrector-1.5B
  • Domain-Specific Texts: Use custom training tools

5.2 What error types are covered?

  1. Spelling Errors: Character confusion (e.g., “帐号” → “账号”)
  2. Grammar Errors: Sentence structure issues (e.g., “原因是…造成”)
  3. Collocation Errors: Measure word misuse (e.g., “一个书”)
  4. Redundancy: Duplicate words (e.g., “大约半小时左右”)

5.3 How to improve domain-specific results?

# Legal document enhancement example:
prompt = "As a legal document proofreader, correct errors in this clause: "
text_input = "甲方应于签约后三十日个工作日内付款。"  # Contains error

6. Technical Deep Dive

6.1 Multi-Stage Training

  1. Data Augmentation: 14 grammatical error patterns
  2. Iterative Training: Multi-round fine-tuning with 2M samples
  3. Domain Adaptation: Legal/medical/official text optimization

6.2 Champion Model Architecture

graph LR
A[Input Text] --> B(Error Detection)
B --> C{Error Type}
C -->|Spelling| D[Character Similarity Analysis]
C -->|Grammar| E[Dependency Parsing]
D & E --> F[Correction Generation]
F --> G[Corrected Output]

7. Academic Reference

@inproceedings{wei2024automated,
  title={Automated Detection, Correction and Fluency Assessment of Grammatical Errors in Chinese Composition},
  author={Wei, Tian},
  booktitle={Proceedings of the 23rd Chinese National Conference on Computational Linguistics},
  pages={278--284},
  year={2024}
}

Project evolution: Star History Chart