A Practical Guide to the Sensitive-Lexicon Chinese Sensitive-Word List

“

After reading this guide you will know

what a sensitive-word list is and why it matters

how to plug Sensitive-lexicon into any project in under five minutes

how to stay on the right side of the law and avoid false positives

the fifteen most common questions developers ask, answered in plain language

1 Why a Sensitive-Word List Exists

Every day, millions of messages, comments and posts are published online.
Forums, chat rooms, games and apps need a quick way to spot words that break local rules or platform policies.
A sensitive-word list (sometimes called a “blacklist” or “blocklist”) is the simplest building block: a plain-text file that contains words or phrases you do not want to appear.

Sensitive-lexicon is an open-source example of such a list.
It focuses on Chinese and covers politics, adult content, violence and other common categories.
The list is updated by the community and released under the MIT License, so you can use it in commercial projects for free.

2 What Exactly Is Sensitive-Lexicon?

Item	Details
Name	Sensitive-lexicon
Language	Chinese
Format	Plain UTF-8 text, one word per line
Size	Tens of thousands of entries
Update cycle	Continuous, driven by GitHub commits
License	MIT License
Source	https://github.com/Konsheng/Sensitive-lexicon

The repository also includes a Vocabulary/ folder for topic-specific sub-lists and a star-history chart so you can see how the project has grown.

3 Quick Start: Three Steps to Filter Text

3.1 Download the List

Open a terminal and run:

git clone https://github.com/Konsheng/Sensitive-lexicon.git
cd Sensitive-lexicon

You will now see:

Sensitive-lexicon/
├── sensitive-lexicon.txt   # main list
├── Vocabulary/             # optional category files
└── README.md

sensitive-lexicon.txt is the file you need.

3.2 Choose a Matching Algorithm

A word list on its own does nothing; you need code that scans incoming text and replaces or flags the matches.
Below are three common approaches.

Algorithm	How It Works	When to Use
DFA (Deterministic Finite Automaton)	Turns the list into a state machine, scans text once	High-traffic services needing millisecond response
Trie (Prefix Tree)	Builds a tree where each node is one character, saves memory	Memory-constrained environments
Regular Expression	Compiles the list into one big regex, one line of code	Scripts, prototypes, low-volume tools

Example in Python Using a Trie

import os

# Step 1: Read the list
with open('sensitive-lexicon.txt', encoding='utf-8') as f:
    words = [line.strip() for line in f if line.strip()]

# Step 2: Build the trie
trie = {}
for w in words:
    node = trie
    for ch in w:
        node = node.setdefault(ch, {})
    node['#'] = True  # end-of-word marker

# Step 3: Filter function
def filter_text(text: str) -> str:
    i, n = 0, len(text)
    out = []
    while i < n:
        node, j = trie, i
        while j < n and text[j] in node:
            node = node[text[j]]
            if '#' in node:               # hit a sensitive word
                out.append('*' * (j - i + 1))
                i = j + 1
                break
            j += 1
        else:
            out.append(text[i])
            i += 1
    return ''.join(out)

# quick test
print(filter_text("这是一个测试句子"))

Save the script as filter.py and run:

python filter.py

If the test sentence contains any word from the list, you will see asterisks instead.

3.3 Verify the Output

Type a few test sentences yourself.
Check that the correct word is replaced.
If nothing happens, open sensitive-lexicon.txt and confirm the word is present and spelled exactly the same way (no extra spaces).

4 Contributing New Words

Sensitive-lexicon is community-driven.
If you spot a missing word, you can add it by following these steps.

Fork the repository on GitHub.
Add or edit entries in the Vocabulary/ folder or in sensitive-lexicon.txt.
- One word or phrase per line
- No leading or trailing spaces
Document the source in your pull request (news article, chat log, etc.).
Submit the pull request.
Wait for a maintainer to review and merge.

Tips to keep the project healthy:

Send small pull requests (a few dozen words) rather than thousands at once.
If you are unsure, open an issue first to discuss the word or category.

5 Legal, Ethical and Practical Considerations

5.1 Follow Local Laws and Platform Policies

Sensitive-word definitions differ by country and by service.
Always have your legal or policy team review the list before going live.

5.2 Context Matters

The same word can be acceptable in a medical article but problematic in a chat room.
Build additional logic (whitelists, user reputation, human review) to reduce false positives.

5.3 Log Responsibly

Keep an audit trail of blocked content, but strip user identifiers to protect privacy.

6 Frequently Asked Questions (FAQ)

Question	Short Answer
1. How complete is the list?	Tens of thousands of entries; good coverage but never 100 %. You will still need domain-specific additions.
2. How often is it updated?	Whenever the community submits new words; expect several commits per month.
3. Is it free for commercial use?	Yes, under the MIT License.
4. Can I use it in production right away?	Yes, but run your own QA and legal checks first.
5. How do I distinguish adult terms from medical terms?	The list itself does not tag context; your application logic must decide.
6. Does it collect user data?	No. The repository only contains the word list.
7. Are other languages included?	Currently Chinese only.
8. What encoding should I use?	UTF-8 without BOM.
9. Is there a REST API?	No. You load the file directly or wrap it in your own service.
10. How do I benchmark performance?	Use tools like `wrk` or Apache Bench and measure queries per second and memory usage.
11. What if legitimate words are blocked?	Maintain a whitelist or add a second human-review step.
12. Is there a GUI?	No official graphical interface; you can build one on top of the text file.
13. How do I roll back a bad update?	Use Git: `git checkout <previous-commit>`
14. Is there a Docker image?	No official image; you can write a simple Dockerfile that copies the list into your container.
15. How do I contact the maintainers?	Open an issue or discussion on GitHub.

7 Turning the List into a Micro-service

If you do not want to reload the list every time your program starts, wrap it in a small service:

Load the list into Redis (key “sensitive_words”).
Write a tiny HTTP API in Flask, FastAPI, or Spring Boot.

Expose one endpoint:

POST /filter
Body: {"text": "input sentence"}
Response: {"filtered": "**** sentence"}

Update the Redis set whenever the list changes; no restart required.

8 Visual Summary

graph TD
    A[Sensitive-Word List] --> B[Download]
    A --> C[Use in Code]
    A --> D[Contribute]
    B --> B1[git clone]
    C --> C1[DFA / Trie / Regex]
    C --> C2[HTTP micro-service]
    D --> D1[Pull Request]
    D --> D2[Issue Discussion]

9 Closing Thoughts

Filtering sensitive words is not a one-time task.
Language, culture and laws evolve, and so must your approach.
Sensitive-lexicon gives you a solid starting point in Chinese; the rest is up to your judgment, your users’ needs and your local regulations.

Clone the repository today, run a quick test, and decide if it fits your project.
If it does, keep an eye on the commit history—new words arrive regularly.
And if you find a gap, the community will appreciate your contribution.

Master Chinese Content Moderation: The Open Source Sensitive-Word List Guide