A Practical Guide to the Sensitive-Lexicon Chinese Sensitive-Word List
“
After reading this guide you will know
what a sensitive-word list is and why it matters how to plug Sensitive-lexicon into any project in under five minutes how to stay on the right side of the law and avoid false positives the fifteen most common questions developers ask, answered in plain language
1 Why a Sensitive-Word List Exists
Every day, millions of messages, comments and posts are published online.
Forums, chat rooms, games and apps need a quick way to spot words that break local rules or platform policies.
A sensitive-word list (sometimes called a “blacklist” or “blocklist”) is the simplest building block: a plain-text file that contains words or phrases you do not want to appear.
Sensitive-lexicon is an open-source example of such a list.
It focuses on Chinese and covers politics, adult content, violence and other common categories.
The list is updated by the community and released under the MIT License, so you can use it in commercial projects for free.
2 What Exactly Is Sensitive-Lexicon?
Item | Details |
---|---|
Name | Sensitive-lexicon |
Language | Chinese |
Format | Plain UTF-8 text, one word per line |
Size | Tens of thousands of entries |
Update cycle | Continuous, driven by GitHub commits |
License | MIT License |
Source | https://github.com/Konsheng/Sensitive-lexicon |
The repository also includes a Vocabulary/
folder for topic-specific sub-lists and a star-history chart so you can see how the project has grown.
3 Quick Start: Three Steps to Filter Text
3.1 Download the List
Open a terminal and run:
git clone https://github.com/Konsheng/Sensitive-lexicon.git
cd Sensitive-lexicon
You will now see:
Sensitive-lexicon/
├── sensitive-lexicon.txt # main list
├── Vocabulary/ # optional category files
└── README.md
sensitive-lexicon.txt
is the file you need.
3.2 Choose a Matching Algorithm
A word list on its own does nothing; you need code that scans incoming text and replaces or flags the matches.
Below are three common approaches.
Algorithm | How It Works | When to Use |
---|---|---|
DFA (Deterministic Finite Automaton) | Turns the list into a state machine, scans text once | High-traffic services needing millisecond response |
Trie (Prefix Tree) | Builds a tree where each node is one character, saves memory | Memory-constrained environments |
Regular Expression | Compiles the list into one big regex, one line of code | Scripts, prototypes, low-volume tools |
Example in Python Using a Trie
import os
# Step 1: Read the list
with open('sensitive-lexicon.txt', encoding='utf-8') as f:
words = [line.strip() for line in f if line.strip()]
# Step 2: Build the trie
trie = {}
for w in words:
node = trie
for ch in w:
node = node.setdefault(ch, {})
node['#'] = True # end-of-word marker
# Step 3: Filter function
def filter_text(text: str) -> str:
i, n = 0, len(text)
out = []
while i < n:
node, j = trie, i
while j < n and text[j] in node:
node = node[text[j]]
if '#' in node: # hit a sensitive word
out.append('*' * (j - i + 1))
i = j + 1
break
j += 1
else:
out.append(text[i])
i += 1
return ''.join(out)
# quick test
print(filter_text("这是一个测试句子"))
Save the script as filter.py
and run:
python filter.py
If the test sentence contains any word from the list, you will see asterisks instead.
3.3 Verify the Output
-
Type a few test sentences yourself. -
Check that the correct word is replaced. -
If nothing happens, open sensitive-lexicon.txt
and confirm the word is present and spelled exactly the same way (no extra spaces).
4 Contributing New Words
Sensitive-lexicon is community-driven.
If you spot a missing word, you can add it by following these steps.
-
Fork the repository on GitHub. -
Add or edit entries in the Vocabulary/
folder or insensitive-lexicon.txt
.-
One word or phrase per line -
No leading or trailing spaces
-
-
Document the source in your pull request (news article, chat log, etc.). -
Submit the pull request. -
Wait for a maintainer to review and merge.
Tips to keep the project healthy:
-
Send small pull requests (a few dozen words) rather than thousands at once. -
If you are unsure, open an issue first to discuss the word or category.
5 Legal, Ethical and Practical Considerations
5.1 Follow Local Laws and Platform Policies
Sensitive-word definitions differ by country and by service.
Always have your legal or policy team review the list before going live.
5.2 Context Matters
The same word can be acceptable in a medical article but problematic in a chat room.
Build additional logic (whitelists, user reputation, human review) to reduce false positives.
5.3 Log Responsibly
Keep an audit trail of blocked content, but strip user identifiers to protect privacy.
6 Frequently Asked Questions (FAQ)
Question | Short Answer |
---|---|
1. How complete is the list? | Tens of thousands of entries; good coverage but never 100 %. You will still need domain-specific additions. |
2. How often is it updated? | Whenever the community submits new words; expect several commits per month. |
3. Is it free for commercial use? | Yes, under the MIT License. |
4. Can I use it in production right away? | Yes, but run your own QA and legal checks first. |
5. How do I distinguish adult terms from medical terms? | The list itself does not tag context; your application logic must decide. |
6. Does it collect user data? | No. The repository only contains the word list. |
7. Are other languages included? | Currently Chinese only. |
8. What encoding should I use? | UTF-8 without BOM. |
9. Is there a REST API? | No. You load the file directly or wrap it in your own service. |
10. How do I benchmark performance? | Use tools like wrk or Apache Bench and measure queries per second and memory usage. |
11. What if legitimate words are blocked? | Maintain a whitelist or add a second human-review step. |
12. Is there a GUI? | No official graphical interface; you can build one on top of the text file. |
13. How do I roll back a bad update? | Use Git: git checkout <previous-commit> |
14. Is there a Docker image? | No official image; you can write a simple Dockerfile that copies the list into your container. |
15. How do I contact the maintainers? | Open an issue or discussion on GitHub. |
7 Turning the List into a Micro-service
If you do not want to reload the list every time your program starts, wrap it in a small service:
-
Load the list into Redis (key “sensitive_words”). -
Write a tiny HTTP API in Flask, FastAPI, or Spring Boot. -
Expose one endpoint: POST /filter Body: {"text": "input sentence"} Response: {"filtered": "**** sentence"}
-
Update the Redis set whenever the list changes; no restart required.
8 Visual Summary
graph TD
A[Sensitive-Word List] --> B[Download]
A --> C[Use in Code]
A --> D[Contribute]
B --> B1[git clone]
C --> C1[DFA / Trie / Regex]
C --> C2[HTTP micro-service]
D --> D1[Pull Request]
D --> D2[Issue Discussion]
9 Closing Thoughts
Filtering sensitive words is not a one-time task.
Language, culture and laws evolve, and so must your approach.
Sensitive-lexicon gives you a solid starting point in Chinese; the rest is up to your judgment, your users’ needs and your local regulations.
Clone the repository today, run a quick test, and decide if it fits your project.
If it does, keep an eye on the commit history—new words arrive regularly.
And if you find a gap, the community will appreciate your contribution.