Train a Privacy Shield in 30 Minutes—Inside tanaos-text-anonymizer-v1’s Zero-Data Trick

Core question: How do you scrub names, addresses, phones, dates and locations from text when you have zero labeled examples?
One-sentence answer: Load tanaos-text-anonymizer-v1, let the Artifex library synthesise 10 k training lines on the fly, fine-tune for ten minutes, and you get a tiny model that replaces sensitive spans with [MASKED] tokens faster than you can grep.


What this article answers (and why you should care)

「Central question:」 “Can a model with only 110 M parameters really reach production-grade PII removal without any human-labeled data?”
「Short answer:」 Yes—if you stay inside the five entity classes it was fine-tuned for, keep text domain-generic, and follow the synthetic-data recipe shipped with the model.


1. Model at a Glance: Small Footprint, Single Purpose

Attribute Value Remark
Base model FacebookAI/roberta-base 110 M parameters, ~440 MB on disk
Task Token classification (NER) Outputs B-I-O tags
Fine-tune data 10 k synthetic passages Generated by Artifex templates
Entities 5 PII types PERSON, LOCATION, DATE, ADDRESS, PHONE_NUMBER
Languages English out-of-the-box Other langs via synthetic fine-tune
License Apache-2.0 Commercial use allowed

「Author’s reflection」
I initially skipped the tiny model because “bigger is better” is stuck in my head. After my 340-MB custom BERT missed a phone number in a live demo, I ran the same sentence through this “small” model—three milliseconds later the number was gone. Lesson: fit the tool to the risk, not the ego.


2. Which Pieces of Privacy Get Caught?

PII Type Text Example Model Output
PERSON Dr. Emily Zhang called [MASKED] called
LOCATION shipped to Miami shipped to [MASKED]
DATE due on 2024-07-15 due on [MASKED]
ADDRESS office at 88 Broadway, Suite 400 office at [MASKED]
PHONE_NUMBER call (415) 555-2671 call [MASKED]

Anything outside these buckets—emails, IBANs, social-security numbers, driver-license plates—will slide through untouched, so plan a post-processing rule if you need them.


3. Zero-Data Training Demystified

「Central question:」 “Where do the 10 k labels come from if I don’t annotate?”
「Answer:」 Artifex ships with seed dictionaries and sentence templates. It stiches together realistic-looking text and simultaneously writes the B-I-O tags.

Quick mental picture:

  1. Templates like <PERSON> lives in <LOCATION>, phone <PHONE_NUMBER>.
  2. Dictionaries contain Western names, cities, date formats, address fragments, phone patterns.
  3. Random combinations are generated until num_samples is reached.
  4. Each token’s label is stored in a JSON ready for transformers.Trainer.

「Code snapshot (already in the repo)」

from artifex import Artifex

ta = Artifex().text_anonymization
ta.train(
    domain="general",
    num_samples=10000
)

That is literally the entire training script—no hyper-parameter tuning, no manual split, no tokeniser reload.


4. Two Ways to Run It: Choose Your Own Adventure

A. Artifex route—masking happens automatically

from artifex import Artifex

ta = Artifex().text_anonymization
dirty = "John Doe lives at 123 Main St, New York. His phone number is (555) 123-4567."
clean = ta(dirty)[0]
print(clean)
# [MASKED] lives at [MASKED]. His phone number is [MASKED].

If you prefer a different placeholder:

ta.mask_token = "<REDACTED>"

B. Transformers route—you get spans, decide what to do

from transformers import pipeline

ner = pipeline(
    task="token-classification",
    model="tanaos/tanaos-text-anonymizer-v1",
    aggregation_strategy="first"
)

entities = ner("John Doe lives at 123 Main St, New York. His phone number is (555) 123-4567.")
print(entities)
# [{'entity_group': 'PERSON', 'word': 'John Doe', ...}, ...]

Typical use-case: highlight spans in a web UI and let a human confirm before real redaction.


5. Domain Adaptation Without Manual Labels

「Central question:」 “My text is Spanish medical reports—can the model learn new tricks?”
「Answer:」 Feed Artifex a domain sentence in natural language; it will generate Spanish-like templates and vocab on the fly.

ta.train(
    domain="documentos medicos en Español",
    output_path="./es_medical/"
)
ta.load("./es_medical/")

texto = "El paciente John Doe visitó Nueva York el 12 de marzo de 2023 a las 10:30 a. m."
print(ta(texto)[0])
# El paciente [MASKED] visitó [MASKED] el [MASKED] a las [MASKED].

「Author’s reflection」
I tried to be clever and trained on English templates then tested on Spanish—precision dropped to 0.82. Switching the domain description to actual Spanish fixed it overnight. The engine literally reads your prompt, so write it in the target language.


6. Performance Numbers You Can Quote

Test bed: Intel i7-12700, 32 GB RAM, single core, PyTorch 2.1, sequence length 64 tokens

Library Latency Throughput Auto-mask?
Artifex 28 ms 35 sent./sec
Transformers 31 ms 32 sent./sec

GPU snapshot (NVIDIA T4, CUDA 12): latency 5 ms, throughput 200 sent./sec with batch_size=8.

Memory footprint: 440 MB weights + 300 MB runtime overhead—fits a Raspberry Pi 4 if you are patient.


7. Typical Pitfalls & How I Walked into Them

  1. 「Date greed」
    The model tagged “Python 3.10” as a DATE. I added a post-regex \d+\.\d+$ to keep version numbers alive.

  2. 「Address boundary」
    “Room 305, Building 7” was missed because the template dictionary only held Street+Number patterns. Expanding the dictionary to include Room <NUM>, Building <NUM> pushed recall from 0.89 to 0.95.

  3. 「Phone formats matter」
    US-centric (xxx) xxx-xxxx templates caused Australian +61-3-xxxx-xxxx numbers to fail. After inserting five international masks, recall jumped to 0.97.

  4. 「Synthetic duplicates」
    The generator occasionally produces identical sentences, which leaks into the validation split and gives an inflated F1. I de-duplicate with pandas.drop_duplicates() before training—F1 dropped only 0.2 % but felt honest.


8. When to Use This Model—and When to Pass

「Good fit」

  • Chat logs, support tickets, survey comments, medical transcripts that live inside the five entity classes.
  • Teams without annotators but need GDPR/CCPA compliance yesterday.
  • Offline or air-gapped environments where cloud PII services are banned.

「Look elsewhere」

  • Highly specialised vocab (molecular formulae, legal citations) where a single token like “May” is more likely a chemical compound than a DATE.
  • Need to mask emails, IBANs, crypto wallet addresses, driver licences—outside the five classes.
  • You already own tens of thousands of gold labels and chase the absolute SOTA leaderboard; in that case a bigger domain-tuned transformer will beat this tiny specialist.

9. Action Checklist / Implementation Steps

  1. pip install artifex (or transformers if you only need entity spans).
  2. One-liner load: ta = Artifex().text_anonymization.
  3. Feed raw text → receive [MASKED] string.
  4. Need another language or vertical? Run ta.train(domain="describe it", num_samples=10000) → 30 min on a laptop GPU.
  5. Wrap the 3-line inference into a FastAPI service; single-core throughput ≈ 35 req/s.
  6. Add post-regexes for entities the model does not cover (emails, etc.).
  7. De-duplicate synthetic data before you report metrics to stay honest.

10. One-page Overview

  • 110 M parameter model, Apache-2.0, English by default.
  • Detects & masks PERSON, LOCATION, DATE, ADDRESS, PHONE_NUMBER.
  • Zero real labels needed—Artifex synthesises 10 k samples in minutes.
  • CPU latency ~30 ms, GPU ~5 ms; throughput 35–200 sentences/sec.
  • Adapt to new languages/domains with a one-sentence description.
  • Ideal for logs, medical notes, legal memos that must be shared internally or externally under privacy rules.
  • Not a silver bullet for exotic entities; plan extra regex or retraining if you need email, IBAN, SSN, etc.

11. FAQ

「Q1: Is the model open source?」
Yes, weights and code are on Hugging Face under Apache-2.0.

「Q2: Does it support Chinese, French, Arabic?」
Out-of-the-box checkpoint is English-only. Feed Artifex a domain string in your target language and fine-tune; no hand annotation required.

「Q3: Can I change the mask token?」
Artifex lets you set any string: ta.mask_token = "<REDACTED>".

「Q4: Will it recognise passport or credit-card numbers?」
Not by default—you would add regex rules or create new synthetic labels for those classes.

「Q5: How large is the GPU memory footprint?」
≈1.2 GB at inference (FP16); 8 GB is plenty for fine-tuning with batch=8.

「Q6: Is synthetic data privacy-safe?」
The generator composes templates with seed dictionaries, so it does not contain real user data.

「Q7: How do I evaluate my fine-tuned version?」
Artifex reserves 10 % of the generated data for validation. For extra confidence, hand-label 100 random sentences from your real corpus and compute precision/recall.

「Q8: Can I deploy on CPU-only edge devices?」
Yes, 440 MB weights + 300 MB runtime fit most boards; expect ≈30 ms per sentence on a modern laptop CPU.