The Illusion of Privacy: Why Your PDF Redactions Might Be Leaving Data “Naked”
In an era defined by data transparency and digital accountability, we have a dangerous habit of trusting what we see—or rather, what we can’t see. When you see a heavy black rectangle covering a name or a social security number in a legal document, you assume that information is gone.
At Free Law Project, we’ve spent years collecting millions of PDFs, and we’ve discovered a disturbing reality: many redactions are merely digital theater. Instead of permanently removing sensitive data, users often just draw a black box over the text. To the human eye, it’s hidden; to a computer, the text layer remains perfectly intact underneath, waiting to be highlighted and copied.
To solve this systemic failure, we built x-ray: a Python library designed to tear back the curtain on “worthless” redactions.
The Anatomy of a “Bad Redaction”
The problem stems from a fundamental misunderstanding of the PDF format. A PDF is not a flat image; it is a complex stack of layers. A “bad redaction” occurs when a user places a vector graphic (the black box) on top of the text layer without actually deleting the underlying characters.
After witnessing this security lapse for years, we developed x-ray to determine exactly how common this issue is. Whether it’s an amicus brief or congressional testimony, the tool acts as a digital forensic auditor, identifying exactly where sensitive data is still “leaking” through the black ink.
How x-ray Works: Behind the Digital Lens
x-ray leverages the high-performance PyMuPDF engine to parse the structural complexities of PDF files. The logic follows a rigorous four-step process to ensure accuracy:
-
Rectangle Detection: The tool identifies every rectangle element within the PDF. -
Text Overlay Analysis: It scans for letters and words occupying the exact same spatial coordinates as those rectangles. -
Visual Rendering: The tool renders the specific rectangle area as an image. -
Pixel Inspection: It inspects the pixels. If the rectangle is a single solid color (like pure black) but contains text underneath, it is flagged as a bad redaction. If there is a mix of colors or drawings, it suggests a legitimate graphical element.
Implementation: From Local Files to Cloud URLs
One of x-ray’s core strengths is its accessibility. It is designed to fit into modern developer workflows, supporting everything from local paths to remote URLs.
Rapid Installation
You can integrate x-ray into your environment using uv or pip:
uv add x-ray
# or
pip install x-ray
Command Line Intelligence
For a quick check without installation, uvx allows you to run x-ray directly against online documents:
uvx --from x-ray xray https://example.com/legal-doc.pdf
When a document fails inspection, x-ray returns a clean JSON object detailing the failure:
-
Page Number: The specific location of the leak. -
bbox: The precise coordinates (upper left and lower right) of the failed redaction. -
text: The actual hidden text recovered from beneath the box.
Pythonic Integration
For larger data pipelines, x-ray can be imported as a module. It handles local strings, Pathlib objects, and even raw bytes for in-memory processing.
import xray
results = xray.inspect("sensitive_document.pdf")
# Output: {1: [{'bbox': (58.55, 72.19, 75.65, 739.39), 'text': 'Sensitive Data Here'}]}
The Road Ahead for PDF Security
The PDF format is vast and notoriously difficult to parse perfectly. While x-ray currently excels at identifying solid-color overlaps, bad redactions can take many forms. We are actively looking for community contributions to solve tougher cases and improve the library’s reach.
This project is available under the BSD license, making it safe for incorporation into both open-source and proprietary security audits.
Conclusion: Stop Guessing, Start Auditing
Privacy shouldn’t be a game of “hide and seek” where the seeker always wins. By using x-ray, organizations can move beyond visual confirmation and toward structural verification of their sensitive documents.
Would you like me to help you set up a batch script to scan an entire directory of your PDF files for redaction leaks?

