pw-redact

Open-source PII redaction engine for financial services AI pipelines.

Built by Protocol Wealth LLC, an SEC-registered investment adviser, to ensure client data never reaches third-party AI models.

We believe transparency in how client data is handled shouldn't be a competitive secret — it should be an industry standard.

Why This Exists

Financial advisors increasingly use AI for meeting transcription, tax analysis, and planning recommendations. Most solutions send raw client data — names, SSNs, addresses — to cloud AI providers. pw-redact sits between your data and the AI model, stripping PII while preserving the financial data models need to work.

The problem:

"John Smith's AGI is $425,000. SSN: 123-45-6789."
                    ↓ sent to AI model ↓
        AI sees real names, real SSNs — compliance violation

With pw-redact:

"John Smith's AGI is $425,000. SSN: 123-45-6789."
                    ↓ pw-redact /v1/redact ↓
"<PERSON_1> AGI is $425,000. SSN: <US_SSN_1>."
                    ↓ sent to AI model ↓
        AI sees placeholders — PII never leaves your infrastructure
                    ↓ pw-redact /v1/rehydrate ↓
        Original names restored for advisor display

How It Works

pw-redact uses a four-layer architecture to catch PII while preserving financial data:

Raw text (transcript, tax notes, mortgage docs)
        |
        v
   [ Security Layer ]
   Input validation, prompt injection detection, rate limiting
        |
        v
   pw-redact /v1/redact
   |-- Layer 1: Deterministic regex (SSN, CC, EIN, loan numbers, crypto keys...)
   |-- Layer 2: Presidio NLP (names, addresses, phone, email via spaCy NER)
   |-- Layer 3: Custom financial recognizers (CUSIP, account refs, policy numbers)
   |-- Layer 4: Allow-list (preserve $amounts, %, tax brackets, dates)
   `-- Returns: sanitized_text + redaction_manifest + security metadata
        |
        v
   Sanitized text -> your AI model (safe to send externally)
        |
        v
   pw-redact /v1/rehydrate
   |-- Takes: AI model output + redaction_manifest
   `-- Returns: output with original values restored

Key design principles:

Stateless — pw-redact stores nothing. No database, no disk writes. Manifests are returned to the caller.
Deterministic first — Regex patterns run before NLP. If regex catches it, NLP doesn't need to.
Financial data survives — Dollar amounts, percentages, tax brackets, basis points, and 60+ financial acronyms pass through intact.
Consistent placeholders — "John Smith" becomes <PERSON_1> everywhere in the document, so the AI model can track entity relationships.
Security built in — Input sanitization, prompt injection detection, output validation, and rate limiting are part of the pipeline, not bolted on.

For deep architecture details, see docs/architecture.md.

Quick Start

Install

# With pip
pip install -e ".[dev]"
python -m spacy download en_core_web_lg

# With uv (faster)
uv venv && uv pip install -e ".[dev]"
uv pip install en_core_web_lg@https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl

As a Library

from pw_redact.redactor.engine import PWRedactor
from pw_redact.rehydrator.engine import PWRehydrator

redactor = PWRedactor()
rehydrator = PWRehydrator()

# Redact PII
result = redactor.redact(
    "John Smith has SSN 123-45-6789. His AGI is $425,000.",
    context="meeting_transcript",
)

print(result.sanitized_text)
# "<PERSON_1> has SSN <US_SSN_1>. His AGI is $425,000."

# Rehydrate after AI processing
ai_output = "Based on the data, <PERSON_1> should consider a Roth conversion."
restored = rehydrator.rehydrate(ai_output, result.manifest.to_dict())
print(restored)
# "Based on the data, John Smith should consider a Roth conversion."

Different Document Contexts

# Meeting transcript — aggressive name detection
result = redactor.redact(text, context="meeting_transcript")

# Tax return — catches EINs, ITINs
result = redactor.redact(text, context="tax_return")

# Mortgage document — catches NMLS, loan numbers, APN, FHA case numbers
result = redactor.redact(text, context="mortgage")

# Real estate — catches parcel numbers, MLS, title references
result = redactor.redact(text, context="real_estate")

As an API Server

export PW_REDACT_API_KEY=your-secret-key
uvicorn pw_redact.main:app --host 0.0.0.0 --port 8080

# Redact
curl -X POST http://localhost:8080/v1/redact \
  -H "Authorization: Bearer your-secret-key" \
  -H "Content-Type: application/json" \
  -d '{"text": "John Smith SSN 123-45-6789", "context": "general"}'

# Health check (no auth required)
curl http://localhost:8080/v1/health

Docker

docker build -t pw-redact .
docker run -p 8080:8080 -e PW_REDACT_API_KEY=your-key pw-redact

API Reference

POST /v1/redact

Redact PII from text. Returns sanitized text, a manifest for rehydration, and security metadata.

Request:

{
  "text": "John Smith discussed his AGI of $425,000. SSN: 123-45-6789.",
  "context": "meeting_transcript"
}

Response:

{
  "sanitized_text": "<PERSON_1> discussed his AGI of $425,000. SSN: <US_SSN_1>.",
  "manifest": {
    "version": "1.0",
    "redaction_id": "red_a1b2c3d4",
    "placeholders": [
      {"placeholder": "<PERSON_1>", "original": "John Smith", "entity_type": "PERSON", "start": 0, "end": 10},
      {"placeholder": "<US_SSN_1>", "original": "123-45-6789", "entity_type": "US_SSN", "start": 47, "end": 58}
    ],
    "stats": {"entities_found": 2, "entities_by_type": {"PERSON": 1, "US_SSN": 1}, "text_length_original": 60, "text_length_sanitized": 58}
  },
  "security": {
    "input_sanitized": false,
    "sanitization_actions": [],
    "injection_detected": false,
    "injection_score": 0.0,
    "injection_patterns": [],
    "output_valid": true,
    "output_warnings": [],
    "request_id": "req_a1b2c3d4e5f6"
  }
}

Response headers:

X-Request-ID — Unique request identifier for tracing
X-Processing-Time-Ms — Server-side processing time

POST /v1/rehydrate

Restore original values from placeholders using the manifest.

Request:

{
  "text": "Based on <PERSON_1>'s data, a Roth conversion is recommended.",
  "manifest": {"version": "1.0", "redaction_id": "red_a1b2c3d4", "placeholders": [...]}
}

Response:

{
  "rehydrated_text": "Based on John Smith's data, a Roth conversion is recommended."
}

POST /v1/detect

Detect PII locations without redacting. Returns entity positions, types, and confidence scores. Useful for UI highlighting.

Request:

{
  "text": "John Smith's SSN is 123-45-6789.",
  "context": "general"
}

Response:

{
  "entities": [
    {"entity_type": "PERSON", "text": "John Smith", "start": 0, "end": 10, "score": 0.85},
    {"entity_type": "US_SSN", "text": "123-45-6789", "start": 20, "end": 31, "score": 0.90}
  ]
}

GET /v1/health

No authentication required.

{"status": "healthy", "version": "0.1.0", "models_loaded": true}

Supported Entity Types

Layer 1: Regex Detection (30 patterns)

All regex patterns run regardless of document context.

Category	Entity Type	Example
PII	`US_SSN`	123-45-6789
	`CREDIT_CARD`	4111 1111 1111 1111
	`EMAIL`	john@example.com
	`US_PHONE`	(610) 555-1234
	`EIN`	12-3456789
	`DATE_OF_BIRTH`	DOB: 03/15/1980, birthdate: 1980-03-15
	`ACCOUNT_NUMBER`	account #12345678
	`DRIVERS_LICENSE`	DL# A12345678
	`STREET_ADDRESS`	42 Oak Lane, 123 Main Street #201
	`US_ROUTING`	routing: 021000021
Secrets	`JWT`	eyJ... tokens
	`API_KEY`	api_key=..., STRIPE_API_KEY=...
	`PASSWORD`	password=..., passwd:...
	`SECRET_VALUE`	secret=..., SECRET_KEY=..., private_key=...
	`AUTH_TOKEN`	access_token=..., refresh_token=..., session_id=...
	`BEARER_TOKEN`	Bearer ...
	`DB_URL`	postgres://user:pass@host
	`MAGIC_LINK`	reset_link=https://...
Crypto	`CRYPTO_PRIVATE_KEY`	0x + 64 hex chars (ETH private keys)
	`CRYPTO_ADDRESS`	0x + 40 hex chars (ETH addresses)
	`CRYPTO_SEED`	Quoted seed phrases / mnemonics
Mortgage/RE	`NMLS_ID`	NMLS# 456789
	`LOAN_NUMBER`	Loan #MTG-2025-004821
	`MERS_MIN`	MERS# + 18 digits
	`FHA_CASE_NUMBER`	FHA case #123-4567890
	`PARCEL_NUMBER`	APN: 07-14-302-015
	`MLS_NUMBER`	MLS# PM23456789
	`FILE_REFERENCE`	escrow# NCS-123456-LA
System IDs	`CRM_ID`	client_id: 987654
	`PLATFORM_ID`	wallet_id=deadbeef... (hex/UUID)

Layer 2: Presidio NLP Detection

Presidio entities vary by context. spaCy en_core_web_lg provides the NER backbone.

Entity	Detected In
`PERSON`	All contexts
`LOCATION`	All contexts
`PHONE_NUMBER`	All contexts
`EMAIL_ADDRESS`	All contexts
`US_SSN`	All contexts (backup to regex)
`CREDIT_CARD`	All contexts (backup to regex)
`NRP`	meeting_transcript, mortgage
`CUSIP`	financial_notes
`ACCOUNT_REF`	All except general
`POLICY_NUMBER`	meeting_transcript, financial_notes, mortgage, real_estate

Layer 4: Financial Data Preserved

The allow-list ensures these are never redacted. See docs/allow-list-guide.md for details on customizing.

Type	Examples
Dollar amounts	$425,000, $50k, 95,000
Percentages	32%, 6.75%
Tax brackets	32% bracket, 24% rate
Basis points	250 bps, 50 bp
Planning years	2032, 2026
Ages	age 59.5, age 70.5
Form references	Form 1040, Schedule D, IRC S1015
Financial acronyms	AGI, RMD, QCD, 529, W2, 1099
Mortgage/RE terms	LTV, DTI, PMI, TILA, RESPA, HMDA, ALTA, FNMA, VOE

Security

pw-redact includes built-in security hardening for AI pipeline protection. For vulnerability reports, see SECURITY.md.

Input Validation

Max payload: 1MB / 50,000 lines (rejects with 413)
Strips null bytes, ASCII control characters (preserves \n and \t)
Strips invisible Unicode (zero-width spaces, bidi overrides, BOM)
Detects and removes base64-encoded blocks >200 chars (could hide injected instructions)
Strips HTML tags and <script> elements
Strips markdown image references with external URLs (![](http://...))
Normalizes excessive whitespace while preserving paragraph structure

Prompt Injection Detection

25 patterns across 7 categories: instruction override, identity manipulation, prompt extraction, injection keywords, encoded variants (leetspeak, spaced characters), role-play/persona, delimiter manipulation
Returns a confidence score (0.0-1.0) and list of matched patterns
Advisory only — flags suspicious input but does NOT block it. Advisors may legitimately paste client emails containing unusual text. The caller decides policy.
Injection data included in the /v1/redact response security key

Output Validation

Verifies no original PII values leaked into sanitized_text
Validates all placeholders match the <TYPE_N> format
Checks manifest structural integrity

Rate Limiting

In-memory token bucket per API key
Default: 60 requests/minute, 10 requests/second burst
Returns 429 with Retry-After header when exceeded
Configurable via RATE_LIMIT_RPM and RATE_LIMIT_BURST env vars

Request Tracing

X-Request-ID header on every response (UUID)
X-Processing-Time-Ms header for latency monitoring

Use Cases

Meeting transcript sanitization before AI summarization
Tax return data extraction with PII stripped
Mortgage/underwriting documents processed for AI analysis
Real estate/title documents with buyer/seller PII removed
Compliance-safe AI integration for RIAs, broker-dealers, and lenders
Client communication review before sending to AI for drafting
Crypto/DeFi operations with wallet addresses and private keys redacted

Testing

# Run all tests
pytest tests/ -v

# Run specific test file
pytest tests/test_regex_patterns.py -v
pytest tests/test_security.py -v

# Run with timing
pytest tests/ -v --tb=short

Current: 332 tests, 97% coverage, ~4s runtime (session-scoped spaCy model load).

Performance

Metric	Observed	Notes
Test suite	332 tests in ~4s (97% coverage)	spaCy model loaded once per session
Short text (<100 words)	<500ms	Regex + Presidio on warm engine
Long transcript (~500 words)	~1-2s	spaCy NER is the bottleneck
Cold start (Fly.io resume)	~8-10s	spaCy en_core_web_lg model load
Memory	~1.2GB steady-state	en_core_web_lg (~560MB) + Presidio + overhead
Docker image	~1.0GB	Python 3.12 slim + spaCy model

Deployment

pw-redact runs as a stateless container. See docs/deployment.md for full guide.

# Fly.io (quick start)
cp fly.toml.example fly.toml  # edit app name and region
fly apps create your-app-name
fly secrets set PW_REDACT_API_KEY=$(python3 -c "import secrets; print(secrets.token_urlsafe(48))")
fly deploy

Note: Use at least 2GB RAM. spaCy en_core_web_lg loads at ~560MB; 1GB VMs will OOM.

Project Structure

pw-redact/
|-- src/pw_redact/
|   |-- main.py                  # FastAPI app, routes, security middleware
|   |-- config.py                # pydantic-settings env var loading
|   |-- auth.py                  # API key authentication
|   |-- redactor/
|   |   |-- engine.py            # PWRedactor — orchestrates all 4 layers
|   |   |-- regex_patterns.py    # Layer 1: 30 deterministic regex patterns
|   |   |-- presidio_config.py   # Layer 2: Presidio + spaCy NLP setup
|   |   |-- financial_recognizers.py  # Layer 3: CUSIP, account ref, policy
|   |   |-- allow_list.py        # Layer 4: financial data preservation
|   |   `-- manifest.py          # Redaction manifest data structures
|   |-- rehydrator/
|   |   `-- engine.py            # Manifest-based placeholder restoration
|   |-- security/
|   |   |-- input_validator.py   # Sanitize dangerous input
|   |   |-- prompt_injection_detector.py  # Flag injection attempts
|   |   |-- output_validator.py  # Verify no PII leaks in output
|   |   `-- rate_limiter.py      # Token bucket rate limiting
|   `-- models/
|       |-- requests.py          # Pydantic request models
|       `-- responses.py         # Pydantic response models
|-- tests/
|   |-- test_regex_patterns.py   # 100+ regex pattern tests
|   |-- test_redactor.py         # Integration tests + round-trip
|   |-- test_allow_list.py       # Financial data preservation
|   |-- test_financial_recognizers.py
|   |-- test_security.py         # Input validation, injection, rate limiting
|   `-- fixtures/                # SYNTHETIC test data only
|-- docs/
|   |-- architecture.md          # Four-layer pipeline deep dive
|   |-- deployment.md            # Docker, Fly.io, Railway, AWS/GCP guide
|   `-- allow-list-guide.md      # How to customize financial preservation
|-- .github/
|   |-- workflows/ci.yml         # Lint + test on push/PR
|   |-- ISSUE_TEMPLATE/          # Bug report + feature request templates
|   `-- PULL_REQUEST_TEMPLATE.md # PR checklist
|-- SECURITY.md                  # Vulnerability disclosure policy
|-- CONTRIBUTING.md              # Code standards, PR process
|-- CHANGELOG.md                 # Version history
|-- CLAUDE.md                    # AI coding assistant build guide
|-- Dockerfile                   # Production container
|-- fly.toml.example             # Fly.io deployment template
`-- pyproject.toml               # Dependencies, build config, tool settings

Documentation

Document	Purpose
README.md	This file — overview, quick start, API reference
docs/architecture.md	Four-layer pipeline design, merge algorithm, security pipeline
docs/deployment.md	Docker, Fly.io, Railway, AWS/GCP deployment guide
docs/allow-list-guide.md	How to customize financial data preservation
CLAUDE.md	AI coding assistant guide — how this repo was built with Claude Code
SECURITY.md	Vulnerability reporting: security@protocolwealthllc.com
CONTRIBUTING.md	Code standards, adding patterns, PR process
CHANGELOG.md	Version history

Troubleshooting

spaCy model not found:

python -m spacy download en_core_web_lg
# Or with uv:
uv pip install en_core_web_lg@https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl

OOM on Fly.io / Docker: Use at least 2GB RAM. If constrained, switch to en_core_web_md (~40MB vs ~560MB):

export SPACY_MODEL=en_core_web_md
python -m spacy download en_core_web_md

False positive — financial data being redacted: Check if the value matches an allow-list pattern. If not, add one. See docs/allow-list-guide.md.

False negative — PII not caught: Open a bug report with synthetic example data (never real PII).

"max" detected as person name: Known spaCy NLP limitation with common-word names. Over-redacting is safer than under-redacting. Can be tuned with Presidio score thresholds in a future release.

Contributing

See CONTRIBUTING.md for code standards, test requirements, and PR process.

License

MIT — free for commercial use. See LICENSE.

Built By

Protocol Wealth LLC — SEC-registered investment adviser building transparent AI infrastructure for the advisory industry.

Built by Protocol Wealth LLC

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github		.github
docs		docs
src/pw_redact		src/pw_redact
tests		tests
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
fly.toml.example		fly.toml.example
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

pw-redact

Why This Exists

How It Works

Quick Start

Install

As a Library

Different Document Contexts

As an API Server

Docker

API Reference

POST /v1/redact

POST /v1/rehydrate

POST /v1/detect

GET /v1/health

Supported Entity Types

Layer 1: Regex Detection (30 patterns)

Layer 2: Presidio NLP Detection

Layer 4: Financial Data Preserved

Security

Input Validation

Prompt Injection Detection

Output Validation

Rate Limiting

Request Tracing

Use Cases

Testing

Performance

Deployment

Project Structure

Documentation

Troubleshooting

Contributing

License

Built By

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages