Skip to content

Latest commit

 

History

History
309 lines (258 loc) · 14.9 KB

File metadata and controls

309 lines (258 loc) · 14.9 KB

dpdpstack

DPDP-compliant data erasure for Indian apps - handled in your code.

Indian developers keep hitting the same wall: DPDP says erase the user's data on withdrawal, but RBI (KYC, 5 yrs), PMLA, CERT-In (logs, 180 days) and the Companies Act say keep it. So teams hand-delete data across tables, can't prove it, and enterprise tools "cost more than a month's revenue."

dpdpstack is a small, zero-egress library that handles the hard part:

  • Legal-hold-aware erasure - delete now, or defer under RBI/PMLA/CERT-In holds (with the basis recorded), then erase when the hold lapses.
  • PII anonymization - irreversibly null/hash PII while keeping the ledger row (referential integrity), the way teams actually solve this.
  • Certificate of Erasure - a verifiable, tamper-evident proof you erased (or are lawfully holding) a user's data.
  • Zero-egress - you perform the mutation in your own DB; the library only decides and records. Personal data never leaves your systems.

Not a cookie banner. Not a consultant. A deletion/retention engine for developers.

Documentation · Source · Hosted platform

Install

pip install dpdpstack-python-sdk                # core, no dependencies
pip install "dpdpstack-python-sdk[django]"      # + Django adapter
pip install "dpdpstack-python-sdk[sqlalchemy]"  # + SQLAlchemy adapter (FastAPI/Flask/…)
pip install "dpdpstack-python-sdk[crypto]"      # + signed certs & crypto-shred (PyJWT + cryptography)

Quickstart (framework-agnostic)

from dpdpstack import ErasureEngine, AuditLog, RetentionPolicy, Action, rbi_kyc, issue_certificate

engine = ErasureEngine(AuditLog())

# Normal purpose: hard-delete on withdrawal. Your delete runs in `executor`.
engine.request_erasure(
    subject="user_42",
    policy=RetentionPolicy(purpose="marketing", action=Action.DELETE),
    reason="consent_withdrawn",
    executor=lambda action: my_delete_user(42),
)

# KYC: RBI mandates 5y retention -> erasure is DEFERRED, not refused.
res = engine.request_erasure(subject="user_42", policy=rbi_kyc("kyc"), reason="consent_withdrawn")
print(res.status, res.legal_basis, res.erase_after)   # deferred  RBI KYC...  2031-...

cert = issue_certificate(engine.audit, "user_42", "marketing")  # verifiable proof

Django (zero-egress, runs against your models)

# settings.py
INSTALLED_APPS += ["dpdpstack.contrib.django"]
# python manage.py migrate

from dpdpstack import RetentionPolicy, Action, null, redact, rbi_kyc
from dpdpstack.contrib.django.service import erase_instance, pii

# Declare a model's PII fields once with @pii - no pii_fields= on every call.
@pii(name=null, email=null, phone=redact(keep_last=4))
class User(models.Model):
    ...

# Hard delete + audit
erase_instance(user, policy=RetentionPolicy(purpose="marketing", action=Action.DELETE),
               subject=user.external_ref)

# Anonymize PII, keep the (regulated) row - uses the @pii declaration above
erase_instance(user, policy=RetentionPolicy(purpose="profile", action=Action.ANONYMIZE),
               subject=user.external_ref)

# KYC withdrawal -> deferred under RBI hold, nothing deleted, basis recorded
erase_instance(user, policy=rbi_kyc("kyc"), subject=user.external_ref)

FastAPI / Flask / any SQLAlchemy app ([sqlalchemy])

The same engine + DB-backed audit chain, against a SQLAlchemy Session. You map the audit entry once (you own the Base); the @pii declaration is shared with Django.

from sqlalchemy.orm import DeclarativeBase
from dpdpstack import RetentionPolicy, Action, null, redact, rbi_kyc
from dpdpstack.contrib.sqlalchemy.models import DpdpAuditEntryMixin
from dpdpstack.contrib.sqlalchemy.service import erase_instance, pii

class Base(DeclarativeBase): ...

class DpdpAuditEntry(Base, DpdpAuditEntryMixin):   # the hash-chained audit store
    __tablename__ = "dpdp_audit_entries"

@pii(name=null, email=null, phone=redact(keep_last=4))
class User(Base):
    __tablename__ = "users"
    ...

# Anonymize PII, keep the (regulated) row; your session, your transaction.
erase_instance(session, user, audit_model=DpdpAuditEntry, subject=user.external_ref,
               policy=RetentionPolicy(purpose="profile", action=Action.ANONYMIZE))

# KYC withdrawal -> deferred under RBI hold, nothing deleted, basis recorded
erase_instance(session, user, audit_model=DpdpAuditEntry, subject=user.external_ref,
               policy=rbi_kyc("kyc"))
session.commit()

Find your PII fields (scan)

You declare PII once with @pii(...) - but which fields are PII? scan finds them for you. It reads field names and types only (never a single row), matches them against an India-first catalog (Aadhaar, PAN, GST, UPI, phone, email, special-category…), and suggests an anonymize strategy for each. Output is advisory - you review it, then paste. Zero-egress and zero-dependency.

Django - scan your models and get pasteable @pii(...) blocks:

python manage.py dpdp_scan --format python        # or: text (default) | json
# or, without a manage.py:
dpdpstack scan --django --settings myproject.settings --app accounts --format python
# accounts.User
@pii(
    name=null,
    email=null,
    phone=redact(keep_last=4),
    aadhaar_number=hashed(),
)
class User(models.Model):
    ...

Re-running tags each field new (PII, not declared), covered (already declared), or drift (declared, but no longer looks like PII) - so it doubles as an ongoing audit.

Anything else - a sample dict, an API payload, a column list:

from dpdpstack import anonymize_fields
from dpdpstack.detect import scan_mapping, suggest_strategies

suggest_strategies(["email", "phone", "pan", "ledger_balance"])
# {'email': <null>, 'phone': <redact>, 'pan': <hashed>}   # 'ledger_balance' ignored

record = {"email": "a@b.com", "phone": "9876543210", "ledger_balance": 500}
clean = anonymize_fields(record, suggest_strategies(record.keys()))
dpdpstack scan --keys email,phone,pan --format python      # comma-separated names
dpdpstack scan --dict sample.json --format python          # keys of a JSON object ('-' = stdin)

Bring your own catalog by passing a JSON file of the same shape to load_catalog(path=...).

Detect PII in values (and classify a breach)

The scanner above reads field names; detect_values reads values / free text - useful to confirm a column really holds PII, or to fill a breach report's nature field. Aadhaar is checked with the Verhoeff checksum and cards with Luhn, so random 12-/16-digit numbers don't false-positive. Local, zero-dependency.

from dpdpstack import detect_values, classify_breach_nature

detect_values("PAN ABCDE1234F, card 4111 1111 1111 1111")
# [ValueMatch(type='PAN', ...), ValueMatch(type='Payment Card', ...)]

classify_breach_nature("leaked rows: asha@bank.in, Aadhaar 2341 2341 2346, plus medical records")
# ['Email Address', 'Aadhaar Number', 'Health Data']   # for a Rule 7 breach report

Lint your retention policies (DPDP)

lint statically checks a RetentionPolicy for compliance smells - a legal hold with no recorded basis, a hold that will hard-delete a regulated row, a basis cited without a hold period, retention far past what's justified - each tied to a DPDP citation. Offline and advisory.

from dpdpstack import RetentionPolicy, Action, lint_policy

lint_policy(RetentionPolicy(purpose="kyc", legal_hold_days=1825, action=Action.DELETE))
# [ERROR E001: ... no legal_basis recorded ...  [DPDP Rules, 2025 - Rule 8],
#  WARNING W001: ... action=delete will hard-delete ... consider action=anonymize ...]

From the shell (exit code is non-zero if any error is found, so it drops into CI):

dpdpstack lint --presets                                   # the built-in presets are clean
dpdpstack lint --purpose kyc --legal-hold-days 1825 --action delete   # E001 + W001

dpdpstack.rules also exposes DPDP_RULES and STATUTORY_HOLDS (RBI/PMLA/CERT-In/ Companies Act) as a citable reference.

score_policies(...) rolls the findings into a graded readiness report - a deterministic 0-100 score, letter grade, and tier across all your policies (great for a dashboard or an onboarding report):

from dpdpstack import score_policies, rbi_kyc, pmla

score_policies([rbi_kyc(), pmla()]).summary
# '100/100 (A+, exemplary) across 2 policies: 2 clean, 0 errors, 0 warnings.'
dpdpstack lint --presets --score      # ... Readiness: 100/100 (A+, exemplary) across 5 policies …

Retention-safe audit + offline verification

The audit log is hash-chained, so any change breaks verify(). But a retention log must be prunable - and a pruned chain no longer starts at sequence 1, which would break verification. Checkpoints fix that: snapshot a run of entries into an immutable, self-chaining Checkpoint, then prune; verification anchors to the checkpoint instead of the genesis.

log = AuditLog(JsonlAuditStore("audit.jsonl"))
# ... record events ...
cp = log.checkpoint(through_sequence=1000)   # immutable snapshot (persist it)
log.prune_through(1000)                       # drop the archived entries

log.verify_report([cp])      # VerifyResult(ok=True, checked=…, anchored_at=1000)
log.verify_report()          # ok=False, first_error_sequence pinpoints any tampering

An auditor can verify a chain straight from storage - no backend, no API to trust:

dpdpstack verify-chain audit.jsonl --checkpoints cp.jsonl
# OK - verified 2400 entries (anchored at #1000).
#  (exits non-zero and names the broken entry if the chain was tampered with)

Crypto-shred PII in the audit log (optional, [crypto])

The chain normally holds no PII (subject is an opaque ref). When you must record PII inside an entry, seal it: the PII is encrypted into an opaque token that the entry hash covers. Verification runs on the ciphertext, so you can later destroy the key (right-to-erasure) - the payload becomes unreadable while the chain still verifies.

from dpdpstack.sealing import generate_seal_key

key = generate_seal_key()                       # keep secret; deleting it shreds the data
e = log.record("evidence", subject="user_42",
               private={"aadhaar": "2341 2341 2346"}, seal_key=key)
AuditLog.open_sealed(e, key)                     # -> {"aadhaar": "…"}  (with the key)
log.verify()                                     # True — even after the key is destroyed

Key rotation (zero-downtime): pass a list of keys, newest first. New entries seal with the first key; unsealing tries all, so older-key entries still open. The ciphertext is part of the entry hash, so chain entries are never re-encrypted — keep an old key around to read old entries, and retire it once they've been pruned or shredded.

new = generate_seal_key()
log.record("evidence", subject="user_43", private={…}, seal_key=[new, key])  # seals with `new`
AuditLog.open_sealed(e, [new, key])              # still opens the old-key entry

Push evidence to the hosted vault (optional)

Keep everything local, or push your tamper-evident chain to a vault (e.g. getdpdp.net) for an independent, server-timestamped, counter-signed copy. The push carries evidence only - opaque refs, event types, and hashes (plus any sealed ciphertext) - never PII, so it stays zero-egress. It's zero-dependency (stdlib), idempotent at the vault (re-pushing is a no-op), and the fire-and-forget variant never blocks or raises in your request path.

from dpdpstack import EvidenceClient

vault = EvidenceClient("https://getdpdp.net/api/v1", api_key="dpdp_sk_…", source="api")

vault.push(log)                # synchronous: -> {"stored": N, "chain_verified": True, …}
vault.push_background(log)     # fire-and-forget: returns immediately, errors swallowed

Signed certificates (optional, [crypto])

The hash-chained Certificate of Erasure is tamper-evident on its own; add an RS256 signature so anyone can verify it with your public key (and you can't forge it):

from dpdpstack import issue_certificate
from dpdpstack.signing import generate_keypair, issue_signed_certificate, verify_certificate

private_pem, public_pem = generate_keypair()      # keep private secret; publish public
cert = issue_certificate(engine.audit, "user_42", "marketing")
token = issue_signed_certificate(cert, private_pem)   # compact JWT
verify_certificate(token, public_pem)                 # -> {"valid": True, ...}

This is the basis for the hosted, counter-signed certificate at getdpdp.net - a regulator/auditor verifies it independently, and the issuer cannot fake it.

CLI (verify a certificate offline)

With the [crypto] extra installed, an auditor can verify a Certificate of Erasure from the shell - no code, just the cert and your public key:

dpdpstack keygen --out-dir ./keys                       # one-time: make a signing keypair
dpdpstack verify cert.jwt --public-key ./keys/cert_public.pem
# VALID - signature verified.
#   subject: user_42 · status: erased (delete) · chain ok: True

(python -m dpdpstack verify ... works too.)

Presets for the common conflicts

rbi_kyc() (5-yr hold, anonymize) · pmla() · cert_in_logs() (180-day log hold) · companies_act() (8-yr books of account) · third_schedule() (DPDP specified period). Or build your own RetentionPolicy(retention_days=…, legal_hold_days=…, legal_basis="…", action=…).

What's in the box

Module What
policies RetentionPolicy + RBI/PMLA/CERT-In/Companies-Act/Third-Schedule presets
anonymize null / hashed / redact / constant field strategies
audit hash-chained log + checkpoints/pruning + verify_report; store (in-memory, JSONL, Django, SQLAlchemy)
erasure ErasureEngine - legal-hold-aware resolve + your executor
certificate issue_certificate() → verifiable Certificate of Erasure
detect PII discovery - schema (scan_mapping) + values (detect_values, classify_breach_nature)
rules DPDP knowledge pack + lint_policy() / dpdpstack lint
vault EvidenceClient - push the chain to a hosted vault (evidence only, fire-and-forget)
sealing (extra) crypto-shred PII in the chain - seal / unseal / AuditLog.open_sealed
signing (extra) RS256-sign/verify a certificate - pip install dpdpstack-python-sdk[crypto]
contrib.django model-backed audit store + erase_instance() + @pii(...) + dpdp_scan
contrib.sqlalchemy (extra) the same for any SQLAlchemy app (FastAPI/Flask/…)

CLI: dpdpstack scan · lint · verify-chain · verify · keygen (python -m dpdpstack …).

Status & scope

Alpha (0.6). The core is dependency-free and framework-agnostic; Django and SQLAlchemy adapters ship today. Hosted/managed version (dashboard, cross-system fan-out, certificate vault): getdpdp.net.

dpdpstack is tooling, not legal advice; you remain the Data Fiduciary. MIT licensed.