feat(export): stream large-footprint exports and batch anonymize#162
Open
jaylann wants to merge 2 commits into
Open
feat(export): stream large-footprint exports and batch anonymize#162jaylann wants to merge 2 commits into
jaylann wants to merge 2 commits into
Conversation
Add memory-bounded, byte-identical streaming companions alongside the existing materializing paths so a large-footprint subject no longer OOMs: - Exporter.iter_subject_records yields ExportRecords table by table off the cursor (yield_per) instead of building the whole ExportBundle tuple; export_subject is unchanged and now delegates through the same lazy core. - effaced-s3 iter_object_records streams one object body at a time (respecting max_object_bytes); collect_object_records drains it. - ErasureExecutor anonymize fetches PKs in bounded ordered+offset pages rather than all at once, with per-row surrogates (ADR 0007) intact. No export or erasure output changes for any input: the streamed records equal the materialized bundle (same set and order) and the anonymized rows are identical; same EXPORT_REQUESTED/EXPORT_COMPLETED trail. Memory and throughput only. Signed-off-by: Justin Lanfermann <Justin@Lanfermann.dev>
jaylann
commented
Jun 20, 2026
jaylann
left a comment
Owner
Author
There was a problem hiding this comment.
BLOCKER (would be REQUEST_CHANGES — GitHub forbids it on one's own PR): One blocker: the batched-anonymize paging silently changes what gets erased when an ANONYMIZE column overlaps the primary key (a representable manifest), which makes the "byte-identical for any input / MINOR" claim untrue for that case — see the inline comment on erasure_executor.py. The streaming Exporter and S3 iter_object_records paths look output-equivalent and well-tested. One non-blocking docstring nit on the resolver memory bound.
The batched ANONYMIZE paged matched PKs with select().order_by(pk).offset(done): safe only while the ordering key is stable. But a PK column is a legal ANONYMIZE target (_table_steps emits it), and a String/Uuid PK draws a fresh unique surrogate, mutating the ordering key mid-walk so an OFFSET window skips rows and PII survives erasure (reviewer-caught HIGH on #162). Page by a keyset cursor (where(pk > last)) when the PK is a single, non-anonymized column; otherwise capture every matching key in one up-front select and rewrite row by row (the prior all-keys-first behaviour, scoped to composite or self-anonymized PKs). Erased output stays byte-identical to the materializing path; this remains a memory/throughput change (MINOR). Adds test_anonymizing_a_string_primary_key_skips_no_rows_across_batches; aligns the exporter docstring (external memory bound), PROOFS, and the CLAUDE.md A3 learning. Signed-off-by: Justin Lanfermann <Justin@Lanfermann.dev>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Item A3 — export/erasure memory safety at scale (MINOR, additive).
What
A large-footprint subject no longer materializes its whole export (or all anonymize PKs) in memory.
Erasure/export semantics
No output change for any input. The materialized export bundle and the anonymized rows are byte-identical to before — this is a memory/throughput change only. MINOR. The streaming path is proven equivalent to the materialized path (test_exporter_streaming.py), and the batched anonymize reuses the bleed/idempotency harness.
Checks
just check + just test green locally (1014 passed).