refactor: first-class tabular redaction + Strategy/DefaultStrategy merge + test cleanup#149
Merged
Merged
Conversation
Spreadsheets now live in their own modality. Previously CSV pretended to be text and XLSX pretended to be CSV; this PR makes both natively tabular, with a new TabularHandler capability trait and a ContentHandle::Tabular variant. Codec changes: - New TabularHandler trait in handler/tabular/ mirroring TextHandler's shape: locations(), read(&loc) -> Option<TextData>, redact(Redactions<TabularLocation, TabularRedaction>). - BoxedTabularHandler type-erased wrapper. - ContentHandle::Tabular(BoxedTabularHandler) — fifth variant. Spreadsheets route through decode_tabular() instead of decode_text(). - transform/tabular/apply.rs — apply_tabular_redactions() helper (the byte-level intra-cell replacement we deleted earlier, now driven). - CsvHandler implements TabularHandler only — the prior TextHandler/byte-offset path is gone, taking ~250 lines of locate_cells/find_field_in_line scaffolding with it. Cells are addressed by (row_index, column_index); row 0 = headers if present, row 1+ = data rows. Intra-cell start/end offsets are optional; omitted means redact the whole cell. - XlsxHandler implements TabularHandler-only (stub, real implementation pending). - ContentHandle gains tabular_locations(), read_tabular(), apply_tabular_redactions(). - handler/text/ no longer contains csv_*.rs or xlsx_*.rs — they moved to handler/tabular/. Folder layout now matches modality. Engine changes: - Document gains collect_tabular_locations, read_tabular, apply_tabular_redactions. value_at(&Location) now dispatches Location::Tabular to read_tabular instead of returning None. - RedactionApplicator gains build_tabular_redactions, parallel to the text/image/audio builders. Tabular entities with text strategies now actually apply (mask, replace, remove, hash, pseudonymize, encrypt-placeholder, tokenize-placeholder). Redaction code-review cleanup (folded into the same PR since it touches the same file): - entity_map and mapping_index built once at the top of apply(), shared across all four build phases. Old code rebuilt them per phase and did O(n²) linear scans over redaction_map.entries. - text_output mask repeat uses chars().count() instead of bytes() — a multi-byte grapheme like 'é' now produces one mask char, not two. - text/image/audio strategy → output dispatch made explicit per variant. Non-exhaustive fallthroughs still exist (the strategy enums are #[non_exhaustive]) but each warns/logs instead of silently producing a generic placeholder. - Image audit placeholders now encode the strategy: [IMAGE_BLUR:5], [IMAGE_BLOCK:#ff0000], [IMAGE_PIXELATE:8] instead of a single [IMAGE_REDACTED] for everything. - Audio placeholders similarly: [AUDIO_SILENCE], [AUDIO_REMOVE]. - Modality-mismatch (e.g. Strategy::Image on a text entity) now logs at trace level instead of silently dropping. - text_output / image_output / audio_output extracted to free functions; they no longer hide as ::Self::method calls. Tests: - 4 new end-to-end tabular tests on RedactionApplicator covering full-cell mask, partial-cell replace, remove → empty cell, and modality mismatch. - 6 new TabularHandler tests on CsvHandler covering locations/read/redact across header and data rows. - Codec: 87 → 93 tests pass. - Engine: 80 → 85 tests pass. cargo check / clippy / doc all clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
30 tests removed from codec across 12 files. Each dropped test fell into one of three buckets: - Trivial constructor/getter checks (Located/Span map+into, PDF accessors, encode_returns_raw_bytes, document_type_is_pdf). - Macro stubs round-tripped through the macro's own field accesses (all WAV + MP3 tests). - Duplicates of tests already exercised elsewhere (tabular apply, csv_loader load_*_round_trip, pdf_loader load_preserves_raw_bytes, txt_handler read_substring, json_loader load_detects_four_space_indent). Codec test count: 93 → 63. Workspace tests still green, clippy clean. Production-side duplication between apply_text_redactions and apply_tabular_redactions (and between TextRedaction and TabularRedaction) is deliberately deferred — the test suite no longer pretends it's two separate algorithms, but the prod dedupe is its own change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
69 tests removed across 26 files. Same bar as codec cleanup: - Tests of derives (serde round-trip, Display, Default, Clone, Copy): ~30 tests. The derives are the contract; testing them tests the ecosystem, not us. - Tests of dependencies (oxilangtag wrapper, strum, uuid v7 generator, infer crate, str::from_utf8): ~10 tests. - Trivial getter/constructor read-backs (Location::as_text variants, builder_required_fields, ContentMetadata accessors, etc.): ~15 tests. - Duplicates of nearby tests (pattern_expression_roundtrip_glob duplicates _regex, no_overlap_different_col duplicates _row, language_tag parse_with_region duplicates parse_simple, etc.): ~6 tests. - Tests of std-library behavior (Vec::is_empty, Vec::find, Vec::push): ~8 tests. Also dropped four borderline cases by explicit user decision: - workflow::validate_accepts_valid_configs (tests Default::default() through validator with no rules) - time_span::midpoint (trivial integer arithmetic) - bounding_box::area (one-line width * height) - entity::source::ordering (had a 2ms sleep — flaky) Test counts: - nvisy-core: 25 → 11 - nvisy-ontology: 120 → 65 Workspace tests pass, clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DataReference was a "lightweight pointer into a content source" with no callers anywhere in the workspace beyond its own re-export. The type, the module, and the doc-comment bullet all go. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
35 tests removed across 14 files. Same bar as previous cleanups: pattern (9 drops): - Registry-has-entries smoke tests (builtins_load, builtins_load_and_are_nonempty, resolver_builtins). - Trivial insert/get registry round-trips (dictionary_registry::registry_insert_and_get, pattern_registry::registry_insert_and_get). - Engine builder smoke tests (default_engine_builds, builder_pattern_filter). - Collection-wrapper smoke (allow_list_from_iterator, deny_list_insert_and_lookup). provider (5 drops): - run_params_accepts_valid_range — one happy-path code path with three values. - OCR provider parse_response tests for aws_textract, azure_docai, google_vision, datalab_surya, paddle_paddlex — all pure serde-derive deserialization checks (testing serde, not our wire format mapping). engine (11 drops): - mask_counts_chars_not_bytes — used ASCII "John", didn't actually exercise the chars-not-bytes claim it documented. - 9 pipeline config tests (serde-derive deserialization of TOML sections + version_parses_as_semver + validate_accepts_valid_config on Default::default()). - registry/store::base_dir — trivial getter test. server (10 drops): - 6 http_error builder/setter tests (default_http_error, error_from_kind, error_with_context/message/resource, error_builder_chaining) — all builder-set-fields-read-back. - std_error_trait — compile-time-only coercion check, no behavior. - http_kind::all_error_kinds_have_responses — registry-has-entries smoke (Vec iteration + non-empty assertion). - sunset::deprecate_builds_headers — HashMap::contains_key on a single insert; standard-library behavior. Plus 8 unused-import cleanups in handler/loader test modules that imported `futures::StreamExt` / `TextHandler` / `TabularHandler` / `Error` / `JsonPattern` for tests that no longer exist. Test counts: - nvisy-pattern: 66 → 57 - nvisy-provider: 48 → 43 (approx — varies by feature flags) - nvisy-engine: 85 → 74 - nvisy-server: 22 → 12 Workspace tests pass, clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Strategy and DefaultStrategy were structurally the same idea —
"what to do per modality" — represented as two different types: an
enum that picks one variant, and a struct that allows any combination.
DefaultStrategy::for_location was the bridge translating one to the
other, and the applicator's per-build_*_redactions methods each
contained a modality-mismatch branch that traced "rule wants text
but entity is image" without doing anything useful.
Collapsing them:
- Strategy is now a struct { text, image, audio } with Option<*Strategy>
per modality, derives Default. The per-modality enums (TextStrategy,
ImageStrategy, AudioStrategy) gain Default impls so unset fields
resolve to a concrete strategy:
- TextStrategy::default() = Replace { placeholder: "" }
→ engine fills "[ENTITY_KIND]"
- ImageStrategy::default() = Block { color: black }
- AudioStrategy::default() = Silence
- DefaultStrategy and ResolvedStrategy are deleted. Strategy::merge
composes rule-level with policy-default-level strategies (rule
values win, defaults fill gaps).
- Strategy gains text_or_default/image_or_default/audio_or_default
accessors that fall back to the modality's Default. The applicator
already knows the entity's modality from entity.location, so it
calls the matching accessor directly — no enum dispatch on a
separate ResolvedStrategy wrapper.
- Strategy::is_reversible_for(&Location) replaces the old
is_reversible(); reversibility is inherently per-modality.
The applicator's log_modality_mismatch is gone. Every modality always
resolves to a concrete strategy now — no mismatch concept exists.
Each build_*_redactions filters by Location variant (it always did)
and uses the matching accessor for the resolved per-modality strategy.
AuditEntryBuilder::for_entity takes &Location to resolve reversibility
at construction time.
While in there: deleted six divider/separator comments across
apply.rs, default.rs, csv_loader.rs.
Workspace tests pass (engine 74 → 72: dropped two modality-mismatch
tests whose premise no longer exists, replaced one with a test of the
new "image-only strategy on text entity uses TextStrategy::default()"
behavior). Clippy clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Six commits across three threads, all atop the location/read/redact codec API from #148.
1. First-class tabular redaction (`211f95f`)
Spreadsheets get their own modality. CSV no longer pretends to be text; XLSX no longer pretends to be CSV.
Folder layout matches modality: `handler/text/` no longer holds spreadsheet handlers — they live in `handler/tabular/`.
Redaction code-review fixes folded in:
2. Strategy / DefaultStrategy merge (`b0a59d3`)
`Strategy` and `DefaultStrategy` were structurally the same idea — "what to do per modality" — represented as two different types: an enum that picks one variant, and a struct that allows any combination. `DefaultStrategy::for_location` was the bridge translating one to the other.
Collapsed into a single `Strategy` struct with `Option<*Strategy>` per modality. Each per-modality enum gains `Default`:
`text_or_default`/`image_or_default`/`audio_or_default` accessors resolve to a concrete strategy. The applicator already filters by `entity.location` variant, so it calls the matching accessor directly — no enum dispatch needed.
`log_modality_mismatch` is gone. Every modality always resolves to a concrete strategy now — the concept of "rule wants text but entity is image" no longer exists. A rule that only specified an image strategy on a text entity now resolves to `TextStrategy::default()` and produces `[ENTITY_KIND]`.
`Strategy::is_reversible_for(&Location)` replaces the old `is_reversible()` — reversibility is inherently per-modality.
3. Test cleanup (4 commits)
134 tests removed across the workspace. Each fell into one of:
Plus `DataReference` deleted (`da3cdb5`) — unused type with no callers anywhere in the workspace.
Per-crate test count deltas:
Commit walkthrough
Test plan
🤖 Generated with Claude Code