Skip to content

refactor: first-class tabular redaction + Strategy/DefaultStrategy merge + test cleanup#149

Merged
martsokha merged 6 commits into
mainfrom
refactor/tabular-redaction
May 19, 2026
Merged

refactor: first-class tabular redaction + Strategy/DefaultStrategy merge + test cleanup#149
martsokha merged 6 commits into
mainfrom
refactor/tabular-redaction

Conversation

@martsokha
Copy link
Copy Markdown
Member

Summary

Six commits across three threads, all atop the location/read/redact codec API from #148.

1. First-class tabular redaction (`211f95f`)

Spreadsheets get their own modality. CSV no longer pretends to be text; XLSX no longer pretends to be CSV.

  • New `TabularHandler` capability trait + `BoxedTabularHandler` + `ContentHandle::Tabular` variant. Mirrors the text/image/audio shape.
  • CSV implements `TabularHandler` only — cells addressed by `(row_index, column_index)`. Row 0 = headers if present; intra-cell `start_offset`/`end_offset` optional (omitted = redact whole cell). The old byte-offset text path with its ~250 lines of `locate_cells` / `find_field_in_line` machinery is gone.
  • XLSX gets a tabular-only stub (real implementation pending).
  • Engine `RedactionApplicator` gains `build_tabular_redactions` parallel to text/image/audio. `Document::value_at(Location::Tabular)` now actually reads the cell.

Folder layout matches modality: `handler/text/` no longer holds spreadsheet handlers — they live in `handler/tabular/`.

Redaction code-review fixes folded in:

  • `entity_map` + `mapping_index` built once at top of `apply()` (was rebuilt per phase with O(n²) scans).
  • `chars().count()` in mask repeat (multi-byte grapheme bug).
  • Per-strategy audit placeholders: `[IMAGE_BLUR:5]` / `[IMAGE_BLOCK:#ff0000]` / `[AUDIO_SILENCE]` instead of a single `[IMAGE_REDACTED]` for everything.
  • Explicit per-variant strategy → output dispatch; `#[non_exhaustive]` fallthroughs warn-log instead of silently producing a generic placeholder.

2. Strategy / DefaultStrategy merge (`b0a59d3`)

`Strategy` and `DefaultStrategy` were structurally the same idea — "what to do per modality" — represented as two different types: an enum that picks one variant, and a struct that allows any combination. `DefaultStrategy::for_location` was the bridge translating one to the other.

Collapsed into a single `Strategy` struct with `Option<*Strategy>` per modality. Each per-modality enum gains `Default`:

  • `TextStrategy::default()` = `Replace { placeholder: "" }` (engine fills `[ENTITY_KIND]`)
  • `ImageStrategy::default()` = `Block { color: black }`
  • `AudioStrategy::default()` = `Silence`

`text_or_default`/`image_or_default`/`audio_or_default` accessors resolve to a concrete strategy. The applicator already filters by `entity.location` variant, so it calls the matching accessor directly — no enum dispatch needed.

`log_modality_mismatch` is gone. Every modality always resolves to a concrete strategy now — the concept of "rule wants text but entity is image" no longer exists. A rule that only specified an image strategy on a text entity now resolves to `TextStrategy::default()` and produces `[ENTITY_KIND]`.

`Strategy::is_reversible_for(&Location)` replaces the old `is_reversible()` — reversibility is inherently per-modality.

3. Test cleanup (4 commits)

134 tests removed across the workspace. Each fell into one of:

  • Tests derives: serde round-trip on simple structs, `Display` from `derive_more` or `strum`, `Default` derives, `Clone`/`Copy` checks.
  • Tests dependencies: `oxilangtag` wrapper, `strum` derives, `uuid` v7 generator, `infer` crate, the `csv` crate's quoting.
  • Trivial getters/constructor read-backs: `Location::as_text` variants, `builder_required_fields`, `ContentMetadata` accessors, http_error builder-set-fields tests, PDF `accessors`/`empty_handler`, etc.
  • Duplicates of nearby tests: `pattern_expression_roundtrip_glob` (duplicates `_regex`), `no_overlap_different_col` (duplicates `_row`), `load_detects_four_space_indent` (duplicates `two_space`).
  • Tests stdlib behavior: `Vec::is_empty` after `new()`, `Vec::find` returning `None`, `HashMap::contains_key` after one `insert`.
  • Registry-has-entries smoke checks the user explicitly flagged as worthless (`builtins_load`, `all_error_kinds_have_responses`, `resolver_builtins`).

Plus `DataReference` deleted (`da3cdb5`) — unused type with no callers anywhere in the workspace.

Per-crate test count deltas:

  • nvisy-codec: 93 → 63
  • nvisy-core: 25 → 11
  • nvisy-ontology: 120 → 65
  • nvisy-pattern: 66 → 57
  • nvisy-engine: 85 → 72
  • nvisy-server: 22 → 12

Commit walkthrough

  • `211f95f` — first-class tabular redaction (codec + engine, ~1244+/-711 across 17 files)
  • `981faf6` — codec test cleanup (-30 tests / -297 lines)
  • `f8ac8e1` — core/ontology test cleanup (-69 tests / -656 lines)
  • `da3cdb5` — drop unused DataReference (-66 lines)
  • `3d4e685` — pattern/provider/engine/server test cleanup (-35 tests / -616 lines)
  • `b0a59d3` — Strategy / DefaultStrategy merge (~297/-297 across 34 files)

Test plan

  • `cargo check --workspace --all-features` clean
  • `cargo clippy --workspace --all-features -- -D warnings` clean
  • `cargo test --workspace --all-features` — all green
  • `cargo doc --workspace --no-deps --all-features` — no new warnings

🤖 Generated with Claude Code

martsokha and others added 6 commits May 19, 2026 03:42
Spreadsheets now live in their own modality. Previously CSV pretended
to be text and XLSX pretended to be CSV; this PR makes both natively
tabular, with a new TabularHandler capability trait and a
ContentHandle::Tabular variant.

Codec changes:

- New TabularHandler trait in handler/tabular/ mirroring TextHandler's
  shape: locations(), read(&loc) -> Option<TextData>,
  redact(Redactions<TabularLocation, TabularRedaction>).
- BoxedTabularHandler type-erased wrapper.
- ContentHandle::Tabular(BoxedTabularHandler) — fifth variant.
  Spreadsheets route through decode_tabular() instead of decode_text().
- transform/tabular/apply.rs — apply_tabular_redactions() helper (the
  byte-level intra-cell replacement we deleted earlier, now driven).
- CsvHandler implements TabularHandler only — the prior
  TextHandler/byte-offset path is gone, taking ~250 lines of
  locate_cells/find_field_in_line scaffolding with it. Cells are
  addressed by (row_index, column_index); row 0 = headers if present,
  row 1+ = data rows. Intra-cell start/end offsets are optional;
  omitted means redact the whole cell.
- XlsxHandler implements TabularHandler-only (stub, real
  implementation pending).
- ContentHandle gains tabular_locations(), read_tabular(),
  apply_tabular_redactions().
- handler/text/ no longer contains csv_*.rs or xlsx_*.rs — they
  moved to handler/tabular/. Folder layout now matches modality.

Engine changes:

- Document gains collect_tabular_locations, read_tabular,
  apply_tabular_redactions. value_at(&Location) now dispatches
  Location::Tabular to read_tabular instead of returning None.
- RedactionApplicator gains build_tabular_redactions, parallel to
  the text/image/audio builders. Tabular entities with text
  strategies now actually apply (mask, replace, remove, hash,
  pseudonymize, encrypt-placeholder, tokenize-placeholder).

Redaction code-review cleanup (folded into the same PR since it
touches the same file):

- entity_map and mapping_index built once at the top of apply(),
  shared across all four build phases. Old code rebuilt them per
  phase and did O(n²) linear scans over redaction_map.entries.
- text_output mask repeat uses chars().count() instead of bytes()
  — a multi-byte grapheme like 'é' now produces one mask char, not
  two.
- text/image/audio strategy → output dispatch made explicit per
  variant. Non-exhaustive fallthroughs still exist (the strategy
  enums are #[non_exhaustive]) but each warns/logs instead of
  silently producing a generic placeholder.
- Image audit placeholders now encode the strategy:
  [IMAGE_BLUR:5], [IMAGE_BLOCK:#ff0000], [IMAGE_PIXELATE:8]
  instead of a single [IMAGE_REDACTED] for everything.
- Audio placeholders similarly: [AUDIO_SILENCE], [AUDIO_REMOVE].
- Modality-mismatch (e.g. Strategy::Image on a text entity) now
  logs at trace level instead of silently dropping.
- text_output / image_output / audio_output extracted to free
  functions; they no longer hide as ::Self::method calls.

Tests:

- 4 new end-to-end tabular tests on RedactionApplicator covering
  full-cell mask, partial-cell replace, remove → empty cell,
  and modality mismatch.
- 6 new TabularHandler tests on CsvHandler covering
  locations/read/redact across header and data rows.
- Codec: 87 → 93 tests pass.
- Engine: 80 → 85 tests pass.

cargo check / clippy / doc all clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
30 tests removed from codec across 12 files. Each dropped test fell
into one of three buckets:

- Trivial constructor/getter checks (Located/Span map+into, PDF
  accessors, encode_returns_raw_bytes, document_type_is_pdf).
- Macro stubs round-tripped through the macro's own field accesses
  (all WAV + MP3 tests).
- Duplicates of tests already exercised elsewhere
  (tabular apply, csv_loader load_*_round_trip, pdf_loader
  load_preserves_raw_bytes, txt_handler read_substring,
  json_loader load_detects_four_space_indent).

Codec test count: 93 → 63. Workspace tests still green, clippy clean.

Production-side duplication between apply_text_redactions and
apply_tabular_redactions (and between TextRedaction and
TabularRedaction) is deliberately deferred — the test suite no
longer pretends it's two separate algorithms, but the prod dedupe
is its own change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
69 tests removed across 26 files. Same bar as codec cleanup:

- Tests of derives (serde round-trip, Display, Default, Clone, Copy):
  ~30 tests. The derives are the contract; testing them tests the
  ecosystem, not us.
- Tests of dependencies (oxilangtag wrapper, strum, uuid v7 generator,
  infer crate, str::from_utf8): ~10 tests.
- Trivial getter/constructor read-backs (Location::as_text variants,
  builder_required_fields, ContentMetadata accessors, etc.): ~15 tests.
- Duplicates of nearby tests (pattern_expression_roundtrip_glob
  duplicates _regex, no_overlap_different_col duplicates _row,
  language_tag parse_with_region duplicates parse_simple, etc.): ~6
  tests.
- Tests of std-library behavior (Vec::is_empty, Vec::find, Vec::push):
  ~8 tests.

Also dropped four borderline cases by explicit user decision:
- workflow::validate_accepts_valid_configs (tests Default::default()
  through validator with no rules)
- time_span::midpoint (trivial integer arithmetic)
- bounding_box::area (one-line width * height)
- entity::source::ordering (had a 2ms sleep — flaky)

Test counts:
- nvisy-core: 25 → 11
- nvisy-ontology: 120 → 65

Workspace tests pass, clippy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DataReference was a "lightweight pointer into a content source" with
no callers anywhere in the workspace beyond its own re-export. The
type, the module, and the doc-comment bullet all go.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
35 tests removed across 14 files. Same bar as previous cleanups:

pattern (9 drops):
- Registry-has-entries smoke tests (builtins_load, builtins_load_and_are_nonempty, resolver_builtins).
- Trivial insert/get registry round-trips (dictionary_registry::registry_insert_and_get, pattern_registry::registry_insert_and_get).
- Engine builder smoke tests (default_engine_builds, builder_pattern_filter).
- Collection-wrapper smoke (allow_list_from_iterator, deny_list_insert_and_lookup).

provider (5 drops):
- run_params_accepts_valid_range — one happy-path code path with three values.
- OCR provider parse_response tests for aws_textract, azure_docai,
  google_vision, datalab_surya, paddle_paddlex — all pure
  serde-derive deserialization checks (testing serde, not our wire
  format mapping).

engine (11 drops):
- mask_counts_chars_not_bytes — used ASCII "John", didn't actually
  exercise the chars-not-bytes claim it documented.
- 9 pipeline config tests (serde-derive deserialization of TOML
  sections + version_parses_as_semver + validate_accepts_valid_config
  on Default::default()).
- registry/store::base_dir — trivial getter test.

server (10 drops):
- 6 http_error builder/setter tests (default_http_error,
  error_from_kind, error_with_context/message/resource,
  error_builder_chaining) — all builder-set-fields-read-back.
- std_error_trait — compile-time-only coercion check, no behavior.
- http_kind::all_error_kinds_have_responses — registry-has-entries
  smoke (Vec iteration + non-empty assertion).
- sunset::deprecate_builds_headers — HashMap::contains_key on a
  single insert; standard-library behavior.

Plus 8 unused-import cleanups in handler/loader test modules that
imported `futures::StreamExt` / `TextHandler` / `TabularHandler` /
`Error` / `JsonPattern` for tests that no longer exist.

Test counts:
- nvisy-pattern: 66 → 57
- nvisy-provider: 48 → 43 (approx — varies by feature flags)
- nvisy-engine: 85 → 74
- nvisy-server: 22 → 12

Workspace tests pass, clippy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Strategy and DefaultStrategy were structurally the same idea —
"what to do per modality" — represented as two different types: an
enum that picks one variant, and a struct that allows any combination.
DefaultStrategy::for_location was the bridge translating one to the
other, and the applicator's per-build_*_redactions methods each
contained a modality-mismatch branch that traced "rule wants text
but entity is image" without doing anything useful.

Collapsing them:

- Strategy is now a struct { text, image, audio } with Option<*Strategy>
  per modality, derives Default. The per-modality enums (TextStrategy,
  ImageStrategy, AudioStrategy) gain Default impls so unset fields
  resolve to a concrete strategy:
    - TextStrategy::default()  = Replace { placeholder: "" }
      → engine fills "[ENTITY_KIND]"
    - ImageStrategy::default() = Block { color: black }
    - AudioStrategy::default() = Silence
- DefaultStrategy and ResolvedStrategy are deleted. Strategy::merge
  composes rule-level with policy-default-level strategies (rule
  values win, defaults fill gaps).
- Strategy gains text_or_default/image_or_default/audio_or_default
  accessors that fall back to the modality's Default. The applicator
  already knows the entity's modality from entity.location, so it
  calls the matching accessor directly — no enum dispatch on a
  separate ResolvedStrategy wrapper.
- Strategy::is_reversible_for(&Location) replaces the old
  is_reversible(); reversibility is inherently per-modality.

The applicator's log_modality_mismatch is gone. Every modality always
resolves to a concrete strategy now — no mismatch concept exists.
Each build_*_redactions filters by Location variant (it always did)
and uses the matching accessor for the resolved per-modality strategy.

AuditEntryBuilder::for_entity takes &Location to resolve reversibility
at construction time.

While in there: deleted six divider/separator comments across
apply.rs, default.rs, csv_loader.rs.

Workspace tests pass (engine 74 → 72: dropped two modality-mismatch
tests whose premise no longer exists, replaced one with a test of the
new "image-only strategy on text entity uses TextStrategy::default()"
behavior). Clippy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@martsokha martsokha self-assigned this May 19, 2026
@martsokha martsokha added codec nvisy-codec: loaders, transforms, handlers ontology nvisy-ontology: entities, policies, context refactor code restructuring without behavior change engine nvisy-engine: pipeline, registry, operations labels May 19, 2026
@martsokha martsokha merged commit f1e2b1e into main May 19, 2026
5 checks passed
@martsokha martsokha deleted the refactor/tabular-redaction branch May 19, 2026 03:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

codec nvisy-codec: loaders, transforms, handlers engine nvisy-engine: pipeline, registry, operations ontology nvisy-ontology: entities, policies, context refactor code restructuring without behavior change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant