refactor: first-class tabular redaction + Strategy/DefaultStrategy merge + test cleanup by martsokha · Pull Request #149 · nvisycom/runtime

martsokha · 2026-05-19T02:54:21Z

Summary

Six commits across three threads, all atop the location/read/redact codec API from #148.

1. First-class tabular redaction (`211f95f`)

Spreadsheets get their own modality. CSV no longer pretends to be text; XLSX no longer pretends to be CSV.

New `TabularHandler` capability trait + `BoxedTabularHandler` + `ContentHandle::Tabular` variant. Mirrors the text/image/audio shape.
CSV implements `TabularHandler` only — cells addressed by `(row_index, column_index)`. Row 0 = headers if present; intra-cell `start_offset`/`end_offset` optional (omitted = redact whole cell). The old byte-offset text path with its ~250 lines of `locate_cells` / `find_field_in_line` machinery is gone.
XLSX gets a tabular-only stub (real implementation pending).
Engine `RedactionApplicator` gains `build_tabular_redactions` parallel to text/image/audio. `Document::value_at(Location::Tabular)` now actually reads the cell.

Folder layout matches modality: `handler/text/` no longer holds spreadsheet handlers — they live in `handler/tabular/`.

Redaction code-review fixes folded in:

`entity_map` + `mapping_index` built once at top of `apply()` (was rebuilt per phase with O(n²) scans).
`chars().count()` in mask repeat (multi-byte grapheme bug).
Per-strategy audit placeholders: `[IMAGE_BLUR:5]` / `[IMAGE_BLOCK:#ff0000]` / `[AUDIO_SILENCE]` instead of a single `[IMAGE_REDACTED]` for everything.
Explicit per-variant strategy → output dispatch; `#[non_exhaustive]` fallthroughs warn-log instead of silently producing a generic placeholder.

2. Strategy / DefaultStrategy merge (`b0a59d3`)

`Strategy` and `DefaultStrategy` were structurally the same idea — "what to do per modality" — represented as two different types: an enum that picks one variant, and a struct that allows any combination. `DefaultStrategy::for_location` was the bridge translating one to the other.

Collapsed into a single `Strategy` struct with `Option<*Strategy>` per modality. Each per-modality enum gains `Default`:

`TextStrategy::default()` = `Replace { placeholder: "" }` (engine fills `[ENTITY_KIND]`)
`ImageStrategy::default()` = `Block { color: black }`
`AudioStrategy::default()` = `Silence`

`text_or_default`/`image_or_default`/`audio_or_default` accessors resolve to a concrete strategy. The applicator already filters by `entity.location` variant, so it calls the matching accessor directly — no enum dispatch needed.

`log_modality_mismatch` is gone. Every modality always resolves to a concrete strategy now — the concept of "rule wants text but entity is image" no longer exists. A rule that only specified an image strategy on a text entity now resolves to `TextStrategy::default()` and produces `[ENTITY_KIND]`.

`Strategy::is_reversible_for(&Location)` replaces the old `is_reversible()` — reversibility is inherently per-modality.

3. Test cleanup (4 commits)

134 tests removed across the workspace. Each fell into one of:

Tests derives: serde round-trip on simple structs, `Display` from `derive_more` or `strum`, `Default` derives, `Clone`/`Copy` checks.
Tests dependencies: `oxilangtag` wrapper, `strum` derives, `uuid` v7 generator, `infer` crate, the `csv` crate's quoting.
Trivial getters/constructor read-backs: `Location::as_text` variants, `builder_required_fields`, `ContentMetadata` accessors, http_error builder-set-fields tests, PDF `accessors`/`empty_handler`, etc.
Duplicates of nearby tests: `pattern_expression_roundtrip_glob` (duplicates `_regex`), `no_overlap_different_col` (duplicates `_row`), `load_detects_four_space_indent` (duplicates `two_space`).
Tests stdlib behavior: `Vec::is_empty` after `new()`, `Vec::find` returning `None`, `HashMap::contains_key` after one `insert`.
Registry-has-entries smoke checks the user explicitly flagged as worthless (`builtins_load`, `all_error_kinds_have_responses`, `resolver_builtins`).

Plus `DataReference` deleted (`da3cdb5`) — unused type with no callers anywhere in the workspace.

Per-crate test count deltas:

nvisy-codec: 93 → 63
nvisy-core: 25 → 11
nvisy-ontology: 120 → 65
nvisy-pattern: 66 → 57
nvisy-engine: 85 → 72
nvisy-server: 22 → 12

Commit walkthrough

`211f95f` — first-class tabular redaction (codec + engine, ~1244+/-711 across 17 files)
`981faf6` — codec test cleanup (-30 tests / -297 lines)
`f8ac8e1` — core/ontology test cleanup (-69 tests / -656 lines)
`da3cdb5` — drop unused DataReference (-66 lines)
`3d4e685` — pattern/provider/engine/server test cleanup (-35 tests / -616 lines)
`b0a59d3` — Strategy / DefaultStrategy merge (~297/-297 across 34 files)

Test plan

`cargo check --workspace --all-features` clean
`cargo clippy --workspace --all-features -- -D warnings` clean
`cargo test --workspace --all-features` — all green
`cargo doc --workspace --no-deps --all-features` — no new warnings

🤖 Generated with Claude Code

Spreadsheets now live in their own modality. Previously CSV pretended to be text and XLSX pretended to be CSV; this PR makes both natively tabular, with a new TabularHandler capability trait and a ContentHandle::Tabular variant. Codec changes: - New TabularHandler trait in handler/tabular/ mirroring TextHandler's shape: locations(), read(&loc) -> Option<TextData>, redact(Redactions<TabularLocation, TabularRedaction>). - BoxedTabularHandler type-erased wrapper. - ContentHandle::Tabular(BoxedTabularHandler) — fifth variant. Spreadsheets route through decode_tabular() instead of decode_text(). - transform/tabular/apply.rs — apply_tabular_redactions() helper (the byte-level intra-cell replacement we deleted earlier, now driven). - CsvHandler implements TabularHandler only — the prior TextHandler/byte-offset path is gone, taking ~250 lines of locate_cells/find_field_in_line scaffolding with it. Cells are addressed by (row_index, column_index); row 0 = headers if present, row 1+ = data rows. Intra-cell start/end offsets are optional; omitted means redact the whole cell. - XlsxHandler implements TabularHandler-only (stub, real implementation pending). - ContentHandle gains tabular_locations(), read_tabular(), apply_tabular_redactions(). - handler/text/ no longer contains csv_*.rs or xlsx_*.rs — they moved to handler/tabular/. Folder layout now matches modality. Engine changes: - Document gains collect_tabular_locations, read_tabular, apply_tabular_redactions. value_at(&Location) now dispatches Location::Tabular to read_tabular instead of returning None. - RedactionApplicator gains build_tabular_redactions, parallel to the text/image/audio builders. Tabular entities with text strategies now actually apply (mask, replace, remove, hash, pseudonymize, encrypt-placeholder, tokenize-placeholder). Redaction code-review cleanup (folded into the same PR since it touches the same file): - entity_map and mapping_index built once at the top of apply(), shared across all four build phases. Old code rebuilt them per phase and did O(n²) linear scans over redaction_map.entries. - text_output mask repeat uses chars().count() instead of bytes() — a multi-byte grapheme like 'é' now produces one mask char, not two. - text/image/audio strategy → output dispatch made explicit per variant. Non-exhaustive fallthroughs still exist (the strategy enums are #[non_exhaustive]) but each warns/logs instead of silently producing a generic placeholder. - Image audit placeholders now encode the strategy: [IMAGE_BLUR:5], [IMAGE_BLOCK:#ff0000], [IMAGE_PIXELATE:8] instead of a single [IMAGE_REDACTED] for everything. - Audio placeholders similarly: [AUDIO_SILENCE], [AUDIO_REMOVE]. - Modality-mismatch (e.g. Strategy::Image on a text entity) now logs at trace level instead of silently dropping. - text_output / image_output / audio_output extracted to free functions; they no longer hide as ::Self::method calls. Tests: - 4 new end-to-end tabular tests on RedactionApplicator covering full-cell mask, partial-cell replace, remove → empty cell, and modality mismatch. - 6 new TabularHandler tests on CsvHandler covering locations/read/redact across header and data rows. - Codec: 87 → 93 tests pass. - Engine: 80 → 85 tests pass. cargo check / clippy / doc all clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

30 tests removed from codec across 12 files. Each dropped test fell into one of three buckets: - Trivial constructor/getter checks (Located/Span map+into, PDF accessors, encode_returns_raw_bytes, document_type_is_pdf). - Macro stubs round-tripped through the macro's own field accesses (all WAV + MP3 tests). - Duplicates of tests already exercised elsewhere (tabular apply, csv_loader load_*_round_trip, pdf_loader load_preserves_raw_bytes, txt_handler read_substring, json_loader load_detects_four_space_indent). Codec test count: 93 → 63. Workspace tests still green, clippy clean. Production-side duplication between apply_text_redactions and apply_tabular_redactions (and between TextRedaction and TabularRedaction) is deliberately deferred — the test suite no longer pretends it's two separate algorithms, but the prod dedupe is its own change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

69 tests removed across 26 files. Same bar as codec cleanup: - Tests of derives (serde round-trip, Display, Default, Clone, Copy): ~30 tests. The derives are the contract; testing them tests the ecosystem, not us. - Tests of dependencies (oxilangtag wrapper, strum, uuid v7 generator, infer crate, str::from_utf8): ~10 tests. - Trivial getter/constructor read-backs (Location::as_text variants, builder_required_fields, ContentMetadata accessors, etc.): ~15 tests. - Duplicates of nearby tests (pattern_expression_roundtrip_glob duplicates _regex, no_overlap_different_col duplicates _row, language_tag parse_with_region duplicates parse_simple, etc.): ~6 tests. - Tests of std-library behavior (Vec::is_empty, Vec::find, Vec::push): ~8 tests. Also dropped four borderline cases by explicit user decision: - workflow::validate_accepts_valid_configs (tests Default::default() through validator with no rules) - time_span::midpoint (trivial integer arithmetic) - bounding_box::area (one-line width * height) - entity::source::ordering (had a 2ms sleep — flaky) Test counts: - nvisy-core: 25 → 11 - nvisy-ontology: 120 → 65 Workspace tests pass, clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

DataReference was a "lightweight pointer into a content source" with no callers anywhere in the workspace beyond its own re-export. The type, the module, and the doc-comment bullet all go. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

35 tests removed across 14 files. Same bar as previous cleanups: pattern (9 drops): - Registry-has-entries smoke tests (builtins_load, builtins_load_and_are_nonempty, resolver_builtins). - Trivial insert/get registry round-trips (dictionary_registry::registry_insert_and_get, pattern_registry::registry_insert_and_get). - Engine builder smoke tests (default_engine_builds, builder_pattern_filter). - Collection-wrapper smoke (allow_list_from_iterator, deny_list_insert_and_lookup). provider (5 drops): - run_params_accepts_valid_range — one happy-path code path with three values. - OCR provider parse_response tests for aws_textract, azure_docai, google_vision, datalab_surya, paddle_paddlex — all pure serde-derive deserialization checks (testing serde, not our wire format mapping). engine (11 drops): - mask_counts_chars_not_bytes — used ASCII "John", didn't actually exercise the chars-not-bytes claim it documented. - 9 pipeline config tests (serde-derive deserialization of TOML sections + version_parses_as_semver + validate_accepts_valid_config on Default::default()). - registry/store::base_dir — trivial getter test. server (10 drops): - 6 http_error builder/setter tests (default_http_error, error_from_kind, error_with_context/message/resource, error_builder_chaining) — all builder-set-fields-read-back. - std_error_trait — compile-time-only coercion check, no behavior. - http_kind::all_error_kinds_have_responses — registry-has-entries smoke (Vec iteration + non-empty assertion). - sunset::deprecate_builds_headers — HashMap::contains_key on a single insert; standard-library behavior. Plus 8 unused-import cleanups in handler/loader test modules that imported `futures::StreamExt` / `TextHandler` / `TabularHandler` / `Error` / `JsonPattern` for tests that no longer exist. Test counts: - nvisy-pattern: 66 → 57 - nvisy-provider: 48 → 43 (approx — varies by feature flags) - nvisy-engine: 85 → 74 - nvisy-server: 22 → 12 Workspace tests pass, clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Strategy and DefaultStrategy were structurally the same idea — "what to do per modality" — represented as two different types: an enum that picks one variant, and a struct that allows any combination. DefaultStrategy::for_location was the bridge translating one to the other, and the applicator's per-build_*_redactions methods each contained a modality-mismatch branch that traced "rule wants text but entity is image" without doing anything useful. Collapsing them: - Strategy is now a struct { text, image, audio } with Option<*Strategy> per modality, derives Default. The per-modality enums (TextStrategy, ImageStrategy, AudioStrategy) gain Default impls so unset fields resolve to a concrete strategy: - TextStrategy::default() = Replace { placeholder: "" } → engine fills "[ENTITY_KIND]" - ImageStrategy::default() = Block { color: black } - AudioStrategy::default() = Silence - DefaultStrategy and ResolvedStrategy are deleted. Strategy::merge composes rule-level with policy-default-level strategies (rule values win, defaults fill gaps). - Strategy gains text_or_default/image_or_default/audio_or_default accessors that fall back to the modality's Default. The applicator already knows the entity's modality from entity.location, so it calls the matching accessor directly — no enum dispatch on a separate ResolvedStrategy wrapper. - Strategy::is_reversible_for(&Location) replaces the old is_reversible(); reversibility is inherently per-modality. The applicator's log_modality_mismatch is gone. Every modality always resolves to a concrete strategy now — no mismatch concept exists. Each build_*_redactions filters by Location variant (it always did) and uses the matching accessor for the resolved per-modality strategy. AuditEntryBuilder::for_entity takes &Location to resolve reversibility at construction time. While in there: deleted six divider/separator comments across apply.rs, default.rs, csv_loader.rs. Workspace tests pass (engine 74 → 72: dropped two modality-mismatch tests whose premise no longer exists, replaced one with a test of the new "image-only strategy on text entity uses TextStrategy::default()" behavior). Clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

martsokha and others added 6 commits May 19, 2026 03:42

martsokha self-assigned this May 19, 2026

martsokha added codec nvisy-codec: loaders, transforms, handlers ontology nvisy-ontology: entities, policies, context refactor code restructuring without behavior change engine nvisy-engine: pipeline, registry, operations labels May 19, 2026

martsokha merged commit f1e2b1e into main May 19, 2026
5 checks passed

martsokha deleted the refactor/tabular-redaction branch May 19, 2026 03:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: first-class tabular redaction + Strategy/DefaultStrategy merge + test cleanup#149

refactor: first-class tabular redaction + Strategy/DefaultStrategy merge + test cleanup#149
martsokha merged 6 commits into
mainfrom
refactor/tabular-redaction

martsokha commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

martsokha commented May 19, 2026

Summary

1. First-class tabular redaction (`211f95f`)

2. Strategy / DefaultStrategy merge (`b0a59d3`)

3. Test cleanup (4 commits)

Commit walkthrough

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant