Audit pass 2: enforce the fact-lock in the loop + stop scrub corrupting Unicode by ssamba1 · Pull Request #7 · ssamba1/untell

ssamba1 · 2026-06-26T03:28:13Z

Second deep audit pass, focused on the loop-critical / less-covered modules that pass 1 (#6) didn't drill into: the quality gate, the headless loop, the watermark scrubber, and env loading. Every fix has a regression test. 155 passed, 1 skipped, 0 failed; ruff clean.

🔴 Critical — the core guarantee was unenforced

"Citations/numbers locked byte-for-byte" is the project's #1 differentiator. But the similarity gate runs on masked text, so a rewrite that dropped or replaced a sentinel (losing the locked fact) still passed the gate — and restore() silently emitted text with the fact gone.

Reproduced: a rewrite turning ⟦HZ0001⟧ ("47%") into "nearly half" scored sim 0.965 ≥ 0.76 and passed.

Fix: untell_text() now rejects any candidate whose sentinel set ≠ the locked set (find_sentinels). The same gate guards the polish pass. SKILL.md gets an explicit per-rewrite sentinel check (the /untell path is Claude, not the headless loop), and preserve.py --restore warns loudly when locked spans are missing. Test: test_loop_rejects_sentinel_dropping_rewrite — a fact-dropping rewriter is attempted and rejected; Smith (2020) and 47% survive.

🟠 High — `scrub_hidden` corrupted legitimate Unicode on every input

It runs by default on all input and was destroying real content:

Input	Before	After
`E=mc²`	`E=mc2` (NFKC)	`E=mc²`
`❤️`	`❤` (VS16 stripped)	`❤️`
`👨‍👩‍👧‍👦`	4 separate emoji (ZWJ stripped)	intact
Arabic/Hebrew w/ bidi marks	layout garbled	preserved

Fix: strip only genuine watermark carriers — zero-width chars, Unicode tag chars, C0/C1 controls, and orphan ZWJ (between non-emoji) — using NFC, not NFKC. Emoji ZWJ sequences, variation selectors, bidi marks, superscripts and ligatures are preserved; a watermark ZWJ between letters is still removed. Test: test_scrub_preserves_legitimate_unicode.

🟡 Medium / Low

confirm pass used the raw threshold instead of threshold − margin, so a noisy re-score inside the margin band wasn't demoted — now consistent with the pass test.
Added an honest rewrites counter (iterations counts loop cycles, =1 even when the input already passed with 0 rewrites); polish no longer double-scores.
_env.py: handle a UTF-8 BOM (utf-8-sig) and a shell export prefix in the zero-dep fallback parser. Test: test_load_env_handles_bom_and_export.
eval/benchmark.py: use UNTELL_ENABLE_RADAR (was the legacy HUMANIZE_ alias).

Verified false positive (not changed)

"All 6 commercial adapters crash the loop on a 401/429/500" — both score_text and verify() already wrap adapter calls in try/except, so a flaky API degrades gracefully. No change made.

🤖 Generated with Claude Code

…ng Unicode A second deep pass over the loop-critical, less-covered modules (quality gate, headless loop, watermark scrubber, env loader) found real bugs in the core value proposition. Fixes, each with a regression test: CRITICAL - the "facts are locked byte-for-byte" guarantee was mechanically unenforced. The quality gate compares MASKED text, so a rewrite that dropped or replaced a sentinel (losing a locked citation/number/quote) still cleared the similarity gate and silently lost the fact on restore. untell_text() now rejects any candidate whose sentinel set != the locked set (find_sentinels), so a fact-dropping rewrite can never win; the same gate guards the optional polish pass. SKILL.md gains an explicit per-rewrite sentinel check, and `preserve.py --restore` now warns loudly when locked spans are missing. HIGH - scrub_hidden corrupted legitimate Unicode on every input. It applied NFKC (E=mc^2 -> E=mc2, the fi ligature -> fi), stripped variation selectors (heart-emoji lost its presentation), bidi marks (garbling Arabic/Hebrew layout), and ZWJ (splitting the family emoji into four). Rewritten to strip only genuine watermark carriers - zero-width chars, Unicode tag chars, C0/C1 controls, and *orphan* ZWJ between non-emoji - while preserving emoji ZWJ sequences, variation selectors, bidi marks, superscripts and ligatures (NFC, not NFKC). MED / LOW - run.py: the confirm pass used the raw threshold, not threshold - margin, so a re-score within the margin band was not demoted; now consistent with the pass test. - run.py: added an honest `rewrites` counter (`iterations` counts loop cycles and is 1 even when the input already passed with 0 rewrites); polish no longer double-scores. - _env.py: handle a UTF-8 BOM (utf-8-sig) and a shell `export ` prefix in the zero-dep fallback parser. - eval/benchmark.py: use UNTELL_ENABLE_RADAR (was the legacy HUMANIZE_ alias). Verified: 155 passed, 1 skipped, 0 failed; ruff clean. Note: the "commercial adapters crash the loop on an HTTP error" finding was a false positive - both score_text and verify() already wrap adapter calls in try/except. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR hardens the “fact-lock” guarantee in the headless untell loop (sentinel preservation) and fixes scrub_hidden so it no longer corrupts legitimate Unicode, while also improving env loading robustness and aligning benchmark env vars.

Changes:

Enforce sentinel preservation in the headless loop (and polish) and add a rewrites counter + confirm-pass threshold consistency.
Make Unicode scrubbing conservative (preserve emoji ZWJ sequences/VS16/bidi marks; NFC not NFKC) with regression tests.
Improve .env loading (UTF-8 BOM + export prefix) and switch benchmark radar flag to UNTELL_ENABLE_RADAR.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
untell/SKILL.md	Adds explicit “verify every sentinel” instruction to prevent lock loss in the skill workflow.
untell/scripts/run.py	Adds sentinel-set gating in the loop/polish, confirm-threshold fix, and `rewrites` counter.
untell/scripts/preserve.py	Adds `find_sentinels()` and warns loudly when restore input is missing locked spans.
untell/attacks/unicode_tricks.py	Reworks `scrub_hidden` to avoid Unicode corruption; updates hidden-char counting.
untell/_env.py	Loads `.env` with `utf-8-sig` and tolerates `export KEY=VALUE` in fallback parser.
tests/test_run.py	Adds regression test that a sentinel-dropping rewriter is rejected by the loop.
tests/test_env.py	Adds BOM + `export` parsing regression test.
tests/test_attacks_more.py	Adds regression test that scrub preserves legitimate Unicode while removing watermark chars.
eval/benchmark.py	Uses `UNTELL_ENABLE_RADAR` env var instead of legacy `HUMANIZE_ENABLE_RADAR`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

        cand_score = score(candidate)
-        if similarity(masked, candidate) >= sim_bar and cand_score["max"] <= best_score["max"]:
+        # Accept only if the rewrite (a) keeps EVERY sentinel intact — a dropped or altered sentinel
+        # would silently lose a locked citation/number/fact on restore, defeating the whole lock —
+        # (b) holds the meaning-similarity gate, and (c) does not worsen the detector max.
+        if (
+            find_sentinels(candidate) == set(mapping)
+            and similarity(masked, candidate) >= sim_bar
+            and cand_score["max"] <= best_score["max"]
+        ):


            polished = surgical_substitute(best_masked, tier="lite", threshold=threshold)["text"]
-            if score(polished)["max"] <= best_score["max"]:
-                best_masked = polished
-                best_score = score(best_masked)
+            polished_score = score(polished)
+            # Polish must clear the same gates as a rewrite: sentinels intact, meaning preserved,
+            # detector max not worse. (Reuse polished_score; don't re-score and risk detector noise.)
+            if (
+                find_sentinels(polished) == set(mapping)
+                and similarity(masked, polished) >= sim_bar
+                and polished_score["max"] <= best_score["max"]
+            ):


+def find_sentinels(text: str) -> set[str]:
+    """Return the set of sentinel tokens (``⟦HZxxxx⟧``) present in ``text``.
+
+    Used by the loop to mechanically enforce the lock: a rewrite is only accepted if it still
+    contains *exactly* the sentinels it was given — neither dropping one (which would silently
+    lose a locked citation/number/fact on restore) nor inventing one.
+    """
+    return set(_SENTINEL_RE.findall(text))


 def count_hidden(text: str) -> int:
    """How many invisible/homoglyph chars are present — a quick 'is this watermarked?' check."""
-    invisible = len(_INVISIBLE.findall(text))
+    invisible = len(_WATERMARK_CHARS.findall(text))
    homoglyphs = sum(1 for ch in text if ch in _UNHOMOGLYPH)
    return invisible + homoglyphs


Copilot AI review requested due to automatic review settings June 26, 2026 03:28

Copilot started reviewing on behalf of ssamba1 June 26, 2026 03:28 View session

Copilot AI reviewed Jun 26, 2026

View reviewed changes

ssamba1 merged commit c7802eb into main Jun 26, 2026
13 checks passed

ssamba1 deleted the fix/audit-pass-2-loop-and-scrub branch June 26, 2026 16:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audit pass 2: enforce the fact-lock in the loop + stop scrub corrupting Unicode#7

Audit pass 2: enforce the fact-lock in the loop + stop scrub corrupting Unicode#7
ssamba1 merged 1 commit into
mainfrom
fix/audit-pass-2-loop-and-scrub

ssamba1 commented Jun 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ssamba1 commented Jun 26, 2026

🔴 Critical — the core guarantee was unenforced

🟠 High — scrub_hidden corrupted legitimate Unicode on every input

🟡 Medium / Low

Verified false positive (not changed)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

🟠 High — `scrub_hidden` corrupted legitimate Unicode on every input