Audit pass 2: enforce the fact-lock in the loop + stop scrub corrupting Unicode#7
Merged
Merged
Conversation
…ng Unicode A second deep pass over the loop-critical, less-covered modules (quality gate, headless loop, watermark scrubber, env loader) found real bugs in the core value proposition. Fixes, each with a regression test: CRITICAL - the "facts are locked byte-for-byte" guarantee was mechanically unenforced. The quality gate compares MASKED text, so a rewrite that dropped or replaced a sentinel (losing a locked citation/number/quote) still cleared the similarity gate and silently lost the fact on restore. untell_text() now rejects any candidate whose sentinel set != the locked set (find_sentinels), so a fact-dropping rewrite can never win; the same gate guards the optional polish pass. SKILL.md gains an explicit per-rewrite sentinel check, and `preserve.py --restore` now warns loudly when locked spans are missing. HIGH - scrub_hidden corrupted legitimate Unicode on every input. It applied NFKC (E=mc^2 -> E=mc2, the fi ligature -> fi), stripped variation selectors (heart-emoji lost its presentation), bidi marks (garbling Arabic/Hebrew layout), and ZWJ (splitting the family emoji into four). Rewritten to strip only genuine watermark carriers - zero-width chars, Unicode tag chars, C0/C1 controls, and *orphan* ZWJ between non-emoji - while preserving emoji ZWJ sequences, variation selectors, bidi marks, superscripts and ligatures (NFC, not NFKC). MED / LOW - run.py: the confirm pass used the raw threshold, not threshold - margin, so a re-score within the margin band was not demoted; now consistent with the pass test. - run.py: added an honest `rewrites` counter (`iterations` counts loop cycles and is 1 even when the input already passed with 0 rewrites); polish no longer double-scores. - _env.py: handle a UTF-8 BOM (utf-8-sig) and a shell `export ` prefix in the zero-dep fallback parser. - eval/benchmark.py: use UNTELL_ENABLE_RADAR (was the legacy HUMANIZE_ alias). Verified: 155 passed, 1 skipped, 0 failed; ruff clean. Note: the "commercial adapters crash the loop on an HTTP error" finding was a false positive - both score_text and verify() already wrap adapter calls in try/except. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR hardens the “fact-lock” guarantee in the headless untell loop (sentinel preservation) and fixes scrub_hidden so it no longer corrupts legitimate Unicode, while also improving env loading robustness and aligning benchmark env vars.
Changes:
- Enforce sentinel preservation in the headless loop (and polish) and add a
rewritescounter + confirm-pass threshold consistency. - Make Unicode scrubbing conservative (preserve emoji ZWJ sequences/VS16/bidi marks; NFC not NFKC) with regression tests.
- Improve
.envloading (UTF-8 BOM +exportprefix) and switch benchmark radar flag toUNTELL_ENABLE_RADAR.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| untell/SKILL.md | Adds explicit “verify every sentinel” instruction to prevent lock loss in the skill workflow. |
| untell/scripts/run.py | Adds sentinel-set gating in the loop/polish, confirm-threshold fix, and rewrites counter. |
| untell/scripts/preserve.py | Adds find_sentinels() and warns loudly when restore input is missing locked spans. |
| untell/attacks/unicode_tricks.py | Reworks scrub_hidden to avoid Unicode corruption; updates hidden-char counting. |
| untell/_env.py | Loads .env with utf-8-sig and tolerates export KEY=VALUE in fallback parser. |
| tests/test_run.py | Adds regression test that a sentinel-dropping rewriter is rejected by the loop. |
| tests/test_env.py | Adds BOM + export parsing regression test. |
| tests/test_attacks_more.py | Adds regression test that scrub preserves legitimate Unicode while removing watermark chars. |
| eval/benchmark.py | Uses UNTELL_ENABLE_RADAR env var instead of legacy HUMANIZE_ENABLE_RADAR. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
168
to
+176
| cand_score = score(candidate) | ||
| if similarity(masked, candidate) >= sim_bar and cand_score["max"] <= best_score["max"]: | ||
| # Accept only if the rewrite (a) keeps EVERY sentinel intact — a dropped or altered sentinel | ||
| # would silently lose a locked citation/number/fact on restore, defeating the whole lock — | ||
| # (b) holds the meaning-similarity gate, and (c) does not worsen the detector max. | ||
| if ( | ||
| find_sentinels(candidate) == set(mapping) | ||
| and similarity(masked, candidate) >= sim_bar | ||
| and cand_score["max"] <= best_score["max"] | ||
| ): |
Comment on lines
197
to
+205
| polished = surgical_substitute(best_masked, tier="lite", threshold=threshold)["text"] | ||
| if score(polished)["max"] <= best_score["max"]: | ||
| best_masked = polished | ||
| best_score = score(best_masked) | ||
| polished_score = score(polished) | ||
| # Polish must clear the same gates as a rewrite: sentinels intact, meaning preserved, | ||
| # detector max not worse. (Reuse polished_score; don't re-score and risk detector noise.) | ||
| if ( | ||
| find_sentinels(polished) == set(mapping) | ||
| and similarity(masked, polished) >= sim_bar | ||
| and polished_score["max"] <= best_score["max"] | ||
| ): |
Comment on lines
+133
to
+140
| def find_sentinels(text: str) -> set[str]: | ||
| """Return the set of sentinel tokens (``⟦HZxxxx⟧``) present in ``text``. | ||
|
|
||
| Used by the loop to mechanically enforce the lock: a rewrite is only accepted if it still | ||
| contains *exactly* the sentinels it was given — neither dropping one (which would silently | ||
| lose a locked citation/number/fact on restore) nor inventing one. | ||
| """ | ||
| return set(_SENTINEL_RE.findall(text)) |
Comment on lines
112
to
116
| def count_hidden(text: str) -> int: | ||
| """How many invisible/homoglyph chars are present — a quick 'is this watermarked?' check.""" | ||
| invisible = len(_INVISIBLE.findall(text)) | ||
| invisible = len(_WATERMARK_CHARS.findall(text)) | ||
| homoglyphs = sum(1 for ch in text if ch in _UNHOMOGLYPH) | ||
| return invisible + homoglyphs |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Second deep audit pass, focused on the loop-critical / less-covered modules that pass 1 (#6) didn't drill into: the quality gate, the headless loop, the watermark scrubber, and env loading. Every fix has a regression test. 155 passed, 1 skipped, 0 failed; ruff clean.
🔴 Critical — the core guarantee was unenforced
"Citations/numbers locked byte-for-byte" is the project's #1 differentiator. But the similarity gate runs on masked text, so a rewrite that dropped or replaced a sentinel (losing the locked fact) still passed the gate — and
restore()silently emitted text with the fact gone.Reproduced: a rewrite turning
⟦HZ0001⟧("47%") into "nearly half" scored sim 0.965 ≥ 0.76 and passed.Fix:
untell_text()now rejects any candidate whose sentinel set ≠ the locked set (find_sentinels). The same gate guards the polish pass.SKILL.mdgets an explicit per-rewrite sentinel check (the/untellpath is Claude, not the headless loop), andpreserve.py --restorewarns loudly when locked spans are missing. Test:test_loop_rejects_sentinel_dropping_rewrite— a fact-dropping rewriter is attempted and rejected;Smith (2020)and47%survive.🟠 High —
scrub_hiddencorrupted legitimate Unicode on every inputIt runs by default on all input and was destroying real content:
E=mc²E=mc2(NFKC)E=mc²❤️❤(VS16 stripped)❤️👨👩👧👦Fix: strip only genuine watermark carriers — zero-width chars, Unicode tag chars, C0/C1 controls, and orphan ZWJ (between non-emoji) — using NFC, not NFKC. Emoji ZWJ sequences, variation selectors, bidi marks, superscripts and ligatures are preserved; a watermark ZWJ between letters is still removed. Test:
test_scrub_preserves_legitimate_unicode.🟡 Medium / Low
threshold − margin, so a noisy re-score inside the margin band wasn't demoted — now consistent with the pass test.rewritescounter (iterationscounts loop cycles, =1 even when the input already passed with 0 rewrites); polish no longer double-scores._env.py: handle a UTF-8 BOM (utf-8-sig) and a shellexportprefix in the zero-dep fallback parser. Test:test_load_env_handles_bom_and_export.eval/benchmark.py: useUNTELL_ENABLE_RADAR(was the legacyHUMANIZE_alias).Verified false positive (not changed)
"All 6 commercial adapters crash the loop on a 401/429/500" — both
score_textandverify()already wrap adapter calls intry/except, so a flaky API degrades gracefully. No change made.🤖 Generated with Claude Code