Skip to content

Audit pass 2: enforce the fact-lock in the loop + stop scrub corrupting Unicode#7

Merged
ssamba1 merged 1 commit into
mainfrom
fix/audit-pass-2-loop-and-scrub
Jun 26, 2026
Merged

Audit pass 2: enforce the fact-lock in the loop + stop scrub corrupting Unicode#7
ssamba1 merged 1 commit into
mainfrom
fix/audit-pass-2-loop-and-scrub

Conversation

@ssamba1

@ssamba1 ssamba1 commented Jun 26, 2026

Copy link
Copy Markdown
Owner

Second deep audit pass, focused on the loop-critical / less-covered modules that pass 1 (#6) didn't drill into: the quality gate, the headless loop, the watermark scrubber, and env loading. Every fix has a regression test. 155 passed, 1 skipped, 0 failed; ruff clean.

🔴 Critical — the core guarantee was unenforced

"Citations/numbers locked byte-for-byte" is the project's #1 differentiator. But the similarity gate runs on masked text, so a rewrite that dropped or replaced a sentinel (losing the locked fact) still passed the gate — and restore() silently emitted text with the fact gone.

Reproduced: a rewrite turning ⟦HZ0001⟧ ("47%") into "nearly half" scored sim 0.965 ≥ 0.76 and passed.

Fix: untell_text() now rejects any candidate whose sentinel set ≠ the locked set (find_sentinels). The same gate guards the polish pass. SKILL.md gets an explicit per-rewrite sentinel check (the /untell path is Claude, not the headless loop), and preserve.py --restore warns loudly when locked spans are missing. Test: test_loop_rejects_sentinel_dropping_rewrite — a fact-dropping rewriter is attempted and rejected; Smith (2020) and 47% survive.

🟠 High — scrub_hidden corrupted legitimate Unicode on every input

It runs by default on all input and was destroying real content:

Input Before After
E=mc² E=mc2 (NFKC) E=mc²
❤️ (VS16 stripped) ❤️
👨‍👩‍👧‍👦 4 separate emoji (ZWJ stripped) intact
Arabic/Hebrew w/ bidi marks layout garbled preserved

Fix: strip only genuine watermark carriers — zero-width chars, Unicode tag chars, C0/C1 controls, and orphan ZWJ (between non-emoji) — using NFC, not NFKC. Emoji ZWJ sequences, variation selectors, bidi marks, superscripts and ligatures are preserved; a watermark ZWJ between letters is still removed. Test: test_scrub_preserves_legitimate_unicode.

🟡 Medium / Low

  • confirm pass used the raw threshold instead of threshold − margin, so a noisy re-score inside the margin band wasn't demoted — now consistent with the pass test.
  • Added an honest rewrites counter (iterations counts loop cycles, =1 even when the input already passed with 0 rewrites); polish no longer double-scores.
  • _env.py: handle a UTF-8 BOM (utf-8-sig) and a shell export prefix in the zero-dep fallback parser. Test: test_load_env_handles_bom_and_export.
  • eval/benchmark.py: use UNTELL_ENABLE_RADAR (was the legacy HUMANIZE_ alias).

Verified false positive (not changed)

"All 6 commercial adapters crash the loop on a 401/429/500" — both score_text and verify() already wrap adapter calls in try/except, so a flaky API degrades gracefully. No change made.

🤖 Generated with Claude Code

…ng Unicode

A second deep pass over the loop-critical, less-covered modules (quality gate, headless
loop, watermark scrubber, env loader) found real bugs in the core value proposition.
Fixes, each with a regression test:

CRITICAL - the "facts are locked byte-for-byte" guarantee was mechanically unenforced.
The quality gate compares MASKED text, so a rewrite that dropped or replaced a sentinel
(losing a locked citation/number/quote) still cleared the similarity gate and silently
lost the fact on restore. untell_text() now rejects any candidate whose sentinel set
!= the locked set (find_sentinels), so a fact-dropping rewrite can never win; the same
gate guards the optional polish pass. SKILL.md gains an explicit per-rewrite sentinel
check, and `preserve.py --restore` now warns loudly when locked spans are missing.

HIGH - scrub_hidden corrupted legitimate Unicode on every input.
It applied NFKC (E=mc^2 -> E=mc2, the fi ligature -> fi), stripped variation selectors
(heart-emoji lost its presentation), bidi marks (garbling Arabic/Hebrew layout), and
ZWJ (splitting the family emoji into four). Rewritten to strip only genuine watermark
carriers - zero-width chars, Unicode tag chars, C0/C1 controls, and *orphan* ZWJ
between non-emoji - while preserving emoji ZWJ sequences, variation selectors, bidi
marks, superscripts and ligatures (NFC, not NFKC).

MED / LOW
- run.py: the confirm pass used the raw threshold, not threshold - margin, so a re-score
  within the margin band was not demoted; now consistent with the pass test.
- run.py: added an honest `rewrites` counter (`iterations` counts loop cycles and is 1
  even when the input already passed with 0 rewrites); polish no longer double-scores.
- _env.py: handle a UTF-8 BOM (utf-8-sig) and a shell `export ` prefix in the zero-dep
  fallback parser.
- eval/benchmark.py: use UNTELL_ENABLE_RADAR (was the legacy HUMANIZE_ alias).

Verified: 155 passed, 1 skipped, 0 failed; ruff clean.

Note: the "commercial adapters crash the loop on an HTTP error" finding was a false
positive - both score_text and verify() already wrap adapter calls in try/except.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 26, 2026 03:28

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the “fact-lock” guarantee in the headless untell loop (sentinel preservation) and fixes scrub_hidden so it no longer corrupts legitimate Unicode, while also improving env loading robustness and aligning benchmark env vars.

Changes:

  • Enforce sentinel preservation in the headless loop (and polish) and add a rewrites counter + confirm-pass threshold consistency.
  • Make Unicode scrubbing conservative (preserve emoji ZWJ sequences/VS16/bidi marks; NFC not NFKC) with regression tests.
  • Improve .env loading (UTF-8 BOM + export prefix) and switch benchmark radar flag to UNTELL_ENABLE_RADAR.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
untell/SKILL.md Adds explicit “verify every sentinel” instruction to prevent lock loss in the skill workflow.
untell/scripts/run.py Adds sentinel-set gating in the loop/polish, confirm-threshold fix, and rewrites counter.
untell/scripts/preserve.py Adds find_sentinels() and warns loudly when restore input is missing locked spans.
untell/attacks/unicode_tricks.py Reworks scrub_hidden to avoid Unicode corruption; updates hidden-char counting.
untell/_env.py Loads .env with utf-8-sig and tolerates export KEY=VALUE in fallback parser.
tests/test_run.py Adds regression test that a sentinel-dropping rewriter is rejected by the loop.
tests/test_env.py Adds BOM + export parsing regression test.
tests/test_attacks_more.py Adds regression test that scrub preserves legitimate Unicode while removing watermark chars.
eval/benchmark.py Uses UNTELL_ENABLE_RADAR env var instead of legacy HUMANIZE_ENABLE_RADAR.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread untell/scripts/run.py
Comment on lines 168 to +176
cand_score = score(candidate)
if similarity(masked, candidate) >= sim_bar and cand_score["max"] <= best_score["max"]:
# Accept only if the rewrite (a) keeps EVERY sentinel intact — a dropped or altered sentinel
# would silently lose a locked citation/number/fact on restore, defeating the whole lock —
# (b) holds the meaning-similarity gate, and (c) does not worsen the detector max.
if (
find_sentinels(candidate) == set(mapping)
and similarity(masked, candidate) >= sim_bar
and cand_score["max"] <= best_score["max"]
):
Comment thread untell/scripts/run.py
Comment on lines 197 to +205
polished = surgical_substitute(best_masked, tier="lite", threshold=threshold)["text"]
if score(polished)["max"] <= best_score["max"]:
best_masked = polished
best_score = score(best_masked)
polished_score = score(polished)
# Polish must clear the same gates as a rewrite: sentinels intact, meaning preserved,
# detector max not worse. (Reuse polished_score; don't re-score and risk detector noise.)
if (
find_sentinels(polished) == set(mapping)
and similarity(masked, polished) >= sim_bar
and polished_score["max"] <= best_score["max"]
):
Comment on lines +133 to +140
def find_sentinels(text: str) -> set[str]:
"""Return the set of sentinel tokens (``⟦HZxxxx⟧``) present in ``text``.

Used by the loop to mechanically enforce the lock: a rewrite is only accepted if it still
contains *exactly* the sentinels it was given — neither dropping one (which would silently
lose a locked citation/number/fact on restore) nor inventing one.
"""
return set(_SENTINEL_RE.findall(text))
Comment on lines 112 to 116
def count_hidden(text: str) -> int:
"""How many invisible/homoglyph chars are present — a quick 'is this watermarked?' check."""
invisible = len(_INVISIBLE.findall(text))
invisible = len(_WATERMARK_CHARS.findall(text))
homoglyphs = sum(1 for ch in text if ch in _UNHOMOGLYPH)
return invisible + homoglyphs
@ssamba1 ssamba1 merged commit c7802eb into main Jun 26, 2026
13 checks passed
@ssamba1 ssamba1 deleted the fix/audit-pass-2-loop-and-scrub branch June 26, 2026 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants