feat(sync): detect signature drift in rsync and revert/drop#235
feat(sync): detect signature drift in rsync and revert/drop#235mosheabr wants to merge 7 commits into
Conversation
Adds a new "Detect signature drift in rsync" step to sync-skills.yml that runs before the existing compliance check. Detects when the rsync brought in file changes under skills/<dir>/ without a matching skill.oms.sig refresh — landing those would break crypto verification on the next "Verify Signed Skills" sweep. Recovery rules: - Skill already on main: revert the catalog dir to its pre-rsync (signed) state. The signed copy stays live; the unsigned update is held until the source team runs /nvskills-ci on a follow-up PR and the next sync picks up a fresh sig. - Brand-new skill (no prior version on main): drop outright. Both cases write to /tmp/dropped-skills.txt with a "sig drift" reason. The "Track dropped skills" tracker-issue step is extended to partition entries by reason (missing-artifact vs sig-drift) and surface recovery steps for each. Motivation: hand-cherry-picking clean skills out of sync PRs has been a recurring task this week (PRs #226 -> #231, #232 -> #233, #234) because the source teams' content changes outpace their sig refresh. Automating the drop puts the recovery burden back on the source team where it belongs, and stops drift from propagating to Skills.sh / Codex / Hermes / OpenClaw via the catalog. Test cases the reviewer should think about: 1. Skill unchanged in rsync → not touched (no diff). 2. Skill content + sig both changed → no drift (both moved together). 3. Skill content changed, sig unchanged → drift detected. If on main, revert to HEAD. If new, drop outright. 4. New skill missing sig entirely → caught by existing compliance step (not this one). 5. Concurrent drift across multiple skills in one product → all detected; product still marked changed for PR title. Out of scope: full cryptographic verification of (sig, manifest) pairs — that's what the "Verify Signed Skills" workflow (PR #78) is meant to do periodically. This step catches the common authoring mistake (forgot to /nvskills-ci) at sync time, which is what we've been seeing all week. Signed-off-by: Moshe Abramovitch <moshea@nvidia.com>
P1 fixes:
- Untracked files were invisible to `git diff --name-only HEAD`,
which made the brand-new-skill drift branch unreachable and let
existing-skill drift slip through when the change was an added
file (not a modified one). Now unions diff + ls-files --others
to capture the full rsync delta.
- Compliance step truncated /tmp/dropped-skills.txt, wiping drift
records the new step had written. Switched to `touch` so both
steps append to the same file.
P2 fix:
- `git clean -fd` doesn't remove ignored files, so the revert path
could leave rsync-created ignored orphans behind. Use `git clean
-fdx --` to match the "restore pre-rsync state" contract.
P3 fixes:
- `grep -qx "- $product"` parsed the leading "-" as a flag (silent
failure -> always treated as "not found"). Use `grep -Fqx --`.
Fix applied to both the new drift step and the pre-existing
compliance step which had the same bug.
- Tracker-issue partition was sig-drift vs not-sig-drift, which
lumped orphan drops into "Missing artifacts." Now uses explicit
reason prefixes ("missing artifacts:", "orphan:", "sig drift:")
with a 4th "Other" catch-all so nothing is silently mis-bucketed.
- Title generation refactored to build from non-empty buckets so
it stays accurate as categories come and go.
Refactor:
- Tracker body now uses an `emit_section` shell function for the
three (now four) categories, dropping duplicated code.
Signed-off-by: Moshe Abramovitch <moshea@nvidia.com>
P2 fix: - Drift detection now drives the scan from "skill dirs the rsync touched" (git diff + ls-files --others, top-level skill-dir granularity) rather than "skill dirs containing SKILL.md after rsync." The latter missed cases where rsync deleted or renamed SKILL.md without refreshing skill.oms.sig — the existing signed copy on main would have been silently overwritten with an unsigned state. P3 fix: - Sync PR body now partitions /tmp/dropped-skills.txt by the same reason prefixes the tracker issue uses (missing artifacts:, sig drift:, orphan:, plus a catch-all). Drift entries no longer show up as "missing: sig drift: reverted..." nor get mis-labeled in the PR body. Signed-off-by: Moshe Abramovitch <moshea@nvidia.com>
P1 fix:
- Drift scan no longer awk-extracts "skills/$1/$2" from git output —
that collapsed multi-skill products to one entry, masking drift on
sibling skills and misclassifying new skills under existing
products as "existing on main." Now the rsync loop records the
exact destination dir (skills/<catalog_dir>) into
/tmp/rsynced-skill-dirs.txt as it goes, and the drift step reads
that list directly. Nesting-safe (handles both flat and nested
catalog layouts) and no path-parsing.
P2 fix:
- Loop now reads via `while IFS= read -r dir; do … done < file`
instead of `for dir in $touched_dirs`, so paths with spaces or
other word-splitting characters survive intact.
P3 fix:
- Title join used `IFS=', '; "${parts[*]}"`, but bash array
expansion only uses the first IFS char as separator — output was
"1 missing artifacts,2 sig drift" (no space). Now uses
printf "%s, " + trailing-strip, which gives the intended
", "-separated string.
Signed-off-by: Moshe Abramovitch <moshea@nvidia.com>
P3 fix from Codex 4th review: - sig_changed regex `(^|/)skill\.oms\.sig$` matched ANY path ending in `/skill.oms.sig`, including nested cases like `$dir/docs/skill.oms.sig`. Compliance only counts the root sig at `$dir/skill.oms.sig`, so drift detection must too. Otherwise a team adding a stray nested sig file would mask the stale root. - Replaced regex with exact-match `grep -Fx -- "$sig_path"` against the canonical root path. Signed-off-by: Moshe Abramovitch <moshea@nvidia.com>
Adds auto-assignment of the missing-compliance tracker issue to the catalog PIC (default: mosheabr) so any drift change reliably generates a GitHub notification — independent of repo-watch settings. Without this, the rolling issue update was silent unless the maintainer happened to be subscribed. Configurable via a CATALOG_TRACKER_ASSIGNEES secret holding a comma-separated list of GitHub handles, with a hardcoded fallback so the workflow stays self-contained. `--add-assignee` is idempotent and re-asserts assignment on every run, so a manual unassign doesn't permanently mute notifications. Signed-off-by: Moshe Abramovitch <moshea@nvidia.com>
#2 (MED) — gh issue create can skip tracker on bad assignee: The "Track dropped skills" step has continue-on-error: true, so a bad assignee wouldn't kill the whole sync — but it would skip tracker creation entirely, defeating the feature. Both branches (edit existing / create new) now follow the same tolerant pattern: create-without-assignees first, then loop --add-assignee in a trim + per-handle warning. A typo in CATALOG_TRACKER_ASSIGNEES no longer blocks the tracker from existing. #3 (LOW) — Hardcoded personal fallback: Reworked per Codex's catch that GitHub issue assignees are users, not teams — so the earlier "use a team handle" comment was wrong. If CATALOG_TRACKER_ASSIGNEES is unset we now skip assignment entirely and emit a ::warning title=Drift notifications inactive::. No personal fallback; a deactivated handle would silently mute notifications. The warning surfaces in the workflow log every run until the secret is set. #4 (LOW / clarification) — Deleted-sig case: Added an inline comment in the drift loop noting that an rsync that deletes skill.oms.sig alongside content edits registers the deletion in `git diff` (sig_changed matches → drift NOT flagged here), and the now-sigless skill falls through to the compliance step under "missing artifacts: skill.oms.sig". Defense in depth. #1 (HIGH per Sayali) — README version inconsistency on revert: NOT addressed in this commit — verified against .github/scripts/regenerate-readme.sh on main: the README table is now "Product | Description | Skills" (PR #215, commit 1104de0 on 2026-06-01 dropped the Source + Version columns). VERSIONS_FILE is declared but never consumed, and /tmp/sync-versions.txt is vestigial. So the exact "README will lie" risk Sayali raised is stale. Plan to send her the context separately and open a small follow-up PR to drop the dead sync-versions write + VERSIONS_FILE declaration so this trap doesn't catch a future reader. Signed-off-by: Moshe Abramovitch <moshea@nvidia.com>
|
Thanks for the review @sayalinvidia — pushed commit What changed
On #1 (HIGH) — README version inconsistencyI want to flag this carefully because the answer turns out to be different than expected. The README table on So the exact "README will advertise the drifted upstream SHA against now-reverted content" risk you raised is anchored to the pre-#215 table shape, not the current one. I didn't apply the purge logic in this PR for that reason. Two follow-ups I'd suggest in a separate small PR:
Would appreciate a follow-up pass when you have a window — happy to walk through anything that's not obvious. |
Cherry-picked from sync run 26962376888 (PR #234). Lands only the skills whose skill.oms.sig was refreshed alongside content changes, plus the regenerated README. Landing (5 CLEAN dirs): - nemotron-policy-generator (new — from PR #236 source content) - nemotron-customize (Raunak Kalani re-signed today) - nemo-data-designer-plugin (clean refresh) - nemo-evaluator-plugin (Seph's team re-signed) - earth2studio-deterministic-forecast (Nicholas Geneva re-signed) README regenerated — picks up: - "NVIDIA Digital Health Examples" -> "Digital Health" (PR #237) - nemotron-policy-generator row in the Nemotron Skills column Held (17 DRIFT — NOT in this PR — source teams need /nvskills-ci): - 13 x vss-* - nemoclaw-user-reference - nemo-automodel-distributed-training - dali-dynamic-mode - nemo-retriever Follow-up: - Close PR #234 (superseded) - Manual cherry-pick cycle stops once PR #235 (sig-drift detection) lands and the bot auto-reverts drifted skills inside the sync workflow. Signed-off-by: Moshe Abramovitch <moshea@nvidia.com>
Cherry-picked from sync run 26962376888 (PR NVIDIA#234). Lands only the skills whose skill.oms.sig was refreshed alongside content changes, plus the regenerated README. Landing (5 CLEAN dirs): - nemotron-policy-generator (new — from PR NVIDIA#236 source content) - nemotron-customize (Raunak Kalani re-signed today) - nemo-data-designer-plugin (clean refresh) - nemo-evaluator-plugin (Seph's team re-signed) - earth2studio-deterministic-forecast (Nicholas Geneva re-signed) README regenerated — picks up: - "NVIDIA Digital Health Examples" -> "Digital Health" (PR NVIDIA#237) - nemotron-policy-generator row in the Nemotron Skills column Held (17 DRIFT — NOT in this PR — source teams need /nvskills-ci): - 13 x vss-* - nemoclaw-user-reference - nemo-automodel-distributed-training - dali-dynamic-mode - nemo-retriever Follow-up: - Close PR NVIDIA#234 (superseded) - Manual cherry-pick cycle stops once PR NVIDIA#235 (sig-drift detection) lands and the bot auto-reverts drifted skills inside the sync workflow. Signed-off-by: Moshe Abramovitch <moshea@nvidia.com>
Summary
Adds a "Detect signature drift in rsync" step to
sync-skills.ymlthat catches the case where the source-repo team edits skill content without refreshingskill.oms.sig. Recovery is automatic and safe:mainand across downstream channels (Skills.sh, Codex, Claude Code, Hermes, OpenClaw). The unsigned update is held until the source team re-signs.main): drop outright — never land in a drift state.Both write to
/tmp/dropped-skills.txtwith asig drift:reason prefix. The sync PR body and the rollingmissing-compliancetracker issue now partition drops into Missing artifacts / Signature drift / Orphan / Other sections so each failure mode has its own recovery path.Why now
We've hand-cherry-picked three sync PRs this week (#226→#231, #232→#233, #234→pending) because source-repo teams edit skill content without re-running
/nvskills-ci. The unsigned content keeps reaching the bot sync PR; landing it would propagate to Skills.sh + Codex + Hermes + OpenClaw within ~15 minutes and break crypto verification on the next sweep. Automating the drop turns this into a self-service flow: the source team gets pinged, the catalog never goes through a broken-crypto state.How drift is detected
For each catalog dir the rsync actually targeted (recorded in
/tmp/rsynced-skill-dirs.txtduring the rsync loop, so we never path-parse):git diff --name-only HEAD ∪ git ls-files --others --exclude-standard— captures both modifications/deletions and untracked additions.$dir/skill.oms.sigis in that set (exact match viagrep -Fx --, not regex — nested sig files like$dir/docs/skill.oms.sigdon't mask a stale root).git ls-tree HEAD -- "$dir"non-empty) → revert viagit checkout HEAD -- "$dir/"+git clean -fdx -- "$dir/". New skill →rm -rf.Repo-owner visibility
The rolling
missing-compliancetracker issue is now auto-assigned (mosheabrby default; configurable viaCATALOG_TRACKER_ASSIGNEESsecret as a comma-separated list). Every drift change generates a GitHub notification regardless of repo-watch state. Idempotent — re-asserts on each run so manual unassign can't permanently mute notifications.Test cases the design has been validated against
docs/skill.oms.sig) → still drift (root path is what matters).SKILL.mdwithout sig refresh → still caught (driver iterates exact rsync targets, not post-rsyncfind SKILL.md).Bonus fix in this PR
The pre-existing
grep -qx "- $product"at the "mark changed components" line parsed the leading-as a grep option, so the de-dup check silently always fell through. Fixed in both the new drift step and the existing compliance step (grep -Fqx --). Happy to split into a separate fix-bug PR if reviewers prefer.Follow-ups (intentionally out of scope here)
git log -1 --pretty='%ae' -- <changed-files>on the source clone we already have during rsync). Anchored to commit identity, not a static contacts directory — survives team rotations. Tracked as backlog item CI Pipeline to verify skill counts match source repos #3.model_signing verifyis the job of the "Verify Signed Skills" workflow in PR ci: add Verify Signed Skills workflow #78 — the two are complementary (this catches authoring mistakes at sync time; ci: add Verify Signed Skills workflow #78 catches cryptographic mismatches periodically).Review history
4 rounds of Codex review on a private draft branch (
mosheabr/skills-drafts#1) prior to raising upstream. All findings addressed before this PR opened.Stats
.github/workflows/sync-skills.yml