Skip to content

Version-stamp reference.md and warn on engine drift (closes #104)#106

Open
vahid-ahmadi wants to merge 1 commit into
mainfrom
feat/reference-version-stamp
Open

Version-stamp reference.md and warn on engine drift (closes #104)#106
vahid-ahmadi wants to merge 1 commit into
mainfrom
feat/reference-version-stamp

Conversation

@vahid-ahmadi

Copy link
Copy Markdown
Collaborator

Summary

Stamp the engine version into reference.md when it is generated, and warn at load time if the deployed engine has drifted from the version the doc was built against.

Why

reference.md is generated from the installed policyengine_uk_compiled engine and loaded whole into the cached system prompt, but it recorded no engine version. If the deployed engine drifts from the doc, the agent is grounded on stale signatures / parameter schemas and can write code that doesn't match the live engine — producing wrong or inconsistent generated code with no signal that anything is out of sync. This is the app's "stale-index" consistency risk.

What was added

  • build_reference.py — a robust _engine_version() helper (tries importlib.metadata for both hyphen/underscore distribution spellings, falls back to pe.__version__, then "unknown"). render() now emits, right after the intro paragraph:
    • a human-readable line: **Engine version:** \{ver}` (generated at build time).`
    • a parseable marker on its own line: <!-- engine-version: {ver} -->
  • chatbot.py — after REFERENCE_DOC is read, a best-effort drift check parses the stamped version out via regex, resolves the currently installed engine version the same way, and:
    • warns if the stamp is missing,
    • warns if stamp != installed (suggesting a rebuild),
    • logs an info confirmation if they match.
      The entire check is wrapped in try/except and the engine import is lazy, so a missing engine, missing distribution, or any parse failure simply skips the check — it can never break reference loading or request handling.

Closes #104

🤖 Generated with Claude Code

@vercel

vercel Bot commented Jun 8, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
policyengine-uk-chat Ready Ready Preview, Comment Jun 15, 2026 2:42pm

Request Review

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

Beta preview is ready.

@anth-volk anth-volk left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed with attention to how this interacts with the deploy pipeline. The implementation is careful — every failure mode degrades to "check silently does nothing" rather than breaking anything. Two findings worth addressing before merge, two smaller ones inline.

1. The warning is structurally unreachable in production. modal_app.py bakes the backend into the image (add_local_dir(..., copy=True)) and runs build_reference.py in that same image build, immediately after pip_install of the engine — so the stamp and the installed version are regenerated together on every deploy and are equal by construction (same for the Dockerfile path it mirrors). The check's real value is local dev (where a stale reference.md genuinely lingers) and future-proofing against build-process changes. That's still worth having, but the PR framing overstates the production risk — and it makes the dev-checkout bug flagged inline the one that matters, since dev is where this check earns its keep.

2. No tests. The drift check is trivially unit-testable: feed a doc with a matching / mismatched / missing stamp, monkeypatch the version resolver, assert on caplog. Given the prompt-contract tests added in #92/#94, two or three focused tests here would match the local standard.

Things I checked that are fine: the regex is correctly anchored to its own marker line; the two stamp lines add negligible tokens to the cached prompt and the HTML comment is harmless to the model; the module-level call runs once at import and early-returns on the empty-doc path, so CI (no reference.md) gets no new noise; and the branch auto-merges cleanly with current main despite being authored against pre-#92 code.

Verdict: sound idea, safe implementation — I'd like the dev-checkout version resolution fixed and tests added before merge; the helper dedup and debug-log are nice-to-haves to bundle in.

Comment thread backend/routes/chatbot.py Outdated
logger.warning("[CHAT] reference.md not found — run scripts/build_reference.py")


def _installed_engine_version():

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This deliberately mirrors _engine_version() in build_reference.py, but two copies of the same resolution logic is the pattern that just got cleaned up for get_capabilities in #94 — they will drift. Suggest a single get_engine_version() in tooling/simulations.py used by both; the script runs standalone but can reach it with a one-line sys.path insert (it already computes the repo-relative path for OUT).

Comment thread backend/routes/chatbot.py Outdated
Comment on lines +136 to +138
import policyengine_uk_compiled as pe

return getattr(pe, "__version__", None)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both halves of this fallback are dead in practice, and that disables the check exactly where drift is most likely:

  • policyengine_uk_compiled has no __version__ attribute (verified against 0.35.0), so this getattr always returns None.
  • In local-checkout dev setups the engine is only importable after tooling.simulations.ensure_compiled_package_importable() inserts the rust checkout into sys.path — a bare import policyengine_uk_compiled here raises ImportError, so _installed_engine_version() returns None and the check silently skips.

Production always resolves via distribution metadata (and always matches by construction — see review body), so local dev is the one environment where this check adds value, and it's the one where this returns None. Suggest calling ensure_compiled_package_importable() before the fallback import, or routing through a shared tooling helper.

Comment thread backend/routes/chatbot.py Outdated
Comment on lines +173 to +175
except Exception:
# Drift check is purely advisory; never let it affect loading.
pass

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best-effort is right, but fully silent swallowing makes "check skipped due to a bug" indistinguishable from "check never ran". A logger.debug(f"[CHAT] reference drift check failed: {exc}") costs nothing and preserves the never-breaks-loading guarantee.

Comment thread backend/scripts/build_reference.py Outdated
return importlib.metadata.version(dist)
except importlib.metadata.PackageNotFoundError:
continue
return getattr(pe, "__version__", "unknown")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pe.__version__ doesn't exist (verified against 0.35.0), so in any environment without distribution metadata — e.g. the local rust-checkout setup — this stamps unknown. A metadata-resolving runtime then logs the confusing "reference.md was built against engine unknown" warning. Worth either special-casing unknown at the check side (treat as unverifiable, like a missing stamp) or accepting the noisy-but-safe behaviour knowingly.

@vahid-ahmadi vahid-ahmadi left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Version-stamp reference.md and warn on engine drift (#106)

The code is clean, defensive, and zero-risk to merge. My concern is that the mechanism doesn't actually catch the risk #104 describes — and given how reference.md is deployed, the drift warning will almost never fire.

✅ Code quality — good

  • _engine_version() / _installed_engine_version() handle both dist spellings + __version__ fallback. The load-time check is fully wrapped in try/except with a lazy import, so it can never break reference loading or request handling. Correct call.
  • Stamping the version into the doc is genuinely useful for observability/debugging ("which engine was this prompt built against") regardless of the drift logic.

🟡 The drift check will essentially never fire in the real deploy model

reference.md is not committed (.gitignore:14) — it's generated in the same image build as the engine:

  • backend/Dockerfile:15RUN python scripts/build_reference.py, after deps are installed.
  • modal_app.py:49python scripts/build_reference.py against the Modal-installed engine.

So the stamped version is read from the same installed engine the server then runs against → stamp == installed on every normal deploy, and _check_reference_engine_drift logs the info "matches" branch every time. The mismatch warning only triggers if the doc and the engine are produced at different times (e.g. a prebuilt base image with a stale doc + a later engine bump, or a manual runtime reinstall) — which the current Dockerfile/Modal flow doesn't do. As wired today it's close to a guaranteed no-op.

🟡 It doesn't check the thing that's actually at risk

#104's concern is the agent being grounded on stale signatures / parameter schemas / reform-recipe field names. This PR compares a version string, not content. Two gaps:

  1. A schema/field change within a version (or a hand-maintained recipe field name that was never right — see the #88 fix where recipes 3–5 had non-existent fields that silently no-op'd) produces no version mismatch, so this check stays silent.
  2. It's log-only — a buried startup logger.warning, no CI gate, no signal to the user or agent. Nothing fails.

The real protection is build-time validation: in build_reference.py, assert each documented reform-recipe / parameter field name exists in Parameters.model_json_schema() and fail the build if not. That catches the actual stale-index failure mode at the moment the doc is generated, version-bump or not.

🟡 Related: the engine isn't pinned

backend/requirements.txt:7 pins policyengine-uk-compiled>=0.20.0 — a floor, not a pin. A fresh image can resolve to any newer engine. That's the drift this PR targets, but because the doc is rebuilt in the same image, it self-heals silently rather than warning. Tightening to ==0.38.0 (or a tight range) is the more direct mitigation.

Merge-readiness

Currently CONFLICTING / DIRTY — needs a rebase. The conflict is in build_reference.py::render(), the exact intro region #88 just modified (the recipes landed right where this inserts the stamp), plus chatbot.py has moved. Straightforward to rebase.

Verdict

Harmless and adds useful version observability, so no objection to landing it after rebase. But it should not be treated as closing #104 — as deployed it rarely fires and never inspects the at-risk content. I'd pair it with (1) build-time schema validation of recipe/param field names that fails the build, and (2) tightening the engine pin. Happy to take the build-time-validation follow-up if useful.

reference.md is generated from the installed policyengine_uk_compiled
engine and loaded whole into the cached system prompt, but recorded no
engine version. If the deployed engine drifts from the doc, the agent is
grounded on stale signatures/schemas with no signal anything is wrong.

- build_reference.py: resolve the engine version robustly and stamp it
  into the header (human-readable line) plus a parseable
  `<!-- engine-version: X.Y.Z -->` marker.
- chatbot.py: at load, parse the stamped version, resolve the installed
  engine version the same way, and log a warning on mismatch or missing
  stamp. Wrapped in try/except so it can never break reference loading
  or request handling.

Closes #104

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@vahid-ahmadi vahid-ahmadi force-pushed the feat/reference-version-stamp branch from 0f4fef7 to 92592fb Compare June 15, 2026 14:41
@vahid-ahmadi

Copy link
Copy Markdown
Collaborator Author

@anth-volk thanks — all four addressed in the rebase (now on current main, 92592fb).

1. Helper dedup → shared resolver. Pulled the version resolution into a single get_engine_version() in tooling/simulations.py; both build_reference.py (build-time stamp) and chatbot.py (load-time check) now call it. No more two-copies-that-drift. The script reaches it with the one-line sys.path insert you suggested (it already computed the repo-relative path).

2. Dev-checkout dead fallback — fixed. This was the important one. get_engine_version() now calls ensure_compiled_package_importable() before resolving, so the rust-checkout is on sys.path and the local-dev path actually works. I also dropped the pe.__version__ fallback entirely — you're right that it doesn't exist (confirmed on 0.38.0), so distribution metadata is the only real source; the helper returns None when it can't resolve, and callers treat that as "unverifiable."

3. unknown stamp — special-cased at the check side. _check_reference_engine_drift now treats a missing or "unknown" stamp as unverifiable (one warning, no comparison), so a metadata-resolving runtime no longer logs the confusing "built against engine unknown" mismatch.

4. Silent swallow → debug log. The catch-all now does logger.debug("[CHAT] reference drift check skipped: {exc}"), so "skipped due to a bug" is distinguishable from "ran clean" without touching the never-breaks-loading guarantee.

Tests addedbackend/tests/test_reference_drift.py, 7 cases on the drift logic (matching / mismatched / missing / unknown stamp, unresolvable installed version, empty-doc no-op, and the resolver returning a string when the engine is installed), monkeypatching the resolver and asserting on caplog per the local standard. All pass in python313; verified end-to-end too: build_reference.py stamps 0.38.0 and load logs "matches installed engine 0.38.0".

On your review-body point: agreed the production warning is unreachable by construction (doc + engine regenerate together) — its value is local dev and future build-process changes, which is exactly where fix #2 makes it land. Separately I'd still flag that this checks a version string, not the recipe/parameter field names that actually drift (cf. #88), and requirements.txt pins >=0.20.0 not ==; both are follow-ups rather than part of this PR.

@anth-volk

Copy link
Copy Markdown
Contributor

@vahid-ahmadi#106 and #109 are circling the same two questions from different angles, and I'd rather we agree a target shape than rebase around each other piecemeal. Can we sync? I think we need an affirmative, agreed pathway on two things before either merges:

1. How we handle irrelevant / unanswerable questions.
Three overlapping mechanisms are in flight: your scope/refusal contract (#102, in-call), my scope router (#109, pre-call — picks a light vs. full background), and the confirm-first idea I raised on #102. They can compose (different layers), but we should decide the canonical layering: what actually declines, what just loads a lighter background, and what the user sees in each case.

2. How we handle systemic metadata.
#106 (stamp-and-warn on engine drift) and #109 (derive the scope descriptor live from the Parameters schema + capabilities()) are two different answers to the same "stale-index" risk. They also edit the same two spots — build_reference.py and the REFERENCE_DOC load in chatbot.py — so they'll conflict on merge regardless. We should pick one philosophy (stamp-and-warn vs. derive-live) and apply it consistently across reference.md and the scope descriptor.

I'm putting #109 back into draft until we've talked, so we settle the architecture before either lands.

anth-volk added a commit that referenced this pull request Jun 16, 2026
A cheap forced-tool pre-pass builds a structured execution plan for the
opening user turn and routes it to one of: irrelevant, out_of_scope,
partial, needs_plan, ready. The model grounds each plan slot (source
flag); the server gates via per-slot criticality. Any gateway error
fails safe to compute. Supersedes the binary scope router (#109) and the
reference.md drift stamp (#106; replaced by the engine-derived scope
descriptor).

- backend/gateway_config.py: criticality table, inferable set, promotions, gate()
- backend/gateway.py: run_gateway (forced emit_plan), verdict, writer/plan helpers
- backend/prompts.py: gateway + lightweight prompts, scope descriptor
- backend/routes/chatbot.py: gateway routing in generate_stream
- backend/scripts/build_reference.py: build_scope_descriptor -> scope_descriptor.md
- evaluation: GatewayCase + _run_gateway + 14 live cases (evals/cases/gateway)
- backend/tests/test_gateway.py: 33 offline tests (gate, parser, fail-safe)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Version-stamp reference.md and warn on engine/reference drift

2 participants