Version-stamp reference.md and warn on engine drift (closes #104)#106
Version-stamp reference.md and warn on engine drift (closes #104)#106vahid-ahmadi wants to merge 1 commit into
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Beta preview is ready.
|
anth-volk
left a comment
There was a problem hiding this comment.
Reviewed with attention to how this interacts with the deploy pipeline. The implementation is careful — every failure mode degrades to "check silently does nothing" rather than breaking anything. Two findings worth addressing before merge, two smaller ones inline.
1. The warning is structurally unreachable in production. modal_app.py bakes the backend into the image (add_local_dir(..., copy=True)) and runs build_reference.py in that same image build, immediately after pip_install of the engine — so the stamp and the installed version are regenerated together on every deploy and are equal by construction (same for the Dockerfile path it mirrors). The check's real value is local dev (where a stale reference.md genuinely lingers) and future-proofing against build-process changes. That's still worth having, but the PR framing overstates the production risk — and it makes the dev-checkout bug flagged inline the one that matters, since dev is where this check earns its keep.
2. No tests. The drift check is trivially unit-testable: feed a doc with a matching / mismatched / missing stamp, monkeypatch the version resolver, assert on caplog. Given the prompt-contract tests added in #92/#94, two or three focused tests here would match the local standard.
Things I checked that are fine: the regex is correctly anchored to its own marker line; the two stamp lines add negligible tokens to the cached prompt and the HTML comment is harmless to the model; the module-level call runs once at import and early-returns on the empty-doc path, so CI (no reference.md) gets no new noise; and the branch auto-merges cleanly with current main despite being authored against pre-#92 code.
Verdict: sound idea, safe implementation — I'd like the dev-checkout version resolution fixed and tests added before merge; the helper dedup and debug-log are nice-to-haves to bundle in.
| logger.warning("[CHAT] reference.md not found — run scripts/build_reference.py") | ||
|
|
||
|
|
||
| def _installed_engine_version(): |
There was a problem hiding this comment.
This deliberately mirrors _engine_version() in build_reference.py, but two copies of the same resolution logic is the pattern that just got cleaned up for get_capabilities in #94 — they will drift. Suggest a single get_engine_version() in tooling/simulations.py used by both; the script runs standalone but can reach it with a one-line sys.path insert (it already computes the repo-relative path for OUT).
| import policyengine_uk_compiled as pe | ||
|
|
||
| return getattr(pe, "__version__", None) |
There was a problem hiding this comment.
Both halves of this fallback are dead in practice, and that disables the check exactly where drift is most likely:
policyengine_uk_compiledhas no__version__attribute (verified against 0.35.0), so thisgetattralways returnsNone.- In local-checkout dev setups the engine is only importable after
tooling.simulations.ensure_compiled_package_importable()inserts the rust checkout intosys.path— a bareimport policyengine_uk_compiledhere raisesImportError, so_installed_engine_version()returnsNoneand the check silently skips.
Production always resolves via distribution metadata (and always matches by construction — see review body), so local dev is the one environment where this check adds value, and it's the one where this returns None. Suggest calling ensure_compiled_package_importable() before the fallback import, or routing through a shared tooling helper.
| except Exception: | ||
| # Drift check is purely advisory; never let it affect loading. | ||
| pass |
There was a problem hiding this comment.
Best-effort is right, but fully silent swallowing makes "check skipped due to a bug" indistinguishable from "check never ran". A logger.debug(f"[CHAT] reference drift check failed: {exc}") costs nothing and preserves the never-breaks-loading guarantee.
| return importlib.metadata.version(dist) | ||
| except importlib.metadata.PackageNotFoundError: | ||
| continue | ||
| return getattr(pe, "__version__", "unknown") |
There was a problem hiding this comment.
pe.__version__ doesn't exist (verified against 0.35.0), so in any environment without distribution metadata — e.g. the local rust-checkout setup — this stamps unknown. A metadata-resolving runtime then logs the confusing "reference.md was built against engine unknown" warning. Worth either special-casing unknown at the check side (treat as unverifiable, like a missing stamp) or accepting the noisy-but-safe behaviour knowingly.
vahid-ahmadi
left a comment
There was a problem hiding this comment.
Review: Version-stamp reference.md and warn on engine drift (#106)
The code is clean, defensive, and zero-risk to merge. My concern is that the mechanism doesn't actually catch the risk #104 describes — and given how reference.md is deployed, the drift warning will almost never fire.
✅ Code quality — good
_engine_version()/_installed_engine_version()handle both dist spellings +__version__fallback. The load-time check is fully wrapped intry/exceptwith a lazy import, so it can never break reference loading or request handling. Correct call.- Stamping the version into the doc is genuinely useful for observability/debugging ("which engine was this prompt built against") regardless of the drift logic.
🟡 The drift check will essentially never fire in the real deploy model
reference.md is not committed (.gitignore:14) — it's generated in the same image build as the engine:
backend/Dockerfile:15→RUN python scripts/build_reference.py, after deps are installed.modal_app.py:49→python scripts/build_reference.pyagainst the Modal-installed engine.
So the stamped version is read from the same installed engine the server then runs against → stamp == installed on every normal deploy, and _check_reference_engine_drift logs the info "matches" branch every time. The mismatch warning only triggers if the doc and the engine are produced at different times (e.g. a prebuilt base image with a stale doc + a later engine bump, or a manual runtime reinstall) — which the current Dockerfile/Modal flow doesn't do. As wired today it's close to a guaranteed no-op.
🟡 It doesn't check the thing that's actually at risk
#104's concern is the agent being grounded on stale signatures / parameter schemas / reform-recipe field names. This PR compares a version string, not content. Two gaps:
- A schema/field change within a version (or a hand-maintained recipe field name that was never right — see the #88 fix where recipes 3–5 had non-existent fields that silently no-op'd) produces no version mismatch, so this check stays silent.
- It's log-only — a buried startup
logger.warning, no CI gate, no signal to the user or agent. Nothing fails.
The real protection is build-time validation: in build_reference.py, assert each documented reform-recipe / parameter field name exists in Parameters.model_json_schema() and fail the build if not. That catches the actual stale-index failure mode at the moment the doc is generated, version-bump or not.
🟡 Related: the engine isn't pinned
backend/requirements.txt:7 pins policyengine-uk-compiled>=0.20.0 — a floor, not a pin. A fresh image can resolve to any newer engine. That's the drift this PR targets, but because the doc is rebuilt in the same image, it self-heals silently rather than warning. Tightening to ==0.38.0 (or a tight range) is the more direct mitigation.
Merge-readiness
Currently CONFLICTING / DIRTY — needs a rebase. The conflict is in build_reference.py::render(), the exact intro region #88 just modified (the recipes landed right where this inserts the stamp), plus chatbot.py has moved. Straightforward to rebase.
Verdict
Harmless and adds useful version observability, so no objection to landing it after rebase. But it should not be treated as closing #104 — as deployed it rarely fires and never inspects the at-risk content. I'd pair it with (1) build-time schema validation of recipe/param field names that fails the build, and (2) tightening the engine pin. Happy to take the build-time-validation follow-up if useful.
reference.md is generated from the installed policyengine_uk_compiled engine and loaded whole into the cached system prompt, but recorded no engine version. If the deployed engine drifts from the doc, the agent is grounded on stale signatures/schemas with no signal anything is wrong. - build_reference.py: resolve the engine version robustly and stamp it into the header (human-readable line) plus a parseable `<!-- engine-version: X.Y.Z -->` marker. - chatbot.py: at load, parse the stamped version, resolve the installed engine version the same way, and log a warning on mismatch or missing stamp. Wrapped in try/except so it can never break reference loading or request handling. Closes #104 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
0f4fef7 to
92592fb
Compare
|
@anth-volk thanks — all four addressed in the rebase (now on current main, 1. Helper dedup → shared resolver. Pulled the version resolution into a single 2. Dev-checkout dead fallback — fixed. This was the important one. 3. 4. Silent swallow → debug log. The catch-all now does Tests added — On your review-body point: agreed the production warning is unreachable by construction (doc + engine regenerate together) — its value is local dev and future build-process changes, which is exactly where fix #2 makes it land. Separately I'd still flag that this checks a version string, not the recipe/parameter field names that actually drift (cf. #88), and |
|
@vahid-ahmadi — #106 and #109 are circling the same two questions from different angles, and I'd rather we agree a target shape than rebase around each other piecemeal. Can we sync? I think we need an affirmative, agreed pathway on two things before either merges: 1. How we handle irrelevant / unanswerable questions. 2. How we handle systemic metadata. I'm putting #109 back into draft until we've talked, so we settle the architecture before either lands. |
A cheap forced-tool pre-pass builds a structured execution plan for the opening user turn and routes it to one of: irrelevant, out_of_scope, partial, needs_plan, ready. The model grounds each plan slot (source flag); the server gates via per-slot criticality. Any gateway error fails safe to compute. Supersedes the binary scope router (#109) and the reference.md drift stamp (#106; replaced by the engine-derived scope descriptor). - backend/gateway_config.py: criticality table, inferable set, promotions, gate() - backend/gateway.py: run_gateway (forced emit_plan), verdict, writer/plan helpers - backend/prompts.py: gateway + lightweight prompts, scope descriptor - backend/routes/chatbot.py: gateway routing in generate_stream - backend/scripts/build_reference.py: build_scope_descriptor -> scope_descriptor.md - evaluation: GatewayCase + _run_gateway + 14 live cases (evals/cases/gateway) - backend/tests/test_gateway.py: 33 offline tests (gate, parser, fail-safe) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
Stamp the engine version into
reference.mdwhen it is generated, and warn at load time if the deployed engine has drifted from the version the doc was built against.Why
reference.mdis generated from the installedpolicyengine_uk_compiledengine and loaded whole into the cached system prompt, but it recorded no engine version. If the deployed engine drifts from the doc, the agent is grounded on stale signatures / parameter schemas and can write code that doesn't match the live engine — producing wrong or inconsistent generated code with no signal that anything is out of sync. This is the app's "stale-index" consistency risk.What was added
build_reference.py— a robust_engine_version()helper (triesimportlib.metadatafor both hyphen/underscore distribution spellings, falls back tope.__version__, then"unknown").render()now emits, right after the intro paragraph:**Engine version:** \{ver}` (generated at build time).`<!-- engine-version: {ver} -->chatbot.py— afterREFERENCE_DOCis read, a best-effort drift check parses the stamped version out via regex, resolves the currently installed engine version the same way, and:The entire check is wrapped in
try/exceptand the engine import is lazy, so a missing engine, missing distribution, or any parse failure simply skips the check — it can never break reference loading or request handling.Closes #104
🤖 Generated with Claude Code