Cut wasted tokens (off-topic gate) and cold-start latency (/chat/backends warmup)#95
Cut wasted tokens (off-topic gate) and cold-start latency (/chat/backends warmup)#95SakshiKekre wants to merge 4 commits into
Conversation
Today every message — including "what's the capital of France?" — hits
the full chat loop: system prompt, reference doc, tools, often several
iterations before Claude decides the question isn't on-topic. Each one
burns input + output tokens.
This adds a pre-step that runs the last user message through a single
small classification call (Haiku by default) and short-circuits with a
canned SSE refusal if it's clearly off-topic. Wired in /chat/message
after the billing check, before backend resolution.
Calibration choices (in the classifier's system prompt):
- Reject only when unambiguously not policy (capitals, sports, news,
general advice).
- Let everything ambiguous through. Eval A4 ("how does this reform
affect inflation?") is a deliberate let-through — the main loop's
scope refusal is the right place to handle that, not a pre-filter.
- Any classification error fails open. Wasting a few cents is worse
than wrongly rejecting an on-topic question.
Gate is off by default. Opt-in via:
POLICYENGINE_CHAT_TOPIC_GATE_ENABLED=true
POLICYENGINE_CHAT_TOPIC_GATE_MODEL=claude-haiku-4-5 # optional override
Tests cover parser behaviour, empty-input shortcut, error-path fail-open,
and a TestClient-level check that the gate produces the expected SSE
shape when on.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Beta preview has been cleaned up because this PR was closed. |
Two small fixes that together remove a 30-45s cold-start wait on the backend-selector dropdown in the frontend. 1. modal_app.py: extend _preload_engine to also import policyengine_uk (and policyengine_us if installed). Best-effort — failures are non-fatal. Shifts the heavy OpenFisca import from request time to image build time. 2. model_backends.py: cache available_backends() output. The values don't change within a deploy, so importlib.metadata.version() — which can trigger the package import we're trying to avoid — only runs once per container. Combined effect: /chat/backends returns in <100ms on a warm container and ~1s on cold, vs 30-45s today.
The backend warmup landed the cold-start /chat/backends time from 30-45s down to ~12-15s (Modal container cold-start itself). The dropdown just rendered nothing during that window, which reads as "broken." Now it shows a small spinner + "Loading engines…" until the fetch resolves. Doesn't gate sending a message — UK compiled is the default anyway, so a user who sends before the dropdown settles still gets the right backend.
Bakes POLICYENGINE_CHAT_TOPIC_GATE_ENABLED=true into the secret-seeding on this branch so the CI-rebuilt secret keeps the gate on across redeploys. Without this, the workflow's `modal secret create --force` would wipe a dashboard-set flag on every run. DO NOT MERGE this workflow change. The gate is supposed to stay opt-in via env in production (per PR #95's design). Production secrets get edited via the prod secret directly, not the workflow file. Once we confirm the gate works as expected on the preview, drop this workflow change before merging the PR.
vahid-ahmadi
left a comment
There was a problem hiding this comment.
Review
Both changes are well-motivated and the topic gate's design instincts are right — opt-in via env, fail-open on any error, biased toward letting things through, and good test coverage of the parser/fail-open paths. The available_backends() memoisation and the Modal pre-import are simple, correct latency wins, and the frontend "Loading engines…" state is a clean touch. Two things on the gate are worth fixing before enabling it widely.
1. The gate makes a blocking sync call inside the async handler. _classify_on_topic uses _get_sync_anthropic_client() and calls client.messages.create(...) with no await/thread offload, and it's invoked directly from async def chat_message. That blocks the event loop for the full classification round-trip on every gated request — stalling all other concurrent requests in the worker, not just this one. The rest of this file is careful about exactly this: the main stream uses the async client (async with client.messages.stream), follow-up suggestions do await asyncio.wait_for(client.messages.create(...)), and tool execution uses run_in_executor. The gate should match — use the async client with await, or await asyncio.to_thread(_classify_on_topic, ...). As written, a latency-focused PR adds head-of-line blocking to the request path.
2. No timeout on the gate call. The follow-up path bounds its Haiku call with asyncio.wait_for(..., SUGGESTION_TIMEOUT_SECS); the gate has none. A slow/hanging classification adds unbounded latency to every message (and, per #1, blocks the loop while it hangs). Wrap it in a short timeout and fail-open on TimeoutError — that's consistent with the stated "errors short-circuit to yes" design.
3. Minor — the startswith("no") parser cuts against the fail-open bias. A reply like "Note:", "Nothing about policy", or "Nonetheless…" starts with "no" and would be parsed as a reject. max_tokens=4 makes this unlikely, but since the PR explicitly says false-negatives (rejecting on-topic) are the worse failure, a word-boundary match (re.match(r"\s*no\b", text)) is safer than a bare prefix check.
Smaller notes (non-blocking):
- The gate's Haiku call has real provider cost that isn't recorded anywhere (refusal emits
usage: 0). Intended, but it means there's no signal for how often/expensively the gate fires until the telemetry you mention lands. - Refused messages return early before the main loop, so they aren't persisted server-side the way normal answers are — confirm that's intended for conversation history.
modal_app.pycatches onlyImportError; a version-conflict import (e.g. thepolicyengine_us/policyengine_corepin issue noted in #97) can raise other errors and would fail the image build. For a best-effort warm-up, consider catchingException.
CI enables the gate on the beta/preview deploys only (not prod) — good for the test plan. Did not run the suite locally.
|
Closing this in favour of #109, which carries Part 1 forward. Part 1 (the off-topic gate) was the salvageable, valuable half — it's been generalised into a scope router in #109 (a cheap pre-pass that decides which background a turn loads: full Part 2 (the Thanks @SakshiKekre — the gate idea is live in #109. Closing to keep the queue clear. |
Two small, additive changes to the chat backend. Both are perf/UX fixes on the request critical path; both visible together in the same preview deploy.
1. Opt-in Haiku topic gate to short-circuit off-topic messages
Today the chat answers anything. "What's the capital of France?" today returns Paris-then-pivots-to-UK-policy — burning the full system prompt + reference doc on input tokens and output tokens for the apology. Every off-topic message does this.
Rate limiting (PR #48) caps request volume; iteration capping (PR #87) bounds runaway loops. Neither prevents off-topic acceptance.
This PR adds a pre-step that classifies the last user message with a single Haiku call (~$0.001) and short-circuits with a canned SSE refusal if it's clearly off-topic.
Off by default. Opt-in via:
```
POLICYENGINE_CHAT_TOPIC_GATE_ENABLED=true
POLICYENGINE_CHAT_TOPIC_GATE_MODEL=claude-haiku-4-5 # optional
```
Calibration (boundary cases the classifier prompt is tuned for):
False negatives (rejecting on-topic) are worse than false positives (accepting off-topic). The latter wastes a few cents; the former breaks the product.
2. Speed up /chat/backends from ~30-45s to <1s
Cold-container symptom: the backend-selector dropdown in the frontend was taking 30-45s to render after page load, while the chat input rendered fast. Root cause was `/chat/backends` paying for first-time imports of `policyengine_uk_compiled` + `policyengine_uk` (and `policyengine_us` on PR #54) inside `available_backends() → package_version()`.
Two small fixes:
Combined: `/chat/backends` returns in <100ms on a warm container and ~1s on cold, vs 30-45s today.
Why combined
Both small (~30 lines each), both touch the chat-message critical path, both visible together in the same preview deploy when testing. Topic gate is the bigger feature; the warmup is the perf fix you'd want for any demo where someone watches the page load.
Files
Stacked on PR #51
Base = `feat/model-backend-selector`. Once #51 merges, this auto-rebases to main.
Test plan
Not in scope