Cut wasted tokens (off-topic gate) and cold-start latency (/chat/backends warmup) by SakshiKekre · Pull Request #95 · PolicyEngine/policyengine-uk-chat

SakshiKekre · 2026-06-01T19:01:01Z

Two small, additive changes to the chat backend. Both are perf/UX fixes on the request critical path; both visible together in the same preview deploy.

1. Opt-in Haiku topic gate to short-circuit off-topic messages

Today the chat answers anything. "What's the capital of France?" today returns Paris-then-pivots-to-UK-policy — burning the full system prompt + reference doc on input tokens and output tokens for the apology. Every off-topic message does this.

Rate limiting (PR #48) caps request volume; iteration capping (PR #87) bounds runaway loops. Neither prevents off-topic acceptance.

This PR adds a pre-step that classifies the last user message with a single Haiku call (~$0.001) and short-circuits with a canned SSE refusal if it's clearly off-topic.

Off by default. Opt-in via:
```
POLICYENGINE_CHAT_TOPIC_GATE_ENABLED=true
POLICYENGINE_CHAT_TOPIC_GATE_MODEL=claude-haiku-4-5 # optional
```

Calibration (boundary cases the classifier prompt is tuned for):

Prompt	Classifier	Why
"Capital of France?"	reject	unambiguously off-topic
"What did the chancellor say yesterday?"	reject	news, not policy
"How will the PA reform affect inflation?"	let through	eval A4 — main loop should explain microsim vs macro
"What's the EITC?"	let through	factual policy lookup
ambiguous / malformed reply	let through	fail-open by design

False negatives (rejecting on-topic) are worse than false positives (accepting off-topic). The latter wastes a few cents; the former breaks the product.

2. Speed up /chat/backends from ~30-45s to <1s

Cold-container symptom: the backend-selector dropdown in the frontend was taking 30-45s to render after page load, while the chat input rendered fast. Root cause was `/chat/backends` paying for first-time imports of `policyengine_uk_compiled` + `policyengine_uk` (and `policyengine_us` on PR #54) inside `available_backends() → package_version()`.

Two small fixes:

`modal_app.py`: extend `_preload_engine` to also pre-import the Python backends (best-effort; failures non-fatal). Shifts the heavy OpenFisca import from request time to image build time.
`backend/model_backends.py`: memoise `available_backends()` output. The values don't change within a deploy, so `importlib.metadata.version()` only runs once per container.

Combined: `/chat/backends` returns in <100ms on a warm container and ~1s on cold, vs 30-45s today.

Why combined

Both small (~30 lines each), both touch the chat-message critical path, both visible together in the same preview deploy when testing. Topic gate is the bigger feature; the warmup is the perf fix you'd want for any demo where someone watches the page load.

Files

`backend/routes/chatbot.py` (+93): topic gate helpers + early-return wire-up
`backend/tests/test_topic_gate.py` (+86): classifier parser tests, fail-open, end-to-end gate stub
`backend/model_backends.py` (+26/-8): memoised `available_backends()`
`modal_app.py` (+18/-2): Python-backend pre-import

Stacked on PR #51

Base = `feat/model-backend-selector`. Once #51 merges, this auto-rebases to main.

Test plan

Confirm `/chat/backends` returns quickly on warm preview container
Open the preview, watch for the backend dropdown to render fast on first load (cold container)
Flip `POLICYENGINE_CHAT_TOPIC_GATE_ENABLED=true` on the preview's Modal secret; ask the four boundary cases and confirm behaviour matches the calibration table
Flip the env var back off; confirm baseline behaviour returns

Not in scope

Tuning the classifier prompt against a held-out eval set — manual examples for now
Telemetry for how often the gate fires (PR Add eval harness scaffold: spec, scenarios, fixtures dir #52's runner could surface `refused_by_topic_gate` from the `done` event — separate work)
Modal `min_containers=1` keep-warm — overkill for previews

Today every message — including "what's the capital of France?" — hits the full chat loop: system prompt, reference doc, tools, often several iterations before Claude decides the question isn't on-topic. Each one burns input + output tokens. This adds a pre-step that runs the last user message through a single small classification call (Haiku by default) and short-circuits with a canned SSE refusal if it's clearly off-topic. Wired in /chat/message after the billing check, before backend resolution. Calibration choices (in the classifier's system prompt): - Reject only when unambiguously not policy (capitals, sports, news, general advice). - Let everything ambiguous through. Eval A4 ("how does this reform affect inflation?") is a deliberate let-through — the main loop's scope refusal is the right place to handle that, not a pre-filter. - Any classification error fails open. Wasting a few cents is worse than wrongly rejecting an on-topic question. Gate is off by default. Opt-in via: POLICYENGINE_CHAT_TOPIC_GATE_ENABLED=true POLICYENGINE_CHAT_TOPIC_GATE_MODEL=claude-haiku-4-5 # optional override Tests cover parser behaviour, empty-input shortcut, error-path fail-open, and a TestClient-level check that the gate produces the expected SSE shape when on.

vercel · 2026-06-01T19:01:08Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
policyengine-uk-chat	Ready	Preview, Comment	Jun 3, 2026 2:48pm

github-actions · 2026-06-01T19:03:53Z

Beta preview has been cleaned up because this PR was closed.

Two small fixes that together remove a 30-45s cold-start wait on the backend-selector dropdown in the frontend. 1. modal_app.py: extend _preload_engine to also import policyengine_uk (and policyengine_us if installed). Best-effort — failures are non-fatal. Shifts the heavy OpenFisca import from request time to image build time. 2. model_backends.py: cache available_backends() output. The values don't change within a deploy, so importlib.metadata.version() — which can trigger the package import we're trying to avoid — only runs once per container. Combined effect: /chat/backends returns in <100ms on a warm container and ~1s on cold, vs 30-45s today.

The backend warmup landed the cold-start /chat/backends time from 30-45s down to ~12-15s (Modal container cold-start itself). The dropdown just rendered nothing during that window, which reads as "broken." Now it shows a small spinner + "Loading engines…" until the fetch resolves. Doesn't gate sending a message — UK compiled is the default anyway, so a user who sends before the dropdown settles still gets the right backend.

Bakes POLICYENGINE_CHAT_TOPIC_GATE_ENABLED=true into the secret-seeding on this branch so the CI-rebuilt secret keeps the gate on across redeploys. Without this, the workflow's `modal secret create --force` would wipe a dashboard-set flag on every run. DO NOT MERGE this workflow change. The gate is supposed to stay opt-in via env in production (per PR #95's design). Production secrets get edited via the prod secret directly, not the workflow file. Once we confirm the gate works as expected on the preview, drop this workflow change before merging the PR.

vahid-ahmadi

Review

Both changes are well-motivated and the topic gate's design instincts are right — opt-in via env, fail-open on any error, biased toward letting things through, and good test coverage of the parser/fail-open paths. The available_backends() memoisation and the Modal pre-import are simple, correct latency wins, and the frontend "Loading engines…" state is a clean touch. Two things on the gate are worth fixing before enabling it widely.

1. The gate makes a blocking sync call inside the async handler. _classify_on_topic uses _get_sync_anthropic_client() and calls client.messages.create(...) with no await/thread offload, and it's invoked directly from async def chat_message. That blocks the event loop for the full classification round-trip on every gated request — stalling all other concurrent requests in the worker, not just this one. The rest of this file is careful about exactly this: the main stream uses the async client (async with client.messages.stream), follow-up suggestions do await asyncio.wait_for(client.messages.create(...)), and tool execution uses run_in_executor. The gate should match — use the async client with await, or await asyncio.to_thread(_classify_on_topic, ...). As written, a latency-focused PR adds head-of-line blocking to the request path.

2. No timeout on the gate call. The follow-up path bounds its Haiku call with asyncio.wait_for(..., SUGGESTION_TIMEOUT_SECS); the gate has none. A slow/hanging classification adds unbounded latency to every message (and, per #1, blocks the loop while it hangs). Wrap it in a short timeout and fail-open on TimeoutError — that's consistent with the stated "errors short-circuit to yes" design.

3. Minor — the startswith("no") parser cuts against the fail-open bias. A reply like "Note:", "Nothing about policy", or "Nonetheless…" starts with "no" and would be parsed as a reject. max_tokens=4 makes this unlikely, but since the PR explicitly says false-negatives (rejecting on-topic) are the worse failure, a word-boundary match (re.match(r"\s*no\b", text)) is safer than a bare prefix check.

Smaller notes (non-blocking):

The gate's Haiku call has real provider cost that isn't recorded anywhere (refusal emits usage: 0). Intended, but it means there's no signal for how often/expensively the gate fires until the telemetry you mention lands.
Refused messages return early before the main loop, so they aren't persisted server-side the way normal answers are — confirm that's intended for conversation history.
modal_app.py catches only ImportError; a version-conflict import (e.g. the policyengine_us/policyengine_core pin issue noted in #97) can raise other errors and would fail the image build. For a best-effort warm-up, consider catching Exception.

CI enables the gate on the beta/preview deploys only (not prod) — good for the test plan. Did not run the suite locally.

anth-volk · 2026-06-16T12:05:10Z

Closing this in favour of #109, which carries Part 1 forward.

Part 1 (the off-topic gate) was the salvageable, valuable half — it's been generalised into a scope router in #109 (a cheap pre-pass that decides which background a turn loads: full compute vs lean lightweight, no reference doc or tools). #109 credits this PR as the origin.

Part 2 (the /chat/backends cold-start warmup) was coupled to the backend-selector stack (#51), which is now closed — so that half no longer has anything to attach to and won't land as-is.

Thanks @SakshiKekre — the gate idea is live in #109. Closing to keep the queue clear.

vercel Bot deployed to Preview June 1, 2026 19:01 View deployment

SakshiKekre changed the title ~~Add opt-in Haiku topic gate to /chat/message~~ Cut wasted tokens (off-topic gate) and cold-start latency (/chat/backends warmup) Jun 2, 2026

vercel Bot deployed to Preview June 2, 2026 13:42 View deployment

vercel Bot deployed to Preview June 2, 2026 14:20 View deployment

SakshiKekre mentioned this pull request Jun 2, 2026

Enable multizone embedding via prefixed asset URLs #96

Open

4 tasks

vercel Bot deployed to Preview June 3, 2026 14:48 View deployment

vahid-ahmadi reviewed Jun 4, 2026

View reviewed changes

This was referenced Jun 8, 2026

Add an explicit scope/refusal contract to the chat system prompt #101

Open

Add scope/refusal contract to chat system prompt (closes #101) #102

Open

This was referenced Jun 15, 2026

Add opt-in scope router to /chat/message (load the heavy background only when needed) #109

Draft

Design note: scope-aware routing (load the background only when needed) #110

Closed

anth-volk closed this Jun 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cut wasted tokens (off-topic gate) and cold-start latency (/chat/backends warmup)#95

Cut wasted tokens (off-topic gate) and cold-start latency (/chat/backends warmup)#95
SakshiKekre wants to merge 4 commits into
feat/model-backend-selectorfrom
feat/topic-gate

SakshiKekre commented Jun 1, 2026 •

edited

Loading

Uh oh!

vercel Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

vahid-ahmadi left a comment

Uh oh!

anth-volk commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

SakshiKekre commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Opt-in Haiku topic gate to short-circuit off-topic messages

2. Speed up /chat/backends from ~30-45s to <1s

Why combined

Files

Stacked on PR #51

Test plan

Not in scope

Uh oh!

vercel Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vahid-ahmadi left a comment

Choose a reason for hiding this comment

Review

Uh oh!

anth-volk commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SakshiKekre commented Jun 1, 2026 •

edited

Loading

vercel Bot commented Jun 1, 2026 •

edited

Loading

github-actions Bot commented Jun 1, 2026 •

edited

Loading