Skip to content

Cut wasted tokens (off-topic gate) and cold-start latency (/chat/backends warmup)#95

Closed
SakshiKekre wants to merge 4 commits into
feat/model-backend-selectorfrom
feat/topic-gate
Closed

Cut wasted tokens (off-topic gate) and cold-start latency (/chat/backends warmup)#95
SakshiKekre wants to merge 4 commits into
feat/model-backend-selectorfrom
feat/topic-gate

Conversation

@SakshiKekre

@SakshiKekre SakshiKekre commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

Two small, additive changes to the chat backend. Both are perf/UX fixes on the request critical path; both visible together in the same preview deploy.

1. Opt-in Haiku topic gate to short-circuit off-topic messages

Today the chat answers anything. "What's the capital of France?" today returns Paris-then-pivots-to-UK-policy — burning the full system prompt + reference doc on input tokens and output tokens for the apology. Every off-topic message does this.

Rate limiting (PR #48) caps request volume; iteration capping (PR #87) bounds runaway loops. Neither prevents off-topic acceptance.

This PR adds a pre-step that classifies the last user message with a single Haiku call (~$0.001) and short-circuits with a canned SSE refusal if it's clearly off-topic.

Off by default. Opt-in via:
```
POLICYENGINE_CHAT_TOPIC_GATE_ENABLED=true
POLICYENGINE_CHAT_TOPIC_GATE_MODEL=claude-haiku-4-5 # optional
```

Calibration (boundary cases the classifier prompt is tuned for):

Prompt Classifier Why
"Capital of France?" reject unambiguously off-topic
"What did the chancellor say yesterday?" reject news, not policy
"How will the PA reform affect inflation?" let through eval A4 — main loop should explain microsim vs macro
"What's the EITC?" let through factual policy lookup
ambiguous / malformed reply let through fail-open by design

False negatives (rejecting on-topic) are worse than false positives (accepting off-topic). The latter wastes a few cents; the former breaks the product.

2. Speed up /chat/backends from ~30-45s to <1s

Cold-container symptom: the backend-selector dropdown in the frontend was taking 30-45s to render after page load, while the chat input rendered fast. Root cause was `/chat/backends` paying for first-time imports of `policyengine_uk_compiled` + `policyengine_uk` (and `policyengine_us` on PR #54) inside `available_backends() → package_version()`.

Two small fixes:

  • `modal_app.py`: extend `_preload_engine` to also pre-import the Python backends (best-effort; failures non-fatal). Shifts the heavy OpenFisca import from request time to image build time.
  • `backend/model_backends.py`: memoise `available_backends()` output. The values don't change within a deploy, so `importlib.metadata.version()` only runs once per container.

Combined: `/chat/backends` returns in <100ms on a warm container and ~1s on cold, vs 30-45s today.

Why combined

Both small (~30 lines each), both touch the chat-message critical path, both visible together in the same preview deploy when testing. Topic gate is the bigger feature; the warmup is the perf fix you'd want for any demo where someone watches the page load.

Files

  • `backend/routes/chatbot.py` (+93): topic gate helpers + early-return wire-up
  • `backend/tests/test_topic_gate.py` (+86): classifier parser tests, fail-open, end-to-end gate stub
  • `backend/model_backends.py` (+26/-8): memoised `available_backends()`
  • `modal_app.py` (+18/-2): Python-backend pre-import

Stacked on PR #51

Base = `feat/model-backend-selector`. Once #51 merges, this auto-rebases to main.

Test plan

  • Confirm `/chat/backends` returns quickly on warm preview container
  • Open the preview, watch for the backend dropdown to render fast on first load (cold container)
  • Flip `POLICYENGINE_CHAT_TOPIC_GATE_ENABLED=true` on the preview's Modal secret; ask the four boundary cases and confirm behaviour matches the calibration table
  • Flip the env var back off; confirm baseline behaviour returns

Not in scope

  • Tuning the classifier prompt against a held-out eval set — manual examples for now
  • Telemetry for how often the gate fires (PR Add eval harness scaffold: spec, scenarios, fixtures dir #52's runner could surface `refused_by_topic_gate` from the `done` event — separate work)
  • Modal `min_containers=1` keep-warm — overkill for previews

Today every message — including "what's the capital of France?" — hits
the full chat loop: system prompt, reference doc, tools, often several
iterations before Claude decides the question isn't on-topic. Each one
burns input + output tokens.

This adds a pre-step that runs the last user message through a single
small classification call (Haiku by default) and short-circuits with a
canned SSE refusal if it's clearly off-topic. Wired in /chat/message
after the billing check, before backend resolution.

Calibration choices (in the classifier's system prompt):
- Reject only when unambiguously not policy (capitals, sports, news,
  general advice).
- Let everything ambiguous through. Eval A4 ("how does this reform
  affect inflation?") is a deliberate let-through — the main loop's
  scope refusal is the right place to handle that, not a pre-filter.
- Any classification error fails open. Wasting a few cents is worse
  than wrongly rejecting an on-topic question.

Gate is off by default. Opt-in via:
  POLICYENGINE_CHAT_TOPIC_GATE_ENABLED=true
  POLICYENGINE_CHAT_TOPIC_GATE_MODEL=claude-haiku-4-5  # optional override

Tests cover parser behaviour, empty-input shortcut, error-path fail-open,
and a TestClient-level check that the gate produces the expected SSE
shape when on.
@vercel

vercel Bot commented Jun 1, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
policyengine-uk-chat Ready Ready Preview, Comment Jun 3, 2026 2:48pm

Request Review

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown

Beta preview has been cleaned up because this PR was closed.

Two small fixes that together remove a 30-45s cold-start wait on the
backend-selector dropdown in the frontend.

1. modal_app.py: extend _preload_engine to also import policyengine_uk
   (and policyengine_us if installed). Best-effort — failures are
   non-fatal. Shifts the heavy OpenFisca import from request time to
   image build time.

2. model_backends.py: cache available_backends() output. The values
   don't change within a deploy, so importlib.metadata.version() —
   which can trigger the package import we're trying to avoid — only
   runs once per container.

Combined effect: /chat/backends returns in <100ms on a warm container
and ~1s on cold, vs 30-45s today.
@SakshiKekre SakshiKekre changed the title Add opt-in Haiku topic gate to /chat/message Cut wasted tokens (off-topic gate) and cold-start latency (/chat/backends warmup) Jun 2, 2026
The backend warmup landed the cold-start /chat/backends time from
30-45s down to ~12-15s (Modal container cold-start itself). The dropdown
just rendered nothing during that window, which reads as "broken."

Now it shows a small spinner + "Loading engines…" until the fetch
resolves. Doesn't gate sending a message — UK compiled is the default
anyway, so a user who sends before the dropdown settles still gets the
right backend.
Bakes POLICYENGINE_CHAT_TOPIC_GATE_ENABLED=true into the secret-seeding
on this branch so the CI-rebuilt secret keeps the gate on across
redeploys. Without this, the workflow's `modal secret create --force`
would wipe a dashboard-set flag on every run.

DO NOT MERGE this workflow change. The gate is supposed to stay opt-in
via env in production (per PR #95's design). Production secrets get
edited via the prod secret directly, not the workflow file.

Once we confirm the gate works as expected on the preview, drop this
workflow change before merging the PR.

@vahid-ahmadi vahid-ahmadi left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Both changes are well-motivated and the topic gate's design instincts are right — opt-in via env, fail-open on any error, biased toward letting things through, and good test coverage of the parser/fail-open paths. The available_backends() memoisation and the Modal pre-import are simple, correct latency wins, and the frontend "Loading engines…" state is a clean touch. Two things on the gate are worth fixing before enabling it widely.

1. The gate makes a blocking sync call inside the async handler. _classify_on_topic uses _get_sync_anthropic_client() and calls client.messages.create(...) with no await/thread offload, and it's invoked directly from async def chat_message. That blocks the event loop for the full classification round-trip on every gated request — stalling all other concurrent requests in the worker, not just this one. The rest of this file is careful about exactly this: the main stream uses the async client (async with client.messages.stream), follow-up suggestions do await asyncio.wait_for(client.messages.create(...)), and tool execution uses run_in_executor. The gate should match — use the async client with await, or await asyncio.to_thread(_classify_on_topic, ...). As written, a latency-focused PR adds head-of-line blocking to the request path.

2. No timeout on the gate call. The follow-up path bounds its Haiku call with asyncio.wait_for(..., SUGGESTION_TIMEOUT_SECS); the gate has none. A slow/hanging classification adds unbounded latency to every message (and, per #1, blocks the loop while it hangs). Wrap it in a short timeout and fail-open on TimeoutError — that's consistent with the stated "errors short-circuit to yes" design.

3. Minor — the startswith("no") parser cuts against the fail-open bias. A reply like "Note:", "Nothing about policy", or "Nonetheless…" starts with "no" and would be parsed as a reject. max_tokens=4 makes this unlikely, but since the PR explicitly says false-negatives (rejecting on-topic) are the worse failure, a word-boundary match (re.match(r"\s*no\b", text)) is safer than a bare prefix check.

Smaller notes (non-blocking):

  • The gate's Haiku call has real provider cost that isn't recorded anywhere (refusal emits usage: 0). Intended, but it means there's no signal for how often/expensively the gate fires until the telemetry you mention lands.
  • Refused messages return early before the main loop, so they aren't persisted server-side the way normal answers are — confirm that's intended for conversation history.
  • modal_app.py catches only ImportError; a version-conflict import (e.g. the policyengine_us/policyengine_core pin issue noted in #97) can raise other errors and would fail the image build. For a best-effort warm-up, consider catching Exception.

CI enables the gate on the beta/preview deploys only (not prod) — good for the test plan. Did not run the suite locally.

@anth-volk

Copy link
Copy Markdown
Contributor

Closing this in favour of #109, which carries Part 1 forward.

Part 1 (the off-topic gate) was the salvageable, valuable half — it's been generalised into a scope router in #109 (a cheap pre-pass that decides which background a turn loads: full compute vs lean lightweight, no reference doc or tools). #109 credits this PR as the origin.

Part 2 (the /chat/backends cold-start warmup) was coupled to the backend-selector stack (#51), which is now closed — so that half no longer has anything to attach to and won't land as-is.

Thanks @SakshiKekre — the gate idea is live in #109. Closing to keep the queue clear.

@anth-volk anth-volk closed this Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants