Skip to content

Enable FAL endpoint to lift limits#278

Merged
cmpatino merged 4 commits into
mainfrom
enable-fal-endpoint
Jun 1, 2026
Merged

Enable FAL endpoint to lift limits#278
cmpatino merged 4 commits into
mainfrom
enable-fal-endpoint

Conversation

@cmpatino
Copy link
Copy Markdown
Collaborator

@cmpatino cmpatino commented Jun 1, 2026

Bill premium models through the HF Inference Router

Summary

Routes the two premium models — Claude Opus 4.6 and GPT-5.5 — off Bedrock/native provider APIs and onto the HF Inference Router, billed to each user's own HF account. The same daily allowance as before (free: 1, pro: 20 premium-model sessions/day) stays company-subsidized; once a user is past it, their premium usage is billed to their HF account instead of being blocked.

Why

Premium usage was previously footed entirely by the company (Bedrock / a shared inference token). This keeps a free subsidized allowance while letting users who want more pay directly through HF Inference Providers — no extra API keys or billing integration needed, since a signed-in user's OAuth token is already billable on the router (usage shows up under huggingface/ml-intern-models in their HF billing).

Routing

Model New router id
Claude Opus 4.6 huggingface/anthropic/claude-opus-4.6:fal-ai
GPT-5.5 huggingface/openai/gpt-5.5:fal-ai
Research sub-agent (for Claude sessions) downgrades to huggingface/anthropic/claude-sonnet-4-6:fal-ai

The huggingface/ prefix sends these through LiteLLM's OpenAI-compatible path to router.huggingface.co/v1 (bypassing the native anthropic//openai/ adapters). Free models are unchanged.

Billing model

  • Within the daily allowance (free = 1, pro = 20): company-subsidized — INFERENCE_TOKEN + X-HF-Bill-To: smolagents.
  • Past the allowance: the session flips premium_user_billed; the call uses the user's own OAuth token (session.hf_token) and drops X-HF-Bill-To, billing the user. No more 429 block for premium models — the daily cap is now a soft cap.
  • The flag is persisted/restored with the session, so an overflow session stays user-billed across idle-reclaim.
  • It's threaded through every session LLM path: main loop, context compaction, the research sub-agent, and effort probes.
  • Defensive: a premium model that is not user-billable would still hard-cap (429); currently unused, since both premium models are user-billable.

Frontend

  • New BillingBanner (per-session, dismissible once/day), shown only when the active session is actually being user-billed (reads premium_user_billed from SessionInfo).
  • Removed the now-unreachable ClaudeCapDialog and its CLAUDE_QUOTA_EXHAUSTED 429 plumbing.
  • The model-picker chip shows "Billed" past the allowance (was "Pro only").

Validation

Live-checked on the router with a real OAuth token (HTTP 200 + tool calls, billed to the token owner): anthropic/claude-opus-4.6:fal-ai, openai/gpt-5.5:fal-ai, and anthropic/claude-sonnet-4-6:fal-ai all serve with tool support via :fal-ai. OAuth tokens carry the inference-api scope (already requested) and are billable.

Testing

  • Updated/added unit tests in test_agent_model_gating.py, test_llm_params.py, test_cli_rendering.py (subsidize→overflow billing, bill_to_user routing, the research-model router-Sonnet downgrade, the defensive hard-cap).
  • Full unit suite passes; ruff + frontend tsc/eslint clean.

Notes

cmpatino and others added 3 commits May 29, 2026 15:53
Users hit "Server is at capacity (379/200)" and couldn't create sessions:
idle/forgotten runtimes never left the global pool, the frontend reactivated
every historical session on load, and concurrent creates could overshoot the cap.

Backend:
- Idle-session reaper (2h): releases the sandbox + RAM of inactive sessions and
  evicts them from the live pool, leaving them resumable from Mongo (persisted
  resumable, never "ended"). Teardown uses asyncio.wait so a shutdown cancel
  can't be swallowed; submit() enqueues under the lock and refuses sessions being
  reaped. Reaping is skipped when persistence is disabled and aborts if the
  resumable snapshot (incl. the message bulk_write, via save_snapshot strict
  mode) can't be durably written.
- Global create-slot reservation closes the check-then-create race that let
  concurrent creates exceed MAX_SESSIONS.
- last_active_at drives the reaper, stamped on genuine activity only.
- Restore injects a sandbox-reset note so the agent knows its sandbox was wiped,
  skipped when an approval is pending (would orphan the tool results).

Frontend:
- Only the active or backend-processing session reactivates its runtime on load
  and tab refocus; idle sessions render from cache, so opening the app no longer
  refills the pool. isProcessing is cleared on all terminal and approval paths.

Tests: reaper happy path/spared cases, submit-reap race, turn-finish restamp,
global reservation race, persistence-disabled and write-failure aborts, shutdown
cancellation propagation, restore-note inject/skip.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Route Claude Opus 4.6 and GPT-5.5 off Bedrock/native APIs onto the HF
Inference Router (huggingface/anthropic/claude-opus-4.6:fal-ai and
huggingface/openai/gpt-5.5:fal-ai). The same daily allowance as before
(free=1, pro=20) stays company-subsidized via INFERENCE_TOKEN; once a
user is past it, the session's premium_user_billed flag flips and the
call bills the user's own HF OAuth token (no X-HF-Bill-To) instead of
returning the old 429 cap.

- llm_params: bill_to_user routes premium (anthropic/ + openai/) router
  calls to the user's token; free models stay subsidized.
- premium_user_billed is persisted/restored and threaded through the main
  loop, context compaction, the research sub-agent (now downgraded to
  router Sonnet 4.6), and effort probes.
- Frontend: per-session BillingBanner (dismissible daily); remove the now
  unreachable ClaudeCapDialog + 429 quota-exhausted plumbing; the model
  chip shows "Billed" past the allowance.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adopt PR #277's reviewed reclaim-idle-sessions implementation (the reaper
persists the resumable snapshot outside the manager lock then re-checks under
it; the useAgentChat setProcessingState refactor; test_session_reaper) while
preserving this branch's premium-billing changes (premium_user_billed plumbing
in session_manager, and the ClaudeCapDialog / CLAUDE_QUOTA_EXHAUSTED removal in
useAgentChat).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Claude finished @cmpatino's task in 5m 34s —— View job


Review: Enable FAL endpoint to lift limits

No blocking issues — LGTM with 3 P2 observations.

I traced the billing flow end-to-end: gate (backend/routes/agent.py:188-231) → flag on Session (agent/core/session.py:128) → routing decision (agent/core/llm_params.py:265-295) → main loop / compaction / research sub-agent / effort probe all read session.premium_user_billed (agent/core/agent_loop.py:1254-1261, agent_loop.py:540-545, agent/context_manager/manager.py:142-150, agent/tools/research_tool.py:268-273, agent/core/effort_probe.py:193-199) → persistence round-trip via save_snapshot / load_session (agent/core/session_persistence.py:178,210,233,258 and backend/session_manager.py:573-575,723). The hard-cap defensive branch (agent.py:217-226) and the bill-to-user branch (agent.py:227-229) are both unit-tested in tests/unit/test_agent_model_gating.py, and _resolve_llm_params has dedicated tests for the user-billed Anthropic/OpenAI paths, free-model fallthrough, and the bill_to_user flag being ignored for non-premium ids (tests/unit/test_llm_params.py:166-233). Free models stay subsidized even when the flag is set, which is the correct behavior.

The frontend cleanup is consistent: no orphan refs to CLAUDE_QUOTA_EXHAUSTED / ClaudeCapDialog, and the BillingBanner reads premium_user_billed straight from /api/session/{id} (SessionChat.tsx:99-108), which goes through _check_session_accessensure_session_loaded, so a reaped-and-restored session returns the correct flag from Mongo.


P2 — Banner shows a turn late on first overflow

SessionChat.tsx:92-108 refetches /api/session/{id} when busy toggles false. Result: when a fresh session is created and its first submit overflows, the backend flips premium_user_billed=True before the LLM call, but the banner appears only after turn_complete. The user is silently billed for that first turn. Probably acceptable (the user is already past their daily allowance from a prior session), but worth a quota.premiumRemaining === 0-driven optimistic preview in SessionChat.tsx if you want billing to be visible before the click. Pre-existing pattern, low priority.

P2 — list_sessions Mongo path doesn't include premium_user_billed

backend/session_manager.py:1530-1571 synthesizes a row dict from Mongo for non-live sessions and omits the field, so the SessionInfo response defaults to False until the session is hydrated. Today this only affects the sidebar listing (the chat surface re-fetches /api/session/{id} separately), so it's purely cosmetic — but if any future UI surfaces billing state from the list endpoint, it'll silently report wrong values. Add it alongside auto_approval_* for consistency.

P2 — bill_to_user=True can fall back to a cached login token

llm_params.py:274-278 calls resolve_hf_token(session_hf_token) for the user-billed path. resolve_hf_token defaults to include_cached=True (agent/core/hf_tokens.py:26-37), so if session.hf_token is somehow None (token rotation mid-session, reaped session restored without an OAuth-bearing request, etc.) it would silently fall back to huggingface_hub.get_token() — which in a Space could resolve to the runtime's HF_TOKEN. The combination is unlikely to fire in production (OAuth tokens are long-lived and the gate only sets the flag after authenticated submit), but explicitly passing include_cached=False in the user-billed branch would make the failure mode loud (401) instead of silent-misroute.


Nice touches noticed

  • USER_BILLED_MODEL_IDS kept distinct from PREMIUM_MODEL_IDS with a clear comment explaining the defensive fallback (agent.py:70-75). Future company-billed premium models will hard-cap correctly.
  • _get_research_model correctly downgrades huggingface/anthropic/... to the router Sonnet (research_tool.py:227-228), and the chain is covered by test_cli_rendering.py.
  • The research-effort cap to "high" for max/xhigh prefs (research_tool.py:266-267) prevents the main session's effort from leaking onto an un-probed sub-model — good defensive engineering.
    • Branch: enable-fal-endpoint

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Claude finished @cmpatino's task in 2m 48s —— View job


Re-review: Enable FAL endpoint to lift limits

LGTM — all 3 prior P2s addressed, no new P0.

  • Examine commits since prior Claude review (88696c5 Address review comments)
  • Check whether prior P2 findings were addressed
  • Scan for new P0 regressions introduced by the fix-up commit
  • Post focused re-review (P0-only per convergence rule)

Prior findings — fixed

  • P2 (banner shows a turn late): addressed. SessionChat.tsx:120-125 now derives premiumBillingNotice from sessionPremiumBilled || (isActive && onPremiumModel && quota?.premiumRemaining === 0 && !sessionPremiumQuotaCounted), so a fresh premium session for a user already at their daily cap shows the banner before the first submit rather than only after turn_complete. The new premium_quota_counted flag (mapped from the existing claude_counted column for persistence compatibility — agent.py:196-197) is what distinguishes "user has overflowed but this specific session hasn't been gated yet" from "this session was already counted within the allowance." Logic checks out under reaped-and-restored sessions: after a turn completes, sessionPremiumBilled flips true via the re-fetch, so the first OR clause keeps the banner visible without flicker.

  • P2 (list_sessions omits premium_user_billed): addressed at backend/session_manager.py:1543-1546 — both premium_user_billed and premium_quota_counted are now synthesized from the Mongo row for non-live sessions, with a regression test in test_session_manager_persistence.py:729-730,752-753. Pydantic SessionInfo gained the matching field (backend/models.py:109) and the frontend SessionMeta mirrors it through sessionStore.ts:128-129,159-160.

  • P2 (cached-token fallback for bill_to_user): addressed at agent/core/llm_params.py:275resolve_hf_token(session_hf_token, include_cached=False) now returns None instead of silently falling back to huggingface_hub.get_token() (which on a Space could resolve to HF_TOKEN). The new test_hf_router_user_billed_does_not_fall_back_to_cached_token (tests/unit/test_llm_params.py:197-211) pins this with an explicit monkeypatch of both HF_TOKEN and huggingface_hub.get_token, so a future regression that drops the kwarg would fail loudly.

New code in the fix-up commit — no P0

  • Thread-through of quota / refreshQuota as ChatInput props (replacing the internal useUserQuota() call) is clean — ChatInput is only rendered from SessionChat.tsx:173, so there's no other call site missing the new required props. The useUserQuota({ enabled: isActive }) gate (useUserQuota.ts:21,27) correctly prevents background SessionChat instances from each firing their own /api/user/quota fetch.
  • Backend gate at routes/agent.py:207-231 is unchanged in semantics; the only persistence change is that claude_counted was already being snapshotted, and is now also surfaced via the API as premium_quota_counted.
    • Branch: enable-fal-endpoint

@cmpatino cmpatino merged commit c9a83ae into main Jun 1, 2026
3 checks passed
@cmpatino cmpatino deleted the enable-fal-endpoint branch June 1, 2026 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant