Enable FAL endpoint to lift limits by cmpatino · Pull Request #278 · huggingface/ml-intern

cmpatino · 2026-06-01T09:33:15Z

Bill premium models through the HF Inference Router

Summary

Routes the two premium models — Claude Opus 4.6 and GPT-5.5 — off Bedrock/native provider APIs and onto the HF Inference Router, billed to each user's own HF account. The same daily allowance as before (free: 1, pro: 20 premium-model sessions/day) stays company-subsidized; once a user is past it, their premium usage is billed to their HF account instead of being blocked.

Why

Premium usage was previously footed entirely by the company (Bedrock / a shared inference token). This keeps a free subsidized allowance while letting users who want more pay directly through HF Inference Providers — no extra API keys or billing integration needed, since a signed-in user's OAuth token is already billable on the router (usage shows up under huggingface/ml-intern-models in their HF billing).

Routing

Model	New router id
Claude Opus 4.6	`huggingface/anthropic/claude-opus-4.6:fal-ai`
GPT-5.5	`huggingface/openai/gpt-5.5:fal-ai`
Research sub-agent (for Claude sessions)	downgrades to `huggingface/anthropic/claude-sonnet-4-6:fal-ai`

The huggingface/ prefix sends these through LiteLLM's OpenAI-compatible path to router.huggingface.co/v1 (bypassing the native anthropic//openai/ adapters). Free models are unchanged.

Billing model

Within the daily allowance (free = 1, pro = 20): company-subsidized — INFERENCE_TOKEN + X-HF-Bill-To: smolagents.
Past the allowance: the session flips premium_user_billed; the call uses the user's own OAuth token (session.hf_token) and drops X-HF-Bill-To, billing the user. No more 429 block for premium models — the daily cap is now a soft cap.
The flag is persisted/restored with the session, so an overflow session stays user-billed across idle-reclaim.
It's threaded through every session LLM path: main loop, context compaction, the research sub-agent, and effort probes.
Defensive: a premium model that is not user-billable would still hard-cap (429); currently unused, since both premium models are user-billable.

Frontend

New BillingBanner (per-session, dismissible once/day), shown only when the active session is actually being user-billed (reads premium_user_billed from SessionInfo).
Removed the now-unreachable ClaudeCapDialog and its CLAUDE_QUOTA_EXHAUSTED 429 plumbing.
The model-picker chip shows "Billed" past the allowance (was "Pro only").

Validation

Live-checked on the router with a real OAuth token (HTTP 200 + tool calls, billed to the token owner): anthropic/claude-opus-4.6:fal-ai, openai/gpt-5.5:fal-ai, and anthropic/claude-sonnet-4-6:fal-ai all serve with tool support via :fal-ai. OAuth tokens carry the inference-api scope (already requested) and are billable.

Testing

Updated/added unit tests in test_agent_model_gating.py, test_llm_params.py, test_cli_rendering.py (subsidize→overflow billing, bill_to_user routing, the research-model router-Sonnet downgrade, the defensive hard-cap).
Full unit suite passes; ruff + frontend tsc/eslint clean.

Notes

Builds on Reclaim idle sessions to fix session-capacity overshoot #277 (reclaim-idle-sessions), now merged to main — target this PR at main; the diff is just the billing feature.
CLI is unchanged (still uses native anthropic//openai/ ids with the user's own keys).

Users hit "Server is at capacity (379/200)" and couldn't create sessions: idle/forgotten runtimes never left the global pool, the frontend reactivated every historical session on load, and concurrent creates could overshoot the cap. Backend: - Idle-session reaper (2h): releases the sandbox + RAM of inactive sessions and evicts them from the live pool, leaving them resumable from Mongo (persisted resumable, never "ended"). Teardown uses asyncio.wait so a shutdown cancel can't be swallowed; submit() enqueues under the lock and refuses sessions being reaped. Reaping is skipped when persistence is disabled and aborts if the resumable snapshot (incl. the message bulk_write, via save_snapshot strict mode) can't be durably written. - Global create-slot reservation closes the check-then-create race that let concurrent creates exceed MAX_SESSIONS. - last_active_at drives the reaper, stamped on genuine activity only. - Restore injects a sandbox-reset note so the agent knows its sandbox was wiped, skipped when an approval is pending (would orphan the tool results). Frontend: - Only the active or backend-processing session reactivates its runtime on load and tab refocus; idle sessions render from cache, so opening the app no longer refills the pool. isProcessing is cleared on all terminal and approval paths. Tests: reaper happy path/spared cases, submit-reap race, turn-finish restamp, global reservation race, persistence-disabled and write-failure aborts, shutdown cancellation propagation, restore-note inject/skip. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Route Claude Opus 4.6 and GPT-5.5 off Bedrock/native APIs onto the HF Inference Router (huggingface/anthropic/claude-opus-4.6:fal-ai and huggingface/openai/gpt-5.5:fal-ai). The same daily allowance as before (free=1, pro=20) stays company-subsidized via INFERENCE_TOKEN; once a user is past it, the session's premium_user_billed flag flips and the call bills the user's own HF OAuth token (no X-HF-Bill-To) instead of returning the old 429 cap. - llm_params: bill_to_user routes premium (anthropic/ + openai/) router calls to the user's token; free models stay subsidized. - premium_user_billed is persisted/restored and threaded through the main loop, context compaction, the research sub-agent (now downgraded to router Sonnet 4.6), and effort probes. - Frontend: per-session BillingBanner (dismissible daily); remove the now unreachable ClaudeCapDialog + 429 quota-exhausted plumbing; the model chip shows "Billed" past the allowance. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Adopt PR #277's reviewed reclaim-idle-sessions implementation (the reaper persists the resumable snapshot outside the manager lock then re-checks under it; the useAgentChat setProcessingState refactor; test_session_reaper) while preserving this branch's premium-billing changes (premium_user_billed plumbing in session_manager, and the ClaudeCapDialog / CLAUDE_QUOTA_EXHAUSTED removal in useAgentChat). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-01T09:33:29Z

Claude finished @cmpatino's task in 5m 34s —— View job

Review: Enable FAL endpoint to lift limits

No blocking issues — LGTM with 3 P2 observations.

I traced the billing flow end-to-end: gate (backend/routes/agent.py:188-231) → flag on Session (agent/core/session.py:128) → routing decision (agent/core/llm_params.py:265-295) → main loop / compaction / research sub-agent / effort probe all read session.premium_user_billed (agent/core/agent_loop.py:1254-1261, agent_loop.py:540-545, agent/context_manager/manager.py:142-150, agent/tools/research_tool.py:268-273, agent/core/effort_probe.py:193-199) → persistence round-trip via save_snapshot / load_session (agent/core/session_persistence.py:178,210,233,258 and backend/session_manager.py:573-575,723). The hard-cap defensive branch (agent.py:217-226) and the bill-to-user branch (agent.py:227-229) are both unit-tested in tests/unit/test_agent_model_gating.py, and _resolve_llm_params has dedicated tests for the user-billed Anthropic/OpenAI paths, free-model fallthrough, and the bill_to_user flag being ignored for non-premium ids (tests/unit/test_llm_params.py:166-233). Free models stay subsidized even when the flag is set, which is the correct behavior.

The frontend cleanup is consistent: no orphan refs to CLAUDE_QUOTA_EXHAUSTED / ClaudeCapDialog, and the BillingBanner reads premium_user_billed straight from /api/session/{id} (SessionChat.tsx:99-108), which goes through _check_session_access → ensure_session_loaded, so a reaped-and-restored session returns the correct flag from Mongo.

P2 — Banner shows a turn late on first overflow

SessionChat.tsx:92-108 refetches /api/session/{id} when busy toggles false. Result: when a fresh session is created and its first submit overflows, the backend flips premium_user_billed=True before the LLM call, but the banner appears only after turn_complete. The user is silently billed for that first turn. Probably acceptable (the user is already past their daily allowance from a prior session), but worth a quota.premiumRemaining === 0-driven optimistic preview in SessionChat.tsx if you want billing to be visible before the click. Pre-existing pattern, low priority.

P2 — `list_sessions` Mongo path doesn't include `premium_user_billed`

backend/session_manager.py:1530-1571 synthesizes a row dict from Mongo for non-live sessions and omits the field, so the SessionInfo response defaults to False until the session is hydrated. Today this only affects the sidebar listing (the chat surface re-fetches /api/session/{id} separately), so it's purely cosmetic — but if any future UI surfaces billing state from the list endpoint, it'll silently report wrong values. Add it alongside auto_approval_* for consistency.

P2 — `bill_to_user=True` can fall back to a cached login token

llm_params.py:274-278 calls resolve_hf_token(session_hf_token) for the user-billed path. resolve_hf_token defaults to include_cached=True (agent/core/hf_tokens.py:26-37), so if session.hf_token is somehow None (token rotation mid-session, reaped session restored without an OAuth-bearing request, etc.) it would silently fall back to huggingface_hub.get_token() — which in a Space could resolve to the runtime's HF_TOKEN. The combination is unlikely to fire in production (OAuth tokens are long-lived and the gate only sets the flag after authenticated submit), but explicitly passing include_cached=False in the user-billed branch would make the failure mode loud (401) instead of silent-misroute.

Nice touches noticed

USER_BILLED_MODEL_IDS kept distinct from PREMIUM_MODEL_IDS with a clear comment explaining the defensive fallback (agent.py:70-75). Future company-billed premium models will hard-cap correctly.
_get_research_model correctly downgrades huggingface/anthropic/... to the router Sonnet (research_tool.py:227-228), and the chain is covered by test_cli_rendering.py.
The research-effort cap to "high" for max/xhigh prefs (research_tool.py:266-267) prevents the main session's effort from leaking onto an un-probed sub-model — good defensive engineering.
• Branch: enable-fal-endpoint

github-actions · 2026-06-01T09:55:16Z

Claude finished @cmpatino's task in 2m 48s —— View job

Re-review: Enable FAL endpoint to lift limits

LGTM — all 3 prior P2s addressed, no new P0.

Examine commits since prior Claude review (88696c5 Address review comments)
Check whether prior P2 findings were addressed
Scan for new P0 regressions introduced by the fix-up commit
Post focused re-review (P0-only per convergence rule)

Prior findings — fixed

P2 (banner shows a turn late): addressed. SessionChat.tsx:120-125 now derives premiumBillingNotice from sessionPremiumBilled || (isActive && onPremiumModel && quota?.premiumRemaining === 0 && !sessionPremiumQuotaCounted), so a fresh premium session for a user already at their daily cap shows the banner before the first submit rather than only after turn_complete. The new premium_quota_counted flag (mapped from the existing claude_counted column for persistence compatibility — agent.py:196-197) is what distinguishes "user has overflowed but this specific session hasn't been gated yet" from "this session was already counted within the allowance." Logic checks out under reaped-and-restored sessions: after a turn completes, sessionPremiumBilled flips true via the re-fetch, so the first OR clause keeps the banner visible without flicker.
P2 (list_sessions omits premium_user_billed): addressed at backend/session_manager.py:1543-1546 — both premium_user_billed and premium_quota_counted are now synthesized from the Mongo row for non-live sessions, with a regression test in test_session_manager_persistence.py:729-730,752-753. Pydantic SessionInfo gained the matching field (backend/models.py:109) and the frontend SessionMeta mirrors it through sessionStore.ts:128-129,159-160.
P2 (cached-token fallback for bill_to_user): addressed at agent/core/llm_params.py:275 — resolve_hf_token(session_hf_token, include_cached=False) now returns None instead of silently falling back to huggingface_hub.get_token() (which on a Space could resolve to HF_TOKEN). The new test_hf_router_user_billed_does_not_fall_back_to_cached_token (tests/unit/test_llm_params.py:197-211) pins this with an explicit monkeypatch of both HF_TOKEN and huggingface_hub.get_token, so a future regression that drops the kwarg would fail loudly.

New code in the fix-up commit — no P0

Thread-through of quota / refreshQuota as ChatInput props (replacing the internal useUserQuota() call) is clean — ChatInput is only rendered from SessionChat.tsx:173, so there's no other call site missing the new required props. The useUserQuota({ enabled: isActive }) gate (useUserQuota.ts:21,27) correctly prevents background SessionChat instances from each firing their own /api/user/quota fetch.
Backend gate at routes/agent.py:207-231 is unchanged in semantics; the only persistence change is that claude_counted was already being snapshotted, and is now also surfaced via the API as premium_quota_counted.
• Branch: enable-fal-endpoint

cmpatino and others added 3 commits May 29, 2026 15:53

Address review comments

88696c5

cmpatino merged commit c9a83ae into main Jun 1, 2026
3 checks passed

cmpatino deleted the enable-fal-endpoint branch June 1, 2026 10:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable FAL endpoint to lift limits#278

Enable FAL endpoint to lift limits#278
cmpatino merged 4 commits into
mainfrom
enable-fal-endpoint

cmpatino commented Jun 1, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cmpatino commented Jun 1, 2026

Bill premium models through the HF Inference Router

Summary

Why

Routing

Billing model

Frontend

Validation

Testing

Notes

Uh oh!

github-actions Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: Enable FAL endpoint to lift limits

P2 — Banner shows a turn late on first overflow

P2 — list_sessions Mongo path doesn't include premium_user_billed

P2 — bill_to_user=True can fall back to a cached login token

Nice touches noticed

Uh oh!

github-actions Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-review: Enable FAL endpoint to lift limits

Prior findings — fixed

New code in the fix-up commit — no P0

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Jun 1, 2026 •

edited

Loading

P2 — `list_sessions` Mongo path doesn't include `premium_user_billed`

P2 — `bill_to_user=True` can fall back to a cached login token

github-actions Bot commented Jun 1, 2026 •

edited

Loading