Enable FAL endpoint to lift limits#278
Conversation
Users hit "Server is at capacity (379/200)" and couldn't create sessions: idle/forgotten runtimes never left the global pool, the frontend reactivated every historical session on load, and concurrent creates could overshoot the cap. Backend: - Idle-session reaper (2h): releases the sandbox + RAM of inactive sessions and evicts them from the live pool, leaving them resumable from Mongo (persisted resumable, never "ended"). Teardown uses asyncio.wait so a shutdown cancel can't be swallowed; submit() enqueues under the lock and refuses sessions being reaped. Reaping is skipped when persistence is disabled and aborts if the resumable snapshot (incl. the message bulk_write, via save_snapshot strict mode) can't be durably written. - Global create-slot reservation closes the check-then-create race that let concurrent creates exceed MAX_SESSIONS. - last_active_at drives the reaper, stamped on genuine activity only. - Restore injects a sandbox-reset note so the agent knows its sandbox was wiped, skipped when an approval is pending (would orphan the tool results). Frontend: - Only the active or backend-processing session reactivates its runtime on load and tab refocus; idle sessions render from cache, so opening the app no longer refills the pool. isProcessing is cleared on all terminal and approval paths. Tests: reaper happy path/spared cases, submit-reap race, turn-finish restamp, global reservation race, persistence-disabled and write-failure aborts, shutdown cancellation propagation, restore-note inject/skip. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Route Claude Opus 4.6 and GPT-5.5 off Bedrock/native APIs onto the HF Inference Router (huggingface/anthropic/claude-opus-4.6:fal-ai and huggingface/openai/gpt-5.5:fal-ai). The same daily allowance as before (free=1, pro=20) stays company-subsidized via INFERENCE_TOKEN; once a user is past it, the session's premium_user_billed flag flips and the call bills the user's own HF OAuth token (no X-HF-Bill-To) instead of returning the old 429 cap. - llm_params: bill_to_user routes premium (anthropic/ + openai/) router calls to the user's token; free models stay subsidized. - premium_user_billed is persisted/restored and threaded through the main loop, context compaction, the research sub-agent (now downgraded to router Sonnet 4.6), and effort probes. - Frontend: per-session BillingBanner (dismissible daily); remove the now unreachable ClaudeCapDialog + 429 quota-exhausted plumbing; the model chip shows "Billed" past the allowance. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adopt PR #277's reviewed reclaim-idle-sessions implementation (the reaper persists the resumable snapshot outside the manager lock then re-checks under it; the useAgentChat setProcessingState refactor; test_session_reaper) while preserving this branch's premium-billing changes (premium_user_billed plumbing in session_manager, and the ClaudeCapDialog / CLAUDE_QUOTA_EXHAUSTED removal in useAgentChat). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Claude finished @cmpatino's task in 5m 34s —— View job Review: Enable FAL endpoint to lift limitsNo blocking issues — LGTM with 3 P2 observations. I traced the billing flow end-to-end: gate ( The frontend cleanup is consistent: no orphan refs to P2 — Banner shows a turn late on first overflow
P2 —
|
|
Claude finished @cmpatino's task in 2m 48s —— View job Re-review: Enable FAL endpoint to lift limitsLGTM — all 3 prior P2s addressed, no new P0.
Prior findings — fixed
New code in the fix-up commit — no P0
|
Bill premium models through the HF Inference Router
Summary
Routes the two premium models — Claude Opus 4.6 and GPT-5.5 — off Bedrock/native provider APIs and onto the HF Inference Router, billed to each user's own HF account. The same daily allowance as before (free: 1, pro: 20 premium-model sessions/day) stays company-subsidized; once a user is past it, their premium usage is billed to their HF account instead of being blocked.
Why
Premium usage was previously footed entirely by the company (Bedrock / a shared inference token). This keeps a free subsidized allowance while letting users who want more pay directly through HF Inference Providers — no extra API keys or billing integration needed, since a signed-in user's OAuth token is already billable on the router (usage shows up under
huggingface/ml-intern-modelsin their HF billing).Routing
huggingface/anthropic/claude-opus-4.6:fal-aihuggingface/openai/gpt-5.5:fal-aihuggingface/anthropic/claude-sonnet-4-6:fal-aiThe
huggingface/prefix sends these through LiteLLM's OpenAI-compatible path torouter.huggingface.co/v1(bypassing the nativeanthropic//openai/adapters). Free models are unchanged.Billing model
INFERENCE_TOKEN+X-HF-Bill-To: smolagents.premium_user_billed; the call uses the user's own OAuth token (session.hf_token) and dropsX-HF-Bill-To, billing the user. No more 429 block for premium models — the daily cap is now a soft cap.Frontend
BillingBanner(per-session, dismissible once/day), shown only when the active session is actually being user-billed (readspremium_user_billedfromSessionInfo).ClaudeCapDialogand itsCLAUDE_QUOTA_EXHAUSTED429 plumbing.Validation
Live-checked on the router with a real OAuth token (HTTP 200 + tool calls, billed to the token owner):
anthropic/claude-opus-4.6:fal-ai,openai/gpt-5.5:fal-ai, andanthropic/claude-sonnet-4-6:fal-aiall serve with tool support via:fal-ai. OAuth tokens carry theinference-apiscope (already requested) and are billable.Testing
test_agent_model_gating.py,test_llm_params.py,test_cli_rendering.py(subsidize→overflow billing,bill_to_userrouting, the research-model router-Sonnet downgrade, the defensive hard-cap).ruff+ frontendtsc/eslintclean.Notes
main— target this PR atmain; the diff is just the billing feature.anthropic//openai/ids with the user's own keys).