feat(pools): pool exhaustion handling + opt-in agent auth auto-reset#49
Open
nnemirovsky wants to merge 14 commits into
Open
feat(pools): pool exhaustion handling + opt-in agent auth auto-reset#49nnemirovsky wants to merge 14 commits into
nnemirovsky wants to merge 14 commits into
Conversation
…e exhausted notice dedup (A1)
…AgentProfile.ExecUser
There was a problem hiding this comment.
Pull request overview
This PR hardens credential-pool failover behavior when all members are rate-limited (429) by (1) deriving cooldown windows from upstream recovery hints, (2) correctly detecting pool exhaustion and emitting edge-triggered exhausted/recovered notices via a server-side recovery monitor, and (3) adding an opt-in per-pool auth_reset_target that triggers an agent auth reset on recovery to un-latch stuck agents.
Changes:
- Derive per-member cooldown from
Retry-After/x-ratelimit-reset*headers (with clamp/fallback) and improve exhaustion classification. - Add server-lived exhaustion state + recovery monitor to edge-trigger notices and avoid retry/notice storms.
- Add
auth_reset_targetto the data model and expose it via poolops + CLI/REST/Telegram; add container-managerResetAuthsupport and docker exec-user plumbing.
Reviewed changes
Copilot reviewed 34 out of 35 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/vault/pool.go | Adds cooldown clamp constants and new resolver helpers (HasHealthyMember, SoonestCooldown) used for exhaustion/recovery monitoring. |
| internal/vault/pool_test.go | Unit tests for the new resolver helper methods and semantics (lazy expiry, manual rotate). |
| internal/proxy/pool_failover.go | Parses upstream recovery hint headers for cooldown, updates exhaustion logic/dedup, and adds recovered notice formatter. |
| internal/proxy/pool_failover_test.go | Tests header parsing/clamping and exhausted-dedup behavior. |
| internal/proxy/server.go | Adds per-server exhaustion state, recovery monitor lifecycle, and recovery-edge callbacks. |
| internal/proxy/pool_recovery_monitor_test.go | New tests for exhaustion edge-gating, recovery edge, unequal cooldown wakeups, pool removal, and shutdown behavior. |
| internal/proxy/addon.go | Adds an onPoolExhausted hook to delegate exhaustion edge-gating to the Server. |
| internal/store/pools.go | Extends pool model/query paths for auth_reset_target and adds SetPoolAuthResetTarget. |
| internal/store/pools_test.go | Adds migration round-trip test for 000008 and CRUD tests for auth_reset_target. |
| internal/store/migrations/000008_pool_auth_reset.up.sql | Adds auth_reset_target column with default empty string. |
| internal/store/migrations/000008_pool_auth_reset.down.sql | Drops auth_reset_target via SQLite table-rebuild while preserving FK-tied member rows. |
| internal/poolops/poolops.go | Adds channel-agnostic create-with-target + set/clear logic and validation parity. |
| internal/poolops/poolops_test.go | Tests create/set/clear behavior and invalid-target rejection. |
| internal/telegram/commands.go | Adds Telegram surface for auth_reset_target (create arg + set-auth-reset) and output display. |
| internal/telegram/commands_test.go | Tests Telegram adapter parity for create/set/clear and validation rejection. |
| internal/telegram/approval_test.go | Updates container manager mock with ResetAuth for compilation. |
| internal/api/server.go | Adds REST surface for auth_reset_target (create + action route) and API mappings. |
| internal/api/server_test.go | Tests REST create-with-target and set/clear endpoint behavior + validation. |
| internal/api/api.gen.go | Regenerated OpenAPI bindings with new fields and route. |
| api/openapi.yaml | Adds auth_reset_target fields and a new /api/pools/{name}/auth-reset-target route/schema. |
| cmd/sluice/pool.go | Adds CLI --auth-reset-target for create and pool set-auth-reset command; surfaces target in list/status. |
| cmd/sluice/pool_test.go | Tests CLI adapter parity for create/set/clear and validation rejection. |
| internal/container/types.go | Adds ResetAuth to ContainerManager and introduces canonical ValidateResetAuthTarget. |
| internal/container/agent_profile.go | Adds ResetAuthCmd and ExecUser() support; wires Hermes reset + runtime UID. |
| internal/container/agent_profile_test.go | Tests profile reset argv and ExecUser() behavior (incl. nil-safe accessor). |
| internal/container/docker.go | Extends docker exec to accept an optional user and implements ResetAuth running as profile exec user. |
| internal/container/docker_test.go | Tests default-user behavior for existing exec call sites and runtime-UID reset exec. |
| internal/container/docker_socket.go | Threads User into Docker exec-create payload (omitempty when empty). |
| internal/container/docker_socket_test.go | Tests exec-create payload includes/omits User correctly. |
| internal/container/apple.go | Implements best-effort ResetAuth for Apple Container backend. |
| internal/container/tart.go | Implements best-effort ResetAuth for Tart backend. |
| cmd/sluice/main.go | Wires recovery-edge notice fanout + opt-in auth-reset execution + audit event emission. |
| cmd/sluice/main_test.go | Tests recovery-edge notice fanout and auth-reset/audit behavior, including error/no-op cases. |
| CLAUDE.md | Updates operator/developer docs for new pool exhaustion semantics and auth reset features. |
| docs/plans/completed/20260522-pool-exhaustion-and-agent-auth-reset.md | Completed implementation plan documenting the approach, tasks, and verification steps. |
Files not reviewed (1)
- internal/api/api.gen.go: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+241
to
+250
| // HTTP-date (Retry-After absolute form). | ||
| if t, err := http.ParseTime(raw); err == nil { | ||
| if d := t.Sub(now); d > 0 { | ||
| return d, true | ||
| } | ||
| return 0, false | ||
| } | ||
| // Unit-suffixed duration (e.g. OpenAI "1.5s", "60ms"). | ||
| if d, err := time.ParseDuration(raw); err == nil && d > 0 { | ||
| return d, true |
Comment on lines
+1932
to
+1936
| var req SetPoolAuthResetTargetRequest | ||
| if err := json.NewDecoder(limitedBody(w, r)).Decode(&req); err != nil { | ||
| writeError(w, http.StatusBadRequest, "invalid request body", "") | ||
| return | ||
| } |
Comment on lines
+675
to
+693
| "200": | ||
| description: Updated pool | ||
| content: | ||
| application/json: | ||
| schema: | ||
| $ref: "#/components/schemas/Pool" | ||
| "400": | ||
| description: Invalid target | ||
| content: | ||
| application/json: | ||
| schema: | ||
| $ref: "#/components/schemas/ErrorResponse" | ||
| "404": | ||
| description: Pool not found | ||
| content: | ||
| application/json: | ||
| schema: | ||
| $ref: "#/components/schemas/ErrorResponse" | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Live, both members of
openai_poolhit the OpenAI Codex usage-limit 429 (a multi-hour quota window). sluice cooled each member for a flat 60s, then the all-cooling degrade path re-served the soonest-recovering member, which 429'd again. That produced a perpetualopenai_oauthtoopenai_oauth_2failover flap and roughly 2 Telegram notices per minute, forever. The sticky-failover fix (#48) stopped the snap-to-position-0 flap but left this degrade-path flap in place.Second problem: when the whole pool is exhausted the agent can latch "usage limit reached" and stop, needing a manual
hermes auth resetto resume.What this does
Three fixes for the flap plus an opt-in auto-reset.
Retry-After/x-ratelimit-reset*headers (clamped 10s to 6h) instead of a flat 60s, so a quota-exhausted member stays cooled for the real window and is not re-probed every 60s.auth_reset_targetcolumn on the pool. When it is set and the pool recovers, sluice runs the agent auth-reset command (hermes profile, executed as uid 10000:10000 to avoid root-chowning auth.json) so the agent un-latches. An empty target means no reset, so this is off by default.auth_reset_targetis reachable from CLI, REST, and Telegram through the channel-agnosticinternal/poolops, per the channel parity rule.Notes for review
auth_reset_targetunset to get the flap and spam fix without sluice ever touching the agent.000008_pool_auth_resetadds the column. Its down migration snapshots the FK-tied member rows before the table rebuild so the cascade does not drop them.Testing
go test ./...passes across all 14 packages,-raceclean on the touched packages, gofumpt andgo vet ./...clean, andmake generateproduces no diff. New unit tests cover the cooldown header parsing and clamp, exhaustion detection, the recovery monitor edges (unequal cooldowns, pool removed while exhausted, clean shutdown), the auth-reset argv and uid, validation parity across all channels, and migration 000008 up and down on a populated table.Plan:
docs/plans/completed/20260522-pool-exhaustion-and-agent-auth-reset.md