feat(pools): pool exhaustion handling + opt-in agent auth auto-reset by nnemirovsky · Pull Request #49 · nnemirovsky/sluice

nnemirovsky · 2026-05-22T15:56:58Z

Problem

Live, both members of openai_pool hit the OpenAI Codex usage-limit 429 (a multi-hour quota window). sluice cooled each member for a flat 60s, then the all-cooling degrade path re-served the soonest-recovering member, which 429'd again. That produced a perpetual openai_oauth to openai_oauth_2 failover flap and roughly 2 Telegram notices per minute, forever. The sticky-failover fix (#48) stopped the snap-to-position-0 flap but left this degrade-path flap in place.

Second problem: when the whole pool is exhausted the agent can latch "usage limit reached" and stop, needing a manual hermes auth reset to resume.

What this does

Three fixes for the flap plus an opt-in auto-reset.

B1: derive the member cooldown from the upstream Retry-After / x-ratelimit-reset* headers (clamped 10s to 6h) instead of a flat 60s, so a quota-exhausted member stays cooled for the real window and is not re-probed every 60s.
A1: classify the pool as exhausted when no healthy member exists, not only when the failover target equals the failing member. Also collapse the exhausted-notice dedup key so the flap direction cannot produce two keys.
A2: edge-triggered notices. One "pool exhausted" on the healthy-to-exhausted edge and one "pool recovered" on the way back, driven by a time-based recovery monitor on the server (the agent emits no recovering traffic while latched). No periodic spam.
Auto-reset (opt-in, per pool): a new auth_reset_target column on the pool. When it is set and the pool recovers, sluice runs the agent auth-reset command (hermes profile, executed as uid 10000:10000 to avoid root-chowning auth.json) so the agent un-latches. An empty target means no reset, so this is off by default.

auth_reset_target is reachable from CLI, REST, and Telegram through the channel-agnostic internal/poolops, per the channel parity rule.

Notes for review

The auto-reset is opt-in and may not be needed in practice. We saw hermes recover on its own once a member's window reset. Leave auth_reset_target unset to get the flap and spam fix without sluice ever touching the agent.
B1 only fully engages if the upstream 429 carries a recognized header. With no header the cooldown falls back to 60s, but A1 and A2 still prevent the notice storm. Capturing a real Codex usage-limit 429 to confirm which header it sends is a follow-up (see the plan's Post-Completion section).
New migration 000008_pool_auth_reset adds the column. Its down migration snapshots the FK-tied member rows before the table rebuild so the cascade does not drop them.

Testing

go test ./... passes across all 14 packages, -race clean on the touched packages, gofumpt and go vet ./... clean, and make generate produces no diff. New unit tests cover the cooldown header parsing and clamp, exhaustion detection, the recovery monitor edges (unequal cooldowns, pool removed while exhausted, clean shutdown), the auth-reset argv and uid, validation parity across all channels, and migration 000008 up and down on a populated table.

Plan: docs/plans/completed/20260522-pool-exhaustion-and-agent-auth-reset.md

…e exhausted notice dedup (A1)

…r (A2)

…ion 000008)

…AgentProfile.ExecUser

…un-latch

…pool)

…ss channels

…get (unparam)

Copilot

Pull request overview

This PR hardens credential-pool failover behavior when all members are rate-limited (429) by (1) deriving cooldown windows from upstream recovery hints, (2) correctly detecting pool exhaustion and emitting edge-triggered exhausted/recovered notices via a server-side recovery monitor, and (3) adding an opt-in per-pool auth_reset_target that triggers an agent auth reset on recovery to un-latch stuck agents.

Changes:

Derive per-member cooldown from Retry-After / x-ratelimit-reset* headers (with clamp/fallback) and improve exhaustion classification.
Add server-lived exhaustion state + recovery monitor to edge-trigger notices and avoid retry/notice storms.
Add auth_reset_target to the data model and expose it via poolops + CLI/REST/Telegram; add container-manager ResetAuth support and docker exec-user plumbing.

Reviewed changes

Copilot reviewed 34 out of 35 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
internal/vault/pool.go	Adds cooldown clamp constants and new resolver helpers (`HasHealthyMember`, `SoonestCooldown`) used for exhaustion/recovery monitoring.
internal/vault/pool_test.go	Unit tests for the new resolver helper methods and semantics (lazy expiry, manual rotate).
internal/proxy/pool_failover.go	Parses upstream recovery hint headers for cooldown, updates exhaustion logic/dedup, and adds recovered notice formatter.
internal/proxy/pool_failover_test.go	Tests header parsing/clamping and exhausted-dedup behavior.
internal/proxy/server.go	Adds per-server exhaustion state, recovery monitor lifecycle, and recovery-edge callbacks.
internal/proxy/pool_recovery_monitor_test.go	New tests for exhaustion edge-gating, recovery edge, unequal cooldown wakeups, pool removal, and shutdown behavior.
internal/proxy/addon.go	Adds an `onPoolExhausted` hook to delegate exhaustion edge-gating to the Server.
internal/store/pools.go	Extends pool model/query paths for `auth_reset_target` and adds `SetPoolAuthResetTarget`.
internal/store/pools_test.go	Adds migration round-trip test for 000008 and CRUD tests for `auth_reset_target`.
internal/store/migrations/000008_pool_auth_reset.up.sql	Adds `auth_reset_target` column with default empty string.
internal/store/migrations/000008_pool_auth_reset.down.sql	Drops `auth_reset_target` via SQLite table-rebuild while preserving FK-tied member rows.
internal/poolops/poolops.go	Adds channel-agnostic create-with-target + set/clear logic and validation parity.
internal/poolops/poolops_test.go	Tests create/set/clear behavior and invalid-target rejection.
internal/telegram/commands.go	Adds Telegram surface for `auth_reset_target` (create arg + `set-auth-reset`) and output display.
internal/telegram/commands_test.go	Tests Telegram adapter parity for create/set/clear and validation rejection.
internal/telegram/approval_test.go	Updates container manager mock with `ResetAuth` for compilation.
internal/api/server.go	Adds REST surface for `auth_reset_target` (create + action route) and API mappings.
internal/api/server_test.go	Tests REST create-with-target and set/clear endpoint behavior + validation.
internal/api/api.gen.go	Regenerated OpenAPI bindings with new fields and route.
api/openapi.yaml	Adds `auth_reset_target` fields and a new `/api/pools/{name}/auth-reset-target` route/schema.
cmd/sluice/pool.go	Adds CLI `--auth-reset-target` for create and `pool set-auth-reset` command; surfaces target in list/status.
cmd/sluice/pool_test.go	Tests CLI adapter parity for create/set/clear and validation rejection.
internal/container/types.go	Adds `ResetAuth` to `ContainerManager` and introduces canonical `ValidateResetAuthTarget`.
internal/container/agent_profile.go	Adds `ResetAuthCmd` and `ExecUser()` support; wires Hermes reset + runtime UID.
internal/container/agent_profile_test.go	Tests profile reset argv and `ExecUser()` behavior (incl. nil-safe accessor).
internal/container/docker.go	Extends docker exec to accept an optional user and implements `ResetAuth` running as profile exec user.
internal/container/docker_test.go	Tests default-user behavior for existing exec call sites and runtime-UID reset exec.
internal/container/docker_socket.go	Threads `User` into Docker exec-create payload (omitempty when empty).
internal/container/docker_socket_test.go	Tests exec-create payload includes/omits `User` correctly.
internal/container/apple.go	Implements best-effort `ResetAuth` for Apple Container backend.
internal/container/tart.go	Implements best-effort `ResetAuth` for Tart backend.
cmd/sluice/main.go	Wires recovery-edge notice fanout + opt-in auth-reset execution + audit event emission.
cmd/sluice/main_test.go	Tests recovery-edge notice fanout and auth-reset/audit behavior, including error/no-op cases.
CLAUDE.md	Updates operator/developer docs for new pool exhaustion semantics and auth reset features.
docs/plans/completed/20260522-pool-exhaustion-and-agent-auth-reset.md	Completed implementation plan documenting the approach, tasks, and verification steps.

Files not reviewed (1)

internal/api/api.gen.go: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

… response

Copilot

Pull request overview

Copilot reviewed 35 out of 36 changed files in this pull request and generated 3 comments.

Files not reviewed (1)

internal/api/api.gen.go: Language not supported

+	// HTTP-date (Retry-After absolute form).
+	if t, err := http.ParseTime(raw); err == nil {
+		if d := t.Sub(now); d > 0 {
+			return d, true
+		}
+		return 0, false
+	}
+	// Unit-suffixed duration (e.g. OpenAI "1.5s", "60ms").
+	if d, err := time.ParseDuration(raw); err == nil && d > 0 {
+		return d, true


+	var req SetPoolAuthResetTargetRequest
+	if err := json.NewDecoder(limitedBody(w, r)).Decode(&req); err != nil {
+		writeError(w, http.StatusBadRequest, "invalid request body", "")
+		return
+	}


+        "200":
+          description: Updated pool
+          content:
+            application/json:
+              schema:
+                $ref: "#/components/schemas/Pool"
+        "400":
+          description: Invalid target
+          content:
+            application/json:
+              schema:
+                $ref: "#/components/schemas/ErrorResponse"
+        "404":
+          description: Pool not found
+          content:
+            application/json:
+              schema:
+                $ref: "#/components/schemas/ErrorResponse"
+


nnemirovsky added 12 commits May 22, 2026 15:38

docs(plans): add pool exhaustion handling + agent auth auto-reset plan

0bff90d

feat(proxy): derive pool cooldown from upstream recovery hints (B1)

96611d2

feat(proxy): detect pool exhaustion via no-healthy-member and collaps…

cd4fd87

…e exhausted notice dedup (A1)

feat(proxy): edge-triggered pool exhaustion notices + recovery monito…

2c05a5b

…r (A2)

feat(store): add per-pool auth_reset_target column + accessor (migrat…

0a67453

…ion 000008)

feat(pools): expose auth_reset_target on CLI, REST, and Telegram

eefe85d

feat(container): thread optional exec user through ExecInContainer + …

5752563

…AgentProfile.ExecUser

feat(container): add ResetAuth + profile ResetAuthCmd for agent auth …

deb7d88

…un-latch

feat(pools): auto-reset agent auth on pool recovery edge (opt-in per …

8f4a12a

…pool)

test: verify acceptance criteria for pool exhaustion + auth reset

bcc876a

docs: document pool exhaustion handling + agent auth auto-reset

889e9e6

fix(pools): enforce auth_reset_target charset parity at set time acro…

be34498

…ss channels

nnemirovsky requested a review from Copilot May 22, 2026 16:13

Copilot started reviewing on behalf of nnemirovsky May 22, 2026 16:14 View session

test(pools): drop always-constant param from assertStoredAuthResetTar…

d26a3f0

…get (unparam)

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread internal/api/server.go

Comment thread internal/api/server.go Outdated

Comment thread internal/store/pools.go Outdated

Comment thread internal/store/pools.go

Comment thread internal/poolops/poolops.go Outdated

fix(pools): atomic create-with-target and schema-valid set-auth-reset…

4089202

… response

nnemirovsky requested a review from Copilot May 22, 2026 16:33

Copilot started reviewing on behalf of nnemirovsky May 22, 2026 16:34 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pools): pool exhaustion handling + opt-in agent auth auto-reset#49

feat(pools): pool exhaustion handling + opt-in agent auth auto-reset#49
nnemirovsky wants to merge 14 commits into
mainfrom
pool-exhaustion-auth-reset

nnemirovsky commented May 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nnemirovsky commented May 22, 2026

Problem

What this does

Notes for review

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants