Skip to content

fix(openai): self-heal stale Codex used% snapshots + lock semantics (#2994)#2

Closed
StarryKira wants to merge 1 commit into
mainfrom
fix/issue-2994-codex-5h-used-percent-selfheal
Closed

fix(openai): self-heal stale Codex used% snapshots + lock semantics (#2994)#2
StarryKira wants to merge 1 commit into
mainfrom
fix/issue-2994-codex-5h-used-percent-selfheal

Conversation

@StarryKira

Copy link
Copy Markdown
Owner

Problem (Wei-Shaw#2994)

Newly-imported OpenAI/Codex OAuth accounts showed ~96–99% used in the 5h window even when nearly unused (a few cents). The inflated value also tripped shouldAutoPauseOpenAIAccountByQuota, so the account was excluded from scheduling ("导致后续请求无法调度到这个账号").

Root cause was a 100 - usedPercent inversion in Normalize() (commit b65dde63, PR Wei-Shaw#2918). For a fresh account whose x-codex-secondary-used-percent ≈ 1, it stored 100 - 1 = 99 into codex_5h_used_percent. That inversion was already reverted in main (PR Wei-Shaw#2993). The stored value is now the correct "used %".

What this PR adds (hardening, not a re-fix)

  1. Regression test locking in the direct "used %" semantics. They have flip-flopped twice (fix(usage): 修正 OpenAI 5h 用量窗口 used%/remaining% 颠倒 Wei-Shaw/sub2api#2918fix(usage): revert OpenAI 5h used_percent inversion (#2918 regression) Wei-Shaw/sub2api#2993) with no value-level guard: a fresh account (secondary_used_percent=1, 5h window) must store codex_5h_used_percent=1, not 99.

  2. Stale-bounded self-heal in resolveOpenAIQuotaUtilization — the single auto-pause chokepoint feeding the scheduler, WS forwarder, and gateway. An account already poisoned with an inflated used% gets excluded from scheduling, and a paused account never receives traffic to refresh its snapshot, so today it only recovers when the window's reset_at passes (≤5h/≤7d) or an admin opens its usage page (no background refresher exists). Now, when codex_usage_updated_at is older than 2h, the account is no longer auto-paused on that snapshot; it gets one request whose response headers refresh the snapshot via the existing UpdateCodexUsageSnapshotFromHeaders path and self-heal it.

Safety

  • A missing codex_usage_updated_at is treated as fresh (account stays paused) — no timestamp-less snapshot can silently escape auto-pause. Real poisoned snapshots always carry the timestamp.
  • An actively-served, genuinely-exhausted account refreshes codex_usage_updated_at on every response, so it never crosses the 2h bound and stays paused. Guarded by a dedicated test.
  • Trade-off: a legitimately-exhausted and idle account leaks ~1 verification request per 2h — strictly better than the current reset-only recovery, and it cannot regress a recently-updated exhausted account.

No change to Normalize(); no 100-x reintroduced; no new dependency wiring or mocks.

Tests

  • TestBuildCodexUsageExtraUpdates_FreshAccountUsedPercentNotInverted_Issue2994
  • TestOpenAIGatewayService_SelectAccountForModelWithExclusions_StaleUsageSnapshotSkipsPause_Issue2994
  • TestOpenAIGatewayService_SelectAccountForModelWithExclusions_FreshExhaustedSnapshotStillPauses_Issue2994 (guardrail)

go test -tags=unit ./internal/service/... and golangci-lint run ./internal/service/... pass.

Fixes Wei-Shaw#2994

🤖 Generated with Claude Code

…ei-Shaw#2994)

The OpenAI/Codex 5h "used %" inversion that caused fresh accounts to show
~96-99% used (PR Wei-Shaw#2918, commit b65dde6) was already reverted in Wei-Shaw#2993, so the
stored value is now the correct "used %" again. This commit hardens that fix:

1. Regression test locking in direct "used %" semantics. The semantics have
   flip-flopped twice (Wei-Shaw#2918 -> Wei-Shaw#2993) with no value-level guard — a fresh
   account (secondary_used_percent=1, 5h window) must store
   codex_5h_used_percent=1, not 99.

2. Stale-bounded self-heal in resolveOpenAIQuotaUtilization (the single
   auto-pause chokepoint). An account poisoned with an inflated used% gets
   excluded from scheduling, and a paused account never receives traffic to
   refresh its snapshot — so it stayed stuck until the window's reset_at passed
   (up to 5h/7d). When codex_usage_updated_at is older than 2h, the account is
   no longer auto-paused on that snapshot; it gets one request whose response
   headers refresh the snapshot and self-heal it. A missing timestamp is treated
   as fresh (stays paused), and an actively-served exhausted account refreshes
   the timestamp every response so it never crosses the bound — it cannot escape
   auto-pause.

No change to Normalize(); no 100-x reintroduced; no new dependency wiring.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 4, 2026 14:16

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens OpenAI/Codex quota auto-pause behavior by ensuring “used %” semantics remain stable (no inversion regressions) and by allowing accounts with stale usage snapshots to receive a probe request so they can self-heal via refreshed upstream headers, avoiding indefinite exclusion from scheduling.

Changes:

  • Add a stale-snapshot guard (codex_usage_updated_at older than 2h) so auto-pause won’t be enforced on long-idle, potentially poisoned snapshots.
  • Add a regression unit test to lock in direct “used %” storage semantics for Codex 5h/7d windows (issue Wei-Shaw#2994).
  • Add scheduler selection tests covering stale-snapshot skip behavior and a fresh-exhausted guardrail.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
backend/internal/service/openai_gateway_service.go Adds stale snapshot threshold/logic to bypass quota auto-pause when the stored Codex snapshot is too old to trust.
backend/internal/service/openai_gateway_service_codex_snapshot_test.go Adds regression test ensuring codex_5h_used_percent / codex_7d_used_percent are stored as direct used percentages (not inverted).
backend/internal/service/openai_account_scheduler_test.go Adds selection tests verifying stale snapshots don’t pause accounts, while fresh exhausted snapshots still do.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1491 to +1494
// 快照过于陈旧(账号长期未收到流量刷新)时,不再据此暂停。放行后下一次响应头
// 会刷新快照实现自愈,避免账号在错误/过期的 used% 上被永久跳过(issue #2994)。
if openAICodexSnapshotStaleForPause(extra, now) {
return 0, false
@StarryKira

Copy link
Copy Markdown
Owner Author

Superseded by upstream PR Wei-Shaw#3039.

@StarryKira StarryKira closed this Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gpt oauth用量窗口统计问题

2 participants