Skip to content

Handle spending-limit errors gracefully with escalating cooldowns#63

Open
slacki-ai wants to merge 2 commits intolongtermrisk:v0.9from
slacki-ai:fix/spending-limit-resilience
Open

Handle spending-limit errors gracefully with escalating cooldowns#63
slacki-ai wants to merge 2 commits intolongtermrisk:v0.9from
slacki-ai:fix/spending-limit-resilience

Conversation

@slacki-ai
Copy link
Copy Markdown
Contributor

Summary

  • Spending-limit errors (e.g. "Failed to start GPU" due to RunPod hourly spend cap) now trigger a 5-minute global provisioning pause instead of penalising individual GPU types. Jobs stay pending and retry automatically after the pause.
  • Default cooldown reduced from 7 days to 1 hour for non-spending-limit hardware failures.
  • Escalating cooldowns within the same UTC day: 1h → 6h → 2 days for repeated cooldowns on the same GPU type. Resets daily.

How it works

  1. is_spending_limit_error() detects spending-limit patterns in RunPod error messages
  2. On match, record_failure() sets a global pause timestamp instead of counting toward the per-hardware failure threshold
  3. scale_workers() checks is_spending_limit_paused() at the top and returns early (with a log warning), keeping all jobs in pending state
  4. For non-spending-limit failures, the existing threshold logic applies but uses an escalating cooldown schedule tracked per calendar day (UTC)

Configuration

All new values are env-configurable:

  • OW_RUNPOD_SPENDING_LIMIT_PAUSE_SECONDS — global pause duration (default: 300s)
  • OW_RUNPOD_HARDWARE_COOLDOWN_SECONDS — base cooldown (default: 3600s, down from 604800s)

Test plan

  • 8 unit tests pass (3 existing + 5 new covering spending-limit detection, global pause, escalation, and daily reset)
  • Deploy to staging and verify jobs stay pending when spending limit is hit
  • Verify escalating cooldowns work by triggering repeated hardware failures

🤖 Generated with Claude Code

Spending-limit errors from RunPod (e.g. "Failed to start GPU") now trigger
a 5-minute global provisioning pause instead of penalising individual GPU
types. Non-spending-limit hardware failures use escalating cooldowns within
the same UTC day: 1 hour → 6 hours → 2 days (previously a flat 7-day
cooldown).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@slacki-ai slacki-ai force-pushed the fix/spending-limit-resilience branch from 90b0168 to bb71ab3 Compare April 27, 2026 09:06
…ldown dates

- When a spending-limit error is detected in the inner hardware-candidate
  loop, stop trying other GPU types immediately (they'll all fail for the
  same account-wide reason).
- Log spending-limit hits at WARNING level with a clear message.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant