Handle spending-limit errors gracefully with escalating cooldowns by slacki-ai · Pull Request #63 · longtermrisk/openweights

slacki-ai · 2026-04-27T09:00:19Z

Summary

Spending-limit errors (e.g. "Failed to start GPU" due to RunPod hourly spend cap) now trigger a 5-minute global provisioning pause instead of penalising individual GPU types. Jobs stay pending and retry automatically after the pause.
Default cooldown reduced from 7 days to 1 hour for non-spending-limit hardware failures.
Escalating cooldowns within the same UTC day: 1h → 6h → 2 days for repeated cooldowns on the same GPU type. Resets daily.

How it works

is_spending_limit_error() detects spending-limit patterns in RunPod error messages
On match, record_failure() sets a global pause timestamp instead of counting toward the per-hardware failure threshold
scale_workers() checks is_spending_limit_paused() at the top and returns early (with a log warning), keeping all jobs in pending state
For non-spending-limit failures, the existing threshold logic applies but uses an escalating cooldown schedule tracked per calendar day (UTC)

Configuration

All new values are env-configurable:

OW_RUNPOD_SPENDING_LIMIT_PAUSE_SECONDS — global pause duration (default: 300s)
OW_RUNPOD_HARDWARE_COOLDOWN_SECONDS — base cooldown (default: 3600s, down from 604800s)

Test plan

8 unit tests pass (3 existing + 5 new covering spending-limit detection, global pause, escalation, and daily reset)
Deploy to staging and verify jobs stay pending when spending limit is hit
Verify escalating cooldowns work by triggering repeated hardware failures

🤖 Generated with Claude Code

Spending-limit errors from RunPod (e.g. "Failed to start GPU") now trigger a 5-minute global provisioning pause instead of penalising individual GPU types. Non-spending-limit hardware failures use escalating cooldowns within the same UTC day: 1 hour → 6 hours → 2 days (previously a flat 7-day cooldown). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ldown dates - When a spending-limit error is detected in the inner hardware-candidate loop, stop trying other GPU types immediately (they'll all fail for the same account-wide reason). - Log spending-limit hits at WARNING level with a clear message. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

slacki-ai force-pushed the fix/spending-limit-resilience branch from 90b0168 to bb71ab3 Compare April 27, 2026 09:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle spending-limit errors gracefully with escalating cooldowns#63

Handle spending-limit errors gracefully with escalating cooldowns#63
slacki-ai wants to merge 2 commits intolongtermrisk:v0.9from
slacki-ai:fix/spending-limit-resilience

slacki-ai commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

slacki-ai commented Apr 27, 2026

Summary

How it works

Configuration

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant