Skip to content

[async_worker] sinks: bound output-sink retry + per-sink circuit breaker#65

Open
cmttt wants to merge 2 commits into
mainfrom
ls/output-sink-retry-bound
Open

[async_worker] sinks: bound output-sink retry + per-sink circuit breaker#65
cmttt wants to merge 2 commits into
mainfrom
ls/output-sink-retry-bound

Conversation

@cmttt

@cmttt cmttt commented Jun 16, 2026

Copy link
Copy Markdown
Owner

Summary

AsyncMultiOutputSink._push_one retried each sink max_retries + 1 times with an unbounded 0.5s, 1.0s, 1.5s, … backoff on every exception. Pushes run on the sequential rules-sink stream, so when a downstream is down service-wide, every message walks the full retry+backoff before the stream can read the next action. For the Discord event-effects sink (max_retries=5) that's ~7.5s of held time per message — which backpressures the coordinator and stalls ingestion fleet-wide.

This bounds the retry path three ways and adds a per-sink circuit breaker so a downstream outage sheds instead of stalling.

Why (prod incident, 2026-06-15)

A GKE node-pool auto-upgrade of the smite-dataservice-pool (22:24:59–22:38:55Z) rolling-drained the smite-labels StatefulSet node-by-node → the labels service was intermittently gRPC-UNAVAILABLE. Each ApplyEntityMutation failure was retried by the label client (secondaries=2) and by this multi-sink loop (6 attempts, ~7.5s of sleeps). On the sequential stream that blocked ingestion → coordinator went idle → completions collapsed ~83% (5.9K→1K/s) → ~6M-message backlog for ~13 minutes, until the upgrade finished. A routine, scheduled node upgrade should not take down ingestion.

What changed

AsyncMultiOutputSink._push_one (+ tunable attributes on AsyncBaseOutputSink, conservative defaults):

knob default effect
max_backoff_seconds 1.0 caps a single backoff sleep (was unbounded)
max_total_push_seconds 10.0 hard ceiling on total wall-clock per push, across all attempts + backoff (also bounds the all-timeouts case, where no backoff sleep runs)
circuit_breaker_threshold / circuit_breaker_cooldown_seconds 5 / 5.0 after N consecutive failures, shed pushes for the cooldown instead of every message paying the full retry cost; a success resets the counter; a failed half-open probe re-opens (>= not ==)

Circuit-breaker state lives on the (process-shared) AsyncMultiOutputSink, keyed by id(sink), so a service-wide outage trips all concurrency streams at once. Set circuit_breaker_threshold = 0 to disable.

New metrics: output_sink.circuit_opened (transition to open), output_sink.circuit_open (a push shed while open).

Test plan

osprey_async_worker/.../tests/test_async_sinks.py — 12 pass (8 existing + 4 new):

  • opens after threshold consecutive failures and sheds subsequent pushes
  • half-opens (re-attempts) after the cooldown elapses
  • a single success resets the counter (circuit stays closed under intermittent errors)
  • max_total_push_seconds bounds cumulative retry time regardless of max_retries

ruff check + ruff format clean.

Open questions for review

  1. Default-on vs opt-in. This enables the breaker for all sinks by default (threshold=5). Conservative — only trips on sustained consecutive failure — but it changes default retry behavior. Alternative: default threshold=0 (off) here and have Discord opt the event-effects sink in. Preference?
  2. Shedding semantics. While open, a sink's push is dropped for that window (metric emitted). Same data outcome as today's exhausted retries, but without the stall. Acceptable, or should it degrade differently (e.g. queue)?

Deploy note (Discord)

Re-pin in both discord_smite/requirements.{in,txt} and discord_devops/docker/discord_smite/osprey_coordinator.Dockerfile (ARG OSPREY_COMMIT).

AsyncMultiOutputSink._push_one retried every sink up to max_retries+1 times
with an unbounded 0.5s,1.0s,1.5s,... backoff. Pushes run on the sequential
rules-sink stream, so when a downstream is down service-wide every message
walks the full retry+backoff (~7.5s for the event-effects sink) before the
stream reads the next action — backpressuring the coordinator and stalling
ingestion fleet-wide. Observed in prod 2026-06-15: a labels-service blip
during a GKE node-pool auto-upgrade became a 13-min fleet ingestion stall.

Bound it three ways, all tunable per-sink with conservative defaults:
- max_backoff_seconds caps a single backoff sleep
- max_total_push_seconds caps total wall-clock per push (covers all-timeouts)
- circuit_breaker_threshold/cooldown: after N consecutive failures shed
  pushes for a cooldown instead of every message paying the retry cost;
  a success resets the counter, a failed half-open probe re-opens

New metrics: output_sink.circuit_opened, output_sink.circuit_open.
@cmttt cmttt requested a review from haileyok as a code owner June 16, 2026 01:38
…circuit on success

Two fixes from the PR review of the circuit breaker:
- _record_failure now returns whether the circuit is open; _push_one breaks
  out of the retry loop the instant it opens (the threshold-crossing push and
  every failed half-open probe) instead of walking the rest of the
  retry/backoff budget and holding the sequential stream.
- success clears _circuit_open_until (not just the failure count), so a push
  that succeeds — including one closing a circuit opened by a concurrent
  in-flight push — fully closes it instead of leaving stale shedding.

+2 regression tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant