[async_worker] sinks: bound output-sink retry + per-sink circuit breaker#65
Open
cmttt wants to merge 2 commits into
Open
[async_worker] sinks: bound output-sink retry + per-sink circuit breaker#65cmttt wants to merge 2 commits into
cmttt wants to merge 2 commits into
Conversation
AsyncMultiOutputSink._push_one retried every sink up to max_retries+1 times with an unbounded 0.5s,1.0s,1.5s,... backoff. Pushes run on the sequential rules-sink stream, so when a downstream is down service-wide every message walks the full retry+backoff (~7.5s for the event-effects sink) before the stream reads the next action — backpressuring the coordinator and stalling ingestion fleet-wide. Observed in prod 2026-06-15: a labels-service blip during a GKE node-pool auto-upgrade became a 13-min fleet ingestion stall. Bound it three ways, all tunable per-sink with conservative defaults: - max_backoff_seconds caps a single backoff sleep - max_total_push_seconds caps total wall-clock per push (covers all-timeouts) - circuit_breaker_threshold/cooldown: after N consecutive failures shed pushes for a cooldown instead of every message paying the retry cost; a success resets the counter, a failed half-open probe re-opens New metrics: output_sink.circuit_opened, output_sink.circuit_open.
…circuit on success Two fixes from the PR review of the circuit breaker: - _record_failure now returns whether the circuit is open; _push_one breaks out of the retry loop the instant it opens (the threshold-crossing push and every failed half-open probe) instead of walking the rest of the retry/backoff budget and holding the sequential stream. - success clears _circuit_open_until (not just the failure count), so a push that succeeds — including one closing a circuit opened by a concurrent in-flight push — fully closes it instead of leaving stale shedding. +2 regression tests.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
AsyncMultiOutputSink._push_oneretried each sinkmax_retries + 1times with an unbounded0.5s, 1.0s, 1.5s, …backoff on every exception. Pushes run on the sequential rules-sink stream, so when a downstream is down service-wide, every message walks the full retry+backoff before the stream can read the next action. For the Discord event-effects sink (max_retries=5) that's ~7.5s of held time per message — which backpressures the coordinator and stalls ingestion fleet-wide.This bounds the retry path three ways and adds a per-sink circuit breaker so a downstream outage sheds instead of stalling.
Why (prod incident, 2026-06-15)
A GKE node-pool auto-upgrade of the
smite-dataservice-pool(22:24:59–22:38:55Z) rolling-drained thesmite-labelsStatefulSet node-by-node → the labels service was intermittently gRPC-UNAVAILABLE. EachApplyEntityMutationfailure was retried by the label client (secondaries=2) and by this multi-sink loop (6 attempts, ~7.5s of sleeps). On the sequential stream that blocked ingestion → coordinator went idle → completions collapsed ~83% (5.9K→1K/s) → ~6M-message backlog for ~13 minutes, until the upgrade finished. A routine, scheduled node upgrade should not take down ingestion.What changed
AsyncMultiOutputSink._push_one(+ tunable attributes onAsyncBaseOutputSink, conservative defaults):max_backoff_secondsmax_total_push_secondscircuit_breaker_threshold/circuit_breaker_cooldown_seconds>=not==)Circuit-breaker state lives on the (process-shared)
AsyncMultiOutputSink, keyed byid(sink), so a service-wide outage trips all concurrency streams at once. Setcircuit_breaker_threshold = 0to disable.New metrics:
output_sink.circuit_opened(transition to open),output_sink.circuit_open(a push shed while open).Test plan
osprey_async_worker/.../tests/test_async_sinks.py— 12 pass (8 existing + 4 new):thresholdconsecutive failures and sheds subsequent pushesmax_total_push_secondsbounds cumulative retry time regardless ofmax_retriesruff check+ruff formatclean.Open questions for review
threshold=0(off) here and have Discord opt the event-effects sink in. Preference?Deploy note (Discord)
Re-pin in both
discord_smite/requirements.{in,txt}anddiscord_devops/docker/discord_smite/osprey_coordinator.Dockerfile(ARG OSPREY_COMMIT).