[async_worker] sinks: bound output-sink retry + per-sink circuit breaker by cmttt · Pull Request #65 · cmttt/osprey

cmttt · 2026-06-16T01:38:47Z

Summary

AsyncMultiOutputSink._push_one retried each sink max_retries + 1 times with an unbounded 0.5s, 1.0s, 1.5s, … backoff on every exception. Pushes run on the sequential rules-sink stream, so when a downstream is down service-wide, every message walks the full retry+backoff before the stream can read the next action. For the Discord event-effects sink (max_retries=5) that's ~7.5s of held time per message — which backpressures the coordinator and stalls ingestion fleet-wide.

This bounds the retry path three ways and adds a per-sink circuit breaker so a downstream outage sheds instead of stalling.

Why (prod incident, 2026-06-15)

A GKE node-pool auto-upgrade of the smite-dataservice-pool (22:24:59–22:38:55Z) rolling-drained the smite-labels StatefulSet node-by-node → the labels service was intermittently gRPC-UNAVAILABLE. Each ApplyEntityMutation failure was retried by the label client (secondaries=2) and by this multi-sink loop (6 attempts, ~7.5s of sleeps). On the sequential stream that blocked ingestion → coordinator went idle → completions collapsed ~83% (5.9K→1K/s) → ~6M-message backlog for ~13 minutes, until the upgrade finished. A routine, scheduled node upgrade should not take down ingestion.

What changed

AsyncMultiOutputSink._push_one (+ tunable attributes on AsyncBaseOutputSink, conservative defaults):

knob	default	effect
`max_backoff_seconds`	1.0	caps a single backoff sleep (was unbounded)
`max_total_push_seconds`	10.0	hard ceiling on total wall-clock per push, across all attempts + backoff (also bounds the all-timeouts case, where no backoff sleep runs)
`circuit_breaker_threshold` / `circuit_breaker_cooldown_seconds`	5 / 5.0	after N consecutive failures, shed pushes for the cooldown instead of every message paying the full retry cost; a success resets the counter; a failed half-open probe re-opens (`>=` not `==`)

Circuit-breaker state lives on the (process-shared) AsyncMultiOutputSink, keyed by id(sink), so a service-wide outage trips all concurrency streams at once. Set circuit_breaker_threshold = 0 to disable.

New metrics: output_sink.circuit_opened (transition to open), output_sink.circuit_open (a push shed while open).

Test plan

osprey_async_worker/.../tests/test_async_sinks.py — 12 pass (8 existing + 4 new):

opens after threshold consecutive failures and sheds subsequent pushes
half-opens (re-attempts) after the cooldown elapses
a single success resets the counter (circuit stays closed under intermittent errors)
max_total_push_seconds bounds cumulative retry time regardless of max_retries

ruff check + ruff format clean.

Open questions for review

Default-on vs opt-in. This enables the breaker for all sinks by default (threshold=5). Conservative — only trips on sustained consecutive failure — but it changes default retry behavior. Alternative: default threshold=0 (off) here and have Discord opt the event-effects sink in. Preference?
Shedding semantics. While open, a sink's push is dropped for that window (metric emitted). Same data outcome as today's exhausted retries, but without the stall. Acceptable, or should it degrade differently (e.g. queue)?

Deploy note (Discord)

Re-pin in both discord_smite/requirements.{in,txt} and discord_devops/docker/discord_smite/osprey_coordinator.Dockerfile (ARG OSPREY_COMMIT).

AsyncMultiOutputSink._push_one retried every sink up to max_retries+1 times with an unbounded 0.5s,1.0s,1.5s,... backoff. Pushes run on the sequential rules-sink stream, so when a downstream is down service-wide every message walks the full retry+backoff (~7.5s for the event-effects sink) before the stream reads the next action — backpressuring the coordinator and stalling ingestion fleet-wide. Observed in prod 2026-06-15: a labels-service blip during a GKE node-pool auto-upgrade became a 13-min fleet ingestion stall. Bound it three ways, all tunable per-sink with conservative defaults: - max_backoff_seconds caps a single backoff sleep - max_total_push_seconds caps total wall-clock per push (covers all-timeouts) - circuit_breaker_threshold/cooldown: after N consecutive failures shed pushes for a cooldown instead of every message paying the retry cost; a success resets the counter, a failed half-open probe re-opens New metrics: output_sink.circuit_opened, output_sink.circuit_open.

…circuit on success Two fixes from the PR review of the circuit breaker: - _record_failure now returns whether the circuit is open; _push_one breaks out of the retry loop the instant it opens (the threshold-crossing push and every failed half-open probe) instead of walking the rest of the retry/backoff budget and holding the sequential stream. - success clears _circuit_open_until (not just the failure count), so a push that succeeds — including one closing a circuit opened by a concurrent in-flight push — fully closes it instead of leaving stale shedding. +2 regression tests.

cmttt requested a review from haileyok as a code owner June 16, 2026 01:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[async_worker] sinks: bound output-sink retry + per-sink circuit breaker#65

[async_worker] sinks: bound output-sink retry + per-sink circuit breaker#65
cmttt wants to merge 2 commits into
mainfrom
ls/output-sink-retry-bound

cmttt commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cmttt commented Jun 16, 2026

Summary

Why (prod incident, 2026-06-15)

What changed

Test plan

Open questions for review

Deploy note (Discord)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant