Worker-connection-gated readiness probe for the coordinator by cmttt · Pull Request #36 · cmttt/osprey

cmttt · 2026-05-08T21:00:32Z

Summary

Adds a custom HealthChecker on the coordinator's sync_action gRPC service that returns SERVING only when at least one async worker is connected via the bidi stream. Without this, new coord pods accept discord_api → coord process_action calls before any worker has dialed in, those calls enqueue onto a priority queue with no consumers, and discord_api times out with DEADLINE_EXCEEDED.

Why

Every rolling deploy of osprey-coordinator-asyncio-prd produces ~2,500–3,000 DEADLINE_EXCEEDED errors on discord_api.rpc_client.osprey_coordinator_asyncio.process_action. We've worked through the cluster-config knobs over the last few PRs (surge=1, 30s startupProbe delay, 30s preStop sleep, +grace period) and total failures are roughly unchanged because they all address the drain side. The actual gap is on the warmup side: pigeon's grpc.health.v1 answers SERVING the moment port 80 binds (~2s after container start), well before the bidi service has been announced in etcd and workers have connected. K8s readiness flips green at the 30s startupProbe boundary, traffic flows, and the first ~5–15 seconds of inbound calls have nowhere to go.

How

pigeon::Server::with_health_checker — new builder method that pushes a HealthChecker (clone-able closure form: Fn() -> bool + Clone + Send + Sync + 'static) onto every gRPC service registered on the Server. No existing call sites are touched.
Connected-worker counter on OspreyCoordinatorServer — Arc<AtomicI64>, owned by the bidi server. Each bidi stream handler increments the counter on entry and decrements it via a WorkerConnectionGuard Drop impl, so disconnects (graceful, panic, cancellation) all release the count without explicit decrement code at every break site.
main.rs swap — pigeon::serve(sync_action, ...) is replaced with an explicit pigeon::Server::new(...).with_health_checker(\|\| count > 0).serve() chain, mirroring everything pigeon::serve does plus the new checker. Bidi service still uses the unmodified pigeon::serve (no readiness gating; workers don't K8s-probe it).

Net diff: ~76 lines, 3 files, 0 unsafe code.

Important caveat — companion change required

Pigeon's grpc.health.v1.Health/Check returns SERVING for any request with an empty service name (intentional: that's the liveness fast-path, health.rs:101-104). K8s' built-in gRPC probe defaults to an empty service name, which means just shipping this PR does nothing on its own — the K8s probe still goes through the fast-path and ignores the new checker.

The accompanying discord/discord cluster config must change the readiness probe to:

readinessProbe: healthProbe(
  grpcProbe(80, "osprey.rpc.osprey_coordinator.sync_action.v1.OspreyCoordinatorSyncActionService"),
  {periodSeconds: 5, failureThreshold: 2},
)

Once both ship, the K8s readiness probe will route through the registered service-name path → HealthServer::check → our checker → connected_workers > 0.

Test plan

cargo check clean (211 pre-existing warnings, no errors / no new warnings from this change)
Local: spin up coordinator + at least one worker, hit grpcurl localhost:80 grpc.health.v1.Health/Check '{"service":"osprey.rpc.osprey_coordinator.sync_action.v1.OspreyCoordinatorSyncActionService"}' before any worker connects → expect NOT_SERVING; after worker connects → expect SERVING
Verify connection_guard decrements correctly under panic and cancellation by checking counter after worker disconnects mid-stream
Once merged + image rebuilt + pinned + companion discord/discord PR shipped: observe a coord rolling deploy and confirm process_action.failure{exception_type:deadline_exceeded} rate during rollout drops materially (target: <100 vs the current ~2,500–3,000 baseline)

Adds an asyncio-native osprey worker as a sibling package to the gevent worker. Includes coordinator input stream, sinks (rules, output), pigeon client + interceptors, etcd-backed sources provider, executor, and engine. Coordinator gains a bidirectional stream and pigeon module. Subsequent tuning to match the gevent baseline: - Move recompile off the event loop; force gc.collect after engine swap - Match gevent's compile CPU profile (and later remove the gc.collect hack) - Empirically widen pause-input-stream jitter (final: 0-600s) - Preserve WatchMux dedup state across watcher iteration

Engine: AST/grammar tweaks, executor and execution-context refinements, UDF helpers, stdlib UDF updates (categories, experiments, json_data, rules, string), external service utils. Worker: udf_register, osprey_engine, sources_provider (incl. skipping engine recompile on no-op etcd sources events), output/rules sinks. Plus repo housekeeping: CODEOWNERS, PR template, dev workflow doc, and a small UI polish in FeatureSelectModal.

- _mux.py / test_watcher_mux.py: preserve dedup state correctly across watcher iteration. - watcherd_impl.py: downgrade etcd watcher reconnect logs from exception to warning (these reconnects are routine, not errors). - action.proto + generated pb2/pb2.pyi: add GuildBatchMLScore action.

add and register count regex matches udf

* Bootstrap async UDFs through first-party plugin hook Move the async-stdlib MXLookup override from a hardcoded list inside bootstrap_async_udfs into a small first-party pluggy plugin module (stdlib_udfs/_async_stdlib_plugin.py). The plugin registers async-native stdlib replacements through the same register_udfs hook used by discord_osprey_async_plugins. _deduplicate_udfs already handles override-by-class-name, so MXLookup (async) automatically shadows MXLookup (sync) — no special path. Adding a new async stdlib override now means: drop the class file in stdlib_udfs/ and append it to _async_stdlib_plugin.register_udfs(). No edits to bootstrap_async_udfs required. * Tests for async UDF bootstrap behavior Lock down that bootstrap_async_udfs: - resolves MXLookup to the async-native class (not sync stdlib) - doesn't leak SyncMXLookup into the registry - preserves non-overridden stdlib UDFs (JsonData, Rule) - registers the first-party async-stdlib plugin via the same hook used by third-party plugins - emits AsyncMXLookup through the register_udfs hook

The K8s readiness probe on osprey-coordinator-asyncio currently returns SERVING the moment pigeon binds port 80 — well before any async worker has dialed in via the bidi stream. discord_api → coord process_action calls land on a freshly-Ready pod, get enqueued onto a priority queue with no consumers, and time out with DEADLINE_EXCEEDED. This change wires a custom HealthChecker on the sync_action service that reports healthy only when at least one worker is connected via bidi: * `OspreyCoordinatorServer` (the bidi server) now owns an `Arc<AtomicI64>` connected-worker counter. * Each bidi stream handler increments on entry and decrements via a `WorkerConnectionGuard` Drop impl, so disconnects (graceful, panic, cancellation) all release the count. * `pigeon::Server::with_health_checker` is added so a closure can be attached to every gRPC service registered on the Server. * `main.rs` swaps `pigeon::serve(sync_action, ...)` for an explicit `Server::new(...).with_health_checker(...).serve()` chain that gates SERVING on `connected_workers > 0`. Note that pigeon's grpc.health.v1 still fast-paths empty service names to SERVING (liveness semantics), so the K8s readinessProbe must specify the sync_action service name to actually invoke this checker. The cluster-side change ships in a companion discord/discord PR.

…r commit

cmttt and others added 9 commits May 7, 2026 22:57

Metrics / DogStatsd improvements

34add0e

add and register count regex matches udf

e5108dc

Merge pull request #34 from cmttt/chenyu/count_regex

fb5c70e

add and register count regex matches udf

Drop bigquery_shadow_sink.py — WIP file accidentally included in prio…

548aa93

…r commit

cmttt force-pushed the main branch from 4c2b18f to cda5a87 Compare June 6, 2026 16:25

cmttt requested a review from haileyok as a code owner June 6, 2026 16:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker-connection-gated readiness probe for the coordinator#36

Worker-connection-gated readiness probe for the coordinator#36
cmttt wants to merge 9 commits into
mainfrom
ls/coord-readiness-healthchecker

cmttt commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cmttt commented May 8, 2026

Summary

Why

How

Important caveat — companion change required

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants