Change Summary
Topic-based pipelines currently have a cluster of shutdown, backpressure, and mixed-delivery semantic problems that make rollout and shutdown behavior unreliable under load.
The main symptoms observed during live validation were:
- topic receiver drain could report drained while the receiver still kept running
- topic exporter could block shutdown when queue_on_full: block was configured
- mixed-topic try_publish could deliver to broadcast subscribers even when balanced delivery rejected the same message
- mixed async publish could reserve permits across balanced groups in a convoy-prone way while still waiting for another group
These issues are closely related because they all stem from node-side emulation of topic waiting and admission behavior instead of letting the topic runtime own those semantics.
Problem Details
1. Topic receiver drain behavior was not shutdown-safe
During receiver-first shutdown, the topic receiver could acknowledge ingress drain too early while still polling or otherwise remaining active. Under sustained traffic this could leave replace rollouts, pipeline shutdown, or group shutdown waiting until drain deadlines expired.
2. Topic exporter could block coordinated shutdown under backpressure
When queue_on_full: block was used, the topic exporter could remain stuck behind blocked publish behavior and fail to respond promptly to shutdown. That prevented clean downstream drain and contributed to pipeline shutdown timeouts.
3. Mixed-topic try_publish semantics were inconsistent
For mixed topics, the non-blocking path could publish to the broadcast ring before balanced-group admission had succeeded. That meant broadcast subscribers could observe a message that the balanced side had already rejected with DroppedOnFull, which broke the all-or-nothing backpressure model and diverged from the async publish behavior.
4. Mixed async publish could convoy unrelated balanced groups
The async mixed publish path acquired one balanced-group permit at a time and held earlier permits while waiting on later groups. Under partial saturation, one slow group could unnecessarily reserve capacity in unrelated groups even though no message had been admitted yet.
Expected Direction
The cleanup should keep waiting and admission semantics inside the topic runtime rather than duplicating them in topic nodes.
That implies:
- topic-owned blocked publish waiting
- delivery-lease handling on the subscribe side
- shutdown-responsive receiver and exporter behavior without node-side retry loops on the hot path
- all-or-nothing semantics for mixed publish across balanced and broadcast delivery
- no partial permit reservation across balanced groups while waiting for another group
Why This Matters
These problems directly affect correctness of mixed-topic delivery semantics, predictability of backpressure behavior, debuggability of rollout and shutdown behavior under load, and graceful shutdown reliability in topic-heavy topologies.
Validation Signals
The issues were surfaced by live validation of topic-heavy scenarios, especially multi-tenant topic topologies with sustained traffic, where:
- replace rollout could get stuck in draining-old and fail on drain deadline
- pipeline shutdown could time out while waiting for drain
- group shutdown could time out under coordinated shutdown pressure
Follow-up
A focused PR has been extracted to address this topic-runtime and topic-node behavior independently from the larger live-reconfiguration work.
Change Summary
Topic-based pipelines currently have a cluster of shutdown, backpressure, and mixed-delivery semantic problems that make rollout and shutdown behavior unreliable under load.
The main symptoms observed during live validation were:
These issues are closely related because they all stem from node-side emulation of topic waiting and admission behavior instead of letting the topic runtime own those semantics.
Problem Details
1. Topic receiver drain behavior was not shutdown-safe
During receiver-first shutdown, the topic receiver could acknowledge ingress drain too early while still polling or otherwise remaining active. Under sustained traffic this could leave replace rollouts, pipeline shutdown, or group shutdown waiting until drain deadlines expired.
2. Topic exporter could block coordinated shutdown under backpressure
When queue_on_full: block was used, the topic exporter could remain stuck behind blocked publish behavior and fail to respond promptly to shutdown. That prevented clean downstream drain and contributed to pipeline shutdown timeouts.
3. Mixed-topic try_publish semantics were inconsistent
For mixed topics, the non-blocking path could publish to the broadcast ring before balanced-group admission had succeeded. That meant broadcast subscribers could observe a message that the balanced side had already rejected with DroppedOnFull, which broke the all-or-nothing backpressure model and diverged from the async publish behavior.
4. Mixed async publish could convoy unrelated balanced groups
The async mixed publish path acquired one balanced-group permit at a time and held earlier permits while waiting on later groups. Under partial saturation, one slow group could unnecessarily reserve capacity in unrelated groups even though no message had been admitted yet.
Expected Direction
The cleanup should keep waiting and admission semantics inside the topic runtime rather than duplicating them in topic nodes.
That implies:
Why This Matters
These problems directly affect correctness of mixed-topic delivery semantics, predictability of backpressure behavior, debuggability of rollout and shutdown behavior under load, and graceful shutdown reliability in topic-heavy topologies.
Validation Signals
The issues were surfaced by live validation of topic-heavy scenarios, especially multi-tenant topic topologies with sustained traffic, where:
Follow-up
A focused PR has been extracted to address this topic-runtime and topic-node behavior independently from the larger live-reconfiguration work.