Skip to content

Topic shutdown under backpressure and mixed-topic publish semantics need cleanup #2630

@lquerel

Description

@lquerel

Change Summary

Topic-based pipelines currently have a cluster of shutdown, backpressure, and mixed-delivery semantic problems that make rollout and shutdown behavior unreliable under load.

The main symptoms observed during live validation were:

  • topic receiver drain could report drained while the receiver still kept running
  • topic exporter could block shutdown when queue_on_full: block was configured
  • mixed-topic try_publish could deliver to broadcast subscribers even when balanced delivery rejected the same message
  • mixed async publish could reserve permits across balanced groups in a convoy-prone way while still waiting for another group

These issues are closely related because they all stem from node-side emulation of topic waiting and admission behavior instead of letting the topic runtime own those semantics.

Problem Details

1. Topic receiver drain behavior was not shutdown-safe

During receiver-first shutdown, the topic receiver could acknowledge ingress drain too early while still polling or otherwise remaining active. Under sustained traffic this could leave replace rollouts, pipeline shutdown, or group shutdown waiting until drain deadlines expired.

2. Topic exporter could block coordinated shutdown under backpressure

When queue_on_full: block was used, the topic exporter could remain stuck behind blocked publish behavior and fail to respond promptly to shutdown. That prevented clean downstream drain and contributed to pipeline shutdown timeouts.

3. Mixed-topic try_publish semantics were inconsistent

For mixed topics, the non-blocking path could publish to the broadcast ring before balanced-group admission had succeeded. That meant broadcast subscribers could observe a message that the balanced side had already rejected with DroppedOnFull, which broke the all-or-nothing backpressure model and diverged from the async publish behavior.

4. Mixed async publish could convoy unrelated balanced groups

The async mixed publish path acquired one balanced-group permit at a time and held earlier permits while waiting on later groups. Under partial saturation, one slow group could unnecessarily reserve capacity in unrelated groups even though no message had been admitted yet.

Expected Direction

The cleanup should keep waiting and admission semantics inside the topic runtime rather than duplicating them in topic nodes.

That implies:

  • topic-owned blocked publish waiting
  • delivery-lease handling on the subscribe side
  • shutdown-responsive receiver and exporter behavior without node-side retry loops on the hot path
  • all-or-nothing semantics for mixed publish across balanced and broadcast delivery
  • no partial permit reservation across balanced groups while waiting for another group

Why This Matters

These problems directly affect correctness of mixed-topic delivery semantics, predictability of backpressure behavior, debuggability of rollout and shutdown behavior under load, and graceful shutdown reliability in topic-heavy topologies.

Validation Signals

The issues were surfaced by live validation of topic-heavy scenarios, especially multi-tenant topic topologies with sustained traffic, where:

  • replace rollout could get stuck in draining-old and fail on drain deadline
  • pipeline shutdown could time out while waiting for drain
  • group shutdown could time out under coordinated shutdown pressure

Follow-up

A focused PR has been extracted to address this topic-runtime and topic-node behavior independently from the larger live-reconfiguration work.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions