feat: Add support for max_inflight_requests parameter to prevent unbounded memory growth in ensemble models#455
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR adds support for a max_ensemble_inflight_responses parameter to prevent unbounded memory growth in ensemble models by implementing backpressure control. The feature limits concurrent inflight responses from ensemble steps to downstream consumers.
- Adds backpressure configuration parameter parsing with validation
- Implements producer blocking mechanism when downstream consumers are overloaded
- Tracks inflight response counts per step with proper synchronization
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/ensemble_scheduler/ensemble_scheduler.h | Adds max_inflight_responses_ field to EnsembleInfo struct |
| src/ensemble_scheduler/ensemble_scheduler.cc | Implements backpressure logic with tracking, blocking, and configuration parsing |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
whoisj
left a comment
There was a problem hiding this comment.
I have concerns here. This change creates an array of mutex + condition-variables that independently track, what I assume are, producer/consumer channels.
this seems overly complex to me.
why not use a simple integer to track the number of active vs capacity, and a single mutex + cv to handle interactions with those values?
Finally, does this guard against output overflows, where too many requests have completed but downstream models are incapable to consuming those outputs?
|
Need documentation and show the use case. |
max_ensemble_inflight_responses parameter to prevent unbounded memory growth in ensemble modelsmax_inflight_responses parameter to prevent unbounded memory growth in ensemble models
…into spolisetty/tri-26-triton-dali-ensemble-model-memory-issue
max_inflight_responses parameter to prevent unbounded memory growth in ensemble modelsmax_inflight_requests parameter to prevent unbounded memory growth in ensemble models
This PR adds support for a
max_inflight_requestsparameter to prevent unbounded memory growth in ensemble models by implementing backpressure control. The feature limits concurrent in-flight responses from ensemble steps to downstream consumers.Problem
When a fast decoupled producer (e.g., DALI video decoder generating 200 frames instantly) feeds a slow consumer (e.g., image classification taking 200ms per frame), responses pile up in memory waiting to be processed. This causes unbounded memory growth (25-35GB observed for a single request).
Solution
The new parameter blocks the producer when the downstream consumer has too many pending responses (configured limit reached), implementing backpressure control. Example configuration:
CI: triton-inference-server/server#8458