fix: respect caller's CUDA stream in TRT runtime (Green Context support)#4232
fix: respect caller's CUDA stream in TRT runtime (Green Context support)#4232shoumikhin wants to merge 1 commit into
Conversation
bfa0fea to
a0434e4
Compare
Long-term plan: upstream PyTorch PROpened pytorch/pytorch#182149 to add torch::inductor::AOTIModelPackageLoader loader("model.pt2");
for (auto& [name, ivalue] : loader.get_custom_objs()) {
if (auto e = ivalue.toCustomClass<torch_tensorrt::TRTEngine>()) {
e->set_external_stream(reinterpret_cast<int64_t>(my_green_stream));
}
}
loader.run(inputs);This PR (#4232) ships |
08ab339 to
405c706
Compare
|
Friendly ping @narendasan @cehongwang. This has been open for a week. The upstream PyTorch dependency (pytorch/pytorch#182149) is landing today, which unblocks the full end-to-end path. Could one of you take a first pass when you get a chance? Happy to address feedback or split it up if that helps review. |
|
@shoumikhin I will leave the more detailed feedback to @cehongwang but one thing I wanted to raise was would it make sense for your usecase to just pull stream management out of the execute engine op all together. So the lazy stream from the stream pool would be a bail out in the case the user is not managing this externally. My thought here is basically we operationalize stream management in the fx graph with a set of operators that serves as a guard / interface to prevent computation on the default stream. The concern I have is if you are mixing pytorch and tensorrt operations and using say green contexts would you be able to express everything correctly Something to this effect in the green context case: I can upload a prototype if this makes sense. I think we would still use the AOTI hooks you created, just in these stream ops rather than engine by engine |
…82149) ## Summary Adds a public C++ accessor `get_custom_objs()` on `torch::aot_inductor::ProxyExecutor`, `torch::inductor::AOTIModelContainerRunner`, and `torch::inductor::AOTIModelPackageLoader` (plus a Python binding on the loader) that returns a snapshot of the torchbind custom-class constants embedded in a loaded `.pt2` model. The `IValue` payloads alias the live entries inside the proxy executor's `custom_objs_` map (the existing private storage already populated during `OSSProxyExecutor` construction by `pickle_load_obj`). Downcasting to a `CustomClassHolder` subclass and mutating its state — e.g. binding a stream, attaching a communicator, toggling profiling — affects subsequent `run()` invocations. ## Motivation Backends that store live state on torchbind constants embedded in `.pt2` packages cannot reach those objects after load via any public PyTorch API. The constants live in `OSSProxyExecutor::custom_objs_` (private, `std::unordered_map<std::string, c10::IValue>`) and the loader/runner only expose tensor-typed constant accessors (`get_constant_fqns`, `extract_constants_map`, etc.). Concrete user: **torch-tensorrt** ([pytorch/TensorRT#4232](pytorch/TensorRT#4232)) needs to bind a CUDA Green Context stream onto each `TRTEngine` torchbind embedded in a torch-tensorrt `.pt2` model loaded via `AOTIModelPackageLoader`. The `ExportedProgram` path already works (Python walks `named_modules()` and binds each engine), but the AOTI path has no equivalent because the engines never become attributes of an `nn.Module` — they live only inside the proxy executor's private map. This is a **purely additive** change: the new methods are optional to call, have a no-op default on the base `ProxyExecutor`, and do not change any existing semantics. ## API ```cpp // torch/csrc/inductor/aoti_torch/proxy_executor.h class ProxyExecutor { // ... virtual std::unordered_map<std::string, c10::IValue> get_custom_objs() const { return {}; } }; // torch/csrc/inductor/aoti_torch/oss_proxy_executor.h class OSSProxyExecutor : public ProxyExecutor { // ... std::unordered_map<std::string, c10::IValue> get_custom_objs() const override { return custom_objs_; } }; // torch/csrc/inductor/aoti_runner/model_container_runner.h class TORCH_API AOTIModelContainerRunner { // ... std::unordered_map<std::string, c10::IValue> get_custom_objs() const; }; // torch/csrc/inductor/aoti_package/model_package_loader.h class TORCH_API AOTIModelPackageLoader { // ... std::unordered_map<std::string, c10::IValue> get_custom_objs() const; }; ``` Python: ```python loader = torch._C._aoti.AOTIModelPackageLoader("model.pt2", "model") custom_objs = loader.get_custom_objs() # Dict[str, ScriptObject] ``` ## Usage example (from torch-tensorrt) ```cpp torch::inductor::AOTIModelPackageLoader loader("model.pt2"); for (auto& [name, ivalue] : loader.get_custom_objs()) { if (ivalue.isCustomClass()) { if (auto engine = ivalue.toCustomClass<torch_tensorrt::TRTEngine>()) { engine->set_external_stream(reinterpret_cast<int64_t>(my_green_stream)); } } } auto outputs = loader.run(inputs); // each engine now uses its bound stream ``` ## API design rationale Three options were considered: | Option | Pros | Cons | Picked | |---|---|---|---| | Return `const std::unordered_map<std::string, c10::IValue>&` | Zero copy | Leaks STL container layout; ABI-fragile if `custom_objs_` is renamed/reshaped | ❌ | | Visitor: `for_each_custom_obj(std::function<...>)` | Hides container | Cumbersome, no random access by name | ❌ | | **Return `std::unordered_map<std::string, c10::IValue>` by value** | ABI-safe; copy is cheap (single-digit entries × pointer-sized IValue); IValue copies share intrusive_ptr so callers still affect live state | One copy of the map | ✅ | We deliberately do **not** filter by torchbind type or otherwise interpret the constants — the `c10::IValue` interface is the right abstraction; backend filtering is the caller's job. ## Compatibility - **Source compatibility**: additive only. Existing callers of `ProxyExecutor`, `AOTIModelContainerRunner`, and `AOTIModelPackageLoader` are unaffected. - **Binary compatibility**: adds a new virtual to `ProxyExecutor`, which changes its vtable layout. Downstream subclasses of `ProxyExecutor` that were compiled against the old header will need a rebuild — but the default implementation handles them transparently if recompiled. No method removed or signature changed. - **API stability**: returns by value (no internal-layout exposure). The map type is part of the public API; if we ever change the executor's storage type we'd convert at the boundary. ## Tests `test/inductor/test_aoti_torchbind_constants.py` (new): - Compile a model with a `_TorchScriptTesting._Foo` torchbind attribute via `aoti_compile_and_package`. - Load via `torch._C._aoti.AOTIModelPackageLoader`. - Assert `get_custom_objs()` returns the expected non-empty map. - Negative test: model without torchbind constants returns an empty map. ## Suggested reviewers - @desertfire — AOTI runner / package loader owner - @chenyang78 — proxy executor / OSS proxy executor - @angelayi — `.pt2` package format Pull Request resolved: #182149 Approved by: https://github.com/desertfire
…torch#182149) ## Summary Adds a public C++ accessor `get_custom_objs()` on `torch::aot_inductor::ProxyExecutor`, `torch::inductor::AOTIModelContainerRunner`, and `torch::inductor::AOTIModelPackageLoader` (plus a Python binding on the loader) that returns a snapshot of the torchbind custom-class constants embedded in a loaded `.pt2` model. The `IValue` payloads alias the live entries inside the proxy executor's `custom_objs_` map (the existing private storage already populated during `OSSProxyExecutor` construction by `pickle_load_obj`). Downcasting to a `CustomClassHolder` subclass and mutating its state — e.g. binding a stream, attaching a communicator, toggling profiling — affects subsequent `run()` invocations. ## Motivation Backends that store live state on torchbind constants embedded in `.pt2` packages cannot reach those objects after load via any public PyTorch API. The constants live in `OSSProxyExecutor::custom_objs_` (private, `std::unordered_map<std::string, c10::IValue>`) and the loader/runner only expose tensor-typed constant accessors (`get_constant_fqns`, `extract_constants_map`, etc.). Concrete user: **torch-tensorrt** ([pytorch/TensorRT#4232](pytorch/TensorRT#4232)) needs to bind a CUDA Green Context stream onto each `TRTEngine` torchbind embedded in a torch-tensorrt `.pt2` model loaded via `AOTIModelPackageLoader`. The `ExportedProgram` path already works (Python walks `named_modules()` and binds each engine), but the AOTI path has no equivalent because the engines never become attributes of an `nn.Module` — they live only inside the proxy executor's private map. This is a **purely additive** change: the new methods are optional to call, have a no-op default on the base `ProxyExecutor`, and do not change any existing semantics. ## API ```cpp // torch/csrc/inductor/aoti_torch/proxy_executor.h class ProxyExecutor { // ... virtual std::unordered_map<std::string, c10::IValue> get_custom_objs() const { return {}; } }; // torch/csrc/inductor/aoti_torch/oss_proxy_executor.h class OSSProxyExecutor : public ProxyExecutor { // ... std::unordered_map<std::string, c10::IValue> get_custom_objs() const override { return custom_objs_; } }; // torch/csrc/inductor/aoti_runner/model_container_runner.h class TORCH_API AOTIModelContainerRunner { // ... std::unordered_map<std::string, c10::IValue> get_custom_objs() const; }; // torch/csrc/inductor/aoti_package/model_package_loader.h class TORCH_API AOTIModelPackageLoader { // ... std::unordered_map<std::string, c10::IValue> get_custom_objs() const; }; ``` Python: ```python loader = torch._C._aoti.AOTIModelPackageLoader("model.pt2", "model") custom_objs = loader.get_custom_objs() # Dict[str, ScriptObject] ``` ## Usage example (from torch-tensorrt) ```cpp torch::inductor::AOTIModelPackageLoader loader("model.pt2"); for (auto& [name, ivalue] : loader.get_custom_objs()) { if (ivalue.isCustomClass()) { if (auto engine = ivalue.toCustomClass<torch_tensorrt::TRTEngine>()) { engine->set_external_stream(reinterpret_cast<int64_t>(my_green_stream)); } } } auto outputs = loader.run(inputs); // each engine now uses its bound stream ``` ## API design rationale Three options were considered: | Option | Pros | Cons | Picked | |---|---|---|---| | Return `const std::unordered_map<std::string, c10::IValue>&` | Zero copy | Leaks STL container layout; ABI-fragile if `custom_objs_` is renamed/reshaped | ❌ | | Visitor: `for_each_custom_obj(std::function<...>)` | Hides container | Cumbersome, no random access by name | ❌ | | **Return `std::unordered_map<std::string, c10::IValue>` by value** | ABI-safe; copy is cheap (single-digit entries × pointer-sized IValue); IValue copies share intrusive_ptr so callers still affect live state | One copy of the map | ✅ | We deliberately do **not** filter by torchbind type or otherwise interpret the constants — the `c10::IValue` interface is the right abstraction; backend filtering is the caller's job. ## Compatibility - **Source compatibility**: additive only. Existing callers of `ProxyExecutor`, `AOTIModelContainerRunner`, and `AOTIModelPackageLoader` are unaffected. - **Binary compatibility**: adds a new virtual to `ProxyExecutor`, which changes its vtable layout. Downstream subclasses of `ProxyExecutor` that were compiled against the old header will need a rebuild — but the default implementation handles them transparently if recompiled. No method removed or signature changed. - **API stability**: returns by value (no internal-layout exposure). The map type is part of the public API; if we ever change the executor's storage type we'd convert at the boundary. ## Tests `test/inductor/test_aoti_torchbind_constants.py` (new): - Compile a model with a `_TorchScriptTesting._Foo` torchbind attribute via `aoti_compile_and_package`. - Load via `torch._C._aoti.AOTIModelPackageLoader`. - Assert `get_custom_objs()` returns the expected non-empty map. - Negative test: model without torchbind constants returns an empty map. ## Suggested reviewers - @desertfire — AOTI runner / package loader owner - @chenyang78 — proxy executor / OSS proxy executor - @angelayi — `.pt2` package format Pull Request resolved: pytorch#182149 Approved by: https://github.com/desertfire
405c706 to
7bee902
Compare
|
@narendasan many thanks for the feedback and prototype! I rewrote this e2e on top of latest main. Could you take another look please? What I dropped: the process-wide flag, the per-engine setter, the Python multi-stream facade. No new public API, no .pt2 format change, no torchbind additions, no FX graph changes, no header changes. What is in: stream selection moved to the top of the lambda. If the caller is on the default stream, behavior is unchanged (grab a pool stream). If the caller is on any other stream, the engine runs on that stream directly. When the chosen stream changes between calls, set runtime_states.context_changed = true so any captured CUDA Graph re-records. The pre/post sync events are skipped when the engine stream and the caller's stream are the same. Mirrored in _PythonTorchTensorRTModule.py and _CudaGraphsTorchTensorRTModule.py. This matches the shape of execute_engine.cpp on your stream_gaurds branch. I borrowed the caller_on_default local and the fence-skip pattern from there. On your "pull stream management out entirely" suggestion: with this change that is what happens for any caller who installs a non-default stream. The pool-stream path stays only as the fallback for the default-stream case, which matches your wording: the lazy stream from the pool would be a fallback when the user is not managing this externally. On mixed PyTorch + TensorRT in a green context: this PR does not push stream awareness into the FX graph. For the common case (one model per .pt2, one green context, caller wraps the whole forward in a CUDAStreamGuard), every CUDA op inside that guard inherits the green stream from the calling thread, both PyTorch ops and the TensorRT engine. So the slim change covers the "uniform green context for the whole graph" case without graph-level plumbing. Where I think the two approaches fit together: this PR is the runtime contract (the engine runs on whatever stream is current at enqueueV3 time). stream_gaurds is the layer above for cases where the caller wants intra-graph parallelism across multiple streams (different branches on different green contexts, fan-out, etc.). Single-stream callers do not need the FX rewrite to get green contexts working, and graph-level ops still need the runtime to honor the chosen stream, so the slim runtime fix is a prerequisite for stream_gaurds rather than a competing design. One thought I want to share, since it shaped the design: stream identity is imo a runtime scheduling property, not a model property. A cudaStream_t does not survive serialization, and the same .pt2 may run in different deployment topologies (single-tenant, multi-tenant green contexts, NCCL, etc.). Even with apply_stream_plan taking actual streams at runtime, baking the plan structure into the graph means re-exporting whenever the topology changes shape (different number of streams, different fan-out). For the multi-tenant on-device case where the caller picks the partition per process, host-level stream selection feels like the right granularity. I think stream_gaurds is the right design for the intra-graph multi-stream case but I would be careful about making it the default path for users who just want "engine respects my stream guard." Happy to discuss any of this. Mind taking another pass? |
…torch#182149) ## Summary Adds a public C++ accessor `get_custom_objs()` on `torch::aot_inductor::ProxyExecutor`, `torch::inductor::AOTIModelContainerRunner`, and `torch::inductor::AOTIModelPackageLoader` (plus a Python binding on the loader) that returns a snapshot of the torchbind custom-class constants embedded in a loaded `.pt2` model. The `IValue` payloads alias the live entries inside the proxy executor's `custom_objs_` map (the existing private storage already populated during `OSSProxyExecutor` construction by `pickle_load_obj`). Downcasting to a `CustomClassHolder` subclass and mutating its state — e.g. binding a stream, attaching a communicator, toggling profiling — affects subsequent `run()` invocations. ## Motivation Backends that store live state on torchbind constants embedded in `.pt2` packages cannot reach those objects after load via any public PyTorch API. The constants live in `OSSProxyExecutor::custom_objs_` (private, `std::unordered_map<std::string, c10::IValue>`) and the loader/runner only expose tensor-typed constant accessors (`get_constant_fqns`, `extract_constants_map`, etc.). Concrete user: **torch-tensorrt** ([pytorch/TensorRT#4232](pytorch/TensorRT#4232)) needs to bind a CUDA Green Context stream onto each `TRTEngine` torchbind embedded in a torch-tensorrt `.pt2` model loaded via `AOTIModelPackageLoader`. The `ExportedProgram` path already works (Python walks `named_modules()` and binds each engine), but the AOTI path has no equivalent because the engines never become attributes of an `nn.Module` — they live only inside the proxy executor's private map. This is a **purely additive** change: the new methods are optional to call, have a no-op default on the base `ProxyExecutor`, and do not change any existing semantics. ## API ```cpp // torch/csrc/inductor/aoti_torch/proxy_executor.h class ProxyExecutor { // ... virtual std::unordered_map<std::string, c10::IValue> get_custom_objs() const { return {}; } }; // torch/csrc/inductor/aoti_torch/oss_proxy_executor.h class OSSProxyExecutor : public ProxyExecutor { // ... std::unordered_map<std::string, c10::IValue> get_custom_objs() const override { return custom_objs_; } }; // torch/csrc/inductor/aoti_runner/model_container_runner.h class TORCH_API AOTIModelContainerRunner { // ... std::unordered_map<std::string, c10::IValue> get_custom_objs() const; }; // torch/csrc/inductor/aoti_package/model_package_loader.h class TORCH_API AOTIModelPackageLoader { // ... std::unordered_map<std::string, c10::IValue> get_custom_objs() const; }; ``` Python: ```python loader = torch._C._aoti.AOTIModelPackageLoader("model.pt2", "model") custom_objs = loader.get_custom_objs() # Dict[str, ScriptObject] ``` ## Usage example (from torch-tensorrt) ```cpp torch::inductor::AOTIModelPackageLoader loader("model.pt2"); for (auto& [name, ivalue] : loader.get_custom_objs()) { if (ivalue.isCustomClass()) { if (auto engine = ivalue.toCustomClass<torch_tensorrt::TRTEngine>()) { engine->set_external_stream(reinterpret_cast<int64_t>(my_green_stream)); } } } auto outputs = loader.run(inputs); // each engine now uses its bound stream ``` ## API design rationale Three options were considered: | Option | Pros | Cons | Picked | |---|---|---|---| | Return `const std::unordered_map<std::string, c10::IValue>&` | Zero copy | Leaks STL container layout; ABI-fragile if `custom_objs_` is renamed/reshaped | ❌ | | Visitor: `for_each_custom_obj(std::function<...>)` | Hides container | Cumbersome, no random access by name | ❌ | | **Return `std::unordered_map<std::string, c10::IValue>` by value** | ABI-safe; copy is cheap (single-digit entries × pointer-sized IValue); IValue copies share intrusive_ptr so callers still affect live state | One copy of the map | ✅ | We deliberately do **not** filter by torchbind type or otherwise interpret the constants — the `c10::IValue` interface is the right abstraction; backend filtering is the caller's job. ## Compatibility - **Source compatibility**: additive only. Existing callers of `ProxyExecutor`, `AOTIModelContainerRunner`, and `AOTIModelPackageLoader` are unaffected. - **Binary compatibility**: adds a new virtual to `ProxyExecutor`, which changes its vtable layout. Downstream subclasses of `ProxyExecutor` that were compiled against the old header will need a rebuild — but the default implementation handles them transparently if recompiled. No method removed or signature changed. - **API stability**: returns by value (no internal-layout exposure). The map type is part of the public API; if we ever change the executor's storage type we'd convert at the boundary. ## Tests `test/inductor/test_aoti_torchbind_constants.py` (new): - Compile a model with a `_TorchScriptTesting._Foo` torchbind attribute via `aoti_compile_and_package`. - Load via `torch._C._aoti.AOTIModelPackageLoader`. - Assert `get_custom_objs()` returns the expected non-empty map. - Negative test: model without torchbind constants returns an empty map. ## Suggested reviewers - @desertfire — AOTI runner / package loader owner - @chenyang78 — proxy executor / OSS proxy executor - @angelayi — `.pt2` package format Pull Request resolved: pytorch#182149 Approved by: https://github.com/desertfire
cehongwang
left a comment
There was a problem hiding this comment.
Looks good. Can you rebase this PR to the main branch? You might need to move changes in PythonTorchTensorRTmodule to the new file TRTEngine.py
Rebased on top of main (post pytorch#4222 Python runtime rework). The original PR pytorch#4232 modified _PythonTorchTensorRTModule.py which was removed by pytorch#4222; the same stream-selection logic has been ported into the new _TRTEngine class. Behavior: - If the caller is on the default CUDA stream, keep legacy behavior: run the engine on a dedicated pool stream and synchronise via wait_stream. - If the caller is on a non-default stream (e.g. attached to a CUDA Green Context for SM partitioning), honor it: reuse the caller's stream for the engine and skip the wait_stream pair (saving extra syncs and preserving the caller's scheduling choice end to end). - When CUDA graphs are enabled and the engine stream changes between invocations, trigger graph recapture via runtime_states.context_changed so we don't replay a graph recorded against a stale stream. Implementation: - core/runtime/execute_engine.cpp: same logic in C++ (cudagraphs replay path + standard path). - py/torch_tensorrt/dynamo/runtime/_TRTEngine.py: new _prepare_streams helper called from both _execute_standard and _execute_output_allocator. - py/torch_tensorrt/dynamo/runtime/_CudaGraphsTorchTensorRTModule.py: same logic in the cudagraphs wrapper module.
7bee902 to
6c448b4
Compare
Rebased on top of main (post pytorch#4222 Python runtime rework). The original PR pytorch#4232 modified _PythonTorchTensorRTModule.py which was removed by pytorch#4222; the same stream-selection logic has been ported into the new _TRTEngine class. Behavior: - If the caller is on the default CUDA stream, keep legacy behavior: run the engine on a dedicated pool stream and synchronise via wait_stream. - If the caller is on a non-default stream (e.g. attached to a CUDA Green Context for SM partitioning), honor it: reuse the caller's stream for the engine and skip the wait_stream pair (saving extra syncs and preserving the caller's scheduling choice end to end). - When CUDA graphs are enabled and the engine stream changes between invocations, trigger graph recapture via runtime_states.context_changed so we don't replay a graph recorded against a stale stream. Implementation: - core/runtime/execute_engine.cpp: same logic in C++ (cudagraphs replay path + standard path). - py/torch_tensorrt/dynamo/runtime/_TRTEngine.py: new _prepare_streams helper called from both _execute_standard and _execute_output_allocator. - py/torch_tensorrt/dynamo/runtime/_CudaGraphsTorchTensorRTModule.py: same logic in the cudagraphs wrapper module.
6c448b4 to
7b908d4
Compare
|
Rebased on |

What this fixes
Callers commonly pin work to a specific CUDA stream via
torch.cuda.stream(...)(Python) orc10::cuda::CUDAStreamGuard(...)(C++) for NCCL, AMP overlap, custom scheduling, and most recently CUDA Green Contexts (CUDA 12.4+) for SM partitioning. A green-context-derived stream is the only way to confine a TensorRT engine to a partition.Today the Torch-TensorRT runtime ignores the caller's stream. It reads
getCurrentCUDAStream(), then substitutes one from PyTorch's pool. Pool streams are not bound to any green context, so the partition is silently bypassed. The same pattern lives in both Python runtime modules. Net effect: Torch-TensorRT cannot be used for GPU partitioning at all today.What this PR does
The runtime now picks the engine's stream like this:
Two small pieces of bookkeeping fall out of this change:
The same logic is mirrored in
_PythonTorchTensorRTModule.py(hoisted to the outerforwardso both the standard and output-allocator paths see the resolved stream) and_CudaGraphsTorchTensorRTModule.py.What stays the same for existing users
Anyone who calls
model(x)without setting a stream sees the exact same behavior as before. No new public API. No new arguments. No.pt2file format change. No new fields on engine state.Behavior change for existing non-default-stream callers
Users who already wrap inference in
torch.cuda.stream(my_stream)(orc10::cuda::CUDAStreamGuard(my_stream)) for non-green reasons (custom scheduling, NCCL, AMP) will see the engine run on their stream instead of a separate pool stream. Results are identical; the timing characteristics may differ, since the previous separate-stream + event-fence pattern is replaced by direct execution on the caller's stream. For CUDA Graphs users who alternate between several non-default streams, the captured graph re-records on each stream change.What a Green Context user does
Test plan
.pt2, run inside aCUDAStreamGuardover a Green-Context-derived stream, capture annsystrace, confirm kernels run on the green stream and stay inside the partition's SMs.