Skip to content

Add TensorRT Multi-Device (multi-GPU) inference support#120

Open
pkisfaludi-nv wants to merge 1 commit into
triton-inference-server:r26.06from
pkisfaludi-nv:feat/trt-28040-multi-device
Open

Add TensorRT Multi-Device (multi-GPU) inference support#120
pkisfaludi-nv wants to merge 1 commit into
triton-inference-server:r26.06from
pkisfaludi-nv:feat/trt-28040-multi-device

Conversation

@pkisfaludi-nv

Copy link
Copy Markdown

Summary

Adds TensorRT Multi-Device (MD) support to the tensorrt backend: run a single TensorRT engine sharded across multiple GPUs via NCCL DistCollective + IExecutionContext::setCommunicator (GA in TensorRT 11), transparently to clients (same gRPC/HTTP API).

This mirrors an internal change; submitting upstream as requested by the backend maintainers.

Usage

Enable per model with a KIND_MODEL instance group + parameters:

instance_group [ { kind: KIND_MODEL count: 1 } ]
parameters [
  { key: "enable_multi_device"           value: { string_value: "true" } },
  { key: "multi_device_gpus"             value: { string_value: "0,1" } },
  { key: "multi_device_per_rank_engines" value: { string_value: "true" } }
]

See docs/multi_device.md.

Implementation

  • ncclCommInitAll, per-rank deserialize + concurrent setCommunicator (sequential deadlocks), adaptive P2P/host input replication, fan-out enqueueV3, rank-0 output.
  • Supports offline-sharded engines and per-rank weight-shard (tensor-parallel) engines (multi_device_per_rank_engines).
  • Built behind -DTRITON_ENABLE_TENSORRT_MULTI_DEVICE=ON (TensorRT >= 11 + NCCL); default off, so non-MD builds/models are unchanged.
  • docs/ includes the engine builders used for testing.

Validation

Validated on 2× and 8× B200 (NVLink): a sharded model across 2 GPUs matches the 1-GPU baseline (rel_max ~3.6e-3); server logs TensorRT Multi-Device ready: N ranks; both GPUs active.

Notes

  • Requires TensorRT >= 11.
  • DCO signed-off. Happy to split into smaller commits or adjust naming per maintainer preference.

Run a single TensorRT engine sharded across multiple GPUs via TensorRT
Multi-Device (NCCL DistCollective + IExecutionContext::setCommunicator, GA in
TensorRT 11). Enabled per-model with a KIND_MODEL instance group plus parameters
enable_multi_device, multi_device_gpus, and multi_device_per_rank_engines;
transparent to clients (same gRPC/HTTP API).

Runtime: ncclCommInitAll, per-rank deserialize + setCommunicator (concurrent),
adaptive P2P/host input replication, fan-out enqueueV3, rank-0 output. Supports
offline-sharded engines and per-rank weight-shard (tensor-parallel) engines.

docs/multi_device.md documents configuration + validation; docs/ includes engine
builders used for testing. Validated on 2x and 8x B200 (NVLink).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Peter Kisfaludi <pkisfaludi@nvidia.com>
@pkisfaludi-nv

Copy link
Copy Markdown
Author

cc @mc-nv @whoisj — this is the upstream version of the internal MR you reviewed (TRT-28040, TensorRT Multi-Device / multi-GPU support), submitted here at @mc-nv's request. Would appreciate your review when you have a chance. I couldn't add you as formal reviewers from a fork — please assign yourselves (or let me know who should own it). Thanks!

@mc-nv mc-nv changed the base branch from main to r26.06 June 10, 2026 04:05
@mc-nv

mc-nv commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

@pkisfaludi-nv
I've change the target brach to 26.06, default one doesn't support the TensorRT 11.
I been tested your changes today and they went through the build process.
#121

cc: @whoisj

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants