Add TensorRT Multi-Device (multi-GPU) inference support#120
Open
pkisfaludi-nv wants to merge 1 commit into
Open
Add TensorRT Multi-Device (multi-GPU) inference support#120pkisfaludi-nv wants to merge 1 commit into
pkisfaludi-nv wants to merge 1 commit into
Conversation
Run a single TensorRT engine sharded across multiple GPUs via TensorRT Multi-Device (NCCL DistCollective + IExecutionContext::setCommunicator, GA in TensorRT 11). Enabled per-model with a KIND_MODEL instance group plus parameters enable_multi_device, multi_device_gpus, and multi_device_per_rank_engines; transparent to clients (same gRPC/HTTP API). Runtime: ncclCommInitAll, per-rank deserialize + setCommunicator (concurrent), adaptive P2P/host input replication, fan-out enqueueV3, rank-0 output. Supports offline-sharded engines and per-rank weight-shard (tensor-parallel) engines. docs/multi_device.md documents configuration + validation; docs/ includes engine builders used for testing. Validated on 2x and 8x B200 (NVLink). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Peter Kisfaludi <pkisfaludi@nvidia.com>
Author
|
cc @mc-nv @whoisj — this is the upstream version of the internal MR you reviewed (TRT-28040, TensorRT Multi-Device / multi-GPU support), submitted here at @mc-nv's request. Would appreciate your review when you have a chance. I couldn't add you as formal reviewers from a fork — please assign yourselves (or let me know who should own it). Thanks! |
Contributor
|
@pkisfaludi-nv cc: @whoisj |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds TensorRT Multi-Device (MD) support to the
tensorrtbackend: run a single TensorRT engine sharded across multiple GPUs via NCCLDistCollective+IExecutionContext::setCommunicator(GA in TensorRT 11), transparently to clients (same gRPC/HTTP API).This mirrors an internal change; submitting upstream as requested by the backend maintainers.
Usage
Enable per model with a
KIND_MODELinstance group + parameters:See
docs/multi_device.md.Implementation
ncclCommInitAll, per-rank deserialize + concurrentsetCommunicator(sequential deadlocks), adaptive P2P/host input replication, fan-outenqueueV3, rank-0 output.multi_device_per_rank_engines).-DTRITON_ENABLE_TENSORRT_MULTI_DEVICE=ON(TensorRT >= 11 + NCCL); default off, so non-MD builds/models are unchanged.docs/includes the engine builders used for testing.Validation
Validated on 2× and 8× B200 (NVLink): a sharded model across 2 GPUs matches the 1-GPU baseline (
rel_max ~3.6e-3); server logsTensorRT Multi-Device ready: N ranks; both GPUs active.Notes