Skip to content

Releases: pytorch/TensorRT

Torch-TensorRT v2.12.0

20 May 19:24
9afefd0

Choose a tag to compare

Torch-TensorRT 2.12.0 Linux x86-64 and Windows targets

PyTorch 2.12, CUDA 13.0, TensorRT 10.16, Python 3.10~3.13

Torch-TensorRT Wheels are available:

x86-64 Linux and Windows:
CUDA 13.0 + Python 3.10-3.13 is Available via PyPI

aarch64 SBSA Linux and Jetson Thor
CUDA 13.0 + Python 3.10–3.13 + Torch 2.12 + TensorRT 10.16

Jetson Orin

  • no torch_tensorrt 2.9/2.10/2.11/2.12 release for Jetson Orin
  • please continue using torch_tensorrt 2.8 release

Torch-TensorRT-RTX 2.12.0 Linux x86-64 and Windows targets

PyTorch 2.12, CUDA 13.0, TensorRT-RTX 1.4, Python 3.10~3.13

Torch-TensorRT-RTX Wheels are available:

x86-64 Linux and Windows:
CUDA 13.0 + Python 3.10-3.13 is Available via PyPI

CUDA 13.0 + Python 3.10-3.13 is also Available via Pytorch Index

Native Distributed Collectives

In prior versions of Torch-TensorRT, distributed operators were backed by kernels provided by TensorRT-LLM that the user needed to manually install. With Torch-TensorRT 2.12, many of these operations are natively supported which means in deployment, only Torch-TensorRT needs to be installed.

The distributed infrastructure is designed to operate on top of torch.distributed. Once a graph is sharded, traced and compiled. The torch.distributed device mesh can be passed to torch-tensorrt compiled modules using the following API:

trt_model = torch.compile(model, backend="torch_tensorrt", ...)
_ = trt_model(inp)  # warmup — triggers engine build
dist.barrier()

with torch_tensorrt.distributed.distributed_context(dist.group.WORLD, trt_model) as dmodel:
    output = dmodel(inp)

dist.destroy_process_group()
os._exit(0)

This can be done at compile time as well

with torch_tensorrt.distributed.distributed_context(tp_group):
    trt_model = torch.compile(model, backend="torch_tensorrt", ...)
    output = trt_model(inp)

Note: use_distributed_trace is no longer necessary to compile multi-device models, torch-tensorrt will automatically recognize distributed collectives and set the setting for the user.

torchtrtrun

The distributed operations utilize the NCCL version distributed by PyTorch which must be added to LD_PRELOAD before importing torch-tensorrt. As a convience we provide a tool torchtrtrun which is analogous to torchrun that configures these libraries correctly in addition to allowing users to launch models distributed across multiple nodes.

For example:

# Node 0 (rank 0):
torchtrtrun --nproc_per_node=1 --nnodes=2 --node_rank=0 \
  --rdzv_endpoint=<node0-ip>:29500 \
  tensor_parallel_llama_multinode.py

# Node 1 (rank 1):
torchtrtrun --nproc_per_node=1 --nnodes=2 --node_rank=1 \
  --rdzv_endpoint=<node0-ip>:29500 \
  tensor_parallel_llama_multinode.py

Serialization and torch.export

Models sharded and then exported can be compiled and saved to disk before being loaded on a deployment system. By default these modules attempt to bind to the default torch distributed device mesh. If there are multiple valid device meshes availble, the above API can be used to set a specific one to execute the engine.

More information on torch-tensorrt distributed collective support can be found here: https://docs.pytorch.org/TensorRT/tutorials/deployment/distributed_inference.html#multinode-inference

More information on native multi-device collectives can be found here: https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/work-with-transformers.html#multi-device-attention-preview-feature

ExecuTorch Support

Torch-TensorRT 2.12 introduces initial ExecuTorch integration for exporting and running TensorRT-accelerated models in ExecuTorch .pte format. Users can now
save TensorRT-compiled ExportedProgram / FX models with:

torch_tensorrt.save(model, "model.pte", output_format="executorch")

This release adds a TensorRT ExecuTorch backend, partitioner, and serialization path that embeds TensorRT engine payloads directly into the .pte using the
same engine metadata format as the Torch-TensorRT runtime. The release package also includes a C++ TensorRT ExecuTorch backend source package and a reference
ExecuTorch runner showing how to load .pte files, initialize the TensorRT delegate, bind inputs/outputs, and execute inference without requiring Python at
runtime.

Highlights:

  • New torch_tensorrt.executorch Python APIs: TensorRTBackend, TensorRTPartitioner, and get_edge_compile_config.
  • New output_format="executorch" save path for generating ExecuTorch .pte models.
  • Support for static-shape and TensorRT profile-based dynamic-shape export examples.
  • New native C++ TensorRT ExecuTorch backend and reference runner included in libtorchtrt.tar.gz.
  • Engines that require TensorRT output allocators, such as data-dependent output shape engines, are not supported yet.
  • In Torch-TensorRT 2.12, the ExecuTorch integration still depends on LibTorch in the native runtime path.
    In the next Torch-TensorRT 2.13 release is planned to move this to a pure ExecuTorch backend implementation without the LibTorch runtime dependency.

Known Limitations

  • ExecuTorch support still depends on the Torch/LibTorch C++ libraries used by Torch-TensorRT; this release does not provide a pure
    ExecuTorch-only TensorRT deployment path.
  • TensorRT engine payloads larger than 2 GiB are not supported when embedded in an ExecuTorch .pte file.
  • Selecting a target device during ExecuTorch export is not currently supported. Exported .pte files default to cuda:0.

Comprehensive Attention Support

This release extends the TRT attention converters to support GQA/MQA and decode-phase attention based on TensorRT IAttentionLayer. Specifically, it covers all SDPA kernel variants, MHA/GQA/MQA attention patterns, causal vs non-causal masking, bool/float/broadcast mask shapes, decode-phase attention (seq_q=1), non-power-of-2 head dims, LLM-realistic configs, and multiple dtypes. This feature is enabled by default. If you want to turn it off, please set decompose_attention=True.

Known Limitations

  • TensorRT 10.x (will be resolved in TRT 11.0) and TensorRT-RTX-1.4:
    For TensorRT 10.x, large causal sequences of k/v (seq >= 512, is_causal=True) in FP16/BF16
    IAttentionLayer produces ~80% element mismatch at long sequences. Thus, we use FP32 for
    the scale factor. If you want to use the accurate dtype, please set decompose_attention=True
    or upgrade to TRT 11.0 or later.

Comprehensive Complex Numerics Support

Torch-TensorRT can now compile models containing complex64/complex128 tensors end-to-end. TensorRT itself has no native complex dtype — a lowering pass intercepts complex subgraphs before partitioning and rewrites them into equivalent real arithmetic on a (..., 2) last-dim layout (real/imag interleaved), so the engine only sees standard float ops and callers don't have to change anything.

This unlocks compilation of models that use complex arithmetic for rotary position embeddings — Llama 3 (1D RoPE), and video generation transformers like CogVideoX, Wan, and HunyuanVideo (3D RoPE) — including under dynamic shapes and in distributed (multi-GPU) settings.

What's supported

  • Complex inputs (placeholders) and buffers (get_attr) are rewritten to real-valued equivalents. placeholder(complex64) becomes placeholder(float32) with an appended trailing dim of 2; complex buffers are replaced via torch.stack([t.real, t.imag], dim=-1). Dynamic-shape SymInts are preserved across the rewrite.
  • Complex multiply (aten.mul.Tensor between two complex operands) is decomposed to the standard identity (ac − bd) + (ad + bc)i.
  • view_as_complex / view_as_real are erased — they become identities once the layout is already (..., 2).
  • Shape-manipulation ops are handled with the trailing real/imag dim in mind: reshape / view / _unsafe_view, flatten, unsqueeze, squeeze, permute, transpose, t, cat, stack, select, slice, narrow, roll, flip, split, chunk, expand, repeat. Negative dim indices are auto-shifted by −1 so dim=-1 keeps meaning "the original last complex dim."
  • Math ops over complex inputsmm / bmm / matmul (including complex×real and real×complex mixed forms), abs, angle, real, imag, sin/cos/exp/log on complex, reciprocal (for scalar / complex), sum.dim_IntList / mean.dim / prod.dim_int, ones_like / full_like (correctly initialise as 1+0i / fill+0i).
  • Engine outputs that stay complex — models that return a complex tensor without going through view_as_real are also detected via a forward scan that collects unbounded complex nodes.
  • Runtime I/O — complex inputs can be passed as-is at call time. The runtime modules automatically apply torch.view_as_real(x).contiguous() to complex inputs before handing them to the engine, and rebuild complex outputs on the way back.
  • truncate_double=True lowers complex128 to float32 (vs the default float64)
  • Refit caching — complex buffers via a last-dim slice-matching stage in _save_weight_mapping, plus a tuple-keyed (sd_key, last_dim, idx) lookup in construct_refit_mapping_from_weight_name_map. Verification picks the real-unpacked branch for the reference module when complex inputs are present.
  • Graceful fallback — complex ops the rewriter doesn't know how to handle now fall back to PyTorch execution rather than failing the compile.

Known limitations

  • Only view_as_real-anchored subgraphs and forward-scanned comple...
Read more

Torch-TensorRT v2.11.0

07 Apr 17:03
0cc00aa

Choose a tag to compare

Torch-TensorRT 2.11.0 Linux x86-64 and Windows targets

PyTorch 2.11, CUDA 12.6 12.8 12.9, 13.0, TensorRT 10.15, Python 3.10~3.13

Torch-TensorRT Wheels are available:

x86-64 Linux and Windows:
CUDA 13.0 + Python 3.10-3.13 is Available via PyPI

CUDA 12.6/12.8/12.9/13.0 + Python 3.10-3.13 is also Available via Pytorch Index

aarch64 SBSA Linux and Jetson Thor
CUDA 13.0 + Python 3.10–3.13 + Torch 2.11 + TensorRT 10.15

Jetson Orin

  • no torch_tensorrt 2.9/2.10/2.11 release for Jetson Orin
  • please continue using torch_tensorrt 2.8 release

Torch-TensorRT-RTX 2.11.0 Linux x86-64 and Windows targets

PyTorch 2.11, CUDA 12.9, 13.0, TensorRT-RTX 1.3, Python 3.10~3.13

Torch-TensorRT-RTX Wheels are available:

x86-64 Linux and Windows:
CUDA 13.0 + Python 3.10-3.13 is Available via PyPI

CUDA 12.9/13.0 + Python 3.10-3.13 is also Available via Pytorch Index

note: tensorrt-rtx 1.3 wheel is not in pypi yet, so please download the tarball from https://developer.nvidia.com/tensorrt-rtx
install the wheel from tarball

IAttention Layer

In this release, TensorRT's native IAttention layer is used by default to handle various attention-related ATen ops, including SDPA, Flash-SDPA, Efficient-SDPA, and cuDNN-SDPA. This integration enables more efficient execution and can improve model performance. To explicitly enable this behavior, set decompose_attention=False in the compile() function. When enabled, the native TensorRT implementation is utilized to achieve optimized attention computation.
However, due to current TensorRT limitations, certain operations such as compute_log_sumexp and Grouped Query Attention (GQA) are not yet supported. If these cases are encountered, an informational prompt will be displayed during compilation. Alternatively, you can set decompose_attention=True to decompose the attention ops into multiple basic ATen ops. Although this approach may not achieve the same level of performance optimization, it offers broader operator coverage and greater compatibility across different model architectures.

Improvements to the Symbolic Shape System

Two key improvements have been made to the symbolic shape system used to track mutations on dynamic dimensions through out the body of the graph.

  1. A shape prop formula is recorded as metadata in for every compiled engine.

Now instead of needing to instantiate the engine in order to do key tasks like serialization, retracing or similar tasks which require fake tensor propagation, we record the shape relation between inputs and outputs for every TRT subgraph at compile time and store it as metadata. The torch.ops.tensorrt.execute_engine meta kernel now just replays this function in the new shape environment.
This should enable more seamless integration with the rest of the torch.compile ecosystem as meta kernels in Torch-TensorRT will now work in the same way as other meta kernels.

  1. For unbounded shape ranges, we now select sane defaults

Dynamic shapes are lazily inserted into Dynamo graphs. This is most noticeable when using Torch-TensorRT as a backend for torch.compile. Here when the first inference call is made to a boxed function

trt_mod = torch.compile(mod, backend="tensorrt")
trt_mod(*inputs)

The shapes of intermediate tensors are considered static and derived eagerly from the shapes of input tensors.

If input shapes then change

trt_mod(*other_sized_inputs)

TorchDynamo will start marking dimensions where shapes differ as dynamic dimensions. However, it does not assume upper bounds and critically, has no "optimal" or target size as required by TensorRT.

As such when we see such [FIXED_SIZE, inf) ranges, we set a sane upper-bound (max_int / max_dims) and assume the optimal size as the mid point of that range. We highly recommend users explicitly set dynamic shape bounds for both torch.export and torch.compile use cases, but this system can serve as fallback. https://docs.pytorch.org/TensorRT/user_guide/compilation/dynamic_shapes.html

Torch-TensorRT-RTX

TensorRT-RTX is a JIT focused version of TensorRT that allows users to target many different hardware platforms with one artifact. It allows developers to easily provide performance to their users across the many variations of RTX GPUs. In previous versions of Torch-TensorRT we have provided source code support for using TensorRT-RTX as a backend for Torch-TensorRT, allowing users to get access to the same workflows they would with standard Torch-TensorRT with more JIT oriented optimization approach.

With 2.11 Torch-TensorRT-RTX has graduated to its own package that you can install with pip install torch-tensorrt-rtx. This package uses all the same APIs as Torch-TensorRT, just with a different backend. torch-tensorrt-rtx 2.11 targets TensorRT-RTX 1.3. For 2.11, TensorRT-RTX must be installed via a wheel distributed on developer.nvidia.com: https://developer.nvidia.com/tensorrt-rtx

Know Limitations:

  • bf16 precision is generally supported, however it is possible in some models that there may still be numerical accuracy issues. This will be addressed in futuer versions of TensorRT-RTX
  • There is a know accuracy issue when running Grouped Query Attention TensorRT-RTX which will be addressed in a future release of TensorRT-RTX

run_llm int8 quantization

We have added support for performing post training quantization in int8 precision from the command line using the run_llm tool .
You can apply int8 quantization backed by the TensorRT-Model-Optimizer-Toolkit using --quant_format fp8

python run_llm.py --model meta-llama/Llama-3.1-8B --quant_format fp8 --prompt "What is parallel programming?" --model_precision FP16 --num_tokens 128

Empty Tensor

The Torch-TensorRT Runtime has

We have added support for providing torch empty tensors (tensors with one or more zero sized dimensions) as input to Torch-TensorRT compiled programs.

Under the hood we use TensorRT native empty tensor semantics. Empty tensors are marked by a 1B placeholder input to the engine. Both the python and C++ runtimes support this feature

What's Changed

Read more

Torch-TensorRT v2.10.0

20 Feb 02:13
1a1c2e2

Choose a tag to compare

Torch-TensorRT 2.10.0 Linux x86-64 and Windows targets

PyTorch 2.10, CUDA 12.9, 13.0, TensorRT 10.14, Python 3.10~3.13

Torch-TensorRT Wheels are available:

x86-64 Linux and Windows:
CUDA 13.0 + Python 3.10-3.13 is Available via PyPI

CUDA 12.9/13.0 + Python 3.10-3.13 is also Available via Pytorch Index

aarch64 SBSA Linux and Jetson Thor
CUDA 13.0 + Python 3.10–3.13 + Torch 2.10 + TensorRT 10.14

Jetson Orin

  • no torch_tensorrt 2.9/2.10 release for Jetson Orin
  • please continue using torch_tensorrt 2.8 release

Important Changes

Retracing is enabled as the default behavior of saving a compiled graph module with torch_tensorrt.save. Torch-TensorRT re-exports the graph using torch.export.export(strict=False) to save it. This preserves the completeness of the output FX Graph and fills in metadata.

New Features

LLM improvements

The run_llm script now supports compiling models that have previously been quantized using the TensorRT Model Optimizer
Toolkit and uploaded to HuggingFace.

Now we support the following inference scenarios:

Standard high precision model, directly compile and run inference in fp16/bf16 via torch_tensorrt Autocast

python run_llm.py --model Qwen/Qwen2.5-0.5B-Instruct \
--prompt "What is parallel programming?" \
--model_precision FP16 --num_tokens 128 \
--cache static_v2 --enable_pytorch_run

Standard high precision model, use TensorRT Model optimizer to quantize and compile on device and then run inference in fp8/nvfp4 precision

python run_llm.py --model Qwen/Qwen2.5-0.5B-Instruct  \
--prompt "What is parallel programming?"  \
--model_precision FP16 \
--quant_format fp8 --num_tokens 128 \
--cache static_v2 --enable_pytorch_run

Previously quantized model uploaded to Huggingface, directly compile and run inference infp8/nvfp4

python run_llm.py --model nvidia/Qwen3-8B-FP8 \
--prompt "What is parallel programming?"  \
--model_precision FP16 \
--quant_format fp8 \
--num_tokens 128 \
--cache static_v2 --enable_pytorch_run

Notes:
--model_precision
this is mandatory, it is used to tell llm tool what is the model's precision
--quant_format
this is optional, it is only used for quantized model inference
for the pre-quantized modelopt checkpoint, this is to tell

Improvements to Engine Caching

Before this release, since weight-stripped engines can be refitted only once due to the limitation of TensorRT (<10.14), we cached weighted engines to make sure Engine Caching feature work properly, which occupied unnecessary hard disk. Since this release, if users install TensorRT >= 10.14, engine caching will only save weight-stripped engines on disk regardless of compilation_settings.strip_engine_weights, and then, when users pull out the cached engine, it will be automatically refitted and kept refittable all the time, which means compiled TRT modules can be refitted multiple times with the function refit_module_weights(). e.g.:

for _ in range(3):
    trt_gm = refit_module_weights(trt_gm, exp_program)

Autocast

Before TensorRT 10.12, TensorRT would implicitly pick kernels for layers that result in the best performance (i.e., weak typing). Weak typing behavior is deprecated in newer TensorRT versions, but it is a good way to maximize performance. Therefore, in this release, we want to provide a solution for users to enable mixed precision behavior like weak typing, which is called Autocast.

Unlike PyTorch Autocast, Torch-TensorRT Autocast is a rule-based autocast, which intelligently selects nodes to
keep in FP32 precision to maintain model accuracy while benefiting from reduced precision on the rest of the nodes.
Torch-TensorRT Autocast also supports users to specify which nodes to exclude from Autocast, considering some nodes
might be more sensitive to affecting accuracy. In addition, Torch-TensorRT Autocast can cooperate with PyTorch Autocast,
allowing users to use both PyTorch Autocast and Torch-TensorRT Autocast in the same model. Torch-TensorRT Autocast
respects the precision of the nodes within PyTorch Autocast context. Please refer to Torch-TRT mixed precision doc for more details.

Compilation Resource Management

Compiling large models on limited-resource hardware is challenging. Before this release, to successfully compile the FLUX model (24GB), we needed at least 128GB of host memory, which is >5x of the model size. This huge consumption limited Torch-TensorRT's capability to compile large models with limited resources.

Host Memory Optimization

Introduce the feature of trimming malloc memory, thus reducing peak host memory consumption.
bash export TORCHTRT_ENABLE_BUILDER_MALLOC_TRIM=1 python example.py
By using the environment variable, the peak memory usage can be reduced to 3x.

If the cuda memory is sufficient, you can disable by setting offload_module_to_cpu=False to further reduce the host memory to 2x. More detailed explanation can be found here: https://github.com/pytorch/TensorRT/blob/main/docsrc/contributors/resource_management.rst

Resource Aware Partitioner

A new feature called Resource Aware Partitioner is introduced to address situations where the available host memory is smaller than 3x of the model size. In compilation settings, set enable_resource_partitioning=True and (optionally) set a cpu_memory_budget, the partitioner will automatically shard the graph such that the compilation resource consumption can fit into very constrained resources (<2x) without sacrificing performance and accuracy.
Example usage can be found here:
https://github.com/pytorch/TensorRT/blob/b7ae84fc020b1f0428b019d39c6284c7d52626e7/examples/dynamo/low_cpu_memory_compilation.py

Debugger

TensorRT API Capture

In this release, we have added the TensorRT API Capture and Replay feature
which streamlines the process of reproducing and debugging issues within your model.
It allows you to record the engine-building phase of your model and later replay the engine-build steps.

Capture:
The capture feature is by default disabled.
You can enable the capture feature via environment variable: TORCHTRT_ENABLE_TENSORRT_API_CAPTURE=1
TORCHTRT_ENABLE_TENSORRT_API_CAPTURE=1 python your_model_test.py
You should see the shim.json and shim.bin generated after enable the capture.

Replay:
Use tensorrt_player tool to replay the captured trt engine build without the original framework
tensorrt_player -j /absolute/path/to/shim.json -o /absolute/path/to/output_engine

Limitations:
-This feature is currently restricted to Linux(x86-64 and aarch64) only.
-This feature is currently restricted to capture and record 1 trt engine only,
in case you have graph break, there are multiple engines are built, only the first engine is recorded.
In the next release we will support multiple engines are all recorded in the same bin file.

You can see more details in
https://docs.pytorch.org/TensorRT/getting_started/capture_and_replay.html?highlight=capture+replay#

What's Changed

Read more

Torch-TensorRT v2.9.0

17 Oct 15:58
8767d9b

Choose a tag to compare

PyTorch 2.9, CUDA 13.0 TensorRT 10.13, Python 3.13

Torch-TensorRT 2.9.0 Linux x86-64 and Windows targets PyTorch 2.9, TensorRT 10.13, CUDA 13.0, 12.8, 12.6 and Python 3.10 ~ 3.13

Python

x86-64 Linux and Windows

aarch64 SBSA Linux and Jetson Thor

NOTE: You must explicitly install TensorRT or use system installed TensorRT wheels for aarch64 platforms

uv pip install torch torch-tensorrt tensorrt 

aarch64 Jetson Orin

 - no torch_tensorrt 2.9 release for Jetson Orin, please continue using torch_tensorrt 2.8 release

C++

x86-64 Linux and Windows

  • CUDA 13.0 Tarball / Zip

Deprecations

FX Frontend

The FX frontend was the precursor to the Dynamo frontend and a number of Dynamo components were shared between the two. Now that the Dynamo frontend is stable and all shared components have been decoupled we will no longer ship the FX frontend in binary releases starting in H1Y26. The FX frontend will remain in the source tree for the foreseeable future so source builds can reinstall the frontend if necessary.

New Features

LLM and VLM improvements

In this release, we’ve introduced several key enhancements:

  • Sliding Window Attention in SDPA Converter : Added support for sliding window attention, enabling successful compilation of the Gemma3 model (Gemma3-1B).
  • Dynamic Custom Lowering Passes
    Refactored the lowering framework to allow users to dynamically register custom passes based on the configuration of Hugging Face models.
  • Vision-Language Model (VLM) Support
    • Added support for Eagle2 and Qwen2.5-VL models via the new run_vlm.py utility.
    • run_vlm.py enables compilation of both vision and language components of a VLM model. It also supports KV caching for efficient VLM generation.

See the documentation for detailed instructions on running these models.

TensorRT-RTX

TensorRT-RTX is a JIT-first version of TensorRT. Where as TensorRT will perform tactic selection and fusions during a build phase. TensorRT-RTX allows you to distribute builds prior to specializing for specific hardware so that one GPU agnostic package can be distributed to all users of your builds. Then on first use, TensorRT RTX will tune for the specific hardware your users are running. Torch-TensorRT-RTX is a build of Torch-TensorRT that uses the TensorRT-RTX compiler stack inplace of standard TensorRT. All APIs are identical to Torch-TensorRT, however, some features such as weak-typing and at compile time post training quantization are not supported.

Improvements

  • Closed a number of performance gaps between Torch-TensorRT and ONNX TensorRT constructed graphs

What's Changed

Read more

Torch-TensorRT v2.8.0

09 Aug 06:02
e94d48c

Choose a tag to compare

PyTorch 2.8, CUDA 12.8 TensorRT 10.12, Python 3.13

Torch-TensorRT 2.8.0 Standard Linux x86-64 and Windows targets PyTorch 2.8, TensorRT 10.12, CUDA 12.6, 12.8, 12.9 and Python 3.9 ~ 3.13

Platform support

In addition to the standard Windows x86-64 and Linux x86-64 releases, we now provide binary builds for SBSA and Jetson:

Deprecations

New Features

AOT-Inductor Pythonless Deployment

Stability: Beta

Historically TorchScript has been used to run Torch-TensorRT programs outside of a Python interpreter. Both the dynamo/torch.compile frontend and the TorchScript frontends supported this TorchScript deployment workflow.

Old
trt_model = torch_tensorrt.compile(model, ir="dynamo", arg_inputs=[...])
ts_model = torch.jit.trace(trt_model, inputs=[...])
ts_model.save("trt_model.ts")

Now you can achieve a similar result using AOT-Inductor. AOTInductor is a specialized version of TorchInductor, designed to process exported PyTorch models, optimize them, and produce shared libraries as well as other relevant artifacts. These compiled artifacts are specifically crafted for deployment in non-Python environments.

Torch-TensorRT can embed TensorRT engines in AOTInductor libraries to accelerate models further. You are also able to combine Inductor kernels with TensorRT engines via this method. This allows users to deploy their models outside of Python using torch-compile native technologies.

New
with torch.no_grad():
    cg_trt_module = torch_tensorrt.compile(model, **compile_settings)
    torch_tensorrt.save(
        cg_trt_module,
        file_path=os.path.join(os.getcwd(), "model.pt2"),
        output_format="aot_inductor",
        retrace=True,
        arg_inputs=example_inputs,
    )

This model.pt2 file can then be loaded in either Python or C++ using Torch APIs.

import torch
import torch_tensorrt
model = torch._inductor.aoti_load_package(os.path.join(os.getcwd(), "model.pt2"))
#include <iostream>
#include <vector>

#include "torch/torch.h"
#include "torch/csrc/inductor/aoti_package/model_package_loader.h"

int main(int argc, const char* argv[]) {

  std::string trt_aoti_module_path = "model.pt2";
  c10::InferenceMode mode;

  torch::inductor::AOTIModelPackageLoader loader(trt_aoti_module_path);
  std::vector<torch::Tensor> inputs = {torch::randn({8, 10}, at::kCUDA)};
  std::vector<torch::Tensor> outputs = loader.run(inputs);
  std::cout << "Result from the first inference:"<< std::endl;
  std::cout << outputs << std::endl;

  return 0;
}

More information can be found here https://docs.pytorch.org/TensorRT/user_guide/runtime.html as we as a code example here: https://github.com/pytorch/TensorRT/blob/release/2.8/examples/torchtrt_aoti_example/inference.cpp

PTX Plugins

Stability: Stable

In Torch-TensorRT 2.7.0 we introduced auto-generated plugins which allows users to automatically wrap kernels / PyTorch custom Operators into TensorRT plugins to run their models without a graph break. In 2.8.0 we extend this system to support PTX based plugins which enables users to serialize and run their TensorRT engines without requiring any PyTorch / Triton / Python in the runtime or access to the original kernel implementation. This approach also has the added benefit of lower overhead than the auto-generated plugin system for achieving maximum performance.

Example below shows how to register a custom operator, generate the necessary plugin, and integrate it into the TensorRT execution graph. [the example]
(https://github.com/pytorch/TensorRT/blob/main/examples/dynamo/aot_plugin.py)

Hierarchical Multi-backend Adjacency Partitioner

Stability: Experimental

The Hierarchical Multi-backend Adjacency Partitioner enables sophisticated model partitioning strategies for distributing PyTorch models across multiple backends based on operator support and priority ordering. A prototype partitioner has been added to the package which allows graphs to be split across multiple backends (e.g., TensorRT, PyTorch Inductor) based on operator capabilities. By providing a backend preference order operators are assigned to the highest-priority backend that supports them.

Please refer to the example for usage.

Model Optimizer-Based NVFP4 Quantization (PTQ) Support for Linux

Stability: Stable

Introducing NVFP4 for efficient and accurate low-precision inference on the Blackwell GPU architecture.
Currently, the workflow supports quantizing models from FP16 → NVFP4.

Directly quantizing from FP32 → NVFP4 is not recommended as it may lead to accuracy degradation. Instead, first convert or train the model in FP16, then quantize to NVFP4.

Full example:
https://github.com/pytorch/TensorRT/blob/release/2.8/examples/apps/flux_demo.py

run_llm and KV Caching

Stability: Beta

We’ve introduced a KV caching implementation for Torch-TensorRT using native TensorRT operations, yielding significant improvements in inference performance for autoregressive large language models (LLMs). KV caching is a crucial optimization that reduces latency by reusing attention activations across decoding steps. In our approach, the KV cache is modeled as fixed-size tensor inputs and outputs, with outputs from each decoding step looped back as inputs to update the cache incrementally. This update is performed using TensorRT-supported operations such as slice, concat, and pad. The design allows step-wise cache updates while preserving compatibility with TensorRT’s optimization workflow and engine serialization.

We’ve also introduced a new utility, run_llm.py, to run inference on popular LLMs with KV caching enabled.

To run a Qwen3 model using KV caching with Torch-TensorRT, use the following command:

python run_llm.py --model Qwen/Qwen3-8B --prompt "What is parallel programming?" --precision FP16 --num_tokens 128 --cache static_v2 --benchmark

Please refer to Compiling LLM models from Huggingface for more details and limitations.

Debugger

We introduced a new debugger for better usability and a debugging experience for Torch-TensorRT. The debugger centralized all debugger settings, such as logging level from critical to info, and engine profiling. We also introduced fx graph visualization in the debugger, where you can specify the specific lowering pass before/which you want to draw the graph. Moreover, the debugger can provide engine profiling and layer information that is compatible with TREX, an engine visualization tool developed by TensorRT, that better explains the engine structure.

Model Zoo

We have expanded support to include several popular models from the Qwen3 and Llama3 series. In this release, we’ve also addressed various performance and accuracy issues to improve overall stability. For a complete list of supported models, please refer to the Supported Models section.

Bug Fixes

Refit

Refit has been re-enabled for Python 3.13 after being disabled in 2.7.0

  • Reduced memory overhead by offloading model to CPU

Performance improvements

  • Linear converter was reverted to the earlier implementation because it shows perf improvements in fp16 on some models (e.g., BERT)
  • Group Norm converter was simplified to reduce unnecessary TensorRT ILayers
  • The constants in the BatchNorm converter are now folded at compile time, leading to significant performance improvements.
  • SDPA op decomposition is optimized, resulting in same or better performance as ONNX-TensorRT for transformer-based diffusion models such as Stable Diffusion 3/WAN2.1/FLUX

What's Changed

Read more

Torch-TensorRT v2.6.1

03 Jun 04:49
707f16d

Choose a tag to compare

What's Changed

Full Changelog: v2.6.0...v2.6.1

Torch-TensorRT v2.7.0

07 May 23:44
8250319

Choose a tag to compare

PyTorch 2.7, CUDA 12.8, TensorRT 10.9, Python 3.13

Torch-TensorRT 2.7.0 targets PyTorch 2.7, TensorRT 10.9, and CUDA 12.8, (builds for CUDA 11.8/12.4 are available via the PyTorch package index - https://download.pytorch.org/whl/cu118 https://download.pytorch.org/whl/cu124). Python versions from 3.9-3.13 are supported. We no longer provide builds for the pre-cxx11-abi, all wheels and tarballs will use the cxx11 ABI.

Known Issues

  • Engine refitting is disabled in Python 3.13.

Using Self Defined Kernels in TensorRT Engines using Automatic Plugin Generation

Users may develop their own custom kernels using DSLs such as OpenAI Triton. Through the use of PyTorch Custom Ops and Torch-TensorRT Automatic Plugin Generation, these kernels can be called within the TensorRT engine with minimal extra code required.

@triton.jit
def elementwise_scale_mul_kernel(X, Y, Z, a, b, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(0)
    # Compute the range of elements that this thread block will work on
    block_start = pid * BLOCK_SIZE
    # Range of indices this thread will handle
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    # Load elements from the X and Y tensors
    x_vals = tl.load(X + offsets)
    y_vals = tl.load(Y + offsets)
    # Perform the element-wise multiplication
    z_vals = x_vals * y_vals * a + b
    # Store the result in Z
    tl.store(Z + offsets, z_vals)


@torch.library.custom_op("torchtrt_ex::elementwise_scale_mul", mutates_args=())  # type: ignore[misc]
def elementwise_scale_mul(
    X: torch.Tensor, Y: torch.Tensor, b: float = 0.2, a: int = 2
) -> torch.Tensor:
    # Ensure the tensors are on the GPU
    assert X.is_cuda and Y.is_cuda, "Tensors must be on CUDA device."
    assert X.shape == Y.shape, "Tensors must have the same shape."

    # Create output tensor
    Z = torch.empty_like(X)

    # Define block size
    BLOCK_SIZE = 1024

    # Grid of programs
    grid = lambda meta: (X.numel() // meta["BLOCK_SIZE"],)

    # Launch the kernel with parameters a and b
    elementwise_scale_mul_kernel[grid](X, Y, Z, a, b, BLOCK_SIZE=BLOCK_SIZE)

    return Z

@torch.library.register_fake("torchtrt_ex::elementwise_scale_mul")
def _(x: torch.Tensor, y: torch.Tensor, b: float = 0.2, a: int = 2) -> torch.Tensor:
    return x

torch_tensorrt.dynamo.conversion.plugins.custom_op("torchtrt_ex::elementwise_scale_mul", supports_dynamic_shapes=True, requires_output_allocator=False)

trt_mod_w_kernel = torch_tensorrt.compile(module, ...) 

torch_tensorrt.dynamo.conversion.plugins.custom_op will generate a TensorRT plugin using the Quick Deploy Plugin system and using PyTorch's FakeTensor mode by reusing information required to register a Torch custom op to use with TorchDynamo. It will also generate the Torch-TensorRT converter to insert the plugin to the TensorRT engine.

QDP Plugins for Torch Custom Ops and Converters for QDP Plugins can be generated individually using

torch_tensorrt.dynamo.conversion.plugins.generate_plugin(
    "torchtrt_ex::elementwise_scale_mul"
)
torch_tensorrt.dynamo.conversion.plugins.generate_plugin_converter(
    "torchtrt_ex::elementwise_scale_mul",
    supports_dynamic_shapes=True,
    requires_output_allocator=False,
)

MutableTorchTensorRTModule improvements

MutableTorchTensorRTModule automatically recompiles if the engine becomes invalid. Previously, engines would assume static shape which means that if a user provides a different sized input, the graph would recompile or pull from engine cache. Now developers are able to provide shape hints to the MutableTorchTensorRTModule which will allow the module to handle a broader range of inputs without recompiling. For example:

pipe.unet = torch_tensorrrt.MutableTorchTensorRTModule(pipe.unet, **settings)
BATCH = torch.export.Dim("BATCH", min=2, max=24)
_HEIGHT = torch.export.Dim("_HEIGHT", min=16, max=32)
_WIDTH = torch.export.Dim("_WIDTH", min=16, max=32)
HEIGHT = 4 * _HEIGHT
WIDTH = 4 * _WIDTH
args_dynamic_shapes = ({0: BATCH, 2: HEIGHT, 3: WIDTH}, {})
kwargs_dynamic_shapes = {
    "encoder_hidden_states": {0: BATCH},
    "added_cond_kwargs": {
        "text_embeds": {0: BATCH},
        "time_ids": {0: BATCH},
     },
     "return_dict": None,
}
pipe.unet.set_expected_dynamic_shape_range(
    args_dynamic_shapes, kwargs_dynamic_shapes
)

Data Dependent Shape support

For networks that produce outputs whose shapes are dependent on the shape of the input, the output buffer must be allocated at runtime. To support this use case we have added a new runtime mode Dynamic Output Allocation Mode to support Data Dependent Shape (DDS) operations, such as NonZero op. (#3388)

Note:
  • Dynamic output allocation mode cannot be used in conjunction with CUDA Graphs nor pre-allocated outputs feature.
  • Without dynamic output allocation, the output buffer is allocated based on the inferred output shape based on input size.

There are two scenarios in which dynamic output allocation is enabled:

  1. The model has been identified at compile time to require dynamic output allocation for at least one TensorRT subgraph. These models will engage the runtime mode automatically (with logging) and are incompatible with other runtime modes such as CUDA Graphs. Converters can declare that subgraphs that they produce will require the output allocator using requires_output_allocator=True there by forcing any model which utilizes the converter to automatically use the output allocator runtime mode. e.g.,
    @dynamo_tensorrt_converter(
        torch.ops.aten.nonzero.default,
        supports_dynamic_shapes=True,
        requires_output_allocator=True,
    )
    def aten_ops_nonzero(
        ctx: ConversionContext,
        target: Target,
        args: Tuple[Argument, ...],
        kwargs: Dict[str, Argument],
        name: str,
    ) -> Union[TRTTensor, Sequence[TRTTensor]]:
        ...
  1. Users may manually enable dynamic output allocation mode via the torch_tensorrt.runtime.enable_output_allocator context manager.
    # Enables Dynamic Output Allocation Mode, then resets the mode to its prior setting
    with torch_tensorrt.runtime.enable_output_allocator(trt_module):
        ...

Tiling Optimization support

Tiling optimization enables cross-kernel tiled inference. This technique leverages on-chip caching for continuous kernels in addition to kernel-level tiling. It can significantly enhance performance on platforms constrained by memory bandwidth. (#3444)

We currently support four tiling strategies "none", "fast", "moderate", "full". A higher level allows TensorRT to spend more time searching for better tiling strategy. Here's an example to call tiling optimization:

    compiled_model = torch_tensorrt.compile(
        model,
        ir="dynamo",
        inputs=inputs,
        tiling_optimization_level="full",
        l2_limit_for_tiling=10,
    )

Model Zoo additions

  • Added support for compiling the FLUX.1-dev 12B model in our model zoo. An example is available here. Quantized variants of FLUX are under development as part of future work.

General Improvements

  • Improved BF16 support in model compilation by fixing bugs and adding new tests to cover both full-graph and graph-break scenarios.
  • Significantly accelerated model compilation time (#3396)

Python 3.13 support

We added support for Python 3.13 (#3455). However, due to the Python object reference issue in PyTorch 2.7, we disabled the refitting related features for Python 3.13 in this release. This issue should be fixed in the next release.

What's Changed

Read more

Torch-TensorRT v2.6.0

05 Feb 22:03
44375f2

Choose a tag to compare

PyTorch 2.6, CUDA 12.6 TensorRT 10.7, Python 3.12

Torch-TensorRT 2.6.0 targets PyTorch 2.6, TensorRT 10.7, and CUDA 12.6, (builds for CUDA 11.8/12.4 are available via the PyTorch package index - https://download.pytorch.org/whl/cu118 https://download.pytorch.org/whl/cu124). Python versions from 3.9-3.12 are supported. We do not support 3.13 in this release due to TensorRT not supporting that version of Python at this time.

Deprecation notice

The torchscript frontend will be deprecated in v2.6. Specifically, the following usage will no longer be supported and will issue a deprecation warning at runtime if used:

torch_tensorrt.compile(model, ir="torchscript")

Moving forward, we encourage users to transition to one of the supported options:

torch_tensorrt.compile(model)
torch_tensorrt.compile(model, ir="dynamo")
torch.compile(model, backend="tensorrt")

Torchscript will continued to be supported as a deployment format via post compilation tracing

dynamo_model = torch_tensorrt.compile(model, ir="dynamo", arg_inputs=[...])
ts_model = torch.jit.trace(dynamo_model, inputs=[...])
ts_model(...)

Please refer to the README for more information regarding our deprecation policy.

Cross-OS Compilation

In Torch-TensorRT 2.6 it is now possible to use a Linux host to compile Torch-TensorRT programs for Windows using the torch_tensorrt.cross_compile_for_windows API. These programs use a slightly different serialization format to facilitate this workflow and cannot be run on Linux. Therefore, when calling torch_tensorrt.cross_compile_for_windows expect the program to be saved directly to disk. Developers should then use the torch_tensorrt.load_cross_compiled_exported_program on the Windows target to load the serialized program. Torch-TensorRT programs now include target platform information to verify OS compatibility on deserialization. This in turn has caused an ABI bump for the runtime.

if load:
    # load the saved model in Windows
    if platform.system() != "Windows" or platform.machine() != "AMD64":
        raise ValueError(
            "cross runtime compiled model for windows can only be loaded in Windows system"
        )
    loaded_model = torchtrt.load_cross_compiled_exported_program(save_path).module()
    print(f"model has been successfully loaded from ${save_path}")
    # inference
    trt_output = loaded_model(input)
    print(f"inference result: {trt_output}")
else:
    if platform.system() != "Linux" or platform.architecture()[0] != "64bit":
        raise ValueError(
            "cross runtime compiled model for windows can only be compiled in Linux system"
        )
    compile_spec = {
        "debug": True,
        "min_block_size": 1,
    }
    torchtrt.cross_compile_for_windows(
        model, file_path=save_path, inputs=inputs, **compile_spec
    )
    print(
        f"model has been successfully cross compiled and saved in Linux to {args.path}"
    )

Runtime Weight Streaming

Weight Streaming in Torch-TensorRT is a memory optimization technique that helps deploy large models on memory-constrained devices by dynamically loading weights as needed during inference, reducing the overall memory footprint and enabling more efficient use of hardware resources. It is an opt-in feature that needs to be enabled at both build time and runtime.

trt_model = torch_tensorrt.dynamo.compile(
    model,
    inputs=input_tensors,
    enabled_precisions={torch.float32}, # only float32 precision is allowed for strongly typed network
    use_explicit_typing=True,           # create a strongly typed network
    enable_weight_streaming=True,       # enable weight streaming
)

Control the weight streaming budget at runtime using the weight streaming context manager

with torch_tensorrt.runtime.weight_streaming(trt_model) as weight_streaming_ctx:
    # Get the total size of streamable weights in the engine
    streamable_budget = weight_streaming_ctx.total_device_budget
    # Set 50% weight streaming budget
    requested_budget = int(streamable_budget * 0.5)
    weight_streaming_ctx.device_budget = requested_budget
    trt_model(inputs)

Inter-Block CUDAGraphs

We updated CUDAGraphs API to support Inter-Block CUDAGraphs. When a compiled Torch-TensorRT module has graph breaks, previously, only TensorRT blocks could be run with CUDAGraph's optimized kernel launch. With Torch-TensorRT 2.6 the entire graph can be captured and executed in a unified CUDAGraph to minimize kernel launch overhead.

# Previous API
with torch_tensorrt.runtime.enable_cudagraphs():
    torchtrt_model(inputs)
# New API
with torch_tensorrt.runtime.enable_cudagraphs(torchtrt_model) as cudagraphs_model:
    cudagraphs_model(input)

Improvements to Engine Caching

First, there are some API changes.

  1. make_refittable was renamed to immutable_weights in preparation for a future release that will default engines to be compiled with the refit feature enabled, allowing for the Torch-TensorRT engine cache to provide maximum benefits.
  2. refit_identical_engine_weights was added to specify whether to refit the engine with identical weights;
  3. strip_engine_weights was added to specify whether to strip the engine weights.
  4. The default disk size for engine caching was expanded to 5GB.

In addition, one of the capabilities of engine caching is to recognize whether two graphs are isomorphic. If a new graph is isomorphic to any previously compiled TensorRT engine, the engine cache will reuse that engine instead of recompiling the graph, thereby avoiding recompilation time. In the previous release, we utilized FxGraphCachePickler.get_hash(new_gm) from PyTorch to calculate hash values which took up a large portion of the total compile time. In this release, we designed a new hash function to get hash values quickly and then determine the isomorphism with ~4x speedup.

C++11 ABI Changes

To keep pace with PyTorch, as of release 2.6, we switched docker images from manylinux to manylinux2_28. In Torch/Torch-TensorRT 2.6, PRE_CXX11_ABI is used for CUDA 11.8 and 12.4, while CXX11_ABI is used for CUDA 12.6. For Torch/Torch-TensorRT 2.7, CXX11_ABI will be used for all CUDA 11.8, 12.4, and 12.6.

Explicit Typing

We introduce a new compilation setting, use_explicit_typing, to enable mixed precision inference with Torch-TensorRT. When this flag is enabled, TensorRT operates in strong typing mode, ensuring that layer data types are preserved during compilation. For a detailed demonstration of this behavior, refer to the provided tutorial. To learn more about strong typing in TensorRT, refer to the relevant section in the TensorRT Developer Guide.

Model Zoo

Multi-GPU Improvements

There are experimental improvements to multi-gpu workflows, including pulling NCCL operations into TensorRT subgraphs automatically. These should be considered alpha stability. More information can be found here: https://github.com/pytorch/TensorRT/tree/main/examples/distributed_inference

What's Changed

Read more

Torch-TensorRT v2.5.0

18 Oct 19:45
f2e1e6c

Choose a tag to compare

PyTorch 2.5, CUDA 12.4, TensorRT 10.3, Python 3.12

Torch-TensorRT 2.5.0 targets PyTorch 2.5, TensorRT 10.3 and CUDA 12.4.
(builds for CUDA 11.8/12.1 are available via the PyTorch package index - https://download.pytorch.org/whl/cu118 https://download.pytorch.org/whl/cu121)

Deprecation notice

The torchscript frontend will be deprecated in v2.6. Specifically, the following usage will no longer be supported and will issue a deprecation warning at runtime if used:

torch_tensorrt.compile(model, ir="torchscript")

Moving forward, we encourage users to transition to one of the supported options:

torch_tensorrt.compile(model)
torch_tensorrt.compile(model, ir="dynamo")
torch.compile(model, backend="tensorrt")

Torchscript will continued to be supported as a deployment format via post compilation tracing

dynamo_model = torch_tensorrt.compile(model, ir="dynamo", arg_inputs=[...])
ts_model = torch.jit.trace(dynamo_model, inputs=[...])
ts_model(...)

Please refer to the README for more information regarding our deprecation policy.

Refit (Beta)

v2.5.0 introduces direct model refitting from PyTorch for your compiled Torch-TensorRT programs. Sometimes the weights need to change through the course of inference and in the past full recompilation was necessary to change out the weights of the model, either through automatic recompilation through torch.compile or through manual recompilation with torch_tensorrt.compile. Now using the refit_module_weights API, compiled modules can be refitted by providing a new PyTorch module (with identical structure) containing the new weights. Compiled modules must be compiled with make_refittable to use this feature.

# Create and export the updated model
model2 = models.resnet18(pretrained=True).eval().to("cuda")
exp_program2 = torch.export.export(model2, tuple(inputs))


compiled_trt_ep = torch_trt.load("./compiled.ep")

# This returns a new module with updated weights
new_trt_gm = refit_module_weights(
    compiled_module=compiled_trt_ep,
    new_weight_module=exp_program2,
)

There are some ops that are not compatible with refit, such as ops that utilize ILoop layer. When make_refittable is enabled, these ops will be forced to run in PyTorch. It should also be known that engines that are refit enabled may be slightly less performant than non-refittable engines as TensorRT cannot tune for the specific weights it will see at execution time.

Refit Caching (Experimental)

Refitting on its own can help to speed up update model swap times by 0.5-2x. However, the speed of refit can be further improved by utilizing refit caching. Refit caching at compile time stores hints for a direct mapping from PyTorch module members to TRT layer names in the metadata of TorchTensorRTModule. This caching can speed up refit by orders of magnitude. However, it currently has limitations when dealing with layers that have compile time optimization. This feature is still experimental as there may be some ops that are not amenable to refit caching. We still enable using the cache by default when refitting to collect feedback on the edge cases and we provide a output validator which can be used to ensure that refit occurred properly. When verify_outputs is True if the refit failed, then the refitter will discard the cache and refit from scratch.

new_trt_gm = refit_module_weights(
    compiled_module=compiled_trt_ep,
    new_weight_module=exp_program2,
    arg_inputs=inputs,
    verify_outputs=True, 
)

MutableTorchTensorRTModule (Experimental)

torch.compile is incredibly useful when it comes to trying to optimize models that may change over time since it can automatically recompile the module when something changes. However, the major limitation of torch.compile is it cannot be serialized. For users who are looking for similar flexibility but the added ability to serialize and move their work we have introduced the MutableTorchTensorRTModule. This module wraps a PyTorch module and exposes its members transparently, however it injects listeners on setattr and overrides the forward function to use TensorRT accelerated subgraphs. This means you can make changes to your module such as applying adapters and the MutableTorchTensorRTModule will detect the change and mark the function for refit or recompilation based on the change. Similar to torch.compile this is done in a JIT manner, so the first inference after a change will perform the refit or recompile operation.

from diffusers import DiffusionPipeline

with torch.no_grad():
    settings = {
        "use_python_runtime": True,
        "enabled_precisions": {torch.float16},
        "debug": True,
        "make_refittable": True,
    }

    model_id = "runwayml/stable-diffusion-v1-5"
    device = "cuda:0"

    prompt = "house in forest, shuimobysim, wuchangshuo, best quality"
    negative = "(worst quality:2), (low quality:2), (normal quality:2), lowres, normal quality, out of focus, cloudy, (watermark:2),"

    pipe = DiffusionPipeline.from_pretrained(
        model_id, revision="fp16", torch_dtype=torch.float16
    )
    pipe.to(device)

    # The only extra line you need
    pipe.unet = torch_trt.MutableTorchTensorRTModule(pipe.unet, **settings)

    image = pipe(prompt, negative_prompt=negative, num_inference_steps=30).images[0]
    image.save("./without_LoRA_mutable.jpg")

    # Standard Huggingface LoRA loading procedure
    pipe.load_lora_weights(
        "stablediffusionapi/load_lora_embeddings",
        weight_name="moxin.safetensors",
        adapter_name="lora1",
    )
    pipe.set_adapters(["lora1"], adapter_weights=[1])
    pipe.fuse_lora()
    pipe.unload_lora_weights()

    # Refit triggered
    image = pipe(prompt, negative_prompt=negative, num_inference_steps=30).images[0]
    image.save("./with_LoRA_mutable.jpg")

Engine Caching

In some scenarios, users may compile a module multiple times and each time it takes a long time to build a TensorRT engine in the backend. Engine caching will boost performance by reusing previously compiled TensorRT engines rather than recompiling it every time, thereby avoiding recompilation time. When a cached engine is loaded, it will be refitted with the new module weights.

To make it more efficient, as long as two graph modules have the same structure, even though their weights are not the same, we still consider they are the same, i.e., isomorphic graph modules. Isomorphic graph modules with the same compilation settings will share cached engines.

We implemented DiskEngineCache so that users can directly use the APIs to control how and where to save/load cached engines on the disk of the local machine. For exmaple,

trt_gm = torch_trt.dynamo.compile(
    exp_program,
    tuple(inputs),
    make_refitable=True,
    cache_built_engines=True,
    reuse_cached_engines=True,
    engine_cache_dir="/tmp/torch_trt_engine_cache"
    engine_cache_size=1 << 30,  # 1GB
)

In addition, considering some users want to save to or load engines from other servers, clusters, or cloud, we also provided a base class BaseEngineCache so that users are able to easily implement their own logic to save and load engines. For example,

class MyEngineCache(BaseEngineCache):
    def __init__(
        self,
        addr: str,
    ) -> None:
        self.addr= addr

    def save(
        self,
        hash: str,
        blob: bytes,
        prefix: str = "blob",
    ):
        # user's customized function to save engines
        write_to(self.addr, name=f"{prefix}_{hash}.bin", content=blob)

    def load(self, hash: str, prefix: str = "blob") -> Optional[bytes]:
        # user's customized function to load engines
        return read_from(self.addr, name=f"{prefix}_{hash}.bin")


trt_gm = torch_trt.dynamo.compile(
    exp_program,
    tuple(inputs),
    make_refitable=True,
    cache_built_engines=True,
    reuse_cached_engines=True,
    custom_engine_cache=MyEngineCache("xxxxx"),
)

CUDA Graphs

In v2.5.0 CUDA graph support for in engine kernel launch optimization has been added through a new runtime mode. This mode can be activated from Python using

import torch_tensorrt 

my_torchtrt_model = torch_tensorrt.compile(...)

with torch_tensorrt.runtime.enable_cudagraphs():
    my_torchtrt_model(inputs)

This mode works by creating CUDAGraphs around individual TensorRT engines which improves their efficiency. It creates graph through a capture phase which is tied to the input shape to the engine. When the input shape changes, this graph is invalidated and the graph is automatically recaptured.

Model Optimizer based Int8 Quantization(PTQ) support for Linux

This version introduces official support for the int8 Quantization via modelopt (https://github.com/NVIDIA/TensorRT-Model-Optimizer) 17.0 for Linux.
Full examples can be found at https://github.com/pytorch/TensorRT/blob/main/examples/dynamo/vgg16_ptq.py
running the vgg16 example for int8 ptq

step1:  generate checkpoint file for vgg16:
cd examples/int8/training/vgg16
python main.py --lr 0.01 --batch-size 128 --drop-ratio 0.15 \
--ckpt-dir $(pwd)/vgg16_ckpts --epochs 20 --seed 545
this should produce a ckpt file at examples/int8/training/vgg16/vgg16_ckpts/ckpt_epoch20.pth

step2: run int8 ptq for vgg16:
python examples/dynamo/vgg16_fp8_ptq.py --batch-size 128 \
--ckpt=examples/int8/training/vgg16/vgg16_ckpts/ckpt_epoch20.pth \
--quantize-type=int8

LLM examples

We now offer dynamic shape support for all converters (covering core ATen operations). Dynamic shapes are widely utilized in leading LLM models, where input sequence lengths may vary. With this release, we showcase full graph compilation for Ll...

Read more

Torch-TensorRT v2.4.0

29 Jul 21:50
77278fe

Choose a tag to compare

C++ runtime support in Windows Support, Enhanced Dynamic Shape support in Converters, PyTorch 2.4, CUDA 12.4, TensorRT 10.1, Python 3.12

Torch-TensorRT 2.4.0 targets PyTorch 2.4, CUDA 12.4 (builds for CUDA 11.8/12.1 are available via the PyTorch package index - https://download.pytorch.org/whl/cu118 https://download.pytorch.org/whl/cu121) and TensorRT 10.1.
This version introduces official support for the C++ runtime on the Windows platform, though it is limited to the dynamo frontend, supporting both AOT and JIT workflows. Users can now utilize both Python and C++ runtimes on Windows. Additionally, this release expands support to include all Aten Core Operators, except torch.nonzero, and significantly increases dynamic shape support across more converters. Python 3.12 is supported for the first time in this release.

Full Windows Support

In this release we introduce both C++ and Python runtime support in Windows. Users can now directly optimize PyTorch models with TensorRT on Windows, with no code changes. C++ runtime is the default option and users can enable Python runtime by specifying use_python_runtime=True

import torch
import torch_tensorrt
import torchvision.models as models

model = models.resnet18(pretrained=True).eval().to("cuda")
input = torch.randn((1, 3, 224, 224)).to("cuda")
trt_mod = torch_tensorrt.compile(model, ir="dynamo", inputs=[input])
trt_mod(input)

Enhanced Op support in Converters

Support for Converters is near 100% of core ATen. At this point fall back to PyTorch execution is either due to specific limitations of converters or some combination of user compiler settings (e.g. torch_executed_ops, dynamic shape). This release also expands the number of operators that support dynamic shape. dryrun will provide specific information on your model + settings support.

What's Changed

Read more