03 Jun 18:40

f9cc746

v0.8.0 Latest

Latest

TensorRT Edge-LLM 0.8.0 Release 2026-06-02

We are very excited to announce release 0.8.0 of TensorRT Edge-LLM!

TensorRT Edge-LLM 0.8.0 completes the transition to the checkpoint-based export workflow. The tensorrt_edgellm package is now the official front-end for quantization and ONNX export.

This release has been validated on the primary NVIDIA edge platforms, including Jetson Orin and Thor-class devices, and adds support to the just released Jetpack 7.2, which can be downloaded from Nvidia’s official site. As a general sizing guideline, a model is expected to run when the target device has at least the model size plus approximately 3 GB of available memory for runtime overhead. Please see our updated Quick Start Guide, Supported Model List, and model examples for the latest deployment guidance.

Breaking Changes

The legacy ONNX export workflow has been removed. The checkpoint-based workflow is now the official TensorRT Edge-LLM export path.
experimental/llm_loader and experimental/quantization functionality has moved into the tensorrt_edgellm package.
Use the unified CLI commands:
- tensorrt-edgellm-quantize
- tensorrt-edgellm-export
- tensorrt-edgellm-merge-lora
- tensorrt-edgellm-reduce-vocab
- tensorrt-edgellm-preprocess-audio
Older per-component export commands such as tensorrt-edgellm-export-llm, tensorrt-edgellm-export-visual, and python -m llm_loader.export_all_cli should be replaced by tensorrt-edgellm-export.

Key Features

Promoted tensorrt_edgellm as the official checkpoint-based quantization and ONNX export workflow, with unified CLI commands for export, quantization, LoRA, vocabulary reduction, and audio preprocessing
Added Qwen3/Qwen3.5/Qwen3.6 MoE NVFP4 export
Reduced engine build memory for INT4 FFN, INT4 MoE, and LM-head weights
Added OpenAI-compatible server tool-calling support, per-request stop strings, and Qwen3-TTS streaming improvements
Updated dependency pins, including PyTorch 2.12, Transformers 5.9, and ModelOpt 0.44

Other Important Features

Runtime and Performance

Added pluggable decoding support for vanilla, EAGLE, and MTP runtime paths
Added Qwen3-ASR export and quantization workflows, including independent LLM and audio-encoder precision selection, including FP8 Qwen3 ASR Audio Encoder
Added fused gate+up and XQA kernel support for new MoE configurations
Fixed XQA kernel-loader thread-safety and CUDA 13.3 MoE top-k softmax build issues

Export and Quantization

Improved mixed-precision quantization handling for fused QKV and gate/up projections

Server and API

Added experimental Dockerfiles for containerized development
Expanded high-level Python API and server validation for LLM, VLM, and streaming flows

Documentation

Updated documentation for checkpoint export, quantization, LoRA, FP8 KV cache, FP8 embedding, speculative decoding, ASR, TTS, and server usage

NVIDIA Contributors

@fans-nv @nvluxiaoz @nvamberl @ruocheng-nv @poweiw @yuanyao-nv @Jasper-NV @charllll @mahu888 @xinrzhang-star @JCalafato @nvmbreughe @nvxingkaiz @ever-wong @zhijial-nvidia @willg-nv @levichen-nvidia @xiangg-nv @nvyocox @nv-samcheng

Contributors

JCalafato, charllll, and 18 other contributors

Assets 2

20 May 02:31

nvluxiaoz

v0.7.1

3647690

v0.7.1

TensorRT Edge-LLM 0.7.1 Release 2026-05-19

We are very excited to announce release 0.7.1 of TensorRT Edge-LLM!

We welcome community contribution. We added Alpamayo-1-10B support in our software and thanks for @Turoad to raise #67. This MR provides insights for our implementation.
We are excited to launch our pages on Jetson AI Lab. Please check more tutorials and model deployment guidelines there.

Deprecation Notice

The original workflow used to quantize and export ONNX is replaced by the new checkpoint-based workflow to quantize and export ONNX. This workflow was released as "experimental" in prior versions, but is now the official front-end into TensorRT Edge-LLM. The original tensorrt_edgellm is deprecated and will be removed in 0.8.0.

Key Features

Added Qwen3.5 MTP support
Added Alpamayo-1-10B support
Added Qwen3-TTS streaming support
Added FP8 ViT and Qwen3-TTS support for experimental loader
Added migration for customization and usage with the experimental loader
Improved Mamba prefill kernel performance
Rearchitected runtime with composable stacks
Fixed #81 and other bugs

NVIDIA Contributors

Contributors

JCalafato, charllll, and 20 other contributors

Assets 2

0 Join discussion

28 Apr 21:43

nvxingkaiz

v0.7.0

bbbab9a

v0.7.0

TensorRT Edge-LLM 0.7.0 Release 2026-04-28

We are very excited to announce release 0.7.0 of TensorRT Edge-LLM!

We welcome community contribution and have merged #68 and #42. Thanks the contributors @matiaslin and @victoroliv2.
We are excited to launch our pages on Jetson AI Lab. Please check more tutorials and model deployment guidelines there.

Deprecation Notice

The experimental workflow is expected to reach full parity of the original workflow. The original tensorrt_edgellm is deprecated and will be removed in 0.8.0.

Key Features

Introduced Day 0 Support of NVIDIA Nemotron 3 Nano Omni. Please review Jetson AI Lab on how to run TensorRT Edge-LLM for this model.
Introduced an experimental High Level Python API and OpenAI Compatible Server. Please follow our Quick Start Guide, Introduction and Code.
Implemented an agent-friendly experimental new workflow to quantize and convert models to ONNX. The quantized checkpoints are now compatible with major frameworks like TensorRT-LLM, vLLM and sgLang, and the ONNX conversion has 0 GPU consumption saves > 70% total memory consumption.
Expanded pre-quantized models into our model support lists. Please see our updated [supported model lists].(https://nvidia.github.io/TensorRT-Edge-LLM/0.7.0/user_guide/getting_started/supported-models.html)
Added AGENTS.md to assist development and users. Automatic agents to debug accuracy is added.
Added performance dashboard for selected models. The dashboard will keep updating for newer versions and models.

Other Important Features

Model Support

Added Qwen3.5 LLM/VLM support
Added NVIDIA Nemotron Nano 3 30B A3B NVFP4 support
Added Qwen3-ASR quantization workflow

Performance Improvements

Added FP8 embedding support to reduce embedding-table memory
Reduced runtime memory by sharing TensorRT execution context memory
Added multi-batch cutedsl prefill kernels
Improved performance by cutedsl-based Mamba SSD kernels

Runtime Extensions

Unified LLM runtime execution paths for vanilla and EAGLE3 decoding
Added per-slot streaming with the StreamChannel API
Replaced LinearKVCache with per layer HybridCacheManager to support more hybrid models

Workflow Improvements

Improved build workflow with automatic TensorRT detection and cutedsl kernel binary shipment
Upgraded Transformers support to 5.x

NVIDIA Contributors

Contributors

victoroliv2, JCalafato, and 19 other contributors

Assets 2

0 Join discussion

16 Apr 16:13

nvluxiaoz

v0.6.1

2620a97

v0.6.1

TensorRT Edge-LLM 0.6.1 Release 2026-04-15

Added DriveOS 7.2.4 official support
Fixed EAGLE draft model weights loading issue to retrieve acceptance rate

Assets 2

0 Join discussion

19 Mar 04:50

nvluxiaoz

v0.6.0

996623c

v0.6.0

TensorRT Edge-LLM 0.6.0 Release 2026-03-16

We are very excited to announce the 0.6.0 release of TensorRT Edge-LLM!

TensorRT Edge-LLM is featured in GTC 2026! Link to our blog.
Our developer roadmap for H1 2026 is listed in #32.
Welcome to check out our flagship demo "Run Physical AI Models Faster on Jetson" at GTC 2026. This demo uses TensorRT Edge-LLM as one of the backends to showcase ASR/LLM/TTS capability of NVIDIA Jetson AGX Thor.

Breaking Changes

Due to the ViT Attention Plugin and split QKV changes, ONNX models exported by previous versions are not compatible with 0.6.0.

Key Features

Model Support

Added Nemotron-Nano-9B-v2 support via mamba_ssm and causal_conv1d
Added day 0 support for Nemotron-3-Nano-4B
Added Qwen3-30B-A3B-GPTQ-Int4 support via INT4 MoE Plugin
Added Qwen3-ASR and Qwen3-TTS end-to-end support

Performance Improvements

Added cutedsl FMHA kernels to speed up prefill performance on Blackwell
Used ViT Attention Plugin with fmha-v2 and cutedsl kernels to speed up multi image ViT performance
Updated Attention Plugin to split q, k and v to save memory usage

Runtime Maturity

Added LoRA support for Speculative Decoding
Fixed several compiler warnings and document exceptions for functions
Added coverage tests

NVIDIA Contributors

@fans-nv @nvluxiaoz @nvamberl @ruocheng-nv @Jasper-NV @yuanyao-nv @fkaster-nvidia @poweiw @charllll @mahu888 @xinrzhang-star @JCalafato @nvmbreughe

Contributors

JCalafato, charllll, and 11 other contributors

Assets 2

5 Join discussion

20 Feb 00:36

nvluxiaoz

v0.5.0

8fe7fe1

v0.5.0

TensorRT Edge-LLM 0.5.0 Release 2026-02-19

We are very excited to announce the second release of TensorRT Edge-LLM! We welcome community contribution and have merged the first community PR #13 @taoz27) in this release. Our developer roadmap for H1 2026 is listed in #32.

Breaking Changes

Due to the standalone embedding processing module features, ONNX models exported by previous versions are not compatible with 0.5.0.

Key Features

Implemented and used standalone embedding processing module to reduce multi-modal modeling complexity and reduce Eagle inference memory footprint
Added FP8 KV Cache support
Unified TensorRT execution context for prefill and decode to reduce memory footprint
Supported vanilla decoding for speculative decoding runtime
Used collision resistant hashing for CUDA graphs
Updated int4GroupwiseGemmPlugin to TensorRT Plugin-v3 interface
Refactored documentations
Added ViT attention mask and RoPE parameter caching to reduce recomputation for Qwen (Merged #13 by @taoz27)
Added Jetpack 6.2 compatibility

NVIDIA Contributors

@fans-nv @nvluxiaoz @nvamberl @ruocheng-nv @fkaster-nvidia @chemudupatiks @poweiw @charllll @mahu888 @xinrzhang-star @byte-deve @JCalafato @nvmbreughe

Contributors

JCalafato, taoz27, and 12 other contributors

Assets 2

0 Join discussion

05 Jan 22:50

nvluxiaoz

v0.4.0

50a61d0

v0.4.0

TensorRT Edge-LLM 0.4.0 Release 2026-01-06

We are very excited to announce the first release of TensorRT Edge-LLM! TensorRT Edge-LLM is NVIDIA's high-performance C++ inference runtime for Large Language Models (LLMs) and Vision-Language Models (VLMs) on embedded platforms. Please follow our Quick Start Guide for the usage.

Key Components

Flexible Python CLI exporter from HuggingFace checkpoints to ONNX
C++ TensorRT engine builder
C++ tokenizers and multi-modal processors
C++ runtime, including vanilla decoding and EAGLE3 speculative decoding
Optimized CUDA kernels for Multi-Head Attention, Sampling and EAGLE3 utility
Examples to run inference and perform accuracy evaluations and perf benchmarks

Model Support

Llama3.x
Qwen2/2.5/3 (Dense)
Qwen2/2.5/3-VL (Dense)
InternVL3
Phi4-Multimodal

Please check the model support page for more details.

Key Features

Model Export

nvfp4/fp8/int4 quantization
nvfp4/fp8 lm_head quantization
EAGLE3 draft quantization
Vocab reduction

Runtime

Multi-batch EAGLE3 speculative decoding for LLM and VLM
Decoding CUDA Graph
System prompt KVCache reuse
Open-AI style chat template
Dynamic LoRA switching

Assets 2

0 Join discussion

Releases: NVIDIA/TensorRT-Edge-LLM

v0.8.0

TensorRT Edge-LLM 0.8.0 Release 2026-06-02

Breaking Changes

Key Features

Other Important Features

Runtime and Performance

Export and Quantization

Server and API

Documentation

NVIDIA Contributors

Contributors

Uh oh!

v0.7.1

TensorRT Edge-LLM 0.7.1 Release 2026-05-19

Deprecation Notice

Key Features

NVIDIA Contributors

Contributors

Uh oh!

v0.7.0

TensorRT Edge-LLM 0.7.0 Release 2026-04-28

Deprecation Notice

Key Features

Other Important Features

Model Support

Performance Improvements

Runtime Extensions

Workflow Improvements

NVIDIA Contributors

Contributors

Uh oh!

v0.6.1

TensorRT Edge-LLM 0.6.1 Release 2026-04-15

Uh oh!

v0.6.0

TensorRT Edge-LLM 0.6.0 Release 2026-03-16

Breaking Changes

Key Features

Model Support

Performance Improvements

Runtime Maturity

NVIDIA Contributors

Contributors

Uh oh!

v0.5.0

TensorRT Edge-LLM 0.5.0 Release 2026-02-19

Breaking Changes

Key Features

NVIDIA Contributors

Contributors

Uh oh!

v0.4.0

TensorRT Edge-LLM 0.4.0 Release 2026-01-06

Key Components

Model Support

Key Features

Model Export

Runtime

Uh oh!