Releases: NVIDIA/TensorRT-Edge-LLM
v0.8.0
TensorRT Edge-LLM 0.8.0 Release 2026-06-02
We are very excited to announce release 0.8.0 of TensorRT Edge-LLM!
TensorRT Edge-LLM 0.8.0 completes the transition to the checkpoint-based export workflow. The tensorrt_edgellm package is now the official front-end for quantization and ONNX export.
This release has been validated on the primary NVIDIA edge platforms, including Jetson Orin and Thor-class devices, and adds support to the just released Jetpack 7.2, which can be downloaded from Nvidia’s official site. As a general sizing guideline, a model is expected to run when the target device has at least the model size plus approximately 3 GB of available memory for runtime overhead. Please see our updated Quick Start Guide, Supported Model List, and model examples for the latest deployment guidance.
Breaking Changes
- The legacy ONNX export workflow has been removed. The checkpoint-based workflow is now the official TensorRT Edge-LLM export path.
experimental/llm_loaderandexperimental/quantizationfunctionality has moved into thetensorrt_edgellmpackage.- Use the unified CLI commands:
tensorrt-edgellm-quantizetensorrt-edgellm-exporttensorrt-edgellm-merge-loratensorrt-edgellm-reduce-vocabtensorrt-edgellm-preprocess-audio
- Older per-component export commands such as
tensorrt-edgellm-export-llm,tensorrt-edgellm-export-visual, andpython -m llm_loader.export_all_clishould be replaced bytensorrt-edgellm-export.
Key Features
- Promoted
tensorrt_edgellmas the official checkpoint-based quantization and ONNX export workflow, with unified CLI commands for export, quantization, LoRA, vocabulary reduction, and audio preprocessing - Added Qwen3/Qwen3.5/Qwen3.6 MoE NVFP4 export
- Reduced engine build memory for INT4 FFN, INT4 MoE, and LM-head weights
- Added OpenAI-compatible server tool-calling support, per-request stop strings, and Qwen3-TTS streaming improvements
- Updated dependency pins, including PyTorch 2.12, Transformers 5.9, and ModelOpt 0.44
Other Important Features
Runtime and Performance
- Added pluggable decoding support for vanilla, EAGLE, and MTP runtime paths
- Added Qwen3-ASR export and quantization workflows, including independent LLM and audio-encoder precision selection, including FP8 Qwen3 ASR Audio Encoder
- Added fused gate+up and XQA kernel support for new MoE configurations
- Fixed XQA kernel-loader thread-safety and CUDA 13.3 MoE top-k softmax build issues
Export and Quantization
- Improved mixed-precision quantization handling for fused QKV and gate/up projections
Server and API
- Added experimental Dockerfiles for containerized development
- Expanded high-level Python API and server validation for LLM, VLM, and streaming flows
Documentation
- Updated documentation for checkpoint export, quantization, LoRA, FP8 KV cache, FP8 embedding, speculative decoding, ASR, TTS, and server usage
NVIDIA Contributors
@fans-nv @nvluxiaoz @nvamberl @ruocheng-nv @poweiw @yuanyao-nv @Jasper-NV @charllll @mahu888 @xinrzhang-star @JCalafato @nvmbreughe @nvxingkaiz @ever-wong @zhijial-nvidia @willg-nv @levichen-nvidia @xiangg-nv @nvyocox @nv-samcheng
v0.7.1
TensorRT Edge-LLM 0.7.1 Release 2026-05-19
We are very excited to announce release 0.7.1 of TensorRT Edge-LLM!
- We welcome community contribution. We added Alpamayo-1-10B support in our software and thanks for @Turoad to raise #67. This MR provides insights for our implementation.
- We are excited to launch our pages on Jetson AI Lab. Please check more tutorials and model deployment guidelines there.
Deprecation Notice
- The original workflow used to quantize and export ONNX is replaced by the new checkpoint-based workflow to quantize and export ONNX. This workflow was released as "experimental" in prior versions, but is now the official front-end into TensorRT Edge-LLM. The original tensorrt_edgellm is deprecated and will be removed in 0.8.0.
Key Features
- Added Qwen3.5 MTP support
- Added Alpamayo-1-10B support
- Added Qwen3-TTS streaming support
- Added FP8 ViT and Qwen3-TTS support for experimental loader
- Added migration for customization and usage with the experimental loader
- Improved Mamba prefill kernel performance
- Rearchitected runtime with composable stacks
- Fixed #81 and other bugs
NVIDIA Contributors
@fans-nv @nvluxiaoz @nvamberl @ruocheng-nv @poweiw @yuanyao-nv @Jasper-NV @charllll @mahu888 @xinrzhang-star @JCalafato @nvmbreughe @nvxingkaiz @ever-wong @zhijial-nvidia @willg-nv @levichen-nvidia @xiangg-nv @maggie-j-liu @nvyocox @nv-samcheng
v0.7.0
TensorRT Edge-LLM 0.7.0 Release 2026-04-28
We are very excited to announce release 0.7.0 of TensorRT Edge-LLM!
- We welcome community contribution and have merged #68 and #42. Thanks the contributors @matiaslin and @victoroliv2.
- We are excited to launch our pages on Jetson AI Lab. Please check more tutorials and model deployment guidelines there.
Deprecation Notice
- The experimental workflow is expected to reach full parity of the original workflow. The original tensorrt_edgellm is deprecated and will be removed in 0.8.0.
Key Features
- Introduced Day 0 Support of NVIDIA Nemotron 3 Nano Omni. Please review Jetson AI Lab on how to run TensorRT Edge-LLM for this model.
- Introduced an experimental High Level Python API and OpenAI Compatible Server. Please follow our Quick Start Guide, Introduction and Code.
- Implemented an agent-friendly experimental new workflow to quantize and convert models to ONNX. The quantized checkpoints are now compatible with major frameworks like TensorRT-LLM, vLLM and sgLang, and the ONNX conversion has 0 GPU consumption saves > 70% total memory consumption.
- Expanded pre-quantized models into our model support lists. Please see our updated [supported model lists].(https://nvidia.github.io/TensorRT-Edge-LLM/0.7.0/user_guide/getting_started/supported-models.html)
- Added AGENTS.md to assist development and users. Automatic agents to debug accuracy is added.
- Added performance dashboard for selected models. The dashboard will keep updating for newer versions and models.
Other Important Features
Model Support
- Added Qwen3.5 LLM/VLM support
- Added NVIDIA Nemotron Nano 3 30B A3B NVFP4 support
- Added Qwen3-ASR quantization workflow
Performance Improvements
- Added FP8 embedding support to reduce embedding-table memory
- Reduced runtime memory by sharing TensorRT execution context memory
- Added multi-batch cutedsl prefill kernels
- Improved performance by cutedsl-based Mamba SSD kernels
Runtime Extensions
- Unified LLM runtime execution paths for vanilla and EAGLE3 decoding
- Added per-slot streaming with the
StreamChannelAPI - Replaced LinearKVCache with per layer HybridCacheManager to support more hybrid models
Workflow Improvements
- Improved build workflow with automatic TensorRT detection and cutedsl kernel binary shipment
- Upgraded Transformers support to 5.x
NVIDIA Contributors
@fans-nv @nvluxiaoz @nvamberl @ruocheng-nv @poweiw @yuanyao-nv @Jasper-NV @charllll @mahu888 @xinrzhang-star @JCalafato @nvmbreughe @nvxingkaiz @ever-wong @zhijial-nvidia @willg-nv @levichen-nvidia @xiangg-nv @maggie-j-liu
v0.6.1
v0.6.0
TensorRT Edge-LLM 0.6.0 Release 2026-03-16
We are very excited to announce the 0.6.0 release of TensorRT Edge-LLM!
- TensorRT Edge-LLM is featured in GTC 2026! Link to our blog.
- Our developer roadmap for H1 2026 is listed in #32.
- Welcome to check out our flagship demo "Run Physical AI Models Faster on Jetson" at GTC 2026. This demo uses TensorRT Edge-LLM as one of the backends to showcase ASR/LLM/TTS capability of NVIDIA Jetson AGX Thor.
Breaking Changes
- Due to the ViT Attention Plugin and split QKV changes, ONNX models exported by previous versions are not compatible with 0.6.0.
Key Features
Model Support
- Added Nemotron-Nano-9B-v2 support via mamba_ssm and causal_conv1d
- Added day 0 support for Nemotron-3-Nano-4B
- Added Qwen3-30B-A3B-GPTQ-Int4 support via INT4 MoE Plugin
- Added Qwen3-ASR and Qwen3-TTS end-to-end support
Performance Improvements
- Added cutedsl FMHA kernels to speed up prefill performance on Blackwell
- Used ViT Attention Plugin with fmha-v2 and cutedsl kernels to speed up multi image ViT performance
- Updated Attention Plugin to split q, k and v to save memory usage
Runtime Maturity
- Added LoRA support for Speculative Decoding
- Fixed several compiler warnings and document exceptions for functions
- Added coverage tests
NVIDIA Contributors
@fans-nv @nvluxiaoz @nvamberl @ruocheng-nv @Jasper-NV @yuanyao-nv @fkaster-nvidia @poweiw @charllll @mahu888 @xinrzhang-star @JCalafato @nvmbreughe
v0.5.0
TensorRT Edge-LLM 0.5.0 Release 2026-02-19
We are very excited to announce the second release of TensorRT Edge-LLM! We welcome community contribution and have merged the first community PR #13 @taoz27) in this release. Our developer roadmap for H1 2026 is listed in #32.
Breaking Changes
- Due to the standalone embedding processing module features, ONNX models exported by previous versions are not compatible with 0.5.0.
Key Features
- Implemented and used standalone embedding processing module to reduce multi-modal modeling complexity and reduce Eagle inference memory footprint
- Added FP8 KV Cache support
- Unified TensorRT execution context for prefill and decode to reduce memory footprint
- Supported vanilla decoding for speculative decoding runtime
- Used collision resistant hashing for CUDA graphs
- Updated int4GroupwiseGemmPlugin to TensorRT Plugin-v3 interface
- Refactored documentations
- Added ViT attention mask and RoPE parameter caching to reduce recomputation for Qwen (Merged #13 by @taoz27)
- Added Jetpack 6.2 compatibility
NVIDIA Contributors
@fans-nv @nvluxiaoz @nvamberl @ruocheng-nv @fkaster-nvidia @chemudupatiks @poweiw @charllll @mahu888 @xinrzhang-star @byte-deve @JCalafato @nvmbreughe
v0.4.0
TensorRT Edge-LLM 0.4.0 Release 2026-01-06
We are very excited to announce the first release of TensorRT Edge-LLM! TensorRT Edge-LLM is NVIDIA's high-performance C++ inference runtime for Large Language Models (LLMs) and Vision-Language Models (VLMs) on embedded platforms. Please follow our Quick Start Guide for the usage.
Key Components
- Flexible Python CLI exporter from HuggingFace checkpoints to ONNX
- C++ TensorRT engine builder
- C++ tokenizers and multi-modal processors
- C++ runtime, including vanilla decoding and EAGLE3 speculative decoding
- Optimized CUDA kernels for Multi-Head Attention, Sampling and EAGLE3 utility
- Examples to run inference and perform accuracy evaluations and perf benchmarks
Model Support
- Llama3.x
- Qwen2/2.5/3 (Dense)
- Qwen2/2.5/3-VL (Dense)
- InternVL3
- Phi4-Multimodal
Please check the model support page for more details.
Key Features
Model Export
- nvfp4/fp8/int4 quantization
- nvfp4/fp8 lm_head quantization
- EAGLE3 draft quantization
- Vocab reduction
Runtime
- Multi-batch EAGLE3 speculative decoding for LLM and VLM
- Decoding CUDA Graph
- System prompt KVCache reuse
- Open-AI style chat template
- Dynamic LoRA switching