Skip to content

Releases: NVIDIA/TensorRT-Edge-LLM

v0.8.0

03 Jun 18:40
f9cc746

Choose a tag to compare

TensorRT Edge-LLM 0.8.0 Release 2026-06-02

We are very excited to announce release 0.8.0 of TensorRT Edge-LLM!

TensorRT Edge-LLM 0.8.0 completes the transition to the checkpoint-based export workflow. The tensorrt_edgellm package is now the official front-end for quantization and ONNX export.

This release has been validated on the primary NVIDIA edge platforms, including Jetson Orin and Thor-class devices, and adds support to the just released Jetpack 7.2, which can be downloaded from Nvidia’s official site. As a general sizing guideline, a model is expected to run when the target device has at least the model size plus approximately 3 GB of available memory for runtime overhead. Please see our updated Quick Start Guide, Supported Model List, and model examples for the latest deployment guidance.

Breaking Changes

  • The legacy ONNX export workflow has been removed. The checkpoint-based workflow is now the official TensorRT Edge-LLM export path.
  • experimental/llm_loader and experimental/quantization functionality has moved into the tensorrt_edgellm package.
  • Use the unified CLI commands:
    • tensorrt-edgellm-quantize
    • tensorrt-edgellm-export
    • tensorrt-edgellm-merge-lora
    • tensorrt-edgellm-reduce-vocab
    • tensorrt-edgellm-preprocess-audio
  • Older per-component export commands such as tensorrt-edgellm-export-llm, tensorrt-edgellm-export-visual, and python -m llm_loader.export_all_cli should be replaced by tensorrt-edgellm-export.

Key Features

  • Promoted tensorrt_edgellm as the official checkpoint-based quantization and ONNX export workflow, with unified CLI commands for export, quantization, LoRA, vocabulary reduction, and audio preprocessing
  • Added Qwen3/Qwen3.5/Qwen3.6 MoE NVFP4 export
  • Reduced engine build memory for INT4 FFN, INT4 MoE, and LM-head weights
  • Added OpenAI-compatible server tool-calling support, per-request stop strings, and Qwen3-TTS streaming improvements
  • Updated dependency pins, including PyTorch 2.12, Transformers 5.9, and ModelOpt 0.44

Other Important Features

Runtime and Performance

  • Added pluggable decoding support for vanilla, EAGLE, and MTP runtime paths
  • Added Qwen3-ASR export and quantization workflows, including independent LLM and audio-encoder precision selection, including FP8 Qwen3 ASR Audio Encoder
  • Added fused gate+up and XQA kernel support for new MoE configurations
  • Fixed XQA kernel-loader thread-safety and CUDA 13.3 MoE top-k softmax build issues

Export and Quantization

  • Improved mixed-precision quantization handling for fused QKV and gate/up projections

Server and API

  • Added experimental Dockerfiles for containerized development
  • Expanded high-level Python API and server validation for LLM, VLM, and streaming flows

Documentation

  • Updated documentation for checkpoint export, quantization, LoRA, FP8 KV cache, FP8 embedding, speculative decoding, ASR, TTS, and server usage

NVIDIA Contributors

@fans-nv @nvluxiaoz @nvamberl @ruocheng-nv @poweiw @yuanyao-nv @Jasper-NV @charllll @mahu888 @xinrzhang-star @JCalafato @nvmbreughe @nvxingkaiz @ever-wong @zhijial-nvidia @willg-nv @levichen-nvidia @xiangg-nv @nvyocox @nv-samcheng

v0.7.1

20 May 02:31
3647690

Choose a tag to compare

TensorRT Edge-LLM 0.7.1 Release 2026-05-19

We are very excited to announce release 0.7.1 of TensorRT Edge-LLM!

  • We welcome community contribution. We added Alpamayo-1-10B support in our software and thanks for @Turoad to raise #67. This MR provides insights for our implementation.
  • We are excited to launch our pages on Jetson AI Lab. Please check more tutorials and model deployment guidelines there.

Deprecation Notice

  • The original workflow used to quantize and export ONNX is replaced by the new checkpoint-based workflow to quantize and export ONNX. This workflow was released as "experimental" in prior versions, but is now the official front-end into TensorRT Edge-LLM. The original tensorrt_edgellm is deprecated and will be removed in 0.8.0.

Key Features

  • Added Qwen3.5 MTP support
  • Added Alpamayo-1-10B support
  • Added Qwen3-TTS streaming support
  • Added FP8 ViT and Qwen3-TTS support for experimental loader
  • Added migration for customization and usage with the experimental loader
  • Improved Mamba prefill kernel performance
  • Rearchitected runtime with composable stacks
  • Fixed #81 and other bugs

NVIDIA Contributors

@fans-nv @nvluxiaoz @nvamberl @ruocheng-nv @poweiw @yuanyao-nv @Jasper-NV @charllll @mahu888 @xinrzhang-star @JCalafato @nvmbreughe @nvxingkaiz @ever-wong @zhijial-nvidia @willg-nv @levichen-nvidia @xiangg-nv @maggie-j-liu @nvyocox @nv-samcheng

v0.7.0

28 Apr 21:43
bbbab9a

Choose a tag to compare

TensorRT Edge-LLM 0.7.0 Release 2026-04-28

We are very excited to announce release 0.7.0 of TensorRT Edge-LLM!

  • We welcome community contribution and have merged #68 and #42. Thanks the contributors @matiaslin and @victoroliv2.
  • We are excited to launch our pages on Jetson AI Lab. Please check more tutorials and model deployment guidelines there.

Deprecation Notice

Key Features

Other Important Features

Model Support

Performance Improvements

  • Added FP8 embedding support to reduce embedding-table memory
  • Reduced runtime memory by sharing TensorRT execution context memory
  • Added multi-batch cutedsl prefill kernels
  • Improved performance by cutedsl-based Mamba SSD kernels

Runtime Extensions

  • Unified LLM runtime execution paths for vanilla and EAGLE3 decoding
  • Added per-slot streaming with the StreamChannel API
  • Replaced LinearKVCache with per layer HybridCacheManager to support more hybrid models

Workflow Improvements

NVIDIA Contributors

@fans-nv @nvluxiaoz @nvamberl @ruocheng-nv @poweiw @yuanyao-nv @Jasper-NV @charllll @mahu888 @xinrzhang-star @JCalafato @nvmbreughe @nvxingkaiz @ever-wong @zhijial-nvidia @willg-nv @levichen-nvidia @xiangg-nv @maggie-j-liu

v0.6.1

16 Apr 16:13
2620a97

Choose a tag to compare

TensorRT Edge-LLM 0.6.1 Release 2026-04-15

  • Added DriveOS 7.2.4 official support
  • Fixed EAGLE draft model weights loading issue to retrieve acceptance rate

v0.6.0

19 Mar 04:50
996623c

Choose a tag to compare

TensorRT Edge-LLM 0.6.0 Release 2026-03-16

We are very excited to announce the 0.6.0 release of TensorRT Edge-LLM!

  • TensorRT Edge-LLM is featured in GTC 2026! Link to our blog.
  • Our developer roadmap for H1 2026 is listed in #32.
  • Welcome to check out our flagship demo "Run Physical AI Models Faster on Jetson" at GTC 2026. This demo uses TensorRT Edge-LLM as one of the backends to showcase ASR/LLM/TTS capability of NVIDIA Jetson AGX Thor.

Breaking Changes

  • Due to the ViT Attention Plugin and split QKV changes, ONNX models exported by previous versions are not compatible with 0.6.0.

Key Features

Model Support

  • Added Nemotron-Nano-9B-v2 support via mamba_ssm and causal_conv1d
  • Added day 0 support for Nemotron-3-Nano-4B
  • Added Qwen3-30B-A3B-GPTQ-Int4 support via INT4 MoE Plugin
  • Added Qwen3-ASR and Qwen3-TTS end-to-end support

Performance Improvements

  • Added cutedsl FMHA kernels to speed up prefill performance on Blackwell
  • Used ViT Attention Plugin with fmha-v2 and cutedsl kernels to speed up multi image ViT performance
  • Updated Attention Plugin to split q, k and v to save memory usage

Runtime Maturity

  • Added LoRA support for Speculative Decoding
  • Fixed several compiler warnings and document exceptions for functions
  • Added coverage tests

NVIDIA Contributors

@fans-nv @nvluxiaoz @nvamberl @ruocheng-nv @Jasper-NV @yuanyao-nv @fkaster-nvidia @poweiw @charllll @mahu888 @xinrzhang-star @JCalafato @nvmbreughe

v0.5.0

20 Feb 00:36
8fe7fe1

Choose a tag to compare

TensorRT Edge-LLM 0.5.0 Release 2026-02-19

We are very excited to announce the second release of TensorRT Edge-LLM! We welcome community contribution and have merged the first community PR #13 @taoz27) in this release. Our developer roadmap for H1 2026 is listed in #32.

Breaking Changes

  • Due to the standalone embedding processing module features, ONNX models exported by previous versions are not compatible with 0.5.0.

Key Features

  • Implemented and used standalone embedding processing module to reduce multi-modal modeling complexity and reduce Eagle inference memory footprint
  • Added FP8 KV Cache support
  • Unified TensorRT execution context for prefill and decode to reduce memory footprint
  • Supported vanilla decoding for speculative decoding runtime
  • Used collision resistant hashing for CUDA graphs
  • Updated int4GroupwiseGemmPlugin to TensorRT Plugin-v3 interface
  • Refactored documentations
  • Added ViT attention mask and RoPE parameter caching to reduce recomputation for Qwen (Merged #13 by @taoz27)
  • Added Jetpack 6.2 compatibility

NVIDIA Contributors

@fans-nv @nvluxiaoz @nvamberl @ruocheng-nv @fkaster-nvidia @chemudupatiks @poweiw @charllll @mahu888 @xinrzhang-star @byte-deve @JCalafato @nvmbreughe

v0.4.0

05 Jan 22:50
50a61d0

Choose a tag to compare

TensorRT Edge-LLM 0.4.0 Release 2026-01-06

We are very excited to announce the first release of TensorRT Edge-LLM! TensorRT Edge-LLM is NVIDIA's high-performance C++ inference runtime for Large Language Models (LLMs) and Vision-Language Models (VLMs) on embedded platforms. Please follow our Quick Start Guide for the usage.

Key Components

Model Support

  • Llama3.x
  • Qwen2/2.5/3 (Dense)
  • Qwen2/2.5/3-VL (Dense)
  • InternVL3
  • Phi4-Multimodal

Please check the model support page for more details.

Key Features

Model Export

  • nvfp4/fp8/int4 quantization
  • nvfp4/fp8 lm_head quantization
  • EAGLE3 draft quantization
  • Vocab reduction

Runtime

  • Multi-batch EAGLE3 speculative decoding for LLM and VLM
  • Decoding CUDA Graph
  • System prompt KVCache reuse
  • Open-AI style chat template
  • Dynamic LoRA switching