R25.03 shantanu by Shantanu1058 · Pull Request #8820 · triton-inference-server/server

Shantanu1058 · 2026-06-05T12:32:45Z

Thanks for submitting a PR to Triton!
Please go the the Preview tab above this description box and select the appropriate sub-template:

If you already created the PR, please replace this message with one of

and fill it out.

…lt in defaulting to the wrong main branch (triton-inference-server#8069)

…s for triton repos to pass in to cmake (triton-inference-server#8072)

…riton-inference-server#7839)

…ing to Features (triton-inference-server#8099)

…cannot load after 25.03 (triton-inference-server#8089)

)

…8130) Co-authored-by: BenjaminBraunDev <benjaminbraun@google.com> Co-authored-by: Kyle McGill <kmcgill@nvidia.com> Co-authored-by: Ziqi Fan <ziqif@nvidia.com> Co-authored-by: Yingge He <yinggeh@nvidia.com> Co-authored-by: Yingge He <157551214+yinggeh@users.noreply.github.com> Co-authored-by: Kris Hung <krish@nvidia.com> Co-authored-by: richardhuo-nv <rihuo@nvidia.com> Co-authored-by: Tanmay Verma <tanmay2592@gmail.com> Co-authored-by: Olga Andreeva <124622579+oandreeva-nv@users.noreply.github.com> Co-authored-by: Indrajit Bhosale <iamindrajitb@gmail.com>

…riton-inference-server#8097)

…erver#8144)

)

…ver#8038)

…erver#8134) Add the tool calling parsers implementation to openai frontend, the available parsers are llama3 and mistral. Most of the implementation is from the vllm. A user could use the --tool-call-parser arguments to specify the tool parser. Add the --chat-template {chat template file path} argument to allow the user use the customized template to better tune the prompt for tool calling. Add the guided decoding backend integration with the tool calling to enable the named tool calling and required tool calling functionalities. Please check more detail in the change of README.md All changes in python/openai/openai_frontend/engine/utils/tool_call_parsers are from the vLLM with some minor compatibility changes.

…ence-server#8158)

… TensorRT backend (triton-inference-server#8150)

…ce-server#8159)

…ce-server#8152)

…r#8165)

…server#7969) Add shutdown timer to the gRPC endpoint for both infer and streaming infer requests. Inflight requests will be allowed to complete before and new requests made after shutdown has started will be rejected.

…nce-server#8155)

…erver#8170) Added an additional check to prevent the value of byte_size and offset used in a request from exceeding the bounds of shared memory.

…rver#8172)

…ver#8185)

Co-authored-by: Dmitry Tokarev <dtokarev@nvidia.com> Co-authored-by: richardhuo-nv <rihuo@nvidia.com>

…rence-server#8753)

…C inference requests (triton-inference-server#8741)

…n-inference-server#8745)

Co-authored-by: Yingge He <157551214+yinggeh@users.noreply.github.com>

…-inference-server#8761)

… reserved parameter keys (triton-inference-server#8763)

…rence-server#8769)

…ton-inference-server#8764)

…exhaustion (triton-inference-server#8770)

…erence-server#8777)

This change: Creates a new L0_torch_aoti test suit. Adds complex Torch AOTI model generation to qa/common/gen_qa_models.py. Cleans up existion AOTI model generation in qa/common/gen_qa_models.py. Enabled torchvision AOTI model generation in qa/common/gen_qa_model_repository.

…inference-server#8783)

…ton-inference-server#8782)

…ton-inference-server#8775)

…nference-server#8791)

…-server#8792)

…eout (triton-inference-server#8768)

…e-server#8800)

…erence-server#8787)

…el directory (triton-inference-server#8793)

Co-authored-by: J Wyman <jwyman@nvidia.com> Co-authored-by: Yingge He <157551214+yinggeh@users.noreply.github.com>

…bility in OpenAI frontend (triton-inference-server#8817)

…ence-server#8803)

nv-tusharma and others added 30 commits March 13, 2025 10:26

[test]: Fix L0_batch_custom - missing CMake variables that could resu…

b534c7c

…lt in defaulting to the wrong main branch (triton-inference-server#8069)

[test]: L0_batch_custom & L0_client_build_variants: Set default value…

45d1fb4

…s for triton repos to pass in to cmake (triton-inference-server#8072)

feat: ORCA Format KV Cache Utilization in Inference Response Header (t…

928fd7e

…riton-inference-server#7839)

test: Add test for ORCA (triton-inference-server#8009)

422527d

Remove extra file from ORCA commit (triton-inference-server#8075)

6afaed7

docs: rename userguide AI Agents section to Features | add spec decod…

03dbf31

…ing to Features (triton-inference-server#8099)

docs: change server README to only load densenet_onnx since tf model …

fd19783

…cannot load after 25.03 (triton-inference-server#8089)

feat: Configurable grpc infer thread count (triton-inference-server#8061

5fd6bc4

)

fix: Fix gRPC cancellation race condition (triton-inference-server#8078)

42811e0

test: Add tests cancelling BLS decoupled request in Python backend (t…

6c6df11

…riton-inference-server#8097)

vLLM backend SBSA build (triton-inference-server#8142)

ef52c84

fix: Fix segfaults in tracing mode after long run (triton-inference-s…

25baa7b

…erver#8144)

ci: Removed obsolete lib: libnvToolsExt.so (triton-inference-server#8146

1e7ba25

)

fix: Update REAMDE.md (triton-inference-server#8148)

27f0410

feat: Add multi-LoRA support to OpenAI frontend (triton-inference-ser…

5eb09ce

…ver#8038)

ci: Remove unsupported PA tests from server repo and CI (triton-infer…

20b8dff

…ence-server#8158)

test: Add config parameter "execution_context_allocation_strategy" to…

901b0e9

… TensorRT backend (triton-inference-server#8150)

TPRD-1425: Excluding Triton Model Analyzer from build (triton-inferen…

7dd5810

…ce-server#8159)

Fixes for 25.04 release L0_grpc_* and L0_http_* tests (triton-inferen…

1f9787c

…ce-server#8152)

test: Input batch size overflow vulnerability (triton-inference-serve…

ff4bd4e

…r#8165)

Adding additional output to build process (triton-inference-server#8175)

d308c25

build: Integrate to use PA and GAP assets if available (triton-infere…

744995f

…nce-server#8155)

build: Update ARG in Dockerfile.sdk (triton-inference-server#8179)

02706fd

Fix: Update handling of shared mem integer values (triton-inference-s…

8d62bd8

…erver#8170) Added an additional check to prevent the value of byte_size and offset used in a request from exceeding the bounds of shared memory.

fix: Add HTTP JSON parsing recursion depth limit (triton-inference-se…

f46557b

…rver#8172)

test: Add backend_api_test to test backend APIs (triton-inference-ser…

58ccf51

…ver#8185)

Update default branch post-25.04 (triton-inference-server#8188)

4fc0a5e

Co-authored-by: Dmitry Tokarev <dtokarev@nvidia.com> Co-authored-by: richardhuo-nv <rihuo@nvidia.com>

yinggeh and others added 30 commits April 20, 2026 11:27

test: Safely handle filesystem exception (triton-inference-server#8736)

9144042

fix: Address SonarQube issues - clean up container files (triton-infe…

05c2180

…rence-server#8753)

test: Add validation to reject duplicate output names in HTTP and gRP…

5fd7a93

…C inference requests (triton-inference-server#8741)

fix: Avoid Reusing Closed File Descriptor (triton-inference-server#8733)

0806295

test: Add HTTP test for deep JSON in repository index requests (trito…

4525088

…n-inference-server#8745)

test: Fix various vLLM tests (triton-inference-server#8756)

3ff5959

post: Advance main to 26.05dev (triton-inference-server#8760)

8cce7bb

Co-authored-by: Yingge He <157551214+yinggeh@users.noreply.github.com>

fix: tag tritonfrontend wheel with arch-specific platform tag (triton…

f160200

…-inference-server#8761)

fix: Reject requests if parameters and forward headers contain Triton…

69987b7

… reserved parameter keys (triton-inference-server#8763)

fix: Pre-allocate serialized buffer for gRPC BYTES input (triton-infe…

2f65837

…rence-server#8769)

fix: Prevent memory retention on failed compressed HTTP requests (tri…

669cef0

…ton-inference-server#8764)

fix: Cap chunked HTTP request chunk count at 65536 to prevent memory …

13480cb

…exhaustion (triton-inference-server#8770)

chore(version): Update development version 2.70.0 / 26.06 (triton-inf…

c7a1312

…erence-server#8777)

fix(qa): ommit plugin creation if TensorRT branch is missed. (triton-…

1e69d88

…inference-server#8783)

test: Align QA BF16 with ml_dtypes and generate ONNX BF16 models (tri…

665030a

…ton-inference-server#8782)

test: Add C++ gRPC cancellation tests to L0_request_cancellation (tri…

a706aed

…ton-inference-server#8775)

fix: Verify RE2::FullMatch return value in Sagemaker server (triton-i…

c985be3

…nference-server#8791)

docs: Officially drop Windows-related documentation (triton-inference…

c8e2f02

…-server#8792)

fix: Replace std:atoi with std:stoi (triton-inference-server#8794)

2d64d2d

fix: ignore SIGPIPE to prevent server crash on S3 idle connection tim…

e24fd86

…eout (triton-inference-server#8768)

test: Special handling for ONNX bf16 models in tests (triton-inferenc…

9082a24

…e-server#8800)

feat: Add HTTP request body size limit to OpenAI frontend (triton-inf…

8cb2b77

…erence-server#8787)

test: Verify EXECUTION_ENV_PATH archive entries stay within the mod…

e520f8c

…el directory (triton-inference-server#8793)

post: Update default branch post-26.05 (triton-inference-server#8804)

258a8cb

Co-authored-by: J Wyman <jwyman@nvidia.com> Co-authored-by: Yingge He <157551214+yinggeh@users.noreply.github.com>

Remove Windows server build support (triton-inference-server#8812)

6328853

fix: Remove all_special_tokens_extended for transformers v5 compati…

c84bd83

…bility in OpenAI frontend (triton-inference-server#8817)

build: normalize install tree ownership and permissions (triton-infer…

5dfefde

…ence-server#8803)

Initial commit

c3661e8

Added DB logic and new Dockerfile

797ec68

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

R25.03 shantanu#8820

R25.03 shantanu#8820
Shantanu1058 wants to merge 228 commits into
triton-inference-server:r25.03from
Shantanu1058:r25.03_shantanu

Shantanu1058 commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

20 participants

Conversation

Shantanu1058 commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

20 participants