Skip to content

R25.03 shantanu#8820

Open
Shantanu1058 wants to merge 228 commits into
triton-inference-server:r25.03from
Shantanu1058:r25.03_shantanu
Open

R25.03 shantanu#8820
Shantanu1058 wants to merge 228 commits into
triton-inference-server:r25.03from
Shantanu1058:r25.03_shantanu

Conversation

@Shantanu1058

Copy link
Copy Markdown

Thanks for submitting a PR to Triton!
Please go the the Preview tab above this description box and select the appropriate sub-template:

If you already created the PR, please replace this message with one of

and fill it out.

nv-tusharma and others added 30 commits March 13, 2025 10:26
…8130)

Co-authored-by: BenjaminBraunDev <benjaminbraun@google.com>
Co-authored-by: Kyle McGill <kmcgill@nvidia.com>
Co-authored-by: Ziqi Fan <ziqif@nvidia.com>
Co-authored-by: Yingge He <yinggeh@nvidia.com>
Co-authored-by: Yingge He <157551214+yinggeh@users.noreply.github.com>
Co-authored-by: Kris Hung <krish@nvidia.com>
Co-authored-by: richardhuo-nv <rihuo@nvidia.com>
Co-authored-by: Tanmay Verma <tanmay2592@gmail.com>
Co-authored-by: Olga Andreeva <124622579+oandreeva-nv@users.noreply.github.com>
Co-authored-by: Indrajit Bhosale <iamindrajitb@gmail.com>
…erver#8134)

Add the tool calling parsers implementation to openai frontend, the available parsers are llama3 and mistral. Most of the implementation is from the vllm. A user could use the --tool-call-parser arguments to specify the tool parser.
Add the --chat-template {chat template file path} argument to allow the user use the customized template to better tune the prompt for tool calling.
Add the guided decoding backend integration with the tool calling to enable the named tool calling and required tool calling functionalities.
Please check more detail in the change of README.md

All changes in python/openai/openai_frontend/engine/utils/tool_call_parsers are from the vLLM with some minor compatibility changes.
…server#7969)

Add shutdown timer to the gRPC endpoint for both infer and streaming infer requests. Inflight requests will be allowed to complete before and new requests made after shutdown has started will be rejected.
…erver#8170)

Added an additional check to prevent the value of byte_size and offset used in a request from exceeding the bounds of shared memory.
Co-authored-by: Dmitry Tokarev <dtokarev@nvidia.com>
Co-authored-by: richardhuo-nv <rihuo@nvidia.com>
yinggeh and others added 30 commits April 20, 2026 11:27
Co-authored-by: Yingge He <157551214+yinggeh@users.noreply.github.com>
This change:

Creates a new L0_torch_aoti test suit.
Adds complex Torch AOTI model generation to qa/common/gen_qa_models.py.
Cleans up existion AOTI model generation in qa/common/gen_qa_models.py.
Enabled torchvision AOTI model generation in qa/common/gen_qa_model_repository.
Co-authored-by: J Wyman <jwyman@nvidia.com>
Co-authored-by: Yingge He <157551214+yinggeh@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.