R25.03 shantanu#8820
Open
Shantanu1058 wants to merge 228 commits into
Open
Conversation
…lt in defaulting to the wrong main branch (triton-inference-server#8069)
…s for triton repos to pass in to cmake (triton-inference-server#8072)
…cannot load after 25.03 (triton-inference-server#8089)
…8130) Co-authored-by: BenjaminBraunDev <benjaminbraun@google.com> Co-authored-by: Kyle McGill <kmcgill@nvidia.com> Co-authored-by: Ziqi Fan <ziqif@nvidia.com> Co-authored-by: Yingge He <yinggeh@nvidia.com> Co-authored-by: Yingge He <157551214+yinggeh@users.noreply.github.com> Co-authored-by: Kris Hung <krish@nvidia.com> Co-authored-by: richardhuo-nv <rihuo@nvidia.com> Co-authored-by: Tanmay Verma <tanmay2592@gmail.com> Co-authored-by: Olga Andreeva <124622579+oandreeva-nv@users.noreply.github.com> Co-authored-by: Indrajit Bhosale <iamindrajitb@gmail.com>
…erver#8134) Add the tool calling parsers implementation to openai frontend, the available parsers are llama3 and mistral. Most of the implementation is from the vllm. A user could use the --tool-call-parser arguments to specify the tool parser. Add the --chat-template {chat template file path} argument to allow the user use the customized template to better tune the prompt for tool calling. Add the guided decoding backend integration with the tool calling to enable the named tool calling and required tool calling functionalities. Please check more detail in the change of README.md All changes in python/openai/openai_frontend/engine/utils/tool_call_parsers are from the vLLM with some minor compatibility changes.
…server#7969) Add shutdown timer to the gRPC endpoint for both infer and streaming infer requests. Inflight requests will be allowed to complete before and new requests made after shutdown has started will be rejected.
…erver#8170) Added an additional check to prevent the value of byte_size and offset used in a request from exceeding the bounds of shared memory.
Co-authored-by: Dmitry Tokarev <dtokarev@nvidia.com> Co-authored-by: richardhuo-nv <rihuo@nvidia.com>
Co-authored-by: Yingge He <157551214+yinggeh@users.noreply.github.com>
… reserved parameter keys (triton-inference-server#8763)
This change: Creates a new L0_torch_aoti test suit. Adds complex Torch AOTI model generation to qa/common/gen_qa_models.py. Cleans up existion AOTI model generation in qa/common/gen_qa_models.py. Enabled torchvision AOTI model generation in qa/common/gen_qa_model_repository.
Co-authored-by: J Wyman <jwyman@nvidia.com> Co-authored-by: Yingge He <157551214+yinggeh@users.noreply.github.com>
…bility in OpenAI frontend (triton-inference-server#8817)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Thanks for submitting a PR to Triton!
Please go the the
Previewtab above this description box and select the appropriate sub-template:If you already created the PR, please replace this message with one of
and fill it out.