Skip to content

feat: Add streaming tool-call parse buffer limit to prevent excessive memory usage#8811

Open
pskiran1 wants to merge 15 commits into
mainfrom
spolisetty/tri-1016-psirt-triton-openai-frontend-auto-toolpparsing-can-oom-kill
Open

feat: Add streaming tool-call parse buffer limit to prevent excessive memory usage#8811
pskiran1 wants to merge 15 commits into
mainfrom
spolisetty/tri-1016-psirt-triton-openai-frontend-auto-toolpparsing-can-oom-kill

Conversation

@pskiran1

@pskiran1 pskiran1 commented May 31, 2026

Copy link
Copy Markdown
Member

What does the PR do?

The streaming tool-call parser (partial_json_parser.loads()) re-parses the full accumulated output on every chunk, resulting in excessive CPU and memory growth for large tool-call arguments. This PR adds a configurable per-request buffer cap --max-tool-call-parse-bytes that truncates the stream gracefully when exceeded.

Checklist

  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging ref.
  • All template sections are filled out.
  • Optional: Additional screenshots for behavior/output changes with before/after.

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

  • build
  • ci
  • docs
  • feat
  • fix
  • perf
  • refactor
  • revert
  • style
  • test

Related PRs:

Where should the reviewer start?

Test plan:

  • CI Pipeline ID: 53226753

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

@pskiran1 pskiran1 added the PR: feat A new feature label Jun 1, 2026
whoisj
whoisj previously approved these changes Jun 4, 2026

@whoisj whoisj left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I did have a couple of non-blocking questions.

Would be good if we could get @yinggeh to review this as well, but please merge by EoD Friday even if he's not able to get a review completed by then.

self.chat_template = load_chat_template(chat_template)

if self.tool_call_parser is not None:
print(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, why print("[INFO] ...") and use logger.info()?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, in openai_frontend, the root logger is not configured in main, as a result logging does not appear to be working. So, I have been using print statements, similar to the approach in fastapi_frontend.py.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean following lines doesn't work at all?

except Exception:
logger.debug(
"Failed to cancel inference after tool-call parse "
"truncation (request %s)",
request_id,
exc_info=True,
)

and len(previous_text) + len(delta_text)
> self.max_tool_call_parse_bytes
):
print(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better as logger.warning()?

@yinggeh

yinggeh commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Docs?

# streaming tool-call parser processes per request.
# Since the parser re-parses the entire buffer with each new chunk,
# this limit helps bound per-request CPU and memory usage.
DEFAULT_MAX_TOOL_CALL_PARSE_BYTES: int = 16 * 1024

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 16 KiB? Can this limit be bigger?

Comment thread python/openai/openai_frontend/engine/utils/tool_call_parsers/utils.py Outdated
Comment thread python/openai/openai_frontend/main.py Outdated
Comment thread python/openai/openai_frontend/main.py Outdated
Comment thread python/openai/tests/test_tool_calling.py Outdated
Comment thread python/openai/tests/test_tool_calling.py Outdated
Comment thread python/openai/tests/test_tool_calling.py
Comment thread python/openai/tests/test_tool_calling.py
Comment thread python/openai/openai_frontend/engine/triton_engine.py Outdated
…tils.py

Co-authored-by: Yingge He <157551214+yinggeh@users.noreply.github.com>
@pskiran1

pskiran1 commented Jun 5, 2026

Copy link
Copy Markdown
Member Author

Docs?

Sorry, I missed committing the README changes earlier. The documentation has now been updated.
Thank you.

@pskiran1 pskiran1 requested a review from yinggeh June 5, 2026 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PR: feat A new feature

Development

Successfully merging this pull request may close these issues.

3 participants