fix: ExecuTorch export of TRT engines >2 GB by storing engine as Tensor attribute#4277
Open
shoumikhin wants to merge 1 commit into
Open
fix: ExecuTorch export of TRT engines >2 GB by storing engine as Tensor attribute#4277shoumikhin wants to merge 1 commit into
shoumikhin wants to merge 1 commit into
Conversation
bde257f to
0e4ea40
Compare
…or attribute
`torch_tensorrt.save(..., output_format="executorch", retrace=False)` fails on
models whose serialized TensorRT engine is larger than ~2 GB. The error:
```
SyntaxError: unterminated string literal (detected at line 6) (<eval_with_key>.N, line 6)
```
The threshold isn't TensorRT's: it's CPython's. The Python tokenizer cannot
parse string literals larger than `INT32_MAX` (~2 GiB) — see
`Parser/tokenizer.c`. When FX compiles a graph whose source contains a
giant string literal, `compile()` raises this `SyntaxError`. Smaller
engines (a few hundred MB) export fine; the issue is invisible until you
cross the limit.
```python
import torch
import torch_tensorrt
class Big(torch.nn.Module):
def __init__(self):
super().__init__()
# Big enough that the FP32 TRT engine exceeds ~2 GB.
self.linear = torch.nn.Linear(32768, 32768)
def forward(self, x):
return self.linear(x)
model = Big().cuda().eval()
example = (torch.randn(1, 32768, device="cuda"),)
trt_program = torch_tensorrt.dynamo.compile(
torch.export.export(model, example),
inputs=list(example),
enabled_precisions={torch.float32},
)
torch_tensorrt.save(
trt_program,
"model.pte",
output_format="executorch",
retrace=False,
arg_inputs=list(example),
)
```
`py/torch_tensorrt/_compile.py::_replace_execute_engine_for_executorch`
base64-encodes the engine bytes into a Python `str` and passes that `str`
as a **positional argument** to
`torch.ops.tensorrt.no_op_placeholder_for_execute_engine`:
```python
engine_info[ENGINE_IDX] = base64.b64encode(engine_bytes).decode("utf-8")
...
no_op_node = gm.graph.call_function(no_op, (inputs_arg, *engine_info_strs))
```
When `gm.recompile()` then re-emits the FX graph as Python source, that
literal string lands directly in the source as `'ZnRydA…'`. The bigger
the engine, the bigger the literal. Past ~2 GB the source no longer
tokenizes and `exec(compile(src, …))` raises `SyntaxError`.
The op schema was also declared with `serialized_engine: str`
(`py/torch_tensorrt/dynamo/runtime/meta_ops/register_meta_ops.py`), which
encouraged the inline-literal codegen path.
Treat the engine the same way FX treats any other large binary constant:
store it as a tensor attribute on the graph module and reference it from
the call site via a `get_attr` node. The literal that lands in source is
then `self._trt_engine_0`, not 2 GB of base64.
Three source files change:
1. **`py/torch_tensorrt/dynamo/runtime/meta_ops/register_meta_ops.py`**
Change `serialized_engine: str` → `serialized_engine: torch.Tensor` in
both the `@torch.library.custom_op` and `@torch.library.register_fake`
declarations of `tensorrt::no_op_placeholder_for_execute_engine`.
2. **`py/torch_tensorrt/_compile.py::_replace_execute_engine_for_executorch`**
Instead of base64-encoding the engine bytes into a string positional
arg, wrap them in a `torch.uint8` 1-D tensor, register them as a
persistent buffer on the graph module (`_trt_engine_<i>`), create a
`get_attr` FX node pointing at that buffer, and place the `get_attr`
node at the `ENGINE_IDX` slot of the no_op call. Every other slot
stays `str` — they are small metadata. The now-orphan original engine
attribute is `delattr`'d from the module so it doesn't double-serialize
into `state_dict` alongside the new buffer.
3. **`py/torch_tensorrt/executorch/backend.py`**
When reading the `tensorrt::no_op_placeholder_for_execute_engine` node
inside the ExecuTorch backend, resolve the `get_attr` FX node at
`ENGINE_IDX` back to the underlying `torch.Tensor` and convert it to
0e4ea40 to
9f4e7b0
Compare
Author
|
Thanks @lanluo-nvidia for triaging this! Rebased onto main and switched the base branch. The merge resolved cleanly into the new TR01 blob format on main: the engine now flows in as a |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
torch_tensorrt.save(..., output_format="executorch", retrace=False)fails onmodels whose serialized TensorRT engine is larger than ~2 GB. The error:
The threshold isn't TensorRT's: it's CPython's. The Python tokenizer cannot
parse string literals larger than
INT32_MAX(~2 GiB) — seeParser/tokenizer.c. When FX compiles a graph whose source contains agiant string literal,
compile()raises thisSyntaxError. Smallerengines (a few hundred MB) export fine; the issue is invisible until you
cross the limit.
Reproduction
Root cause
py/torch_tensorrt/_compile.py::_replace_execute_engine_for_executorchbase64-encodes the engine bytes into a Python
strand passes thatstras a positional argument to
torch.ops.tensorrt.no_op_placeholder_for_execute_engine:When
gm.recompile()then re-emits the FX graph as Python source, thatliteral string lands directly in the source as
'ZnRydA…'. The biggerthe engine, the bigger the literal. Past ~2 GB the source no longer
tokenizes and
exec(compile(src, …))raisesSyntaxError.The op schema was also declared with
serialized_engine: str(
py/torch_tensorrt/dynamo/runtime/meta_ops/register_meta_ops.py), whichencouraged the inline-literal codegen path.
Fix
Treat the engine the same way FX treats any other large binary constant:
store it as a tensor attribute on the graph module and reference it from
the call site via a
get_attrnode. The literal that lands in source isthen
self._trt_engine_0, not 2 GB of base64.The op schema also changes from
serialized_engine: strtoserialized_engine: torch.Tensorso the rewriter and the ExecuTorchbackend agree on what flows through the
ENGINE_IDXslot. Every otherslot stays
str— they are small metadata.Why this approach (and tradeoffs considered)
The original
serialized_engine: strschema is not a careless choice —there are real reasons one might prefer all-primitive op signatures over
Tensor args at constant positions. Notable tradeoffs:
torch.library.custom_opwith onlystr/int/floatargs istrivially serializable by any consumer of the schema, and avoids any
ambiguity about whether a Tensor arg represents a runtime input or a
graph-baked constant. Switching one slot to
torch.Tensormeans theExecuTorch backend (and any other consumer) has to special-case the
resolution from
get_attrFX node → underlying buffer.the same engine, str-equality is O(n) but trivial. Tensor-by-content
equality is more involved.
stris host-only bydefinition. A uint8 constant tensor could in principle be picked up by
a memory-planning pass and staged onto an accelerator, which would
waste memory for an inert payload.
The fix keeps the tensor on CPU explicitly (it's created via
torch.frombufferonbytes) and the backend always resolves it backto host
bytesbefore serialization, so the placement risk above iscontained.
The portability advantage of all-primitive schemas is real, but it does
not buy anything for an op whose intended consumer is the in-tree
TensorRTBackend. And it does not survive contact with engines >2 GB,which the inline-literal codegen path fundamentally cannot encode no
matter what we do at the producer side. The Tensor-attribute approach
is the same pattern FX, AOTI, and ExecuTorch already use for every
other large binary constant in a graph; this PR brings the engine
carrier in line with that convention.
Testing
End-to-end verified on a large vision model whose FP32 TRT engine is
~2.2 GB: export to
.ptenow succeeds with noSyntaxError, and theExecuTorch runtime loads and runs the resulting
.pteagainstreference outputs without divergence. The previously-failing repro
(
torch_tensorrt.save(..., output_format="executorch", retrace=False)from above) now completes cleanly. ExecuTorch backend unit tests
updated to use the new Tensor-based schema.
Backwards compatibility
This breaks
.ptefiles produced by the old exporter. The opschema for
tensorrt::no_op_placeholder_for_execute_enginechangedfrom
serialized_engine: strtoserialized_engine: torch.Tensor, sothe new runtime cannot load
.ptefiles that were serialized with theprevious string-based schema.
Action: regenerate any cached
.ptefiles with the new exporterbefore loading them with the new runtime. The TRT engine itself has
not changed format — only how it's carried through the FX graph and
the ExecuTorch delegate boundary.
I did not version-tag the op signature to keep both schemas alive in
parallel: it would add a permanent compatibility branch to the runtime
for a payload format nobody should keep around (the old path could not
ship engines >2 GB at all, so the volume of stale
.ptefiles usingthe old schema should be small).