Dynamic shape on the C++ trace path (elementwise add)#269
Open
YWHyuk wants to merge 13 commits into
Open
Conversation
c2a243b to
6b39c94
Compare
b7c1ec4 to
9033945
Compare
Under torch.compile(dynamic=True) the Inductor loop ranges carry sympy symbols (e.g. ks0/s52) instead of concrete ints. The tile-size heuristics did concrete-int arithmetic on those ranges and crashed with sympy "cannot determine truth value" before any MLIR was emitted. Neutralize the tile-fit heuristics for symbolic dims: they only shave a tile to a known dim to minimize the wasted tail, which is meaningless when the dim is unknown at compile time. Skip them, keep the fixed init tile, and let the tail become a runtime remainder (masked). - trim_large_tail: skip a dim whose range is symbolic - get_padding_ratio: report zero padding for a symbolic dim/tile - is_dim_dividable: raise a clear NotImplementedError for symbolic dims (the recompile-to-divisible path has no symbolic equivalent and would loop forever; index_expr/indirect indexing under dynamic shape is a later step) - make_choices: drop a symbolic axis from the tile-grow candidates All guards are isinstance(sympy.Expr)-gated, so the concrete-shape path is unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
Make the MLIR backend emit valid IR for torch.compile(dynamic=True). A
size symbol (e.g. ks0) now becomes a usable kernel argument and the loop
over the dynamic dim carries the symbol as a runtime bound:
- mlir_argdefs: a size-symbol arg had no buffer_types entry (it is not a
buffer/graph_input/constant), so it KeyError'd. Key it by name (which
is also the host-side SymInt the wrapper passes) and describe it as a
scalar int.
- get_mlir_shape: a symbolic numel becomes a dynamic memref dim ("?")
instead of being stringified into an invalid type.
- LoopLevel: a symbolic upper bound is emitted as an index SSA value
(%<name>_bound); a non-symbol symbolic expr raises NotImplementedError.
- codegen_loops: a prologue at the function top level loads each size arg
(memref<1xi64>) and index_casts it to %<name>_bound, a valid affine
symbol usable as the loop bound.
The emitted IR parses and lowers through the whole standard pipeline
(decompose/vlane -> fine-grained/vcix -> standard lowering) for a dynamic
elementwise add. Static kernels are unchanged (every path gates on
isinstance(.., sympy.Expr)).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
torch.compile(dynamic=True) puts sympy size symbols (e.g. s52) in the
arg_attributes shape/stride fields. define_kernel emitted that list as a
module-scope Python literal in the generated wrapper, so a bare s52 was
undefined at import time and raised NameError before call() ran.
Recursively stringify sympy expressions in the meta before emitting it
('s52'). The real extent already reaches the kernel as a runtime arg (the
wrapper's call() computes s52 from the input tensor shape and passes it),
so the compile-time descriptor only needs to be import-safe and
shape-agnostic. No-op for static kernels (their meta has no sympy).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
The functional (Spike) validation binary is generated in MLIRCodeCache.load at compile time with the tensor extent baked into the host buffer sizes (mlir_caller_codegen allocates each buffer from arg_size). Under torch.compile(dynamic=True) the extent is a runtime value (memref<?>), so there is no concrete size to instantiate the fixed-shape validation binary -- generate_args_define would size a buffer from the symbol and fail. Skip the functional-validation block when the kernel MLIR carries a dynamic memref dim (same effect as pytorchsim_functional_mode=off). The kernel is still compiled shape-agnostically and timed via the gem5/TOG + trace path; correctness of a dynamic kernel is validated at its concrete instantiation, not at compile time. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
gem5 measures per-tile compute cost, which is shape-invariant. Add pin_loops_to_one_tile (cycle_table.py): a general MLIR-bindings rewrite that forces every affine.for which would iterate more than once to run a single tile (upper bound -> the loop step). It handles both a constant multi-iteration bound and a symbolic (runtime-extent) bound, so the cpp TOG cycle sampling can use it for static and dynamic kernels alike. Wire it into MLIRCodeCache.load for dynamic shape: run the legacy cycle machinery (run_tog -> _custom.mlir -> cycle binary -> gem5) on a one-tile COPY of the post-vcix IR, while the symbolic _postvcix.mlir is kept for the producer .so / cycle_table. The sampling host buffers are sized to one tile (_concretize_attrs_for_sampling), and the legacy ONNX TOG output (generate_tile_graph) is skipped for dynamic (it enumerates tiles statically and is unused when the trace path is the default sim path). dump_metadata now also tolerates a scalar size argument. Static kernels are unchanged (every new branch gates on a dynamic memref dim). Wiring the static cycle sampling through pin_loops_to_one_tile too is the intended next step but needs the sampling decoupled from run_tog (which also builds the legacy full TOG). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
Make the C++ trace producer .so build for a dynamic (runtime-extent) kernel, so its loop bounds are read at runtime from shape_args. - build_tog._build gains serialize=False: build_skeleton only needs the builder side effects (loop/compute/DMA nodes), not the serialized TOG string, whose display() formats a constant loop_end -- None for a dynamic loop. The bound stays on the affine.for in the IR. - lower_to_emitc._rewrite_signature: an original kernel arg still used after build_skeleton's DCE is a size symbol (its memref.load feeds a loop bound; tensors are referenced by name in togsim.dma attrs and DCE to unused). Re-source each such load from shape_args[k] via emitc.subscript (k = the size arg's order), then drop the arg. The producer's loop then reads the runtime extent: for (iv=0; iv<shape_args[k]; ...). Verified: a dynamic elementwise add builds one trace.so whose recorded trace scales with shape_args (1024 -> 14 insts, 2048 -> 28). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
…te file The dynamic trace producer reads its loop bounds from shape_args; feed them at simulation time through the existing per-kernel attribute YAML (the file that already carries address_info), not a bespoke channel. - write_kernel_attribute_file: a scalar input (a dynamic size arg, e.g. s52) is not a tensor address -- collect such scalars into a shape_args sequence in the YAML, in arg order (== the producer's shape_args[k]). - run_standalone: pass --attribute <yaml> alongside --trace_so so the trace path receives it, the same file the legacy path passes via the models_list command. - main.cc: add --attribute; in the trace branch load the YAML and fill shape_args from its shape_args sequence, passed to run_producer (was nullptr,0). - run_kernel_simulation: skip the Spike functional run for a dynamic kernel (its fixed-shape validation binary is intentionally not built). Verified end to end: one compiled add runs at 1024 (183 cycles) and 2048 (261 cycles) from the same trace.so, driven by shape_args. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
…nary) Produce correct output VALUES for a dynamic kernel: the Spike validation binary is now shape-agnostic and reads the runtime extent from the size-arg buffer, the same way the trace producer reads shape_args. - Simulator.dump_args/write_arg: a size symbol arg (MLIR_ARGS_VAR) is a kernel input -- write its runtime value (int64) to a .raw so the kernel can load its loop bound. This is Spike's existing per-arg .raw channel (used for tensors); the size arg was just being skipped. - mlir_caller_codegen: the validation binary loads each size arg first into N_<sym>, then mallocs the tensor buffers and builds the memref descriptors from N at runtime (not the compile-time extent). argv slots are assigned in arg order (matching dump_args). A numel that is a size SYMBOL becomes N_<sym>; a concrete numel (including a stringified sympy.Integer like '128') stays a literal. - extension_codecache: build + run the validation binary for dynamic too. Verified: one compiled add returns correct values at 1024 / 2048 / 1536 and a 1D tail size 1000 from the same binary. Tail/lane padding for >1D shapes is a separate follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
One torch.compile(dynamic=True) add, run at 1024 and 2048 from a single compiled trace producer .so, checking the output values (allclose) at each size. Sizes are tile multiples so no tail padding is needed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
Small robustness cleanups from the PR review (no behavior change): - Add MLIRKernelArgs.is_mlir_arg_var and use it where the MLIR_ARGS_VAR mask was open-coded (mlir_caller_codegen._is_var, Simulator.dump_args). - Detect a dynamic kernel in MLIRCodeCache.load via that flag (any size-symbol arg) instead of sniffing "memref<?" in the IR text. - Drop a dead shape_args local in run_kernel_simulation: it was left over from an earlier run_spike gate; the runtime extents reach the simulator via the attribute YAML (write_kernel_attribute_file), not from there. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
The dynamic-shape tile/bound paths each had their own ad hoc guard for a symbolic dimension (isinstance sympy.Expr / and-not-is_number variants). Add one predicate, mlir_common.is_symbolic_dim(x) = a sympy.Expr that is not a compile-time constant, and use it at every site: is_dim_dividable, trim_large_tail, get_padding_ratio, LoopLevel._bound_str, and make_choices. No behavior change (verified static 128/512 + dynamic add still pass); it just gives one place to get the rule right when adding new dim arithmetic. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
Full roadmap for extending the C++ trace path to general dynamic shape: the runtime DMA stack already carries runtime dims/strides, so the work is codegen (general symbolic index lowering + runtime togsim.dma descriptors); 7-phase build order, cross-cutting contracts, test matrix, risks. Notes that dynamic floor/mod belongs in axis_split (symbolic-aware), not the legacy convert_index affine path. Planning artifact -- remove before merging the feature. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
Generalise axis_split boundary detection and divisibility-chain construction to accept symbolic size expressions, as a strict superset of the integer case: concrete-int reshapes produce identical split plans, and a dynamic reshape whose flattened extent E is a product of dims (divisor a genuine factor, e.g. FloorDiv(v, N) / ModularIndexing(v, 1, N) with extent M*N) is now detected. - _divides/_eq/_gt1/_proper/_quotient/_as_size: boundary arithmetic that reduces exactly to int ops when operands are concrete and otherwise uses sympy (Mod simplifies to 0, cancel gives the quotient) under the symbols' integer/positive assumptions. - _ordered_chain replaces _is_chain + numeric sort: orders boundaries by the divisibility partial order (b_i precedes b_j iff b_i | b_j) instead of numeric value, so symbolic suffix-product boundaries (N | M*N) chain; returns None on a non-total chain (incompatible radices) exactly as before. - collect_boundaries / find_split_plan keep symbolic divisors and extents. - build_split_body sizes sub-vars with _quotient/_as_size (symbolic seg extents). Detection layer only: the residual-floor/mod folding (_fold_with_ranges) for symbolic divisors and the runtime dynamic-stride DMA needed for end-to-end symbolic reshape are follow-ups. Verified by tests/test_axis_split_symbolic.py (static cases match legacy, symbolic cases detected, misaligned/non-divisor bail) and confirmed behaviour-neutral on the static view suite. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SfwHCV7TaX4s9xkn8i7anG
8e9db1a to
82e9255
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
torch.compile(dynamic=True)support on the C++ trace path. One compiledtrace producer
.soserves any input size: the producer reads its loop boundfrom
shape_argsat runtime, while the per-tile cost table (sampled once, since atile's cost is shape-invariant) keys the timing.
Verified end to end: a single dynamic
a + bruns at N=1024 (2 tiles, 183cycles) and N=2048 (4 tiles, 261 cycles) from the same
.so, the tracescaling with
shape_args.What each commit does
heuristics for a sympy dim (they only minimize a known dim's tail); all gated
on
isinstance(.., sympy.Expr), so static is unchanged.scalar-int kernel arg;
memref<?xf32>;affine.for ... to %<name>_boundwith atop-of-function
memref.load+index_castprologue.arg_attributesso thegenerated wrapper imports (the extent already arrives as a runtime arg).
binary can't be instantiated for a runtime extent.
pin_loops_to_one_tile(generalfor static + dynamic) runs gem5 sampling on a one-tile copy; the symbolic IR is
kept for the producer / cost table.
build_skeletonskips the loop_endserialization it never needed;
lower_to_emitcre-sources each still-used sizearg from
shape_args[k]viaemitc.subscript, so the producer loop readsfor (iv=0; iv<shape_args[k]; iv+=step).existing per-kernel attribute YAML (
shape_args),run_standalonepasses--attribute, andmain.ccfillsshape_argsforrun_producer.tests/ops/elementwise/test_dynamic_add.py.Known limitations / follow-ups
skipped for dynamic, so output values are zeros today; the test checks output
shape. Functional output for dynamic is a follow-up.
pin_loops_to_one_tileonce itis decoupled from
run_tog; op coverage beyond contiguous 1D add (matmul /LLVM fork and is not part of this PR.
Base: stacked on
feature/togsim-cpp-trace(#267).🤖 Generated with Claude Code