Skip to content

[Frontend] MLIR codegen on Python bindings: port C++ passes + affine-only DMA#259

Merged
YWHyuk merged 20 commits into
developfrom
feature/tog-python-binding
Jun 19, 2026
Merged

[Frontend] MLIR codegen on Python bindings: port C++ passes + affine-only DMA#259
YWHyuk merged 20 commits into
developfrom
feature/tog-python-binding

Conversation

@YWHyuk

@YWHyuk YWHyuk commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Summary

Moves the PyTorchSim MLIR codegen onto the MLIR Python bindings, ports the custom
C++ MLIR passes to Python, and enforces an affine-only DMA contract so codegen emits
only per-axis affine indices. After this PR the only C++ mlir-opt passes left in the
compile path are -test-loop-padding and -test-pytorchsim-to-vcix.

Themes (10 commits):

  • MLIR Python bindings infra — in-process LLVM/Gemmini/vlane lowering passes,
    coverage tooling, LLVM pin bump.
  • DMA lowering — emit togsim.transfer and decompose to <=4D memref.dma_start;
    unify all DMA codegen on togsim.transfer; the >4D peel is an affine.for nest
    with the lane-banked physical SRAM offset (delivered as the last index operand),
    which also fixed decompose-transfer peel output (memref.subview + unrolled dma_start) incompatible with TOG generation #258.
  • Floor/mod removal (affine-only contract) — axis-split at the Inductor
    scheduling layer (mixed-radix, default-on, rank guard removed) + graph-copy for
    operands axis-split can't linearize; coverage test test_floormod_axis_split.py.
  • build_tog — port the C++ -test-tile-operation-graph analysis to Python.
  • dma-fine-grained — port the C++ -dma-fine-grained pass to Python (matmul
    MVIN subtile loops + input/weight fusion), run in-process between loop-padding and
    vcix.

Validation

  • build_tog: byte-exact vs the C++ pass on 52 real kernels.
  • dma-fine-grained: structural match vs mlir-opt -dma-fine-grained on rank 2/3/4
    (matmul / bmm / conv) — same vcix dma_start and line counts.
  • End-to-end (Gem5+Spike+TOGSim, allclose): floor/mod suite, gemm/bmm/conv2d, and the
    resnet / transformer / vit / mlp / mobilenet / llama models.

Notes

🤖 Generated with Claude Code

@YWHyuk YWHyuk force-pushed the feature/tog-python-binding branch from 98b4f0f to 689e2c0 Compare June 17, 2026 07:59
YWHyuk added 6 commits June 17, 2026 17:33
Run the MLIR -> LLVM pipeline in-process via the bindings PassManager.
Add Python out-of-line passes: lower_to_llvm, lower_dma_to_gemmini, lower_vlane_idx.
Auto-resolve the bindings path from TORCHSIM_LLVM_PATH and ship the bindings in the
LLVM artifact. Add op_coverage tooling and the bindings smoke test. Bump the LLVM
pin and rebuild the thirdparty base image.
Emit togsim.transfer for >4D DMA and decompose it to a <=4D customized
memref.dma_start: unit-collapse fast path, unrolled-subview peel for >4 effective
dims. Fix #258 by emitting affine.apply (not arith.addi) for the peeled DRAM offset
so the TOG pass can walk the loop index through it.
Split a loop axis so aligned FloorDiv/ModularIndexing collapse to per-axis affine
indices. Mixed-radix split over a divisibility chain; integer-typed split symbols;
r-prefix innermost reduce dims. Reindex the collapsed LoopBody instead of re-tracing;
fold residual floor/mod via tensor range info. Shared boundary helpers, rank guard,
and an uncovered floor/mod ledger. Enabled by default with the recompile fallback
instrumented.
… floor/mod

Insert a copy to relayout an operand whose floor/mod cannot be removed by axis-split:
incompatible-radix shared-axis access and cross-axis multi-variable arguments.
Enabled by default alongside axis-split.
Port the analysis and IR-mutation halves of the C++ test-tile-operation-graph pass
to Python, wire build_tog into the gem5 path, and drop the C++ pass. Node-id counter
is thread-local for concurrent compilation.
…seek seed

Add tests/ops/view/test_floormod_axis_split.py covering axis-split and graph-copy
patterns. Seed the global RNG in the deepseek base test so config-random weights are
deterministic.
@YWHyuk YWHyuk force-pushed the feature/tog-python-binding branch from a127c37 to 3c56c6b Compare June 17, 2026 08:34
@YWHyuk YWHyuk changed the title [Frontend] Port test-tile-operation-graph to Python (build_tog) [Frontend] Move MLIR codegen to Python bindings (affine-only contract + build_tog) Jun 17, 2026
YWHyuk and others added 4 commits June 17, 2026 20:24
…cal SRAM offset

Rewrite the >4D peel to mirror the C++ -dma-fine-grained subtile loop: wrap the
outer dims in an affine.for nest (marked inner_loop so build_tog/TOG registers the
induction var) and emit one <=4D memref.dma_start per iteration. The slice SRAM
offset is the lane-banked physical offset -- split-outer dims rescaled by the lane
coeff (stride/old_size*new_size, the MVIN block_stride / buildSramAffineMap rule) --
delivered as the last SRAM index operand. The previous unrolled subview carried the
offset in the subview, which extract_aligned_pointer_as_index strips in the gemmini
lowering, so every slice aliased the same spad location (pixel_shuffle MISMATCH). The
DRAM offset folds with the original index into one affine.apply so processDramIndices
can walk the loop index (#258).

Thread vectorlane (systolic-array size) through run_python_passes into the pass for
the rescale's nr_outerloop. Drop the axis-split rank guard now that >4D is peeled
correctly, and register tests/ops/view/test_floormod_axis_split.py in the CI allowlist.

Validated end-to-end (Gem5+Spike+TOGSim): pixel_shuffle (>4D peel) and the full
floor/mod suite pass; elementwise/gemm/conv2d/reduce/softmax/MLP regress clean.
Route every MVIN/MVOUT -- both the MLIRKernel load/store backend path and the
template path (gemm/conv/bmm/maxpool/cat) -- through emit_transfer, so a single
decompose-transfer pass lowers all DMAs to memref.dma_start. This drops the
get_dma_code emitter, the _dma_needs_transfer instance flag, and
format_dma_op_attributes.

togsim.transfer now also carries subtile_size and async, which decompose
propagates onto the lowered dma_start (subtile filtered to the kept axes when
unit dims collapse). For <=4D tiles decompose emits the descriptor directly on
the original SRAM buffer (no collapse_shape) so the C++ -dma-fine-grained
subtile split, which walks the SRAM operand, sees a direct buffer as before.

Validated end-to-end (Spike + TOGSim) on elementwise, gemm (matmul/addmm), bmm,
conv2d, group_conv, pool, cat, reduce, softmax, layernorm, batchnorm.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reflect a6b7ebb: the find_split_plan rank guard is gone (>4D index now lowers
through the decompose-transfer affine.for peel, pixel_shuffle end-to-end), and
the decompose-transfer peel <-> TOG incompatibility is resolved. Move it from
Known-issues to Done; drop the >4D rank-guard caveat and the high-rank next-step.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Port mlir/test/lib/Analysis/TestDmaFineGrained.cpp to a Python out-of-line pass
(passes/dma_fine_grained.py): split the matmul MVIN DMAs (input/weight/bias) into
subtile affine.for nests and fuse the input/weight nests, replacing the C++
-dma-fine-grained pass. The MLIR Python bindings expose no IRMapping, so the fused
nest is built directly (each DMA emitted with the fused induction vars) instead of
cloning bodies -- structurally equivalent, not byte-exact SSA text.

Pipeline: the single mlir-opt invocation is split around the Python pass
(loop-padding -> run_fine_grained in place -> pytorchsim-to-vcix) in both the
functional and gem5 paths (extension_codecache); vectorlane (systolic-array size)
is threaded in for the lane-banked SRAM offset rescale.

Validated against mlir-opt -dma-fine-grained on rank 2/3/4 fixtures (matmul / bmm /
conv: same vcix dma_start and line counts) and end-to-end (Gem5+Spike+TOGSim):
gemm/bmm/conv2d plus the resnet/transformer/vit/mlp models pass.

Docs: dma-transfer-lowering.md -- >4D peel is affine.for + lane-banked physical SRAM
offset via the last index operand; dma_fine_grained / build_tog are now Python
passes; the #258 appendix is marked resolved.
@YWHyuk YWHyuk changed the title [Frontend] Move MLIR codegen to Python bindings (affine-only contract + build_tog) [Frontend] MLIR codegen on Python bindings: port C++ passes + affine-only DMA Jun 17, 2026
YWHyuk added 2 commits June 17, 2026 23:15
…ings)

Port mlir/test/lib/Conversion/PyTorchSimToVCIX/TestPyTorchSimToVCIXConversion.cpp to
a Python out-of-line pass (passes/lower_to_vcix.py): lower linalg.matmul (gemm and
conv2d) and the transcendental math ops (exp/erf/tanh/sin/cos) to VCIX dialect ops
(RISC-V vector custom instructions), replacing the C++ -test-pytorchsim-to-vcix.

The C++ pass is a dialect conversion (applyPartialConversion); the bindings expose no
conversion framework, so each matchAndRewrite is reimplemented as imperative IR
rewriting. The VCIX dialect is not in the Python bindings, so vcix ops are created as
unregistered generic ops -- mlir-opt / mlir-translate (vcix registered) re-parse the
{}-attr generic form fine, and run_standard_lowering already consumes vcix output via
allow_unregistered_dialects, so this matches the existing pipeline.

Pipeline: the vcix mlir-opt invocation is dropped; run_to_vcix runs in-process after
the Python fine-grained pass and before the standard lowering (both functional and
gem5 paths in extension_codecache). mlir-opt now runs only -test-loop-padding.

Validated structurally against mlir-opt -test-pytorchsim-to-vcix (non-constant ops
byte-identical including the dma_wait tag maps, on gemm and conv2d fixtures) and
numerically end-to-end (Gem5+Spike+TOGSim allclose): gemm/bmm/conv2d (incl. large
N/K), softmax, exp/erf/sin/cos, and the resnet18/vit/transformer/mlp models.
dma-fine-grained and pytorchsim-to-vcix are now Python passes (dma_fine_grained,
lower_to_vcix); update the docstring listing -- only test-loop-padding still runs
in mlir-opt.
YWHyuk and others added 6 commits June 18, 2026 12:43
axis-split + graph-copy (on by default) linearize aligned floor/mod at the
scheduling layer, so the index reaching get_dma_info is affine and the
FloorDiv/ModularIndexing tile-divisibility branches there are never entered
(measured: 0 entries across elementwise, gemm, bmm, conv, cat, floor/mod,
reduce, attention). Remove those dead branches and their orphans:

  - the FloorDiv and ModularIndexing tile-forcing + RecompileSignal blocks
  - the implicit-ModularIndexing index rewrite and implicit_local_dims
  - the dead ModularIndexing branch in the dram_stride computation
  - is_modular_indexing, the write-only implicit_dim_size, unused import sys

Kept: the non-floor/mod recompile paths (index-divisibility, indirect access,
non-power-of-2 vec size), RecompileSignal, and the retry loop. The upstream
implicit_dim_ops tile-forcing is left untouched (separate change).

Validated end-to-end (Spike + TOGSim): elementwise, gemm, bmm, conv2d,
group_conv, pool, cat, floor/mod suite, reduce, softmax, layernorm, batchnorm,
gqa -- all pass, 0 recompiles.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…-split)

implicit_dim_ops/extract_dividers/apply_constraints forced the initial tile size
to match a view's floor/mod divider, up front in compute_tile_size. axis-split now
linearizes those views at the scheduling layer, so the forcing is redundant:
disabling it leaves every test allclose-correct and, on the affected kernels,
slightly faster (the forced tile was over-constrained -- batchnorm 1189->1114,
layernorm 4092->3947 cycles; non-floor/mod kernels unchanged).

Remove the machinery and its now-unused imports (ModularIndexing, FloorDiv, Mod,
MemoryDep, StarDep, WeakDep).

Validated end-to-end (Spike + TOGSim): elementwise, gemm, bmm, conv2d, group_conv,
pool, cat, floor/mod suite, reduce, softmax, layernorm, batchnorm, gqa -- all pass,
0 recompiles.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add the TPU layout-assignment & padding investigation (docs/tpu_layout_padding_report.md)
and the loop-padding design doc. Settled model: padding is two layers --
(A) lane/sublane 8x128 alignment is materialized (footprint + DMA traffic), (B) the
compute-block (MXU tile) boundary tail is masked (compute-utilization only, not traffic).
test-loop-padding's post-codegen heuristic is to be replaced by informed emission at the
scheduling/codegen layer (decide early, materialize late); the two costs must be modeled
by separate functions (do not double-count the compute-block tail as traffic).
…ases

Three fixes from the max-effort review of this branch:

- get_dma_info: after retiring the floor/mod recompile branches, a residual
  floor/mod (store-side ModularIndexing, reduction-axis floor/mod, incompatible
  radix) that axis-split/graph-copy did not linearize was silently bucketed by
  its base symbol in the dram_stride loop, emitting a wrong DRAM descriptor.
  Raise NotImplementedError instead of mis-striding silently. No test triggers
  it (0 floor/mod reach get_dma_info in the suite) -- it is a safety net.

- decompose_transfer collapse fast path: keep=[g[-1]] picked the last dim of
  each reassociation group, which is a unit dim when trailing unit dims attach
  after the non-unit one (e.g. [..,4,1,1]); strides/subtile were read from the
  wrong axis. Pick the non-unit dim in each group.

- decompose_transfer >4D peel: new_vlane fell back to 0 whenever the vlane split
  axis was not among the inner 4 dims, conflating peeled-into-the-outer-loop
  (genuinely unrepresentable -> raise) with a unit lane axis (default 0 is fine).

Validated: elementwise, gemm, conv2d, cat, floor/mod suite (incl. pixel_shuffle
>4D peel), softmax, layernorm, batchnorm -- all pass, no spurious raise.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
More fixes from the max-effort review, after verifying each against the C++
reference and reachability:

- graph_copy _relayout_args: ranges picked the consumer iteration shape by rank
  alone (max key=len), so for two equal-rank operands with different per-dim
  extents the broadcast-from operand's smaller shape could win and the real
  incompatible-radix conflict on the broadcast-to dim was missed (order-dependent:
  a commutative reorder flipped correct relayout into a silent miss). Use per-dim
  max extent over the max-rank operands.

- lower_to_vcix _sew/_legalize_vector_type: mirror the C++ legalizeVectorType --
  F16/BF16 return sew 0 (transcendentals stay unlowered for -convert-math-to-llvm,
  as in the validated path) instead of being lowered to VCIX, and add the missing
  rank != 1 guard.

- lower_to_vcix matmul: port the C++ guards as loud failures -- M/N/K must be a
  multiple of the systolic size when > SS (else the N//SS / K//SS loops drop the
  tail tile), and A vs B must agree on the K subtile (last-writer-wins would pick
  one silently). Latent today (heuristic/autotune only emit SS-multiple tiles).

- Doc-only: graph-copy is default-on (TORCHSIM_GRAPH_COPY=0 to disable); fixed the
  two stale 'no-op unless set' comments.

Validated: elementwise, gemm, bmm, conv2d, group_conv, cat, floor/mod suite,
softmax, layernorm -- all pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…fix exp chunk

Two fixes to the C++->Python vcix port (lower_to_vcix.py) that SDPA exercises but
the gemm/bmm/conv tests do not:

- _lower_matmul bailed with 'if ATag is None or BTag is None: return False',
  gating on an MVIN dma_start tag for both operands. In SDPA's fused scores.V
  matmul, operand B is the softmax output produced in place by affine.vector_store,
  not DMAed, so BTag stayed None and the matmul was left un-lowered -> wrong
  attention output. Mirror the C++ MatmulOpLowering: an operand is initialized by
  either a dma_start OR a preceding affine.vector_store into its root memref; bail
  only when an operand is truly uninitialized. BTag/BAsync stay None/0 and are only
  read under 'if BAsync:', so the B dma_wait is correctly skipped (as in C++).

- _make_sf_vc_v_iv n>1 transcendental chunking called
  vector.ExtractStridedSliceOp(offsets, sizes, strides, vec) -- wrong arg order,
  missing the result type and vector operand, raising TypeError under these MLIR
  bindings. Pass (result=legal_ty, vector=vec, offsets, sizes, strides). Only
  reached by large transcendentals (n>1), e.g. SDPA softmax exp, so CI's small-tile
  (n==1) tests never hit it.

Validated end-to-end (Spike+TOGSim allclose): SDPA 56 cases pass (was crash/wrong);
matmul/bmm/conv2d regress clean. Bisected: C++ vcix passes SDPA, Python vcix did
not; exp chunking and fine-grained ruled out separately.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
YWHyuk and others added 2 commits June 18, 2026 19:25
…bug env vars

axis-split and graph-copy are the floor/mod handling path and were
default-on but still gated behind TORCHSIM_AXIS_SPLIT / TORCHSIM_GRAPH_COPY.
Remove the gates so they run unconditionally, and delete the env vars that
were only introduced for validation/debug during development:

  TORCHSIM_AXIS_SPLIT, TORCHSIM_GRAPH_COPY        - default-on toggles
  TORCHSIM_AXIS_SPLIT_FORCE                        - force-split validation aid
  TORCHSIM_AXIS_LEDGER + axis_split.ledger()       - coverage measurement
  TORCHSIM_DEBUG_AXIS_SPLIT + _dump_axis()         - debug dump
  TORCHSIM_GRAPH_COPY_DEBUG                         - graph-copy debug prints
  TORCHSIM_RECOMPILE_LOG                            - vestigial recompile log

Also drop the now-dead ledger() function, the _dump_axis() helper, and the
unused os import in graph_copy.py. The floor/mod regression test no longer
sets the removed env vars. Behavior is unchanged (the toggles were already on).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
dma_fine_grained and lower_to_vcix already exposed run(module, **opts) like the
registered decompose_transfer/lower_vlane_idx, but were called file-based and
directly from extension_codecache, so the .mlir was parsed+printed twice (once per
pass) between loop-padding and the standard lowering, and the pipeline was
hardcoded+duplicated across the functional and gem5 paths.

Give both passes MARKERS and group the four rewrite passes into PRE_OPT_PASSES /
POST_OPT_PASSES around the one remaining mlir-opt pass (-test-loop-padding). A
single driver run_module_passes(in, out, passes, **opts) parses once, runs each
marker-matched pass on the shared Module in order, prints once (copies through when
no marker matches). run_python_passes is now PRE_OPT via that driver; the
functional/gem5 fine-grained+vcix calls each become one run_module_passes.

run_fine_grained / run_to_vcix stay re-exported for standalone/CLI use.

Validated (Spike+TOGSim): elementwise, gemm, conv2d, softmax, floor/mod suite,
SDPA -- all pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@YWHyuk YWHyuk merged commit 20f49c8 into develop Jun 19, 2026
76 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant