TOGSim C++ trace-generation pipeline (P0-P3): explicit dataflow producer + barriers by YWHyuk · Pull Request #267 · PSAL-POSTECH/PyTorchSim

YWHyuk · 2026-06-19T04:22:35Z

What

Replaces the timing-path TOG producer (MLIR -> Python dict -> ONNX -> C++ TileGraphParser) with a compiled, shape-parametric trace producer: post-vcix MLIR -> skeleton -> EmitC -> C++ -> .so. TOGSim dlopens the .so, runs it to record an instruction trace, and feeds it into the existing Simulator/Core (timing core unchanged). Driven by a new --trace_so mode; the legacy ONNX-TOG path is kept and marked DEPRECATED, so nothing existing breaks.

Pipeline

post-vcix .mlir
  | build_skeleton.py        loops + memref.dma_start/wait -> togsim.* ; DCE the rest
  | dep_analysis.py          per-op read/write SRAM buffers (SSA) + vcix preload/matmul pairing
  | lower_to_emitc.py        togsim.* -> emitc.call_opaque ; drive upstream convert-*-to-emitc
  v
EmitC --mlir-translate--> C++ --g++ -shared--> trace.so
  | run_producer (dlopen)    EmitCtx callbacks record a TraceRec stream
  | togsim_trace_bridge.cc   TraceRec -> TileGraph (explicit dependency DAG)
  v
existing Simulator / Core    cycles, DRAM traffic

Dependency model (no in-order, no runtime tag-hash, no op heuristics)

Dependencies are derived from two sources available pre-collapse:

SRAM last-writer per buffer (load->compute, the Y_spad accumulator chain), recovered via SSA + a virtual SA_WEIGHTS buffer that folds preload->matmul.
The systolic array modeled as a pipeline (occupancy/latency split) with two explicit, distinctly-named barriers:
- MEMORY_BAR (renamed from BAR): the DMA/tag memory fence; an async load -> compute waits the data's resp-complete.
- COMPUTE_BAR (new): the compute fence; a store waits all systolic-array pipelines to drain.

Both barriers are first-class trace ops (togsim.compute_barrier -> ABI togsim_compute_barrier) visible in the trace dump and the instruction stream.

Status

256^3 GEMM runs end-to-end through the real Simulator via --trace_so.
Cycle comparison vs the legacy build_tog path on the same kernel + gem5 cycle_list: compute work and DRAM traffic match; matmuls pipeline on 2 SAs; the memory fence correctly delays compute until the weight load arrives.
Known open items (documented in docs/design/togsim_cpp_trace.md sec 10): preload-concurrency cap (needs non-zero preload occupancy), parallel output tiles (dispatch granularity), broader op coverage (conv/SDPA/vector).

Testing

tests/test_togsim_skeleton.py, test_togsim_emitc.py, test_togsim_runtime.py (7 tests).
Manual --trace_so GEMM through TOGSim.
Legacy path untouched (comment-only DEPRECATED markers).

Design of record: docs/design/togsim_cpp_trace.md (sec 9-10).

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno

… feed Skeleton + EmitC + cost/dep analysis on the frontend; the trace runtime, loader, bridge, and Core feed on the simulator; shared MLIR pass helpers and the pipeline tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno

Per-record tag key in the bridge plus per-iteration tag alloc in dma-fine-grained so multi-tile-K and conv loads do not collide; strip the reduction accum marker from the memory_barrier slot. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno

togsim_dispatch with TILE_BEGIN/TILE_END; outline each work-item into togsim_kernel_tile. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno

DMA-capacity throttle and frozen-state guard, per-core VMEM in the configs, and the SA weight-buffer throttle. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno

trace_timeline.py with per-work-item grouping and resource-centric DMA lanes; the trace logs the first DRAM response and the assigned systolic array, and scopes the compute barrier to its dispatch. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno

Default to the trace path; fix uninitialized Instruction fields, the matmul accumulator wedge, fused-subtile dedup, nested/fused epilogue dataflow, and dma_wait fusion; bound concurrent dispatches to the spad, round-robin work-items within a partition, benchmark autotune and run the multi-tenant scheduler through the trace path, and emit trace.so for pooling/reduction. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno

Carry simulator headers through the wrapper for cache-safe replay; drop verbose [P3-trace] logs; fix the key.mlir compile race in load(). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno

… runtime model Replace the trace bridge's accumulated special cases with one dataflow rule and clean up the runtime that consumes it. Dependency rule: per SRAM buffer keep a writers SET; a reader depends on all current writers (occupancy=ISSUE when both are systolic-array ops, else latency=DONE); a writer REPLACEs the set. The only exception is is_mm_accum (a matmul that reads and writes the same buffer = a commutative accumulator): skip its read edge and UNION its write, waiting only the non-matmul init seed and not ordering co-matmuls. This drops the matmul-accumulator chain that deadlocked the SA weight-slot pipeline while keeping the init->matmul edge, and lets a vector epilogue or the store wait every K matmul (fixes the pure-vector store that an empty COMPUTE_BAR let slip). Remove COMPUTE_BAR entirely: a matmul is its own DONE-handle (finish == SA drain), so the store JOINs the matmul writers directly. The whole emit/loader chain is gone -- build_skeleton, lower_to_emitc, togsim.compute_barrier, the runtime symbol, the Opcode/case/_fence_finish, and TraceRec::COMPUTE_BAR -- so a stale producer fails loudly instead of emitting records the bridge would drop. Only MEMORY_BAR remains (an async load's DONE is its data arrival, not issue). Model compute-output spad footprint in the SRAM version/capacity machinery so buffer reuse (WAR) is capacity-modeled, not a hard edge. The output size comes from the DMA records that touch the same buffer (a buf_bytes pre-pass); an in-place buffer (accumulator, relu) is version-transparent so footprint is not double-counted. The occupy gate and version release sit in the MOVIN/MOVOUT/COMP issue points (release before the COMP skip path so a skipped matmul still frees). Runtime: collapse child_inst / _pipeline_children into one event-indexed _deps[ISSUE|DONE] with add_dep(c, on) and fire(e); collapse the weight-slot release queue and the async-load wakeup into one _due_events timed-effect table drained by process_due_events. Both are behavior-preserving (byte-identical). Require the weight-slot model: sa_weight_buffer_depth must be > 0 (errors at init), and the round-robin disable mode is removed. Degenerate traces (a consumer-less preload, an unpinned matmul) hit explicit error+exit guards rather than asserts that vanish under NDEBUG. Mark the legacy ONNX TOG path deprecated: it is superseded by the trace path, so TileGraphParser logs a deprecation warning and the TORCHSIM_LEGACY_TOG=1 opt-in warns at command build. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno

YWHyuk force-pushed the feature/togsim-cpp-trace branch 3 times, most recently from 1151f6a to 7f70bbb Compare June 22, 2026 12:13

YWHyuk mentioned this pull request Jun 23, 2026

Dynamic shape on the C++ trace path (elementwise add) #269

Open

YWHyuk force-pushed the feature/togsim-cpp-trace branch 3 times, most recently from 9b913d4 to 4767e8a Compare June 24, 2026 13:16

YWHyuk and others added 10 commits June 24, 2026 22:35

[Docs] C++ trace pipeline design (runtime-tag pairing, ABI)

fd152bf

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno

[TOGSim] Work-item outlining and ABI v12 dispatch

b189df4

togsim_dispatch with TILE_BEGIN/TILE_END; outline each work-item into togsim_kernel_tile. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno

[TOGSim] Make the trace runtime test self-contained

76a2862

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno

YWHyuk force-pushed the feature/togsim-cpp-trace branch from b7c1ec4 to 9033945 Compare June 24, 2026 13:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TOGSim C++ trace-generation pipeline (P0-P3): explicit dataflow producer + barriers#267

TOGSim C++ trace-generation pipeline (P0-P3): explicit dataflow producer + barriers#267
YWHyuk wants to merge 10 commits into
developfrom
feature/togsim-cpp-trace

YWHyuk commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

YWHyuk commented Jun 19, 2026

What

Pipeline

Dependency model (no in-order, no runtime tag-hash, no op heuristics)

Status

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant