TOGSim C++ trace-generation pipeline (P0-P3): explicit dataflow producer + barriers#267
Open
YWHyuk wants to merge 10 commits into
Open
TOGSim C++ trace-generation pipeline (P0-P3): explicit dataflow producer + barriers#267YWHyuk wants to merge 10 commits into
YWHyuk wants to merge 10 commits into
Conversation
1151f6a to
7f70bbb
Compare
9b913d4 to
4767e8a
Compare
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
… feed Skeleton + EmitC + cost/dep analysis on the frontend; the trace runtime, loader, bridge, and Core feed on the simulator; shared MLIR pass helpers and the pipeline tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
Per-record tag key in the bridge plus per-iteration tag alloc in dma-fine-grained so multi-tile-K and conv loads do not collide; strip the reduction accum marker from the memory_barrier slot. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
togsim_dispatch with TILE_BEGIN/TILE_END; outline each work-item into togsim_kernel_tile. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
DMA-capacity throttle and frozen-state guard, per-core VMEM in the configs, and the SA weight-buffer throttle. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
trace_timeline.py with per-work-item grouping and resource-centric DMA lanes; the trace logs the first DRAM response and the assigned systolic array, and scopes the compute barrier to its dispatch. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
Default to the trace path; fix uninitialized Instruction fields, the matmul accumulator wedge, fused-subtile dedup, nested/fused epilogue dataflow, and dma_wait fusion; bound concurrent dispatches to the spad, round-robin work-items within a partition, benchmark autotune and run the multi-tenant scheduler through the trace path, and emit trace.so for pooling/reduction. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
Carry simulator headers through the wrapper for cache-safe replay; drop verbose [P3-trace] logs; fix the key.mlir compile race in load(). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
… runtime model Replace the trace bridge's accumulated special cases with one dataflow rule and clean up the runtime that consumes it. Dependency rule: per SRAM buffer keep a writers SET; a reader depends on all current writers (occupancy=ISSUE when both are systolic-array ops, else latency=DONE); a writer REPLACEs the set. The only exception is is_mm_accum (a matmul that reads and writes the same buffer = a commutative accumulator): skip its read edge and UNION its write, waiting only the non-matmul init seed and not ordering co-matmuls. This drops the matmul-accumulator chain that deadlocked the SA weight-slot pipeline while keeping the init->matmul edge, and lets a vector epilogue or the store wait every K matmul (fixes the pure-vector store that an empty COMPUTE_BAR let slip). Remove COMPUTE_BAR entirely: a matmul is its own DONE-handle (finish == SA drain), so the store JOINs the matmul writers directly. The whole emit/loader chain is gone -- build_skeleton, lower_to_emitc, togsim.compute_barrier, the runtime symbol, the Opcode/case/_fence_finish, and TraceRec::COMPUTE_BAR -- so a stale producer fails loudly instead of emitting records the bridge would drop. Only MEMORY_BAR remains (an async load's DONE is its data arrival, not issue). Model compute-output spad footprint in the SRAM version/capacity machinery so buffer reuse (WAR) is capacity-modeled, not a hard edge. The output size comes from the DMA records that touch the same buffer (a buf_bytes pre-pass); an in-place buffer (accumulator, relu) is version-transparent so footprint is not double-counted. The occupy gate and version release sit in the MOVIN/MOVOUT/COMP issue points (release before the COMP skip path so a skipped matmul still frees). Runtime: collapse child_inst / _pipeline_children into one event-indexed _deps[ISSUE|DONE] with add_dep(c, on) and fire(e); collapse the weight-slot release queue and the async-load wakeup into one _due_events timed-effect table drained by process_due_events. Both are behavior-preserving (byte-identical). Require the weight-slot model: sa_weight_buffer_depth must be > 0 (errors at init), and the round-robin disable mode is removed. Degenerate traces (a consumer-less preload, an unpinned matmul) hit explicit error+exit guards rather than asserts that vanish under NDEBUG. Mark the legacy ONNX TOG path deprecated: it is superseded by the trace path, so TileGraphParser logs a deprecation warning and the TORCHSIM_LEGACY_TOG=1 opt-in warns at command build. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HAmdM9BrsTvfi8sZnnfNno
b7c1ec4 to
9033945
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Replaces the timing-path TOG producer (
MLIR -> Python dict -> ONNX -> C++ TileGraphParser) with a compiled, shape-parametric trace producer:post-vcix MLIR -> skeleton -> EmitC -> C++ -> .so. TOGSim dlopens the.so, runs it to record an instruction trace, and feeds it into the existing Simulator/Core (timing core unchanged). Driven by a new--trace_somode; the legacy ONNX-TOG path is kept and marked DEPRECATED, so nothing existing breaks.Pipeline
Dependency model (no in-order, no runtime tag-hash, no op heuristics)
Dependencies are derived from two sources available pre-collapse:
SA_WEIGHTSbuffer that folds preload->matmul.MEMORY_BAR(renamed fromBAR): the DMA/tag memory fence; an async load -> compute waits the data's resp-complete.COMPUTE_BAR(new): the compute fence; a store waits all systolic-array pipelines to drain.Both barriers are first-class trace ops (
togsim.compute_barrier-> ABItogsim_compute_barrier) visible in the trace dump and the instruction stream.Status
--trace_so.build_togpath on the same kernel + gem5cycle_list: compute work and DRAM traffic match; matmuls pipeline on 2 SAs; the memory fence correctly delays compute until the weight load arrives.docs/design/togsim_cpp_trace.mdsec 10): preload-concurrency cap (needs non-zero preload occupancy), parallel output tiles (dispatch granularity), broader op coverage (conv/SDPA/vector).Testing
tests/test_togsim_skeleton.py,test_togsim_emitc.py,test_togsim_runtime.py(7 tests).--trace_soGEMM through TOGSim.Design of record:
docs/design/togsim_cpp_trace.md(sec 9-10).🤖 Generated with Claude Code