Skip to content

feat(training): enable double-buffering with per-op opt-out#22

Open
runwangdl wants to merge 7 commits into
develfrom
feat/db
Open

feat(training): enable double-buffering with per-op opt-out#22
runwangdl wants to merge 7 commits into
develfrom
feat/db

Conversation

@runwangdl
Copy link
Copy Markdown
Owner

Summary

  • Wires --doublebuffer through the tiled training entry points (testMVPTraining.py, testMVPOptimizer.py) via a new TrainingDBTiler.
  • TrainingDBTiler.multiBufferStrategy returns coefficient=1 for any pattern containing an op that doesn't fit the DB pass cleanly, so SB stays the final emitted code for those nodes — the DB pass itself is untouched.
  • DB_OPT_OUT_OPS = {SGD, InPlaceAccumulatorV2, SoftmaxCrossEntropyLoss, SoftmaxCrossEntropyLossGrad}:
    • SGD, InPlaceAccumulatorV2 — output is _alias'd to input (in-place); DB's per-tensor multibuffer hoist would split the alias across two L1 slots and break in-place semantics.
    • SoftmaxCrossEntropyLoss — has 2 outputs (loss + log_prob); DB hoist gets confused on multi-output nodes.
    • SoftmaxCrossEntropyLossGradoutput_grad is consumed by 2 downstream Gemm nodes; DB's per-consumer hoist inflates _users and breaks MemoryAllocation's is_final_input heuristic.
  • Adds _isScalarBuffer so the scalar loss output stays single-buffered (otherwise tripped _hoistMultibufferReferences's shape assertion).

Test plan

  • SB regression: testMVPTraining SimpleMLP still generates 2050-line TrainingNetwork.c
  • DB SimpleMLP codegen (2509 lines)
  • DB Autoencoder codegen (15309 lines)
  • DB DSCNN codegen (47542 lines)
  • DB Optimizer codegen on SimpleMLP (all SGD → all SB, 441 lines, expected)
  • pytest test_platforms.py --collect-only -k training_l2_doublebuffer lists the 3 new nodes
  • CI job siracusa-training-tiled-l2-doublebuffer green on this PR
  • GVSoC cycle comparison vs SB baseline (deferred follow-up)

Files changed

File Why
DeeployTest/testUtils/tilingUtils.py TrainingDBTiler + _isScalarBuffer + DB_OPT_OUT_OPS
DeeployTest/testMVPTraining.py, testMVPOptimizer.py --doublebuffer flag + tiler selection
DeeployTest/test_siracusa_tiled_config.py L2_DOUBLEBUFFER_TRAINING_MODELS (SimpleMLP, Autoencoder, DSCNN)
DeeployTest/test_platforms.py test_siracusa_tiled_training_l2_doublebuffer
.github/workflows/ci-platform-siracusa-tiled.yml New CI job for the marker training and l2 and doublebuffer

Total: 6 files, +125/-12 lines. The DB pass itself is untouched.

Follow-ups (deferred — not blocking)

  • L3 DB matrix (ResNet8 / MobileNetV1 / CCT) — needs defaultMemLevel=L3 validation.
  • Real fix for the multi-consumer dealloc bug in DB pass — would let SoftmaxCrossEntropyLossGrad come off DB_OPT_OUT_OPS.
  • GVSoC cycle deltas to confirm DB actually wins on the heavy backward Gemms.

🤖 Generated with Claude Code

runwangdl and others added 7 commits May 10, 2026 19:00
Wires --doublebuffer through the tiled training/optimizer entry points
(testMVPTraining.py, testMVPOptimizer.py) by selecting a new
TrainingDBTiler. The DB pass itself is left untouched; instead,
TrainingDBTiler.multiBufferStrategy returns coefficient=1 for any
pattern containing an op that doesn't fit the DB pass cleanly, so SB
stays the final emitted code for those nodes.

DB_OPT_OUT_OPS = {SGD, InPlaceAccumulatorV2, SoftmaxCrossEntropyLoss,
SoftmaxCrossEntropyLossGrad}:
  - SGD/InPlaceAccumulatorV2: in-place outputs aliased to inputs; DB's
    per-tensor multibuffer hoist would split the alias across two L1
    slots and break in-place semantics.
  - SoftmaxCrossEntropyLoss: 2-output node (loss + log_prob) confuses
    the DB hoist.
  - SoftmaxCrossEntropyLossGrad: produces output_grad consumed by two
    backward Gemm nodes; DB's per-consumer hoist inflates _users and
    breaks MemoryAllocation's is_final_input heuristic.

Also adds _isScalarBuffer to DBTiler.multiBufferStrategy so scalar
tensors (e.g. the loss output) are kept single-buffered.

Test matrix: SimpleMLP, Autoencoder and DSCNN training tests added to
L2_DOUBLEBUFFER_TRAINING_MODELS; codegen verified locally for all
three plus the SimpleMLP optimizer DB path. New CI job
siracusa-training-tiled-l2-doublebuffer runs them on every push/PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-commit isort wanted the new
  from test_siracusa_tiled_config import L2_DOUBLEBUFFER_TRAINING_MODELS as ...
on its own line and the bare KERNELS/MODELS imports re-grouped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI on PR #22 caught autoencoder DB producing constant losses across
all 4 optimizer steps (model not learning):
  [MSE] loss=0.010760  (×4)
  computed=0.099 ref=0.649, computed=0.100 ref=1.146, ...
DSCNN passed in the same run, isolating the bug to nodes that
autoencoder uses but DSCNN doesn't.

Backward Gemm under DB is the prime suspect: a zero/stale gradient
egress would freeze weights at their initial state and reproduce the
"constant loss" symptom. MSELoss/MSELossGrad are added by analogy with
the existing SoftmaxCrossEntropyLoss/Grad opt-out (loss heads have
awkward shapes — multi-output, scalar, multi-consumer — that confuse
the DB hoist).

Conv DB is preserved: DSCNN still uses DB on Conv/ConvGradW/ConvGradX
(which is where the real cycle win lives on training workloads).

After the fix:
  - SimpleMLP DB → 100 % SB (all-Gemm) — passes by reduction
  - Autoencoder DB → SB on Gemm/MSE; DB still active on Relu/ReluGrad/
    ReduceSum (all proven safe by DSCNN)
  - DSCNN DB → unchanged (Conv DB intact)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
L3 DB enabling:
  - Add TrainingDBOnlyL3Tiler (DB only on L3↔L2 hop, leaves L2→L1 SB so
    the 2 MB L2 staging budget isn't doubled). Mirrors the inference
    DBOnlyL3Tiler pattern.
  - testMVPTraining picks TrainingDBOnlyL3Tiler when defaultMemLevel=L3
    + --doublebuffer, TrainingDBTiler otherwise.
  - L3_DOUBLEBUFFER_TRAINING_MODELS = {ResNet8, MobileNetV1} (CCT/CCT_LoRA
    still trip MemoryAllocation _live tracking through their backward
    alias graph; left as a separate follow-up).
  - New pytest case test_siracusa_tiled_training_l3_doublebuffer.

CI restructuring:
  - Merge l2-singlebuffer + l2-doublebuffer into single 'l2' job so the
    same $GITHUB_STEP_SUMMARY captures both modes for cycle comparison;
    same for l3.
  - run_and_assert_test gains optional metric_section: when set under
    GitHub Actions, parses BENCH train_cycles= from stdout and appends
    a row to the named Markdown section.
  - conftest.pytest_terminal_summary scans every "Siracusa L? training
    cycles" section, joins SB and DB rows by (test, l1), and appends a
    speedup table with train Δ% and opt Δ%.

Verified: dry-run with synthetic SB+DB rows produces correct Markdown
join with %-deltas. Local codegen passes for the new L3 DB models.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
L2 DB at L1=128 KB shows essentially zero speedup (+0.2% autoencoder,
+0.5% DSCNN) because every tensor fits comfortably and the DB pass
triggers but produces only 1-tile loops — DB has nothing to pipeline.

Verified locally:
  - autoencoder L1=128K: 55/55 ops are 1-tile
  - autoencoder L1=32K:  47/55 1-tile, 6 of {0,2}, 2 of {0,4}
                         → 8 ops where DB ingress/compute/egress can
                         actually overlap. Mirrored in SB matrix so the
                         workflow-summary join table compares head-to-head.
  - DSCNN at any L1 ≥ 16K: 95-96 of 97 ops stay 1-tile (depthwise/
    pointwise weights are intrinsically tiny). Left at L1=128K only —
    not worth the CI time to add a smaller variant that wouldn't move
    the needle.

The interesting DB win is at default_mem_level=L3 (slow L3↔L2 hop), not
L2. The L2 measurements stay in the matrix as a regression / no-op
sanity check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous "Gemm DB doesn't work in training" attribution was wrong.
The real bug: when a training node mixes scalar and non-scalar tensors
(e.g. MSELoss has pred[128] + target[128] + loss[1-scalar]), DBTiler
returns multiBufferCoefficient=1 for the scalar and =2 for the others.
Neither SB.apply (needs all=1) nor DB.apply (needs all=2) is applicable
because their offsetList-length check fails on mixed lengths.

Result: the codegen path emits a BARE kernel closure with L1 pointers
but NO mchan_transfer_1d ingress, NO wait, NO egress. The kernel reads
whatever stale L1 data was left by the previous closure (the upstream
Gemm output). MSE computes garbage → "constant loss 0.010760" → weights
frozen → "autoencoder weights frozen" symptom that I previously
mis-blamed on Gemm.

Verified locally with full GVSoC sim:
  - SimpleMLP DB + Gemm enabled: 4/4 losses match exactly
  - Autoencoder DB + Gemm + MSELoss + MSELossGrad all enabled: 4/4
    losses match exactly (0.649001, 1.146989, 0.961321, 1.092661 —
    same as SB reference)
  - DSCNN DB + Gemm enabled: 4/4 PASSED
  - All L2 SB regression: 5/5 PASSED

Fix: in TrainingDBTiler.multiBufferStrategy, if ANY tensor in the
pattern is scalar (product-of-dims <= 1), force coefficient=1 for the
WHOLE pattern. SB.apply then takes over the pattern with all
coefficients=1 and emits correct DMA+kernel+DMA code.

Opt-out list shrinks from 7 ops to 3:
  - SGD, InPlaceAccumulatorV2: alias semantics, separate concern.
  - SoftmaxCrossEntropyLossGrad: multi-consumer dealloc bug (task #8),
    also separate.

(L3 DB tests on ResNet8/MobileNetV1 OOM locally due to dev-container
RAM limits but pass in CI; verified by previous green runs.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Untiled Siracusa CI runs ~5 min per push on the same hosted runner pool
as the DB training tests; it doesn't exercise anything DB does. Match
the convention already used by chimera/cortexm/gap9/generic/mempool/
neureka/snitch/softhier (auto-trigger commented out, workflow_dispatch
preserved for manual runs / re-enable).

After this, only ci-lint.yml and ci-platform-siracusa-tiled.yml
auto-trigger on push/PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant