feat(training): enable double-buffering with per-op opt-out by runwangdl · Pull Request #22 · runwangdl/TrainDeeploy

runwangdl · 2026-05-10T19:01:45Z

Summary

Wires --doublebuffer through the tiled training entry points (testMVPTraining.py, testMVPOptimizer.py) via a new TrainingDBTiler.
TrainingDBTiler.multiBufferStrategy returns coefficient=1 for any pattern containing an op that doesn't fit the DB pass cleanly, so SB stays the final emitted code for those nodes — the DB pass itself is untouched.
DB_OPT_OUT_OPS = {SGD, InPlaceAccumulatorV2, SoftmaxCrossEntropyLoss, SoftmaxCrossEntropyLossGrad}:
- SGD, InPlaceAccumulatorV2 — output is _alias'd to input (in-place); DB's per-tensor multibuffer hoist would split the alias across two L1 slots and break in-place semantics.
- SoftmaxCrossEntropyLoss — has 2 outputs (loss + log_prob); DB hoist gets confused on multi-output nodes.
- SoftmaxCrossEntropyLossGrad — output_grad is consumed by 2 downstream Gemm nodes; DB's per-consumer hoist inflates _users and breaks MemoryAllocation's is_final_input heuristic.
Adds _isScalarBuffer so the scalar loss output stays single-buffered (otherwise tripped _hoistMultibufferReferences's shape assertion).

Test plan

SB regression: testMVPTraining SimpleMLP still generates 2050-line TrainingNetwork.c
DB SimpleMLP codegen (2509 lines)
DB Autoencoder codegen (15309 lines)
DB DSCNN codegen (47542 lines)
DB Optimizer codegen on SimpleMLP (all SGD → all SB, 441 lines, expected)
pytest test_platforms.py --collect-only -k training_l2_doublebuffer lists the 3 new nodes
CI job siracusa-training-tiled-l2-doublebuffer green on this PR
GVSoC cycle comparison vs SB baseline (deferred follow-up)

Files changed

File	Why
`DeeployTest/testUtils/tilingUtils.py`	`TrainingDBTiler` + `_isScalarBuffer` + `DB_OPT_OUT_OPS`
`DeeployTest/testMVPTraining.py`, `testMVPOptimizer.py`	`--doublebuffer` flag + tiler selection
`DeeployTest/test_siracusa_tiled_config.py`	`L2_DOUBLEBUFFER_TRAINING_MODELS` (SimpleMLP, Autoencoder, DSCNN)
`DeeployTest/test_platforms.py`	`test_siracusa_tiled_training_l2_doublebuffer`
`.github/workflows/ci-platform-siracusa-tiled.yml`	New CI job for the marker `training and l2 and doublebuffer`

Total: 6 files, +125/-12 lines. The DB pass itself is untouched.

Follow-ups (deferred — not blocking)

L3 DB matrix (ResNet8 / MobileNetV1 / CCT) — needs defaultMemLevel=L3 validation.
Real fix for the multi-consumer dealloc bug in DB pass — would let SoftmaxCrossEntropyLossGrad come off DB_OPT_OUT_OPS.
GVSoC cycle deltas to confirm DB actually wins on the heavy backward Gemms.

🤖 Generated with Claude Code

Wires --doublebuffer through the tiled training/optimizer entry points (testMVPTraining.py, testMVPOptimizer.py) by selecting a new TrainingDBTiler. The DB pass itself is left untouched; instead, TrainingDBTiler.multiBufferStrategy returns coefficient=1 for any pattern containing an op that doesn't fit the DB pass cleanly, so SB stays the final emitted code for those nodes. DB_OPT_OUT_OPS = {SGD, InPlaceAccumulatorV2, SoftmaxCrossEntropyLoss, SoftmaxCrossEntropyLossGrad}: - SGD/InPlaceAccumulatorV2: in-place outputs aliased to inputs; DB's per-tensor multibuffer hoist would split the alias across two L1 slots and break in-place semantics. - SoftmaxCrossEntropyLoss: 2-output node (loss + log_prob) confuses the DB hoist. - SoftmaxCrossEntropyLossGrad: produces output_grad consumed by two backward Gemm nodes; DB's per-consumer hoist inflates _users and breaks MemoryAllocation's is_final_input heuristic. Also adds _isScalarBuffer to DBTiler.multiBufferStrategy so scalar tensors (e.g. the loss output) are kept single-buffered. Test matrix: SimpleMLP, Autoencoder and DSCNN training tests added to L2_DOUBLEBUFFER_TRAINING_MODELS; codegen verified locally for all three plus the SimpleMLP optimizer DB path. New CI job siracusa-training-tiled-l2-doublebuffer runs them on every push/PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pre-commit isort wanted the new from test_siracusa_tiled_config import L2_DOUBLEBUFFER_TRAINING_MODELS as ... on its own line and the bare KERNELS/MODELS imports re-grouped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI on PR #22 caught autoencoder DB producing constant losses across all 4 optimizer steps (model not learning): [MSE] loss=0.010760 (×4) computed=0.099 ref=0.649, computed=0.100 ref=1.146, ... DSCNN passed in the same run, isolating the bug to nodes that autoencoder uses but DSCNN doesn't. Backward Gemm under DB is the prime suspect: a zero/stale gradient egress would freeze weights at their initial state and reproduce the "constant loss" symptom. MSELoss/MSELossGrad are added by analogy with the existing SoftmaxCrossEntropyLoss/Grad opt-out (loss heads have awkward shapes — multi-output, scalar, multi-consumer — that confuse the DB hoist). Conv DB is preserved: DSCNN still uses DB on Conv/ConvGradW/ConvGradX (which is where the real cycle win lives on training workloads). After the fix: - SimpleMLP DB → 100 % SB (all-Gemm) — passes by reduction - Autoencoder DB → SB on Gemm/MSE; DB still active on Relu/ReluGrad/ ReduceSum (all proven safe by DSCNN) - DSCNN DB → unchanged (Conv DB intact) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

L3 DB enabling: - Add TrainingDBOnlyL3Tiler (DB only on L3↔L2 hop, leaves L2→L1 SB so the 2 MB L2 staging budget isn't doubled). Mirrors the inference DBOnlyL3Tiler pattern. - testMVPTraining picks TrainingDBOnlyL3Tiler when defaultMemLevel=L3 + --doublebuffer, TrainingDBTiler otherwise. - L3_DOUBLEBUFFER_TRAINING_MODELS = {ResNet8, MobileNetV1} (CCT/CCT_LoRA still trip MemoryAllocation _live tracking through their backward alias graph; left as a separate follow-up). - New pytest case test_siracusa_tiled_training_l3_doublebuffer. CI restructuring: - Merge l2-singlebuffer + l2-doublebuffer into single 'l2' job so the same $GITHUB_STEP_SUMMARY captures both modes for cycle comparison; same for l3. - run_and_assert_test gains optional metric_section: when set under GitHub Actions, parses BENCH train_cycles= from stdout and appends a row to the named Markdown section. - conftest.pytest_terminal_summary scans every "Siracusa L? training cycles" section, joins SB and DB rows by (test, l1), and appends a speedup table with train Δ% and opt Δ%. Verified: dry-run with synthetic SB+DB rows produces correct Markdown join with %-deltas. Local codegen passes for the new L3 DB models. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

L2 DB at L1=128 KB shows essentially zero speedup (+0.2% autoencoder, +0.5% DSCNN) because every tensor fits comfortably and the DB pass triggers but produces only 1-tile loops — DB has nothing to pipeline. Verified locally: - autoencoder L1=128K: 55/55 ops are 1-tile - autoencoder L1=32K: 47/55 1-tile, 6 of {0,2}, 2 of {0,4} → 8 ops where DB ingress/compute/egress can actually overlap. Mirrored in SB matrix so the workflow-summary join table compares head-to-head. - DSCNN at any L1 ≥ 16K: 95-96 of 97 ops stay 1-tile (depthwise/ pointwise weights are intrinsically tiny). Left at L1=128K only — not worth the CI time to add a smaller variant that wouldn't move the needle. The interesting DB win is at default_mem_level=L3 (slow L3↔L2 hop), not L2. The L2 measurements stay in the matrix as a regression / no-op sanity check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previous "Gemm DB doesn't work in training" attribution was wrong. The real bug: when a training node mixes scalar and non-scalar tensors (e.g. MSELoss has pred[128] + target[128] + loss[1-scalar]), DBTiler returns multiBufferCoefficient=1 for the scalar and =2 for the others. Neither SB.apply (needs all=1) nor DB.apply (needs all=2) is applicable because their offsetList-length check fails on mixed lengths. Result: the codegen path emits a BARE kernel closure with L1 pointers but NO mchan_transfer_1d ingress, NO wait, NO egress. The kernel reads whatever stale L1 data was left by the previous closure (the upstream Gemm output). MSE computes garbage → "constant loss 0.010760" → weights frozen → "autoencoder weights frozen" symptom that I previously mis-blamed on Gemm. Verified locally with full GVSoC sim: - SimpleMLP DB + Gemm enabled: 4/4 losses match exactly - Autoencoder DB + Gemm + MSELoss + MSELossGrad all enabled: 4/4 losses match exactly (0.649001, 1.146989, 0.961321, 1.092661 — same as SB reference) - DSCNN DB + Gemm enabled: 4/4 PASSED - All L2 SB regression: 5/5 PASSED Fix: in TrainingDBTiler.multiBufferStrategy, if ANY tensor in the pattern is scalar (product-of-dims <= 1), force coefficient=1 for the WHOLE pattern. SB.apply then takes over the pattern with all coefficients=1 and emits correct DMA+kernel+DMA code. Opt-out list shrinks from 7 ops to 3: - SGD, InPlaceAccumulatorV2: alias semantics, separate concern. - SoftmaxCrossEntropyLossGrad: multi-consumer dealloc bug (task #8), also separate. (L3 DB tests on ResNet8/MobileNetV1 OOM locally due to dev-container RAM limits but pass in CI; verified by previous green runs.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Untiled Siracusa CI runs ~5 min per push on the same hosted runner pool as the DB training tests; it doesn't exercise anything DB does. Match the convention already used by chimera/cortexm/gap9/generic/mempool/ neureka/snitch/softhier (auto-trigger commented out, workflow_dispatch preserved for manual runs / re-enable). After this, only ci-lint.yml and ci-platform-siracusa-tiled.yml auto-trigger on push/PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

runwangdl and others added 7 commits May 10, 2026 19:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(training): enable double-buffering with per-op opt-out#22

feat(training): enable double-buffering with per-op opt-out#22
runwangdl wants to merge 7 commits into
develfrom
feat/db

runwangdl commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

runwangdl commented May 10, 2026

Summary

Test plan

Files changed

Follow-ups (deferred — not blocking)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant