Major Optimization Patch#69
Open
sudo-owen wants to merge 65 commits into
Open
Conversation
Closes Phase 0.1 of OPT_PLAN: locks per-turn SLOAD/SSTORE budgets for effect-heavy, forced-switch, and multi-mon-switch turn shapes alongside the existing clean-trade scenario. Records the four numbers in a comment block so the shadow layer has a concrete budget to clear. PerTurnTickEffect is a 50-LOC mock with the RoundStart/RoundEnd/AfterDamage bits set, avoiding the StatBoosts+BurnStatus dependency chain for an instrumentation-only test where only the access pattern matters. https://claude.ai/code/session_01AWNvQmDFzSK7sYMDXtyXD5
…nager buffer Adds the OPT_PLAN §4 API surface: TurnSubmission struct, submitTurnMoves (per-turn buffer with full sig verification + first-of-batch sync from engine turnId), and executeBuffered (drains every pending turn in one tx via flag-based dispatch per §6.1, breaks on game-over). Buffer layout matches OPT_PLAN §3 exactly (one uint256 per turn): 8/16/104 bits for p0 move/extra/salt + same for p1. Counters packed into a single slot: (numTurnsExecuted | numTurnsBuffered | lastSubmitTimestamp). executeBuffered lives on the manager rather than the engine so the engine stays ignorant of any specific commit manager. Manager dispatches via ENGINE.getPlayerSwitchForTurnFlagForBattleState between iterations. resetCallContext gains four additional transient clears (tempRNG, koOccurredFlag, tempPreDamage, effectsDirtyBitmap) so batched in-tx execution matches legacy per-tx execution where the EVM auto-clears all transients on tx entry. Production never calls resetCallContext, so the ~400-gas overhead is test-only. https://claude.ai/code/session_01AWNvQmDFzSK7sYMDXtyXD5
…ed flow
20 new tests across four suites:
- BatchEquivalenceTest (4 tests): B in {2, 4, 8} + multi-batch byte-equal
vs legacy executeWithDualSignedMoves. Asserts BattleData, MonStates,
active mons, KO bitmaps, winner all match across two parallel battles.
- BufferSubmissionTest (12 tests): happy path, relayer submission, wrong
committer/revealer signer, missing committer sig (regression for the
unilateral-revealer attack from OPT_PLAN §9), missing revealer sig,
wrong turnId, replay, battle-not-started, empty-buffer execute,
counter accounting, timestamp tracking.
- BatchEdgeTest (5 tests): forced-switch dispatch (flag != 2), single-
side switch, mid-batch game-over (ex advances by actually-executed not
buffered count; remaining entries become dead), legacy->batched and
batched->legacy mode alternation in the same battle.
- BatchGasTest (3 benchmarks): B in {2, 4, 8} comparison of legacy vs
batched total gas. Surfaces the critical finding that the batched flow
is currently MORE expensive than legacy (~33-36% more) -- the per-turn
buffer + counter SSTOREs dominate, and the cross-sub-turn EVM warm-
storage discount alone isn't enough to recoup. Shadow layer (deferred
Phase 1) needed to deliver the gas-savings claim from OPT_PLAN §1.
BatchHelper.sol centralizes _buildTurnSubmission / _submitTurnMoves /
_executeBuffered for reuse across the four suites.
https://claude.ai/code/session_01AWNvQmDFzSK7sYMDXtyXD5
OPT_PLAN.md gains a §11 concrete phase-by-phase todo (Phase 0.1 + 2 done; 0.5 / 1 / 2.5 deferred with reasons) and a §12 decision log capturing the calls made during implementation: scope reduction to defer the shadow layer, executeBuffered on the manager (not engine), buffer keyed by battleKey, single-uint256 packing, resetCallContext extension over parallel function, asymmetric-team game-over test design, and the critical gas finding showing batched is currently more expensive than legacy by ~33-36%. Snapshot deltas reflect the resetCallContext extension: +1-7k gas per existing gas test due to four added TSTOREs per resetCallContext call. All test-only paths -- production code never calls resetCallContext, so real-world gas is unaffected. https://claude.ai/code/session_01AWNvQmDFzSK7sYMDXtyXD5
Implements the full shadow infrastructure described in OPT_PLAN §5.1/§5.2: eight shadow helpers (MonState, KO bitmap, BattleData, effect slots, effect counts, globalKV) with both shadow and storage branches, lazy-load on writes, dirty-bit tracking, and a `_flushShadow` walker that SSTOREs once at end of batch. New `Engine.executeBatchedTurns(bytes32, uint256[])` external entry activates shadow, loops sub-turns with flag-based dispatch per §6.1, flushes, returns (executedCount, winner). Manager's `executeBuffered` now delegates to it. Design decisions: - Helpers take BattleConfig storage cfg explicitly (not via storageKeyForWrite) so external view getters (getEffects, getKOBitmap, etc.) work from staticcall contexts where the transient cache is empty. - Reads check loaded bit (set only on writes), fall back to SLOAD if not loaded — keeps reads view-compatible (no TSTORE in view functions). - BattleData slot 1 has helpers but stayed on storage refs in _executeInternal to avoid a ~13-function refactor; engine continues writing BattleData via storage refs. CRITICAL gas finding (test/BatchGasTest.sol, B=8 clean trade): legacy (per-turn) : 687k -> 849k (+161k regression) batched (submit+execute) : 937k -> 1172k (+235k regression) batched - legacy gap : +249k -> +323k (gap GREW with shadow) The shadow layer: - Adds ~20k/turn overhead to the LEGACY path (memory pattern instead of storage refs; _shadowActive TLOAD check on every helper call paid even when shadow is inactive). - Saves ~24k/sub-turn on the executeBuffered path (SSTORE coalescing for MonState + effect slots + counts + KO bitmap + globalKV). - Per-submission overhead (~85k each) is unchanged — submission infrastructure, not engine-state infrastructure. The shadow can't recover the 8 x 85k = 680k of submission cost. Conclusion: OPT_PLAN §1's gas-savings claim is NOT architecturally achievable with the per-turn-SSTORE submission scheme. To beat dual-signed-per-turn execution, batching needs a different submission design (Merkle-rooted batch claims, sig aggregation, etc.). The shadow layer is correct and stays in place as the substrate for any future submission redesign, but on its own it's a net loss. All 533 existing tests still pass. Gas snapshots regenerated to reflect the legacy-path regression. https://claude.ai/code/session_01AWNvQmDFzSK7sYMDXtyXD5
This reverts commit 3aa1026.
Records vm.stopAndReturnStateDiff for an end-of-game scenario (8-turn damage trade) and tallies access patterns across: - 8 separate-tx legacy executes (cold/warm per-tx) - 8 separate-tx batched submissions - 1 batched executeBuffered (single tx, sub-turns share warm cache) Headline numbers from `test_accessProfile_endOfGame_8turns`: | metric | legacy | batched | delta | |---------------|--------|---------|-------| | Total SLOADs | 1016 | 1041 | +25 | | Cold SLOADs | 144 | 77 | -67 | | Total SSTOREs | 80 | 97 | +17 | | z->nz SSTOREs | 2 | 11 | +9 | | nz->nz SSTOREs| 54 | 62 | +8 | Interpretation: the EVM warm-cache amortization across sub-turns of one tx delivers exactly the cold-SLOAD savings the design intends (-67 cold SLOADs = ~134k gas). The +249k batched gap comes from the per-submission SSTOREs (9 cold z->nz x ~22k + 8 cold nz->nz x ~5k = ~240k), NOT from engine state. Eliminating per-turn buffer SSTORE (e.g. via rolling-hash commitment) is the highest-leverage optimization. https://claude.ai/code/session_01AWNvQmDFzSK7sYMDXtyXD5
Two changes to enable steady-state warm-SSTORE measurement for the batched submission path: 1. Add `Engine.getStorageKey(bytes32) external view` so move managers can resolve a battleKey to the engine's storageKey (the one MappingAllocator recycles across battles via its free-list). 2. Re-key `SignedCommitManager.moveBuffer` and `bufferCounters` from battleKey to storageKey. Battle N+1 reuses the same buffer/counter slots as battle N (warm nonzero->nonzero SSTOREs ~2.9k instead of cold zero->nonzero ~22.1k). No new state in the manager — the engine's existing allocator does the slot recycling. Realistic-game access tally (`test_realisticGameAccessProfile_steadyState`, mirrors `InlineEngineGasTest.test_consecutiveBattleGas`'s 14-turn move sequence, measured on Battle 2 = steady state): | metric | legacy | batched | delta | |---------------|--------|---------|-------| | SSTOREs total | 145 | 174 | +29 | | - z->nz | 4 | 4 | 0 | <- no cold buffer writes | - nz->nz | 92 | 117 | +25 | <- submissions are warm | Cold SLOADs | 280 | 159 | -121 | <- amortization works | Warm SLOADs | 1499 | 1674 | +175 | Storage I/O gas (EIP-2929 prices): legacy: ~1119k batched: ~984k savings: ~135k per 14-turn game (~10k/turn, scales linearly) The cold-state +249k regression I reported earlier was an artifact of fresh battleKey-keyed slots costing 22k z->nz each on every battle. With storageKey-keyed buffer, steady-state batched is strictly cheaper than legacy in storage I/O — the SLOAD amortization across one tx outweighs the per-submission warm SSTORE overhead. 535 existing tests still pass; snapshots regress by ~300-1000 gas each from the added getStorageKey external view. https://claude.ai/code/session_01AWNvQmDFzSK7sYMDXtyXD5
…ttleData Three steady-state submission optimizations stacked on top of the storageKey-keyed buffer: #3 Drop TurnSubmitted event. The moveBuffer SSTORE is itself the on-chain observable; off-chain replay can index storage diffs. Saves a LOG3 per submission. #5 Replace ENGINE.getCommitContext() + ENGINE.getStorageKey() (2 external calls, ~5 SLOADs) with ENGINE.getSubmitContext() in submitTurnMoves. The new getter returns ONLY what async submission needs (p0, p1, turnId, winnerIndex, storageKey) in one call, with 3 SLOADs. Skips startTimestamp/validator/playerSwitchForTurnFlag reads — none are required at submission time in the batched flow. The non-existent-battle case is rejected via the existing winnerIndex != 2 check (default-zero BattleData fails it). #4 Re-pack BattleData so all per-turn-mutable fields live in slot 1. Slot 0 (p1, p0TeamIndex, p1TeamIndex) becomes immutable during play. Slot 1 absorbs turnId (shrunk uint64 -> uint16: 65k turns/battle is plenty) by shrinking lastExecuteTimestamp uint48 -> uint40 (year 36800 cap, still plenty). External getters keep their uint48/uint256 return types via implicit widening, so no ABI break for clients. Realistic-game access tally (steady state, 14-turn game from test_consecutiveBattleGas) - before -> after this commit: submission SLOADs (summed): 98 -> 56 (-42, mostly from skipping the BattleConfig SLOAD that getCommitContext does) batched cold SLOAD savings vs legacy: 121 -> 149 (-28 more cold reads amortized) batched SLOAD delta total: +25 -> +12 (much closer to legacy) batched SSTORE delta total: +29 -> +29 (unchanged - the buffer + counter SSTOREs are still the architectural floor) Net storage I/O savings vs legacy in steady state: ~135k -> ~223k (per 14-turn game), an extra ~88k from these three changes alone. All 536 tests pass (updated test_submitTurnMoves_nonExistentBattle_reverts to expect BattleAlreadyComplete instead of BattleNotYetStarted, since the getSubmitContext path no longer SLOADs startTimestamp; the default-zero winnerIndex on a non-existent battle still catches it). https://claude.ai/code/session_01AWNvQmDFzSK7sYMDXtyXD5
…ers) Lays the substrate for the OPT_PLAN shadow refactor without wiring it up yet. Adds: - `_batchShadowActive` transient flag (engine-internal; gated by `executeBatchedTurns` once that lands) - `_shadowBattleSlot1` (+ loaded/dirty bits) — packed mirror for the one BattleData storage slot that's mutated every sub-turn - `_shadowMonStateLoaded` / `_shadowMonStateDirty` (+ keyed transient region at `_T_MONSTATE_BASE`) — per-mon mirror, up to 16 mons - Read/write helpers (`_readBattleSlot1Packed`, `_writeBattleSlot1Packed`, field-level get/set for winnerIndex/playerSwitchForTurnFlag/turnId/etc., `_readMonStatePacked`, `_writeMonStatePacked`) - Flush routines (`_flushShadowBattleSlot1`, `_flushShadowMonStates`) - `_setLastExecAndIncrementTurnId` — combined end-of-turn helper that the shadow-active path will eventually call None of these are wired up yet — `_executeInternal` and its callees still mutate storage directly via storage refs, so the legacy path is unchanged. The next commit refactors callsites to route through the helpers and adds the `executeBatchedTurns` external entry. All 536 tests still pass. https://claude.ai/code/session_01AWNvQmDFzSK7sYMDXtyXD5
…ial) Step 1 of the OPT_PLAN shadow refactor: route the BattleData slot-1 accesses inside `_executeInternal` and `_checkAndSetWinnerIfGameOver` through the shadow helpers added in the previous checkpoint. Refactored callsites (all functionally identical when shadow inactive since the helpers fall straight through to direct SLOAD/SSTORE): _executeInternal: - start-of-turn winnerIndex check -> _getWinnerIndex - turnId read -> _getTurnId - prev/curr flag swap -> _setPrevPlayerSwitchForTurnFlag + _getPlayerSwitchForTurnFlag - single-player flag branch -> _getPlayerSwitchForTurnFlag once into local - end-of-turn 3-field write -> _setLastExecAndIncrementTurnId - final winnerIndex check -> _getWinnerIndex _checkAndSetWinnerIfGameOver: - winnerIndex read + write -> _getWinnerIndex / _setWinnerIndex _dealDamageInternal: - winnerIndex check -> _getWinnerIndex WIP remainder (still on storage refs, will be converted next): - `battle.activeMonIndex` reads (29 callsites — _unpackActiveMonIndex patterns throughout the engine) - `battle.activeMonIndex = ...` write in `_handleSwitch` - `battle.playerSwitchForTurnFlag = ...` write in `switchActiveMon` - `battle.winnerIndex` reads in `_checkForGameOverOrKO`, `_handleEffects` - `battle.turnId` reads in `_handleMove`, `_handleSwitch` Until those land, `executeBatchedTurns` cannot activate shadow safely (stale reads would break correctness). All 536 tests still pass with this partial refactor since shadow remains dormant. https://claude.ai/code/session_01AWNvQmDFzSK7sYMDXtyXD5
…urns
Finishes the OPT_PLAN tier-1 shadow refactor. All slot-1 reads and
writes inside the engine's execute path now route through the shadow
helpers; when `_batchShadowActive` is set (only inside the new
`executeBatchedTurns` engine entry), reads pull from the transient
mirror and writes defer to it, with a single SSTORE flush at end of
batch.
Engine changes:
- Refactored all 29 `_unpackActiveMonIndex(battle.activeMonIndex, ...)`
callsites + the `_setActiveMonIndex` write in `_handleSwitch` to use
`_getActiveMonIndex` / `_setActiveMonIndexPacked` helpers.
- All remaining `battle.winnerIndex` / `battle.turnId` /
`battle.playerSwitchForTurnFlag` reads route through their respective
helpers (`_getWinnerIndex`, `_getTurnId`, `_getPlayerSwitchForTurnFlag`).
- `switchActiveMon`'s `battle.playerSwitchForTurnFlag` write uses
`_setPlayerSwitchForTurnFlag`.
- New external entry `executeBatchedTurns(bytes32, uint256[])`:
- Auth-gated to `config.moveManager`
- Activates `_batchShadowActive` for the loop
- Unpacks each entry, runs flag-based dispatch (§6.1), calls
`_executeInternal`, resets per-turn transients between sub-turns
- Flushes shadow slot 1 once at end via `_flushShadowBattleSlot1`
- Returns (executed, winner)
Manager change:
- `SignedCommitManager.executeBuffered` collects buffered entries into
a memory array and calls `ENGINE.executeBatchedTurns(battleKey, entries)`
in one external call. No per-iteration `resetCallContext` round-trip.
Realistic 14-turn steady-state game (BatchAccessProfileRealisticTest):
| metric | legacy | batched | delta |
|----------------|--------|---------|-------|
| SLOADs total | 2054 | 1261 | -793 |
| - cold | 280 | 131 | -149 |
| - warm | 1774 | 1130 | -644 |
| SSTOREs total | 119 | 114 | -5 |
| - nz->nz | 83 | 81 | -2 |
Storage I/O cost: legacy ~1089k -> batched ~706k = save ~383k per
14-turn game (~27k/turn, ~35% reduction).
Compared to previous baseline (before shadow refactor): +160k more
savings on top of the existing ~223k advantage. Total batched
storage I/O advantage now ~35% of legacy.
Bonus: legacy SSTOREs dropped 145 -> 119 because
`_setLastExecAndIncrementTurnId`'s explicit RMW coalesces the 3
end-of-turn slot-1 writes into one SSTORE more aggressively than
Solidity's optimizer was previously achieving.
All 536 tests pass.
https://claude.ai/code/session_01AWNvQmDFzSK7sYMDXtyXD5
Adds per-(player, monIndex) transient mirror for MonState so writes during executeBatchedTurns dedup across sub-turns (one SSTORE per dirty slot at batch end instead of one per turn). Mirrors the BattleData slot-1 shadow design: _readMonStatePacked / _writeMonStatePacked / _flushShadowMonStates plus _loadMonState / _storeMonState memory-pattern helpers used by all MonState mutation + read sites (_dealDamageInternal, _updateMonStateInternal, _handleMove stamina deduct, _inlineRegenStaminaForMon, _computePriorityPlayerIndex, _getDamageCalcContextInternal, _readMonStateDelta, getCPUContext, getMonStatesForSide, etc.). Drops the now-unused _getMonState / _deductStamina. Realistic 14-turn steady-state access tally: batched - legacy = -25 SSTOREs / -915 SLOADs (was -5 / -793 pre-MonState-shadow). Legacy gas snapshots regress ~5-8% from the unpack/repack overhead of the memory pattern in single-turn paths — documented in OPT_PLAN §12 as a known trade-off. BatchGasTest now runs a warmup battle (low HP) to completion before the measured battle so the measured battle reuses the freed storageKey (MappingAllocator steady state). The warmup uses the same flow as the measurement (legacy or batched) so manager buffer slots are also warm for the batched path. Asserts storageKey reuse before measuring. BatchAccessProfileRealisticTest adds the same storageKey-reuse precondition asserts for both legacy and batched battle pairs.
Adds test_realisticGameSlotBuckets which re-runs the realistic 14-turn batched flow with state-diff recording and labels each accessed storage slot by its Engine storage region (BD.slotN, BC.slotN, MonState, Effects, GlobalKV, etc.). Surfaces the remaining hot slots after BattleData slot-1 and MonState shadows landed -- top candidates: BC.slot2 (KO bitmap + teamSizes + startTs) with 10 writes/game, and per-mon effect data with ~21 no-op SSTOREs/game.
When executeBatchedTurns reaches a winner, skip _flushShadowMonStates: the next startBattle at this storageKey runs the sentinel-clear loop which overwrites every prior slot regardless, so the un-flushed transient values are recycled either way. Explicitly clears _shadowMonStateLoaded / _shadowMonStateDirty in the skip path so a subsequent executeBatchedTurns in the same tx (multicall, foundry test) doesn't read stale TLOAD bits. BD.slot1 still flushes unconditionally -- getWinner reads it directly post-batch, and that must stay correct. Realistic 14-turn steady-state access delta improves from -25 to -31 SSTOREs vs legacy (6 fewer MonState SSTOREs from the skipped flush). External getMonStateForBattle returns stale values in the gap between batch-end and the next startBattle; accepted trade-off.
BC.slot2 packs moveManager + globalEffectsLength + teamSizes + engineHooksLength + koBitmaps + startTimestamp + hasInlineStaminaRegen + globalKVCount. Of these, only koBitmaps mutates frequently mid-batch (one write per KO; the realistic 14-turn steady-state game shows ~5-6 of the 10 BC.slot2 writes are koBitmaps). Shadow JUST the koBitmaps uint16 into a transient (not the whole slot), so reads of immutable slot-2 fields stay as direct SLOADs in legacy mode -- no TLOAD-check overhead on every moveManager / teamSizes / startTimestamp access. New helpers: _readKoBitmaps (shadow-aware read), _loadShadowKoBitmaps (lazy SLOAD-into-transient on first write), _writeKoBitmaps, _flushShadowKoBitmaps. Refactor _getKOBitmap, _setMonKO, _clearMonKO to route through them. Other field writes during the batch (globalKVCount bump, etc.) keep doing direct SSTORE; the flush at end-of-batch overwrites only the koBitmaps bits in storage so the shadowed value wins. Flush is unconditional -- koBitmaps is part of public API (getKOBitmap, getBattleEndContext, getCPUContext) and the OnBattleEnd hook fires in the same tx for game-ending batches, so storage must be coherent before executeBatchedTurns returns. Realistic 14-turn steady-state delta: batched - legacy = -35 SSTOREs / -936 SLOADs (was -31 / -915 after the game-over flush skip). ~4 fewer SSTOREs from koBit coalescing (~12k gas), ~21 fewer SLOADs from cached-after-first-load reads. Legacy gas snapshots regress ~500 gas per game (~0.1%) from the narrow TLOAD checks. Diagnostic test reads koBitmaps via raw slot 2 load -- the engine's getKOBitmap(battleKey) returns 0 post-game-over due to a pre-existing artifact where _freeStorageKey deletes the battleKey -> storageKey mapping, so getBattle(battleKey) reads an empty config row (already documented at the BattleConfigView builder in getBattle).
Records the two shipped follow-ups (game-over MonState flush skip, narrow koBitmaps shadow) plus the two candidates that were measured and rejected (effect-data no-op guard -- savings were ~2k not ~46k once EIP-2200 was re-read; BC.slot0/1 shadow -- 22k legacy regression vs 14k batched gain). Final realistic-game steady-state delta: batched - legacy = -35 SSTOREs / -936 SLOADs, ~200k gas per 14-turn game.
Counterpart to test_realisticGameAccessProfile_steadyState that uses gasleft() before/after each call instead of vm.startStateDiffRecording. Reports legacy total (sum of 14 per-turn gasleft deltas) vs batched total (sum of 14 submits + 1 executeBuffered). Note in the test docstring that the legacy number is biased in legacy's favor: within one foundry test function, all calls share the EVM warm-set, so legacy turn 2-14 see warm SLOADs instead of the cold SLOADs they would pay as separate production txs. The access-tally test remains the authoritative steady-state comparison.
test_realisticGameSteadyStateGas's within-foundry-tx number understates the legacy cost because legacy turns 2-14 see slots that turn 1 already warmed (EIP-2929 access list is per-tx in production but per-test in foundry). Add a HARNESS BIAS callout to the docstring and a production estimate to the printed output that adds the cold-SLOAD penalty back in for the 260 SLOADs that would re-cold each turn in real deployment. Same callout added to OPT_PLAN §12 with the per-slot proof that the shadow really is coalescing SSTOREs (BD.slot1 14 -> 1, MonStates ~6 -> 0).
Bulk uint104 -> uint96 across engine, manager, commit lib, and tests. EIP-712 DualSignedReveal typehash updated to use uint96 revealerSalt. New moveBuffer entry layout (256 bits tight pack): [p0Move 8 | p0Extra 16 | p0Salt 96 | p1Move 8 | p1Extra 16 | p1Salt 96 | epoch 16] The 16-bit epoch is the low 16 bits of the battleKey OR'd with 1, so it's non-zero and battle-unique (probabilistically). Used by executeBuffered to detect "live for this battle" vs "stale leftover from a prior battle that reused this storageKey but didn't drain its buffer." Replaces the old bufferCounters SSTORE per submit -- removes ~5k gas of write traffic per submitTurnMoves call (~70k per 14-turn game in production). submitTurnMoves: writes only the entry slot (1 SSTORE), no counter. executeBuffered: walks slots from engine.turnId until epoch mismatch. bufferCounters mapping removed. NOTE -- WIP commit before measurement. The 16-bit epoch has a 1/32768 collision risk between two battles ever using the same storageKey; discussing whether to widen to 32 bits (cut salts to 88) before merge. Build green, tests not yet run.
…imination" This reverts commit df02e3a.
…+ submitTurnMoves Removes the committer signature: committer is now identified by msg.sender at call time. Saves 1 ecrecover (~3k gas) + 1 sig in calldata (~1.1k @ 16 gas/byte) per turn on BOTH legacy (executeWithDualSignedMoves) and batched (submitTurnMoves) paths. The unilateral-revealer attack (where a revealer picks any preimage and signs keccak(P*) as the committer's hash) is closed by the msg.sender == committer check: only the actual committer can publish their own move. Trade-off: loses the "anyone can publish with both sigs" relayer capability for the committer side. Each turn, the player who is the committer-this-turn (parity) must submit themselves. Revealer doesn't need to be online at submit time (their sig is included in the committer's submission). Updates so far: - Structs.sol: TurnSubmission drops committerSig field - SignedCommitManager.sol: executeWithDualSignedMoves and submitTurnMoves drop committer sig param + verification, add msg.sender == committer check - BatchHelper.sol: _buildTurnSubmission returns (entry, committerAddr) so callers can vm.prank; _submitTurnMoves wraps the call in vm.prank - BufferSubmissionTest: rewrote relayer-can-submit (now must revert), removed wrong-committer-sig test (no committer sig now), kept revealer-side checks - 5 batched-flow tests: dropped cSig arg, added vm.prank(committer) before call Remaining: SignedCommitManager.t.sol (~18 call sites), InlineEngineGasTest, SignedCommitManagerGasBenchmark, StandardAttackPvPGasTest still need the same mechanical update. Several tests in SignedCommitManager.t.sol verify the committer-sig model specifically (wrongCommitter, replayAcrossBattle via sig recovery, etc.) -- those need rethinking, not just mechanical replacement.
Removes the committer EIP-712 signature from executeWithDualSignedMoves
and submitTurnMoves. The committer is now identified by msg.sender at
call time. The unilateral-revealer attack (revealer picks any preimage
and signs keccak(P*) as the committer's hash) is closed by the
msg.sender == committer check: only the actual committer can publish
their own move with their own preimage.
Trade-off: loses "anyone can publish with both sigs" for the committer
side. Each turn, whichever player is the committer-by-parity must
submit themselves. The revealer side still signs off-chain so they
don't need to be online at submit time.
Per-turn savings (verified on the realistic 14-turn steady-state game):
Legacy (14 separate executeWithDualSignedMoves calls):
before: 1,992,441 gas -> after: 1,940,018 gas (-52,423, ~3.7k/turn)
Batched submit (14 submitTurnMoves):
before: 283,289 gas -> after: 194,977 gas (-88,312, ~6.3k/submit)
Batched total:
before: 2,121,570 gas -> after: 2,033,280 gas (-88,290)
Production estimate (each turn as its own tx; legacy assumes cold start
per tx via cold-SLOAD penalty add-back): batched saves ~426k vs legacy
production (~15.5% per 14-turn game), up from ~390k (14%) before.
Storage I/O is unchanged (-35 SSTOREs / -936 SLOADs vs legacy in the
access tally); savings are pure compute (1 fewer ecrecover, less keccak
for the SignedCommit struct hash) + ~70 bytes less calldata per call
(~1.1k @ 16 gas/byte).
Test changes:
- TurnSubmission.committerSig field removed
- 5 tests in SignedCommitManager.t.sol that verified the old committer-
sig security model deleted or inverted:
- test_revert_replayAttack_differentBattle: deleted (no committer sig
to bind to a specific battleKey; replay defended by msg.sender)
- test_executeWithDualSigned_thirdPartyRelay_succeeds: inverted to
test_revert_executeWithDualSigned_nonCommitterCannotSubmit (relayer
can no longer submit on committer's behalf)
- test_revert_executeWithDualSigned_wrongCommitterSigner: deleted (no
committer sig)
- test_revert_executeWithDualSigned_committerSigForWrongHash: deleted
- test_revert_executeWithDualSigned_unilateralRevealerAttack: kept,
mechanism updated (msg.sender check instead of sig recovery)
- test_revert_replayPrevented_sameBlockAttempt: expected error updated
to PlayerNotAllowed (msg.sender on wrong turn)
- All callers of executeWithDualSignedMoves updated to pass committer
as vm.prank(committer) and drop the cSig arg
- BatchHelper._buildTurnSubmission returns (entry, committerAddr) so
callers can vm.prank; _submitTurnMoves wraps in vm.prank
- BufferSubmissionTest's relayer-can-submit became non-committer-cannot
Adds gasleft() markers inside executeBatchedTurns + _executeInternal that accumulate per-region totals into transient slots (0x200000..0x20000E) and emit a single GasProfile event at the end of each batch carrying the breakdown. Adds ~63k of self-instrumentation overhead per 14-turn game (~3.5% of executeBuffered total) -- relative distribution across regions is the useful signal, not absolute totals. Regions tracked (see _T_GAS_* constants): B1 entry, B2 decode/setup, B3 reset, B4 flush (framework) R1 setup, R2 priority+RNG, R3 RoundStart, R4 prio move, R5 prio after, R6 other move, R7 other after, R8 RoundEnd, R9 turn-end (per-turn) SP single-player branch (switch turns only) Realistic 14-turn steady-state breakdown (after subtracting B1-accumulator carry from battle 1): Effects dispatch (R3+R5+R7+R8): 843k 47% <- biggest target _handleMove (R4+R6): 628k 35% Priority+RNG, turn-end, setup (R1+R2+R9): 222k 12% Single-player branch (SP): 70k 4% Framework (B1+B2+B3+B4): 37k 2% Marked as removable: remove the _accGas calls + constants + GasProfile event before merging to mainline. Kept here so we can re-run profile measurements after each optimization candidate.
Replaces the 3-call (global + priority + other) pattern at RoundStart and RoundEnd with a single _handleEffectsTriple call that runs all three sections in one stack frame, preserving identical semantics: - Global effects gated by SkipIfGameOver - Priority/other per-mon effects gated by SkipIfGameOverOrMonKO - Same inter-section game-over / KO checks - Same playerSwitchForTurnFlag chain Realistic 14-turn batched executeBuffered (per GasProfile event): Before: 1,862,790 gas After: 1,855,677 gas Saved: 7,113 gas (~0.4%) Per-region (battle 2 alone, subtracting battle 1's accumulated counters): R3 RoundStart: -3,390 (158,668 -> 155,278) R8 RoundEnd: -3,633 (395,939 -> 392,306) Below my original ~10-30k estimate. The IR optimizer + via_ir already inlines internal function calls aggressively, so fusing only saves the small amount of redundant stack-frame setup the optimizer couldn't fold. Still above the 5k keep-threshold I set. Legacy executeWithDualSignedMoves gets the same savings since _executeInternal is shared. AfterMove block (player-effects + global-effects pair, runs twice per turn) NOT fused -- different shape (2 calls vs 3) and includes the _inlineStaminaRegen call between them, so a clean fusion is more invasive for less payoff.
…e findings The per-region gasleft() markers + GasProfile event added ~25k overhead per batched executeBuffered call. The trace served its purpose -- now removed to restore clean production gas numbers. If profile is needed again, the structure is preserved in the prior commit's history (89b440a + scroll-back in OPT_PLAN). Final clean realistic 14-turn steady-state (with single-sig + fusion): LEGACY total (single-tx harness, biased): 1,933,178 gas BATCHED submit (14 submits, summed): 194,813 BATCHED execute (1 executeBuf): 1,831,339 BATCHED total: 2,026,152 Production-legacy estimate (14 separate txs): 2,747,178 Batched saves vs production legacy: 427,026 (~15.5%) Access tally: batched - legacy = -35 SSTOREs / -934 SLOADs (unchanged from prior; storage I/O delta was already at the floor before the single-sig / fusion compute optimizations). OPT_PLAN updated with the full trace findings: - 47% (843k) of executeBuffered = effects dispatch - 35% (628k) = _handleMove (real game logic) - 2% (37k) = framework overhead (near floor) - Skipped: per-turn active-mon cache (correctness: mid-turn switches) - Skipped: preload effects into memory (complexity vs ~30k payoff didn't pencil; queued for effect-heavy benchmark) - Kept: _handleEffectsTriple fusion (~7k/game) - Kept: single-sig drop (~88k/game on batched, ~52k on legacy)
…eInternal
Four batch-friendly structural cleanups in one go, totaling ~68k saved
per 14-turn batched game (~3.4% of executeBuffered):
(1) Drop MonMoves event emission per turn (~28k).
Off-chain consumers reconstruct per-turn moves from the manager-side
moveBuffer SSTOREs (batched) or from the executeWith* calldata (legacy).
No event needed in either case. The _emitMonMoves helper + the
MonMoves event definition are removed.
(2) Drop EngineExecute event emission per turn (~10k).
The manager already emits TurnsExecuted(battleKey, startTurnId,
executedCount, winner) once per batched call, which strictly subsumes
per-turn EngineExecute. For legacy, the executeWithDualSignedMoves /
executeWithMoves tx itself is the on-chain event.
(3) Hoist battleKeyForWrite assignment from _executeInternal to entry
points (~1.4k). In batched mode the value is set once before the
loop instead of N times inside _executeInternal -- saves N-1 TSTOREs.
For legacy entry points the cost is unchanged (one set per call).
(4) Hoist numHooks + hasInlineStaminaRegen reads from _executeInternal
to the caller, threading them through as function params (~3-5k).
Both fields live in BC.slot2 and are constant for a battle's
lifetime; reading once per batch instead of once per turn saves
~13 warm SLOADs.
Realistic 14-turn steady-state delta vs prior commit (94cf55d post-fuse):
LEGACY total : 1,933,178 -> 1,867,567 (-65,611, ~3.4%)
BATCHED execute : 1,831,339 -> 1,762,241 (-69,098, ~3.8%)
BATCHED total : 2,026,152 -> 1,957,203 (-68,949, ~3.4%)
Production saves vs L : 427,026 -> 430,364 (+3,338 widening)
Access tally now shows -906 SLOADs (was -934): the difference is the
~28 _getActiveMonIndex reads inside _emitMonMoves that no longer fire.
All 533 tests pass.
Net regression on the realistic 14-turn steady-state profile (~31k storage saved vs ~190k TLOAD-check overhead). Same shape as the BC.slot0/1 shadow rejection above: high read-to-write ratio kills the win. https://claude.ai/code/session_01AWNvQmDFzSK7sYMDXtyXD5
Each call to _getActiveMonIndex(battleKeyForWrite) compiles down to three TLOADs plus a stack frame the IR optimizer couldn't fold across distinct call points (~300-500g per call, not just one TLOAD as initially estimated). Function-frame caching only — never crosses _handleMove, the only thing that can change active mon via switch moves. Effect / ability lifecycle hooks never call switchActiveMon (verified by grep across src/effects/, src/mons/), so the hoist is safe across _runEffects and _handleEffectsTriple branches. Realistic 14-turn steady-state: batched -136k (-7.8%), legacy -121k (-6.5%). All snapshot suites improved 1-10% per scenario.
…straint) HardReset (src/mons/nirvamma/HardReset.sol) is IMoveSet, BasicEffect with an onAfterMove hook that calls engine.switchActiveMon. Effects can change the active mon mid-_runEffects. Previously-cached `_getActiveMonIndex` values carried across distinct effect-iteration call frames could go stale. Audit of the H hoists by safety class: - Pure compute / internal-only: kept (_computePriorityPlayerIndex, _checkForGameOverOrKO, RoundEnd inline stamina regen). - Cached value consumed before any external call: kept (turn-0 ability activations: no IAbility.activateOnSwitch calls switchActiveMon; _addEffect onApply and removeEffect onRemove: both unpacks complete before the hook call). - Function-frame caching matching legacy semantics: kept (_runEffects; matches the original which also cached at function top + documented with a constraint comment). - Cross-branch caching across _runEffects calls: REVERTED (_handleEffectsTriple) — today's RoundStart/RoundEnd effects don't switch, but adding one would silently break this. Defensive depth. Net realistic 14-turn steady-state vs original baseline: - batched: -126,762 gas (-7.2%, was -136,678 before safety reverts) - legacy: -112,502 gas (-6.0%, was -121,240 before) All 533 tests pass including the 4 HardReset tests.
…PLAN The two preceding revert commits restore the engine to the post-caching-sweep state. This entry documents WHY tiered storage was tried and why it didn't pay on the realistic profile, so future readers don't redo the experiment. Key findings from the analysis: - Theoretical storage savings: ~17.7k/game (113 fewer SLOADs, 9 fewer SSTOREs). - Measured savings: ~3.5k. Implied runtime compute overhead: ~14k. - Engine bytecode actually shrank 174 bytes (not bloat — runtime instructions). - The +5 cold SSTOREs were cold-penalty reshuffling from cold SLOADs (offset). - Realistic profile is ~50% StatBoosts (always external) and has ZERO real effect writes during execute (all SSTOREs are no-ops via MappingAllocator slot reuse) so the write-side win is completely unmeasured. Branch lands back at 1,590,098 batched / 1,712,843 legacy (-172k / -154k cumulative vs the pre-H original baseline).
OPT_PLAN §7 / Phase 2.5. CPU manager now supports a buffered flow that mirrors the PvP batched API: - selectMoveWithStateHint(battleKey, aliceMove, aliceExtra, aliceSalt, CPUContext calldata projectedState) — Alice supplies the projected post-prior-turn state in calldata; the CPU consumes it (unverified) to pick its move. Per §7.1 the hint isn't verified — lying just produces suboptimal CPU play against Alice herself, so there's no incentive to cheat. CPU salt = keccak(timestamp, aliceSalt, turnId) per §7.4 (turnId guards against in-block collisions). Emits CPUTurnSalt(battleKey, turnId, timestamp) for off-chain replay. - executeBuffered(battleKey) — drains the buffer in one tx via engine.executeBatchedTurns. Anyone can call (auth lives at the engine gate via msg.sender == config.moveManager, and this contract IS the moveManager for battles started via it). Fires _afterTurn on game-end. Buffer layout matches SignedCommitManager exactly (same 256-bit packed slot, same counter packing) so engine.executeBatchedTurns consumes either buffer interchangeably. Legacy selectMove stays callable on the same contract; first-of-batch sync to engine's turnId makes mid-battle mode alternation seamless. Tests (test/CPUBatchTest.sol, 7 passing): single-submit-execute, multi-batch counter accounting, legacy↔batched mode alternation, empty-buffer revert, non-p0 revert, game-over revert (HP=20 1-hit-KO orchestration), and the lying-hint scenario (engine state stays consistent under deliberately-wrong CPUContext). Uses MockBatchedCPU (deterministic scripted moves) to avoid coupling tests to real CPU heuristics. Byte-equivalence vs legacy selectMove is NOT verified — legacy salt is keccak(battleKey, msg.sender, timestamp) vs batched's keccak(timestamp, aliceSalt, turnId), so engine RNG differs and damage rolls diverge. Behavioural equivalence (battle completes, state consistent, counters correct) is what the suite asserts. Caveat for production CPUs: OkayCPU/FairCPU/BetterCPU make engine state reads inside calculateMove. In batched mode those return STALE state (engine hasn't advanced past prior buffered turns), so CPU quality drops — exactly the trade-off documented in §7.1. Fix is to migrate each CPU to ctx-only logic; not blocking this ship. Adding storage (moveBuffer + bufferCounters) to CPUMoveManager shifts optimizer choices for all subclasses — BetterCPUInlineGasTest regressed ~387g per turn (~0.2% per scenario). Acceptable for the new feature.
Mirrors test/BatchGasTest.sol pattern for CPU mode. Single-tx harness
gas comparison at B ∈ {4, 8, 14, 20, 30} plus a per-turn access tally
(test_cpuBatchAccessTally_B14) recording each call's storage diff
separately so we get production-accurate cold/warm split.
Empirical findings vs MockBatchedCPU (deterministic, no engine reads):
B=4: legacy 537k, batched 656k (in-harness +119k worse, prod +110k)
B=8: legacy 1.04M, batched 1.28M (in-harness +240k, prod +135k)
B=14: legacy 1.80M, batched 2.22M (in-harness +423k, prod +174k)
B=20: legacy 2.55M, batched 3.16M (in-harness +610k, prod +217k)
B=30: legacy 3.81M, batched 4.74M (in-harness +929k, prod +296k)
Per-turn harness regression is a CONSTANT ~30k (independent of B), so
the cold-tx penalty savings (~270k for B=14, scaling slightly with N)
never overtake it. CPU batched is a UX feature (async queueing) but
not a gas optimization at any practical B.
The access-tally test gives the authoritative cold-touch counts that
ground the production estimate (vs the heuristic in BatchGasTest):
Legacy: 280 cold first-touches across 14 selectMove txs (~20/tx)
Batched submits: 112 cold across 14 submit txs (~8/tx)
Batched executeBuffered: 33 cold (1 tx)
Delta: 135 cold first-touches saved in batched = ~270k saved.
This reverts commit 8adaf61.
….5)" This reverts commit 5986f36.
Reframes Phase 2.5 as a separate contract instead of patching the legacy CPUMoveManager. The "CPU" is a phantom opponent address; all decision logic lives off-chain (player runs the transpiled engine locally to pick the CPU's response, then submits both moves). Why this works: no counterparty to cheat. The player can submit any CPU move she wants — misrepresenting the CPU's "ideal" response just produces a worse experience for the player herself. Eliminates the entire on-chain ICPU.calculateMove path that legacy OkayCPU/FairCPU/BetterCPU pay for per turn. What was deleted vs the prior design: - CPUContext calldata hint param (~640 bytes calldata) - ICPU.calculateMove STATICCALL dispatch (~700g/submit + body) - keccak salt derivation (~80g) - per-submit CPUTurnSalt event (~2k) Per-submit cost: ~43k (prior hint design) → ~22k (this). Files: - src/cpu/BatchedCPUMoveManager.sol — abstract with submitTurn + executeBuffered + IMatchmaker. Reuses SignedCommitManager's buffer layout exactly so engine.executeBatchedTurns is shared. - test/mocks/SimpleBatchedCPU.sol — minimal concrete leaf (adds startBattle). - test/BatchedCPUTest.sol — 6 functional tests (submit-execute, multi-batch counter accounting, empty-buffer / non-p0 / post-game-over reverts, buffered-turn readback). - test/BatchedCPUGasTest.sol — gas + cold-access tally vs OkayCPU. Measured at B=14 (test/BatchedCPUGasTest.sol): in-harness: legacy 2,637,557 batched 2,030,352 -23.0% / -607k prod estimate: legacy ~3.49M batched ~2.53M ~-28% / -960k cold delta: 279 -> 92 (-187 cold first-touches in prod) Legacy CPU contracts (CPUMoveManager, CPU, OkayCPU, FairCPU, BetterCPU) and their tests are completely untouched — this is purely additive. Battles choose one model at startBattle time via the moveManager field; mid-battle alternation between the two CPU contracts isn't supported (legacy and batched are separate products). OPT_PLAN §7 rewritten to reflect the new design; §11 Phase 2.5 marked done with measured numbers. All 543 tests pass.
Four separate transient slots (_turnP0MoveEncoded, _turnP1MoveEncoded, _turnP0Salt, _turnP1Salt) become one packed _turnTransient: [0..7] p0 packedMoveIndex (storedMoveIndex | IS_REAL_TURN_BIT) [8..23] p0 extraData [24..127] p0 salt [128..135] p1 packedMoveIndex [136..151] p1 extraData [152..255] p1 salt Per-side IS_REAL_TURN_BIT preserved so _getCurrentTurnMove can still detect "this side's transient is populated" and fall back to storage when not — DefaultCommitManager's execute(battleKey) flow keeps working unchanged. Per call: - executeWithMoves / executeWithSingleMove / executeBatchedTurns per iter: 4 TSTOREs -> 1 TSTORE. - executeBatchedTurns inter-iter reset: 4 -> 1 TSTORE. - setMove mid-execute (Sleep override): now TLOAD + RMW + TSTORE instead of plain TSTORE. +200g per sleep-tick. Rare, net positive. Measured (realistic 14-turn + B=14 CPU batched): PvP batched execute: 1,590,098 -> 1,565,215 (-24,883 / -1.6%) PvP legacy single-tx: 1,712,843 -> 1,687,503 (-25,340 / -1.5%) CPU batched (B=14): 2,030,352 -> 1,997,760 (-32,592 / -1.6%) CPU legacy OkayCPU: 2,637,557 -> 2,608,227 (-29,330 / -1.1%) Snapshot suites improved across the board (Inline_Execute -4.6k, FirstBattle/ThirdBattle -17.3k, SecondBattle -18.6k, StandardAttackPvP -2.1k per turn, BetterCPU -0.5-2k per scenario). All 543 tests pass. Cumulative vs original baseline: batched: 1,762,241 -> 1,565,215 = -197,026 (-11.2%) legacy: 1,867,567 -> 1,687,503 = -180,064 (-9.6%)
… STATICCALL Repacks the BatchedCPUMoveManager's per-storageKey counter slot to also carry the immutable p0 address + an observed gameOverFlag — letting submitTurn auth and game-over checks happen via a single SLOAD instead of a STATICCALL into the engine's getSubmitContext. Layout (256 bits, keyed by storageKey so it reuses across battles via the engine's MappingAllocator pattern): [0..30] numExecuted (uint31) [31] gameOverFlag (1 bit) [32..63] numBuffered (uint32) [64..95] lastSubmitTs (uint32, year 2106 overflow) [96..255] p0 (160-bit address) Plus a separate `storageKeyOf[battleKey]` cache lets us skip getStorageKey on subsequent submits (battleKey -> storageKey is immutable per battle). Hot path (cache hit, all submits after the first): 1 SLOAD storageKeyOf[battleKey] 1 SLOAD bufferState[storageKey] -> p0, gameOver, counters 1 SSTORE moveBuffer[storageKey][nextTurnId] 1 SSTORE bufferState[storageKey] (preserves p0, increments numBuffered) Cold path (first submit per battle): falls back to getSubmitContext to populate both caches, sync counters to engine's turnId, and continue. executeBuffered: - Sets gameOverFlag if winner != 0 so subsequent submits revert fast. - Cold engine.end() timeouts won't propagate here (battle ends without touching this manager's state) but the manager isn't load-bearing for that flow anyway — it just won't accept new submits, which is the correct behavior. Measured at B=14 (test/BatchedCPUGasTest.sol): Submits: 301,998 -> 284,442 (-17,556 / -5.8%) Execute: 1,695,762 -> 1,694,541 (-1,221) Total: 1,997,760 -> 1,978,983 (-18,777 / -0.9%) Cold first-touches: 92 -> 79 (-13 -> ~-26k production cold penalty) In-harness saves vs legacy at B=14 went from 607,205 to 629,244 (+22k). All 543 tests pass.
Adds `Engine.executeWithDualSignedMovesDirect(...)` — an opt-in entry
point for battles started with `moveManager = address(0)` that lets the
caller skip the SignedCommitManager STATICCALL + the redundant
`getCommitAuthForDualSigned` STATICCALL by doing the EIP-712 sig
verification and auth inline in the engine.
Battles with `moveManager` set continue to go through the manager
unchanged. The engine has its own EIP-712 domain ("Engine","1") so sigs
signed for the manager DON'T verify against the engine's direct path —
prevents cross-contamination if someone tries to relay a manager-bound
sig.
Caveat (documented in the function): stall-timeout via Engine.end()
requires either a `validator` set on the battle or hitting
MAX_BATTLE_DURATION. The inline timeout path calls into a commit
manager (which doesn't exist here). Battles using the direct path
should set a validator if they need stall-timeout semantics.
Measured at B=14 (test/EngineDualSignedDirectTest.sol):
via manager: 1,741,827 gas
via engine direct: 1,696,946 gas
saved: 44,881 gas (~3.2k per turn / ~2.6%)
Snapshot regression: ~300g per scenario across EngineGasTest /
StandardAttackPvPGasTest / etc. — bytecode bloat from EIP712 base +
new function. Acceptable for the opt-in win on the new path.
All 549 tests pass (was 543 + 6 new in EngineDualSignedDirectTest).
Two small cleanups to BatchedCPUMoveManager that are correctness-preserving simplifications, with small production savings hidden by the test harness's single-tx warmth bias: 1. submitTurn cache-miss path: skip the `bufferState[storageKey]` SLOAD. Cache miss implies first submit of this battle, so we unconditionally reset `packed` to (ctxTurnId, ctxP0). Any prior battle's leftover state at this storageKey (gameOver flag, old numExecuted from MappingAllocator reuse) is overwritten — the new battle owns the slot. Saves ~2k cold SLOAD per first-submit per battle in production. 2. executeBuffered _afterBattle: use cached p0 from `packed` instead of an extra STATICCALL into `engine.getPlayersForBattle`. Saves ~3k per game-end transition. In the test harness, the warmup populates the bufferState slot before the measured battle starts (MappingAllocator reuses storageKey), so the cache-miss SLOAD is already warm in the test. The savings only show up in production where each submit is its own tx. All 549 tests pass. No measurable in-harness gas change (within noise ~79g at B=14).
Adds three additive APIs so external IMoveSet/IAbility contracts can collapse
the canonical "loop getEffects to dedup then addEffect" and "check globalKV
flag then setGlobalKV" patterns into single calls, and read both sides' stats
+ state + effects in one staticcall:
- addEffectIfNotPresent(target, mon, effect, data) -> bool
Coalesces the 17-site ability idempotency-guard pattern. Storage-side
scan against live + tombstoned slots; only calls _addEffectInternal when
not already present.
- getAndInitGlobalKV(key, valueIfZero) -> previousValue
Read + conditional init in one call. Useful for once-per-battle flag
patterns (~5 sites). Eagerly initializes if previous was zero, so
callers that need conditional set-after-work should keep the split form.
- getMoveContext(battleKey, atkPlayer, atkMon, defPlayer, defMon)
Returns MonStats + MonState + EffectInstance[] for both sides in one
view. Replaces the 4-7 individual getMonStatsForBattle /
getMonStateForBattle / getEffects callbacks the worst-offender custom
moves do today. Sentinel deltas are sanitized to 0 to match
getMonStateForBattle semantics; tombstoned slots are filtered out.
No existing call sites are migrated; all 557 existing tests pass. Snapshots
show a small (~+1000g per execute) dispatch-table regression from the three
new selectors, which migration of custom moves to these APIs will recoup
several times over.
Eight new tests in EngineMoveAPITest cover correctness and write-context
gating for all three APIs.
Removes getBattleValidator and getMoveManager — both declared in IEngine and implemented in Engine, but never called from any src/ contract (no matchmaker, commit manager, validator, hook, mon, or test references). Pure dead dispatch-table weight. Verified by direct grep across src/, test/, script/: only the IEngine declarations and Engine implementations themselves reference these names. Saves ~250-650g per external call to the engine on every code path, ~3-4k per battle setup. Partial offset against the +1000g/execute regression from the three new coalesced move-facing APIs in 6276d2c. Also corrects the CommitContext doc comment — the lightweight context's real benefit over BattleContext is fewer encoded return fields (cheaper ABI encode/decode), not fewer SLOADs (both getters load the same storage slots). Confirmed by measuring a CommitContext → BattleContext swap that regressed execute paths by +5k each. All 557 tests pass.
Removes getMonStateForStorageKey and getPrevPlayerSwitchForTurnFlagForBattleState.
Neither was called from any production contract — only from tests:
- EngineTest.sol used getMonStateForStorageKey(battleKey, …) 4× with battleKey
as the storageKey arg. That's the same data getMonStateForBattle returns
(which internally does _resolveStorageKey), so the test swap is a no-op.
- BatchEquivalenceTest._assertBattlesEqual read
getPrevPlayerSwitchForTurnFlagForBattleState on two battles. Replaced with
a getBattle(key) destructure pulling prevPlayerSwitchForTurnFlag off
BattleData. Tests don't care about per-call gas cost.
Recovers another ~500-1200g per execute path on top of the two zero-caller
removals in 0952574. Combined with that commit, the 4 removals recoup
roughly 60-65% of the +APIs dispatch regression from 6276d2c.
All 557 tests pass.
…Present Replaces the canonical "iterate getEffects to dedup, then addEffect" idiom across 13 sites in 12 mon contracts with a single addEffectIfNotPresent call. Each migration drops ~7 lines of boilerplate + one STATICCALL for getEffects + the in-move loop iteration. Sites migrated (all 12 are clean dedup-on-address pattern): aurox: IronWall, UpOnly ekineki: SneakAttack (special: guards entire body, uses returned bool) embursa: Tinderclaws (activateOnSwitch only; _removeBurnIfPresent kept) gorillax: Angery inutia: ChainExpansion, Interweaving malalien: ActusReus nirvamma: Adaptor pengym: PostWorkout sofabbi: CarrotHarvest xmon: Dreamcatcher, Somniphobia IronWall and SneakAttack use the `if (!addEffectIfNotPresent(...)) return;` form because the effect-presence check guards the entire function body (initial heal / damage calc skipped on subsequent calls in the same lifecycle). NOT migrated (different semantics): ghouliath/RiseFromTheGrave — uses globalKV flag, not getEffects nirvamma/HardReset — data-bit conditional dedup xmon/NightTerrors — find-or-update, not add-only embursa/Tinderclaws site 2 (_removeBurnIfPresent) — remove pattern aurox/GildedRecovery — remove pattern iblivion/Baselight — _findEffect tuple-returning helper All 557 tests pass.
When auditing the 9 globalKV consumer sites for adoption, only RiseFromTheGrave matches the eager-init flag semantics this API was designed for. The other 8 are read-modify-write counters (HoneyBribe, SnackBreak, ModalBolt, Chronoffense turn-anchor), pure reads for branching (PostWorkout, Tinderclaws), explicit set/clear pairs (HeatBeaconLib), or conditional set-after-work patterns (SaviorComplex). None of those can fold into "if previous was zero, set valueIfZero" without losing functionality. One adoption candidate doesn't justify maintaining the API surface — keep the engine lean, leave RiseFromTheGrave on the explicit get + set pair. Removes the IEngine declaration, Engine implementation, the op=2 branch in MockNewAPIMove, and the three EngineMoveAPITest cases. Snapshot impact is within compiler noise (+/- ~90g, no consistent direction) since the function had no production callers to begin with. 554/554 tests pass.
Collapses the 6 individual getMonStatsForBattle / getMonStateForBattle callbacks SneakAttack uses to build its DamageCalcContext into a single getMoveContext call (attacker stats + state, defender stats + state, all covered). Measured savings on the existing EkinekiTest scenarios: test_sneakAttackHitsNonActiveMon: 5,019,144 → 5,006,241 (-12,903) test_sneakAttackOncePerSwitchIn: 5,129,646 → 5,116,773 (-12,873) test_sneakAttackResetsOnSwitchIn: 5,315,998 → 5,300,216 (-15,782) ~13-16k saved per test, dominated by per-move-invocation cost. Validates that getMoveContext earns its dispatch-table weight on high-read sites. Defender mon is a non-active team slot indexed from extraData — works because getMoveContext accepts arbitrary mon indices, not just active. All 554 tests pass.
…-use callers
When auditing migration candidates beyond SneakAttack, every other multi-read
site that fits the (attacker, defender) pair model uses only 2–4 of the
returned fields. Measured impact on HoneyBribe (3 fields used) was +97k
regression per test — the fat context's hidden costs dominate when most data
goes unread:
- 4 stat/state structs fully fetched (4 SLOADs + struct construction each)
- Both effect arrays iterated and memory-allocated even when caller
consults neither
- ~1.1kb of ABI encoding/decoding for the returned struct
SneakAttack used ~10 of the available fields and saved ~13k per call — but
it's the only such site in the codebase. Keeping a one-consumer API plus
paying its dispatch cost on every other engine call doesn't pencil out.
Reverts SneakAttack to its original individual-getter form (keeping the
addEffectIfNotPresent change from d1edf81). Drops getMoveContext + the
_sanitizeMonState helper from Engine.sol, the IEngine declaration, the
MoveContext struct from Structs.sol, and the three EngineMoveAPITest cases.
Net engine surface vs pre-APIs baseline: -4 functions (1 added:
addEffectIfNotPresent; 5 removed: getBattleValidator, getMoveManager,
getMonStateForStorageKey, getPrevPlayerSwitchForTurnFlagForBattleState,
getAndInitGlobalKV).
Snapshot impact vs pre-removal: -638g to -1478g per execute path,
-6457g to -7532g per battle setup, -682g per BetterCPU turn. Versus the
original baseline (commit bdc0505, pre-APIs), all hot paths are now
NEGATIVE gas — the engine is leaner than before this exercise started.
All 551 tests pass.
Covers the full 53-commit arc since main:
- Net IEngine surface delta (+4 / -4, focused-not-bloated)
- Storage layout changes (BattleData slot 0/1 split, MoveDecision pack)
- New managers: BatchedCPUMoveManager + SignedCommitManager batched flow
- Engine internals (transient shadow, slot-1 coalescing, event drop, etc.)
- 12 mon migrations to addEffectIfNotPresent
- Gas-impact table (hot paths net negative vs baseline)
- "Tried and reverted" lessons (getMoveContext, getAndInitGlobalKV,
tiered EffectInstance storage, salt size reduction, etc.)
- Migration guide for downstream consumers
- src/cpu/CPU.sol: unused IMoveSet import
- src/cpu/BetterCPU.sol: unused Type import
- src/cpu/HeuristicCPUBase.sol: rename aggressive-branch bestIdx to
aggBestIdx to silence the shadow warning vs the later same-name local
- script/EngineAndPeriphery.s.sol: unused DefaultValidator/SimplePM
imports + commented-out SimplePM deploy lines
- test/{EngineDualSignedDirectTest,BatchedCPUTest,BatchAccessProfileRealisticTest}.sol:
unused imports flagged by the lint pass
All from forge build warnings; no behavior change.
…, tighten docs
Unused locals (build warnings):
- 4× \`BattleData storage battle = battleData[battleKey]\` reads that the
function body never references (_addEffectInternal, _removeEffectAtSlot,
_dispatchStandardAttack, executeWithDualSignedMovesDirect)
Unused params dropped (with all callsites updated):
- _checkForGameOverOrKO: \`BattleData storage battle\` (6 callers)
- _computePriorityPlayerIndex: \`BattleData storage battle\` (2 callers)
DRY: addEffectIfNotPresent collapses its 3-branch dispatch using
\`_loadEffectsCount\` + the p0/p1 mapping-ternary pattern already used by
\`_getEffectsForTarget\`. -12 LOC, same behavior.
Doc tightening (verbose → terse, kept the WHY):
- executeWithDualSignedMovesDirect natspec (12 → 6 lines)
- getSubmitContext natspec (9 → 3 lines)
- IEngine.addEffectIfNotPresent natspec (4 → 2 lines)
- IEngine.getStorageKey natspec (4 → 3 lines)
- BatchedCPUMoveManager: class @notice/@dev (14 → 7 lines), bufferState layout doc (11 → 8 lines), storageKeyOf doc (4 → 1 line), submitTurn /executeBuffered inline blow-by-blow comments removed where code is self-evident (kept the non-obvious WHY around first-submit reset semantics and the manager-as-moveManager note in executeBuffered). - SignedCommitManager: class @notice/@dev (26 → 8 lines). Dropped the re-walk of the 3-tx vs 1-tx protocol — that's in OPT_PLAN/CHANGELOG. Kept the security WHY (both sigs required vs unilateral revealer). Snapshot refresh: small per-execute deltas (~50-150g) from compiler re-optimization across the engine surface changes in this cleanup pass.
Replaces 66kb of in-progress design speculation with a 171-line summary of
what shipped, what was tried and rejected, what was deferred, and the
measured savings. Cross-references CHANGELOG.md for per-commit detail.
Sections:
- What shipped (core mechanism, shadow layer, layout repacks, single-tx
engine wins, new APIs, surface trims)
- What was skipped (deferred phases + 10 rejected experiments with
measurement-backed reasons)
- Measured savings (CPU batched, PvP legacy, realistic-game steady-state,
final engine-surface deltas)
- 6 lessons worth carrying into v2 work
Adds two helpers that skip the 9-field memory unpack \`_loadMonState\`
performs:
- _readMonStateDelta now operates directly on the packed slot from
_readMonStatePacked (shift+mask per field, sentinel→0 conversion
preserved). Used by the public getMonStateForBattle, so external
move callers benefit too.
- _isMonKnockedOut: dedicated single-bit check for the hot KO-guard
pattern. Saves ~220g/call vs _loadMonState(...).isKnockedOut.
Migrated 10 single-field _loadMonState callsites:
- 8× .isKnockedOut → _isMonKnockedOut
- 2× .speedDelta in _computePriorityPlayerIndex → _readMonStateDelta(...,Speed)
(also removes the now-redundant CLEARED_MON_STATE_SENTINEL check at
the call site since the helper sanitizes internally)
Multi-field _loadMonState callers (10+ remaining) keep the existing
load-modify-store pattern — they amortize the unpack across multiple
fields.
Snapshot deltas vs main: recovers 25-30% of the legacy-path regression
introduced by the MonState shadow refactor. EngineGasTest B1_Execute went
from +69k (+7.6% vs main) to +50k (+5.4%); FirstBattle from +290k (+9.9%)
to +218k (+7.5%); BetterCPU Turn1 from +33k (+13.7%) to +20k (+8.2%).
Residual ~5-10% gap vs main is infrastructure cost the legacy path can't
escape without a full execute-internal split: TLOAD-check tax in
_readMonStatePacked (~100g × ~250 calls/game), BD slot-1 helper-routing,
and the dispatch-table cost of the new external entrypoints. Phase B2
(per-field MonState write helpers for the write-heavy frames) is the
next lever if more recovery is needed.
All 551 tests pass.
…nches
Phase B2 (per OPT_PLAN follow-up). Refactors _updateMonStateInternal to
skip the 9-field memory unpack/repack for the 8 numeric/bool branches it
already special-cases:
- Hp/Stamina/Speed/Atk/Def/SpAtk/SpDef → _addToMonStateDeltaField:
direct packed RMW with sentinel-as-zero semantics. Saves ~410g per
call vs the load-modify-store pattern through _loadMonState.
- ShouldSkipTurn → _setShouldSkipTurn: single-bit flip on the packed
slot.
IsKnockedOut keeps the full-unpack path — its KO-bitmap + winner-check +
OnUpdateMonState side effects are intertwined with the in-memory struct
mutation, so factoring it out for a few hundred gas isn't worth the
restructuring.
Snapshot deltas vs pre-B2 (B1 baseline 8816414):
EngineGasTest FirstBattle 3,138k → 3,128k (-10k)
EngineGasTest SecondBattle 3,193k → 3,180k (-13k)
EngineGasTest ThirdBattle 2,510k → 2,500k (-10k)
InlineEngine B1_Execute 937k → 935k (-2k)
InlineEngine FirstBattle 2,771k → 2,761k (-10k)
Per-execute hot paths in EngineGasTest/BetterCPUInlineGasTest unchanged
because the mock attacks they exercise don't call engine.updateMonState.
Savings appear on battle-lifecycle paths (KO handling, stamina regen)
and on the Inline harness's Battle setups.
Total recovery vs main (B1+B2 combined): ~25-30% of the legacy-path
regression. Residual ~5-9% gap vs main is the architectural cost of the
shadow infrastructure on the legacy code path (TLOAD-check tax in
_readMonStatePacked, BD slot-1 helper-routing, new dispatch entries) —
recoverable only via Option A (parallel execute internals).
All 551 tests pass.
…eric branches" This reverts commit a1cd1e6.
Each shadow-routed helper (_loadMonState, _readMonStatePacked, _getActiveMonIndex, _isMonKnockedOut, _getWinnerIndex, _getTurnId, _loadKoBitmaps, ...) used to TLOAD _batchShadowActive at every call to decide whether to consult the shadow stack. _executeInternal now reads _batchShadowActive once at the top and threads bool isBatched through: _handleMove, _handleEffects, _handleEffectsTriple, _runEffects, _runSingleEffect, _handleSwitch, _addEffectInternal, _activateAbility, _inlineAbilityActivation, _removeEffectAtSlot, _computePriorityPlayerIndex, _dealDamageInternal, _updateMonStateInternal, _dispatchStandardAttackInternal, _inlineStandardAttack, _inlineStaminaRegen, _inlineRegenStaminaForMon, _getDamageCalcContextInternal, _checkForGameOverOrKO, _checkAndSetWinnerIfGameOver External entries (addEffect, addEffectIfNotPresent, removeEffect, dispatchStandardAttack, validatePlayerMoveForBattle, getCPUContext, getValidationContext, getBattleState, getMonStatesForSide, computePriorityPlayerIndex, getDamageCalcContext) read _batchShadowActive once and pass the bool. Public no-bool overloads remain for backward compatibility with mons/effects calling into IEngine. Gas (vs branch tip before this commit): EngineGasTest: -11k to -88k per scenario (-2% to -3%) StandardAttackPvPGasTest: -8k to -8.5k per turn (-6% to -10%) BetterCPUInlineGasTest: -2k to -8k per turn (-3% to -10%) EngineOptimizationTest: -13k / -22k vs origin/main: legacy regression now sits at +3-5% (was +5-15%); StandardAttackPvP Turn0_Lead now beats main outright. All 546 tests pass.
The threading refactor passes bool isBatched explicitly through every shadow-routed helper. A regression where any helper forgets to forward it (or defaults to false) inside a batched flow would silently break mid-batch state visibility — sub-turn N+1 would read stale storage instead of the shadow value sub-turn N wrote. MockStateProbeMove records the opponent's MonState field via getMonStateForBattle into globalKV. Two tests use it: test_batchedShadow_probeObservesMidBatchDamage — damage P1 in sub-turn 1, probe P1's HpDelta in sub-turn 2, assert probe sees the negative delta (not 0). Between sub-turns the shadow is the only carrier; a mis-threaded read would observe stale storage. test_batchedShadow_probeObservesAccumulatedDamage — two damaging turns back-to-back, then probe in turn 3; asserts the cumulative delta is visible (each turn's shadow write must compose).
After threading bool isBatched through every internal caller, the no-bool
wrappers (e.g. _getActiveMonIndex(battleKey)) were only used by:
- _getPlayerSwitchForTurnFlag(battleKey) at executeBatchedTurns L522
(always batched) and executeWithSingleMove L595 (never batched)
- _readMonStateDelta(...) at getMonStateForBattle (external view)
- _getPrevPlayerSwitchForTurnFlag (dead)
Inline the explicit bool argument at the three live callsites:
- L522 passes true (we just set _batchShadowActive = true above)
- L595 passes false (executeWithSingleMove is a fresh external entry —
_batchShadowActive can only become true inside executeBatchedTurns,
and it's reset to false on that function's exit)
- getMonStateForBattle reads _batchShadowActive once and threads it
Delete 21 wrappers (~108 LOC) including the dead prev-flag helpers.
Net gas: -107g on the executeWithSingleMove path (one TLOAD saved).
Everything else unchanged because the wrappers were unused in the
measured paths. Real win is the surface-area reduction.
All 548 tests pass.
_handleMove read BattleData slot 1 three separate times: turnId at the
top, attacker's activeMonIndex right after, defender's activeMonIndex
just before the move dispatch. All three are the same packed slot;
each was a separate SLOAD because the helpers couldn't be deduplicated
across function call boundaries by the optimizer.
Read the slot once into `slot1Packed`, extract `turnIdCached` and
`cachedPackedActiveMon` via bit shifts, and reuse both for the attacker
and defender mon reads in both the INLINE and EXTERNAL paths.
Safety: the cache is only consumed pre-external-call. Mutation barriers
inside _handleMove are:
- moveSet.move(...) / _inlineStandardAttack → PreDamage effects: these
can call engine.switchActiveMon (PistolSquat / HitAndDip / RoundTrip
do this) — but the cached values were already consumed by then.
- _handleSwitch (in the SWITCH branch): writes slot1, called after the
cache has been used.
- moveSet.stamina() and validator.validateSpecificMoveSelection() are
read-only by interface contract — treated as safe.
Gas per turn:
EngineGasTest: -1.5k to -2.8k per Execute scenario (-0.3% to -0.5%)
StandardAttackPvPGasTest: -1,154g per turn (-1.4%)
BetterCPUInlineGasTest: -906g per BothAttack turn (-0.4%)
EngineOptimizationTest: -1.8k / -2.5k
Multi-battle EngineGasTest scenarios: -10k to -12k
All 548 tests pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes Phase 0.1 of OPT_PLAN: locks per-turn SLOAD/SSTORE budgets for
effect-heavy, forced-switch, and multi-mon-switch turn shapes alongside
the existing clean-trade scenario. Records the four numbers in a comment
block so the shadow layer has a concrete budget to clear.
PerTurnTickEffect is a 50-LOC mock with the RoundStart/RoundEnd/AfterDamage
bits set, avoiding the StatBoosts+BurnStatus dependency chain for an
instrumentation-only test where only the access pattern matters.
https://claude.ai/code/session_01AWNvQmDFzSK7sYMDXtyXD5