Skip to content

Major Optimization Patch#69

Open
sudo-owen wants to merge 65 commits into
mainfrom
claude/decouple-engine-move-tracking-ZMINV
Open

Major Optimization Patch#69
sudo-owen wants to merge 65 commits into
mainfrom
claude/decouple-engine-move-tracking-ZMINV

Conversation

@sudo-owen
Copy link
Copy Markdown
Collaborator

Closes Phase 0.1 of OPT_PLAN: locks per-turn SLOAD/SSTORE budgets for
effect-heavy, forced-switch, and multi-mon-switch turn shapes alongside
the existing clean-trade scenario. Records the four numbers in a comment
block so the shadow layer has a concrete budget to clear.

PerTurnTickEffect is a 50-LOC mock with the RoundStart/RoundEnd/AfterDamage
bits set, avoiding the StatBoosts+BurnStatus dependency chain for an
instrumentation-only test where only the access pattern matters.

https://claude.ai/code/session_01AWNvQmDFzSK7sYMDXtyXD5

claude added 30 commits May 21, 2026 20:21
Closes Phase 0.1 of OPT_PLAN: locks per-turn SLOAD/SSTORE budgets for
effect-heavy, forced-switch, and multi-mon-switch turn shapes alongside
the existing clean-trade scenario. Records the four numbers in a comment
block so the shadow layer has a concrete budget to clear.

PerTurnTickEffect is a 50-LOC mock with the RoundStart/RoundEnd/AfterDamage
bits set, avoiding the StatBoosts+BurnStatus dependency chain for an
instrumentation-only test where only the access pattern matters.

https://claude.ai/code/session_01AWNvQmDFzSK7sYMDXtyXD5
…nager buffer

Adds the OPT_PLAN §4 API surface: TurnSubmission struct, submitTurnMoves
(per-turn buffer with full sig verification + first-of-batch sync from
engine turnId), and executeBuffered (drains every pending turn in one tx
via flag-based dispatch per §6.1, breaks on game-over).

Buffer layout matches OPT_PLAN §3 exactly (one uint256 per turn): 8/16/104
bits for p0 move/extra/salt + same for p1. Counters packed into a single
slot: (numTurnsExecuted | numTurnsBuffered | lastSubmitTimestamp).

executeBuffered lives on the manager rather than the engine so the engine
stays ignorant of any specific commit manager. Manager dispatches via
ENGINE.getPlayerSwitchForTurnFlagForBattleState between iterations.

resetCallContext gains four additional transient clears (tempRNG,
koOccurredFlag, tempPreDamage, effectsDirtyBitmap) so batched in-tx
execution matches legacy per-tx execution where the EVM auto-clears all
transients on tx entry. Production never calls resetCallContext, so the
~400-gas overhead is test-only.

https://claude.ai/code/session_01AWNvQmDFzSK7sYMDXtyXD5
…ed flow

20 new tests across four suites:

- BatchEquivalenceTest (4 tests): B in {2, 4, 8} + multi-batch byte-equal
  vs legacy executeWithDualSignedMoves. Asserts BattleData, MonStates,
  active mons, KO bitmaps, winner all match across two parallel battles.

- BufferSubmissionTest (12 tests): happy path, relayer submission, wrong
  committer/revealer signer, missing committer sig (regression for the
  unilateral-revealer attack from OPT_PLAN §9), missing revealer sig,
  wrong turnId, replay, battle-not-started, empty-buffer execute,
  counter accounting, timestamp tracking.

- BatchEdgeTest (5 tests): forced-switch dispatch (flag != 2), single-
  side switch, mid-batch game-over (ex advances by actually-executed not
  buffered count; remaining entries become dead), legacy->batched and
  batched->legacy mode alternation in the same battle.

- BatchGasTest (3 benchmarks): B in {2, 4, 8} comparison of legacy vs
  batched total gas. Surfaces the critical finding that the batched flow
  is currently MORE expensive than legacy (~33-36% more) -- the per-turn
  buffer + counter SSTOREs dominate, and the cross-sub-turn EVM warm-
  storage discount alone isn't enough to recoup. Shadow layer (deferred
  Phase 1) needed to deliver the gas-savings claim from OPT_PLAN §1.

BatchHelper.sol centralizes _buildTurnSubmission / _submitTurnMoves /
_executeBuffered for reuse across the four suites.

https://claude.ai/code/session_01AWNvQmDFzSK7sYMDXtyXD5
OPT_PLAN.md gains a §11 concrete phase-by-phase todo (Phase 0.1 + 2 done;
0.5 / 1 / 2.5 deferred with reasons) and a §12 decision log capturing
the calls made during implementation: scope reduction to defer the
shadow layer, executeBuffered on the manager (not engine), buffer keyed
by battleKey, single-uint256 packing, resetCallContext extension over
parallel function, asymmetric-team game-over test design, and the
critical gas finding showing batched is currently more expensive than
legacy by ~33-36%.

Snapshot deltas reflect the resetCallContext extension: +1-7k gas per
existing gas test due to four added TSTOREs per resetCallContext call.
All test-only paths -- production code never calls resetCallContext, so
real-world gas is unaffected.

https://claude.ai/code/session_01AWNvQmDFzSK7sYMDXtyXD5
Implements the full shadow infrastructure described in OPT_PLAN §5.1/§5.2:
eight shadow helpers (MonState, KO bitmap, BattleData, effect slots,
effect counts, globalKV) with both shadow and storage branches, lazy-load
on writes, dirty-bit tracking, and a `_flushShadow` walker that SSTOREs
once at end of batch. New `Engine.executeBatchedTurns(bytes32, uint256[])`
external entry activates shadow, loops sub-turns with flag-based dispatch
per §6.1, flushes, returns (executedCount, winner). Manager's
`executeBuffered` now delegates to it.

Design decisions:
- Helpers take BattleConfig storage cfg explicitly (not via
  storageKeyForWrite) so external view getters (getEffects, getKOBitmap,
  etc.) work from staticcall contexts where the transient cache is empty.
- Reads check loaded bit (set only on writes), fall back to SLOAD if
  not loaded — keeps reads view-compatible (no TSTORE in view functions).
- BattleData slot 1 has helpers but stayed on storage refs in
  _executeInternal to avoid a ~13-function refactor; engine continues
  writing BattleData via storage refs.

CRITICAL gas finding (test/BatchGasTest.sol, B=8 clean trade):

  legacy (per-turn)        : 687k -> 849k  (+161k regression)
  batched (submit+execute) : 937k -> 1172k (+235k regression)
  batched - legacy gap     : +249k -> +323k (gap GREW with shadow)

The shadow layer:
- Adds ~20k/turn overhead to the LEGACY path (memory pattern instead
  of storage refs; _shadowActive TLOAD check on every helper call paid
  even when shadow is inactive).
- Saves ~24k/sub-turn on the executeBuffered path (SSTORE coalescing
  for MonState + effect slots + counts + KO bitmap + globalKV).
- Per-submission overhead (~85k each) is unchanged — submission
  infrastructure, not engine-state infrastructure. The shadow can't
  recover the 8 x 85k = 680k of submission cost.

Conclusion: OPT_PLAN §1's gas-savings claim is NOT architecturally
achievable with the per-turn-SSTORE submission scheme. To beat
dual-signed-per-turn execution, batching needs a different submission
design (Merkle-rooted batch claims, sig aggregation, etc.). The
shadow layer is correct and stays in place as the substrate for any
future submission redesign, but on its own it's a net loss.

All 533 existing tests still pass. Gas snapshots regenerated to
reflect the legacy-path regression.

https://claude.ai/code/session_01AWNvQmDFzSK7sYMDXtyXD5
Records vm.stopAndReturnStateDiff for an end-of-game scenario (8-turn
damage trade) and tallies access patterns across:
- 8 separate-tx legacy executes (cold/warm per-tx)
- 8 separate-tx batched submissions
- 1 batched executeBuffered (single tx, sub-turns share warm cache)

Headline numbers from `test_accessProfile_endOfGame_8turns`:

  | metric        | legacy | batched | delta |
  |---------------|--------|---------|-------|
  | Total SLOADs  | 1016   | 1041    | +25   |
  | Cold SLOADs   | 144    | 77      | -67   |
  | Total SSTOREs | 80     | 97      | +17   |
  | z->nz SSTOREs | 2      | 11      | +9    |
  | nz->nz SSTOREs| 54     | 62      | +8    |

Interpretation: the EVM warm-cache amortization across sub-turns of one
tx delivers exactly the cold-SLOAD savings the design intends (-67 cold
SLOADs = ~134k gas). The +249k batched gap comes from the per-submission
SSTOREs (9 cold z->nz x ~22k + 8 cold nz->nz x ~5k = ~240k), NOT from
engine state. Eliminating per-turn buffer SSTORE (e.g. via rolling-hash
commitment) is the highest-leverage optimization.

https://claude.ai/code/session_01AWNvQmDFzSK7sYMDXtyXD5
Two changes to enable steady-state warm-SSTORE measurement for the
batched submission path:

1. Add `Engine.getStorageKey(bytes32) external view` so move managers
   can resolve a battleKey to the engine's storageKey (the one
   MappingAllocator recycles across battles via its free-list).

2. Re-key `SignedCommitManager.moveBuffer` and `bufferCounters` from
   battleKey to storageKey. Battle N+1 reuses the same buffer/counter
   slots as battle N (warm nonzero->nonzero SSTOREs ~2.9k instead of
   cold zero->nonzero ~22.1k). No new state in the manager — the
   engine's existing allocator does the slot recycling.

Realistic-game access tally (`test_realisticGameAccessProfile_steadyState`,
mirrors `InlineEngineGasTest.test_consecutiveBattleGas`'s 14-turn move
sequence, measured on Battle 2 = steady state):

  | metric        | legacy | batched | delta |
  |---------------|--------|---------|-------|
  | SSTOREs total | 145    | 174     | +29   |
  | - z->nz       | 4      | 4       |  0    |  <- no cold buffer writes
  | - nz->nz      | 92     | 117     | +25   |  <- submissions are warm
  | Cold SLOADs   | 280    | 159     | -121  |  <- amortization works
  | Warm SLOADs   | 1499   | 1674    | +175  |

Storage I/O gas (EIP-2929 prices):
  legacy:  ~1119k
  batched: ~984k
  savings: ~135k per 14-turn game (~10k/turn, scales linearly)

The cold-state +249k regression I reported earlier was an artifact of
fresh battleKey-keyed slots costing 22k z->nz each on every battle.
With storageKey-keyed buffer, steady-state batched is strictly cheaper
than legacy in storage I/O — the SLOAD amortization across one tx
outweighs the per-submission warm SSTORE overhead.

535 existing tests still pass; snapshots regress by ~300-1000 gas
each from the added getStorageKey external view.

https://claude.ai/code/session_01AWNvQmDFzSK7sYMDXtyXD5
…ttleData

Three steady-state submission optimizations stacked on top of the
storageKey-keyed buffer:

#3 Drop TurnSubmitted event. The moveBuffer SSTORE is itself the
   on-chain observable; off-chain replay can index storage diffs.
   Saves a LOG3 per submission.

#5 Replace ENGINE.getCommitContext() + ENGINE.getStorageKey() (2
   external calls, ~5 SLOADs) with ENGINE.getSubmitContext() in
   submitTurnMoves. The new getter returns ONLY what async submission
   needs (p0, p1, turnId, winnerIndex, storageKey) in one call, with
   3 SLOADs. Skips startTimestamp/validator/playerSwitchForTurnFlag
   reads — none are required at submission time in the batched flow.
   The non-existent-battle case is rejected via the existing
   winnerIndex != 2 check (default-zero BattleData fails it).

#4 Re-pack BattleData so all per-turn-mutable fields live in slot 1.
   Slot 0 (p1, p0TeamIndex, p1TeamIndex) becomes immutable during play.
   Slot 1 absorbs turnId (shrunk uint64 -> uint16: 65k turns/battle is
   plenty) by shrinking lastExecuteTimestamp uint48 -> uint40 (year
   36800 cap, still plenty). External getters keep their uint48/uint256
   return types via implicit widening, so no ABI break for clients.

Realistic-game access tally (steady state, 14-turn game from
test_consecutiveBattleGas) - before -> after this commit:

  submission SLOADs (summed): 98 -> 56  (-42, mostly from skipping the
                                         BattleConfig SLOAD that
                                         getCommitContext does)
  batched cold SLOAD savings vs legacy: 121 -> 149  (-28 more cold
                                                     reads amortized)
  batched SLOAD delta total: +25  -> +12  (much closer to legacy)
  batched SSTORE delta total: +29 -> +29  (unchanged - the buffer +
                                           counter SSTOREs are still
                                           the architectural floor)

Net storage I/O savings vs legacy in steady state: ~135k -> ~223k
(per 14-turn game), an extra ~88k from these three changes alone.

All 536 tests pass (updated test_submitTurnMoves_nonExistentBattle_reverts
to expect BattleAlreadyComplete instead of BattleNotYetStarted, since
the getSubmitContext path no longer SLOADs startTimestamp; the
default-zero winnerIndex on a non-existent battle still catches it).

https://claude.ai/code/session_01AWNvQmDFzSK7sYMDXtyXD5
…ers)

Lays the substrate for the OPT_PLAN shadow refactor without wiring it up
yet. Adds:

- `_batchShadowActive` transient flag (engine-internal; gated by
  `executeBatchedTurns` once that lands)
- `_shadowBattleSlot1` (+ loaded/dirty bits) — packed mirror for the
  one BattleData storage slot that's mutated every sub-turn
- `_shadowMonStateLoaded` / `_shadowMonStateDirty` (+ keyed transient
  region at `_T_MONSTATE_BASE`) — per-mon mirror, up to 16 mons
- Read/write helpers (`_readBattleSlot1Packed`, `_writeBattleSlot1Packed`,
  field-level get/set for winnerIndex/playerSwitchForTurnFlag/turnId/etc.,
  `_readMonStatePacked`, `_writeMonStatePacked`)
- Flush routines (`_flushShadowBattleSlot1`, `_flushShadowMonStates`)
- `_setLastExecAndIncrementTurnId` — combined end-of-turn helper that
  the shadow-active path will eventually call

None of these are wired up yet — `_executeInternal` and its callees
still mutate storage directly via storage refs, so the legacy path is
unchanged. The next commit refactors callsites to route through the
helpers and adds the `executeBatchedTurns` external entry.

All 536 tests still pass.

https://claude.ai/code/session_01AWNvQmDFzSK7sYMDXtyXD5
…ial)

Step 1 of the OPT_PLAN shadow refactor: route the BattleData slot-1
accesses inside `_executeInternal` and `_checkAndSetWinnerIfGameOver`
through the shadow helpers added in the previous checkpoint.

Refactored callsites (all functionally identical when shadow inactive
since the helpers fall straight through to direct SLOAD/SSTORE):

  _executeInternal:
   - start-of-turn winnerIndex check    -> _getWinnerIndex
   - turnId read                         -> _getTurnId
   - prev/curr flag swap                 -> _setPrevPlayerSwitchForTurnFlag + _getPlayerSwitchForTurnFlag
   - single-player flag branch           -> _getPlayerSwitchForTurnFlag once into local
   - end-of-turn 3-field write           -> _setLastExecAndIncrementTurnId
   - final winnerIndex check             -> _getWinnerIndex

  _checkAndSetWinnerIfGameOver:
   - winnerIndex read + write            -> _getWinnerIndex / _setWinnerIndex

  _dealDamageInternal:
   - winnerIndex check                   -> _getWinnerIndex

WIP remainder (still on storage refs, will be converted next):
- `battle.activeMonIndex` reads (29 callsites — _unpackActiveMonIndex
  patterns throughout the engine)
- `battle.activeMonIndex = ...` write in `_handleSwitch`
- `battle.playerSwitchForTurnFlag = ...` write in `switchActiveMon`
- `battle.winnerIndex` reads in `_checkForGameOverOrKO`, `_handleEffects`
- `battle.turnId` reads in `_handleMove`, `_handleSwitch`

Until those land, `executeBatchedTurns` cannot activate shadow safely
(stale reads would break correctness). All 536 tests still pass with
this partial refactor since shadow remains dormant.

https://claude.ai/code/session_01AWNvQmDFzSK7sYMDXtyXD5
…urns

Finishes the OPT_PLAN tier-1 shadow refactor. All slot-1 reads and
writes inside the engine's execute path now route through the shadow
helpers; when `_batchShadowActive` is set (only inside the new
`executeBatchedTurns` engine entry), reads pull from the transient
mirror and writes defer to it, with a single SSTORE flush at end of
batch.

Engine changes:
- Refactored all 29 `_unpackActiveMonIndex(battle.activeMonIndex, ...)`
  callsites + the `_setActiveMonIndex` write in `_handleSwitch` to use
  `_getActiveMonIndex` / `_setActiveMonIndexPacked` helpers.
- All remaining `battle.winnerIndex` / `battle.turnId` /
  `battle.playerSwitchForTurnFlag` reads route through their respective
  helpers (`_getWinnerIndex`, `_getTurnId`, `_getPlayerSwitchForTurnFlag`).
- `switchActiveMon`'s `battle.playerSwitchForTurnFlag` write uses
  `_setPlayerSwitchForTurnFlag`.
- New external entry `executeBatchedTurns(bytes32, uint256[])`:
  - Auth-gated to `config.moveManager`
  - Activates `_batchShadowActive` for the loop
  - Unpacks each entry, runs flag-based dispatch (§6.1), calls
    `_executeInternal`, resets per-turn transients between sub-turns
  - Flushes shadow slot 1 once at end via `_flushShadowBattleSlot1`
  - Returns (executed, winner)

Manager change:
- `SignedCommitManager.executeBuffered` collects buffered entries into
  a memory array and calls `ENGINE.executeBatchedTurns(battleKey, entries)`
  in one external call. No per-iteration `resetCallContext` round-trip.

Realistic 14-turn steady-state game (BatchAccessProfileRealisticTest):

  | metric         | legacy | batched | delta |
  |----------------|--------|---------|-------|
  | SLOADs total   | 2054   | 1261    | -793  |
  | - cold         | 280    | 131     | -149  |
  | - warm         | 1774   | 1130    | -644  |
  | SSTOREs total  | 119    | 114     | -5    |
  | - nz->nz       | 83     | 81      | -2    |

Storage I/O cost: legacy ~1089k -> batched ~706k = save ~383k per
14-turn game (~27k/turn, ~35% reduction).

Compared to previous baseline (before shadow refactor): +160k more
savings on top of the existing ~223k advantage. Total batched
storage I/O advantage now ~35% of legacy.

Bonus: legacy SSTOREs dropped 145 -> 119 because
`_setLastExecAndIncrementTurnId`'s explicit RMW coalesces the 3
end-of-turn slot-1 writes into one SSTORE more aggressively than
Solidity's optimizer was previously achieving.

All 536 tests pass.

https://claude.ai/code/session_01AWNvQmDFzSK7sYMDXtyXD5
Adds per-(player, monIndex) transient mirror for MonState so writes during
executeBatchedTurns dedup across sub-turns (one SSTORE per dirty slot at
batch end instead of one per turn). Mirrors the BattleData slot-1 shadow
design: _readMonStatePacked / _writeMonStatePacked / _flushShadowMonStates
plus _loadMonState / _storeMonState memory-pattern helpers used by all
MonState mutation + read sites (_dealDamageInternal, _updateMonStateInternal,
_handleMove stamina deduct, _inlineRegenStaminaForMon, _computePriorityPlayerIndex,
_getDamageCalcContextInternal, _readMonStateDelta, getCPUContext,
getMonStatesForSide, etc.). Drops the now-unused _getMonState / _deductStamina.

Realistic 14-turn steady-state access tally: batched - legacy = -25 SSTOREs
/ -915 SLOADs (was -5 / -793 pre-MonState-shadow). Legacy gas snapshots
regress ~5-8% from the unpack/repack overhead of the memory pattern in
single-turn paths — documented in OPT_PLAN §12 as a known trade-off.

BatchGasTest now runs a warmup battle (low HP) to completion before the
measured battle so the measured battle reuses the freed storageKey
(MappingAllocator steady state). The warmup uses the same flow as the
measurement (legacy or batched) so manager buffer slots are also warm
for the batched path. Asserts storageKey reuse before measuring.

BatchAccessProfileRealisticTest adds the same storageKey-reuse precondition
asserts for both legacy and batched battle pairs.
Adds test_realisticGameSlotBuckets which re-runs the realistic 14-turn
batched flow with state-diff recording and labels each accessed storage
slot by its Engine storage region (BD.slotN, BC.slotN, MonState, Effects,
GlobalKV, etc.). Surfaces the remaining hot slots after BattleData slot-1
and MonState shadows landed -- top candidates: BC.slot2 (KO bitmap +
teamSizes + startTs) with 10 writes/game, and per-mon effect data with
~21 no-op SSTOREs/game.
When executeBatchedTurns reaches a winner, skip _flushShadowMonStates: the
next startBattle at this storageKey runs the sentinel-clear loop which
overwrites every prior slot regardless, so the un-flushed transient values
are recycled either way. Explicitly clears _shadowMonStateLoaded /
_shadowMonStateDirty in the skip path so a subsequent executeBatchedTurns
in the same tx (multicall, foundry test) doesn't read stale TLOAD bits.

BD.slot1 still flushes unconditionally -- getWinner reads it directly
post-batch, and that must stay correct.

Realistic 14-turn steady-state access delta improves from -25 to -31
SSTOREs vs legacy (6 fewer MonState SSTOREs from the skipped flush).
External getMonStateForBattle returns stale values in the gap between
batch-end and the next startBattle; accepted trade-off.
BC.slot2 packs moveManager + globalEffectsLength + teamSizes +
engineHooksLength + koBitmaps + startTimestamp + hasInlineStaminaRegen
+ globalKVCount. Of these, only koBitmaps mutates frequently mid-batch
(one write per KO; the realistic 14-turn steady-state game shows ~5-6
of the 10 BC.slot2 writes are koBitmaps).

Shadow JUST the koBitmaps uint16 into a transient (not the whole slot),
so reads of immutable slot-2 fields stay as direct SLOADs in legacy
mode -- no TLOAD-check overhead on every moveManager / teamSizes /
startTimestamp access. New helpers: _readKoBitmaps (shadow-aware read),
_loadShadowKoBitmaps (lazy SLOAD-into-transient on first write),
_writeKoBitmaps, _flushShadowKoBitmaps. Refactor _getKOBitmap,
_setMonKO, _clearMonKO to route through them.

Other field writes during the batch (globalKVCount bump, etc.) keep
doing direct SSTORE; the flush at end-of-batch overwrites only the
koBitmaps bits in storage so the shadowed value wins.

Flush is unconditional -- koBitmaps is part of public API (getKOBitmap,
getBattleEndContext, getCPUContext) and the OnBattleEnd hook fires in
the same tx for game-ending batches, so storage must be coherent
before executeBatchedTurns returns.

Realistic 14-turn steady-state delta: batched - legacy = -35 SSTOREs
/ -936 SLOADs (was -31 / -915 after the game-over flush skip).
~4 fewer SSTOREs from koBit coalescing (~12k gas), ~21 fewer SLOADs
from cached-after-first-load reads. Legacy gas snapshots regress
~500 gas per game (~0.1%) from the narrow TLOAD checks.

Diagnostic test reads koBitmaps via raw slot 2 load -- the engine's
getKOBitmap(battleKey) returns 0 post-game-over due to a pre-existing
artifact where _freeStorageKey deletes the battleKey -> storageKey
mapping, so getBattle(battleKey) reads an empty config row (already
documented at the BattleConfigView builder in getBattle).
Records the two shipped follow-ups (game-over MonState flush skip,
narrow koBitmaps shadow) plus the two candidates that were measured
and rejected (effect-data no-op guard -- savings were ~2k not ~46k
once EIP-2200 was re-read; BC.slot0/1 shadow -- 22k legacy regression
vs 14k batched gain). Final realistic-game steady-state delta:
batched - legacy = -35 SSTOREs / -936 SLOADs, ~200k gas per 14-turn
game.
Counterpart to test_realisticGameAccessProfile_steadyState that uses
gasleft() before/after each call instead of vm.startStateDiffRecording.
Reports legacy total (sum of 14 per-turn gasleft deltas) vs batched
total (sum of 14 submits + 1 executeBuffered).

Note in the test docstring that the legacy number is biased in legacy's
favor: within one foundry test function, all calls share the EVM
warm-set, so legacy turn 2-14 see warm SLOADs instead of the cold
SLOADs they would pay as separate production txs. The access-tally
test remains the authoritative steady-state comparison.
test_realisticGameSteadyStateGas's within-foundry-tx number understates
the legacy cost because legacy turns 2-14 see slots that turn 1 already
warmed (EIP-2929 access list is per-tx in production but per-test in
foundry). Add a HARNESS BIAS callout to the docstring and a production
estimate to the printed output that adds the cold-SLOAD penalty back in
for the 260 SLOADs that would re-cold each turn in real deployment.
Same callout added to OPT_PLAN §12 with the per-slot proof that the
shadow really is coalescing SSTOREs (BD.slot1 14 -> 1, MonStates ~6 -> 0).
Bulk uint104 -> uint96 across engine, manager, commit lib, and tests.
EIP-712 DualSignedReveal typehash updated to use uint96 revealerSalt.

New moveBuffer entry layout (256 bits tight pack):
  [p0Move 8 | p0Extra 16 | p0Salt 96 | p1Move 8 | p1Extra 16 | p1Salt 96 | epoch 16]

The 16-bit epoch is the low 16 bits of the battleKey OR'd with 1, so it's
non-zero and battle-unique (probabilistically). Used by executeBuffered
to detect "live for this battle" vs "stale leftover from a prior battle
that reused this storageKey but didn't drain its buffer." Replaces the
old bufferCounters SSTORE per submit -- removes ~5k gas of write traffic
per submitTurnMoves call (~70k per 14-turn game in production).

submitTurnMoves: writes only the entry slot (1 SSTORE), no counter.
executeBuffered: walks slots from engine.turnId until epoch mismatch.
bufferCounters mapping removed.

NOTE -- WIP commit before measurement. The 16-bit epoch has a 1/32768
collision risk between two battles ever using the same storageKey;
discussing whether to widen to 32 bits (cut salts to 88) before merge.

Build green, tests not yet run.
…+ submitTurnMoves

Removes the committer signature: committer is now identified by msg.sender at
call time. Saves 1 ecrecover (~3k gas) + 1 sig in calldata (~1.1k @ 16 gas/byte)
per turn on BOTH legacy (executeWithDualSignedMoves) and batched (submitTurnMoves)
paths. The unilateral-revealer attack (where a revealer picks any preimage and
signs keccak(P*) as the committer's hash) is closed by the msg.sender == committer
check: only the actual committer can publish their own move.

Trade-off: loses the "anyone can publish with both sigs" relayer capability for
the committer side. Each turn, the player who is the committer-this-turn (parity)
must submit themselves. Revealer doesn't need to be online at submit time
(their sig is included in the committer's submission).

Updates so far:
- Structs.sol: TurnSubmission drops committerSig field
- SignedCommitManager.sol: executeWithDualSignedMoves and submitTurnMoves drop
  committer sig param + verification, add msg.sender == committer check
- BatchHelper.sol: _buildTurnSubmission returns (entry, committerAddr) so callers
  can vm.prank; _submitTurnMoves wraps the call in vm.prank
- BufferSubmissionTest: rewrote relayer-can-submit (now must revert), removed
  wrong-committer-sig test (no committer sig now), kept revealer-side checks
- 5 batched-flow tests: dropped cSig arg, added vm.prank(committer) before call

Remaining: SignedCommitManager.t.sol (~18 call sites), InlineEngineGasTest,
SignedCommitManagerGasBenchmark, StandardAttackPvPGasTest still need the same
mechanical update. Several tests in SignedCommitManager.t.sol verify the
committer-sig model specifically (wrongCommitter, replayAcrossBattle via sig
recovery, etc.) -- those need rethinking, not just mechanical replacement.
Removes the committer EIP-712 signature from executeWithDualSignedMoves
and submitTurnMoves. The committer is now identified by msg.sender at
call time. The unilateral-revealer attack (revealer picks any preimage
and signs keccak(P*) as the committer's hash) is closed by the
msg.sender == committer check: only the actual committer can publish
their own move with their own preimage.

Trade-off: loses "anyone can publish with both sigs" for the committer
side. Each turn, whichever player is the committer-by-parity must
submit themselves. The revealer side still signs off-chain so they
don't need to be online at submit time.

Per-turn savings (verified on the realistic 14-turn steady-state game):
  Legacy (14 separate executeWithDualSignedMoves calls):
    before: 1,992,441 gas  ->  after: 1,940,018 gas  (-52,423, ~3.7k/turn)
  Batched submit (14 submitTurnMoves):
    before: 283,289 gas  ->  after: 194,977 gas      (-88,312, ~6.3k/submit)
  Batched total:
    before: 2,121,570 gas  ->  after: 2,033,280 gas  (-88,290)

Production estimate (each turn as its own tx; legacy assumes cold start
per tx via cold-SLOAD penalty add-back): batched saves ~426k vs legacy
production (~15.5% per 14-turn game), up from ~390k (14%) before.

Storage I/O is unchanged (-35 SSTOREs / -936 SLOADs vs legacy in the
access tally); savings are pure compute (1 fewer ecrecover, less keccak
for the SignedCommit struct hash) + ~70 bytes less calldata per call
(~1.1k @ 16 gas/byte).

Test changes:
- TurnSubmission.committerSig field removed
- 5 tests in SignedCommitManager.t.sol that verified the old committer-
  sig security model deleted or inverted:
  - test_revert_replayAttack_differentBattle: deleted (no committer sig
    to bind to a specific battleKey; replay defended by msg.sender)
  - test_executeWithDualSigned_thirdPartyRelay_succeeds: inverted to
    test_revert_executeWithDualSigned_nonCommitterCannotSubmit (relayer
    can no longer submit on committer's behalf)
  - test_revert_executeWithDualSigned_wrongCommitterSigner: deleted (no
    committer sig)
  - test_revert_executeWithDualSigned_committerSigForWrongHash: deleted
  - test_revert_executeWithDualSigned_unilateralRevealerAttack: kept,
    mechanism updated (msg.sender check instead of sig recovery)
  - test_revert_replayPrevented_sameBlockAttempt: expected error updated
    to PlayerNotAllowed (msg.sender on wrong turn)
- All callers of executeWithDualSignedMoves updated to pass committer
  as vm.prank(committer) and drop the cSig arg
- BatchHelper._buildTurnSubmission returns (entry, committerAddr) so
  callers can vm.prank; _submitTurnMoves wraps in vm.prank
- BufferSubmissionTest's relayer-can-submit became non-committer-cannot
Adds gasleft() markers inside executeBatchedTurns + _executeInternal that
accumulate per-region totals into transient slots (0x200000..0x20000E) and
emit a single GasProfile event at the end of each batch carrying the
breakdown. Adds ~63k of self-instrumentation overhead per 14-turn game
(~3.5% of executeBuffered total) -- relative distribution across regions
is the useful signal, not absolute totals.

Regions tracked (see _T_GAS_* constants):
  B1 entry, B2 decode/setup, B3 reset, B4 flush  (framework)
  R1 setup, R2 priority+RNG, R3 RoundStart, R4 prio move, R5 prio after,
  R6 other move, R7 other after, R8 RoundEnd, R9 turn-end                (per-turn)
  SP single-player branch                                                (switch turns only)

Realistic 14-turn steady-state breakdown (after subtracting B1-accumulator
carry from battle 1):
  Effects dispatch (R3+R5+R7+R8):           843k  47%  <- biggest target
  _handleMove (R4+R6):                      628k  35%
  Priority+RNG, turn-end, setup (R1+R2+R9): 222k  12%
  Single-player branch (SP):                 70k   4%
  Framework (B1+B2+B3+B4):                   37k   2%

Marked as removable: remove the _accGas calls + constants + GasProfile
event before merging to mainline. Kept here so we can re-run profile
measurements after each optimization candidate.
Replaces the 3-call (global + priority + other) pattern at RoundStart and
RoundEnd with a single _handleEffectsTriple call that runs all three
sections in one stack frame, preserving identical semantics:
  - Global effects gated by SkipIfGameOver
  - Priority/other per-mon effects gated by SkipIfGameOverOrMonKO
  - Same inter-section game-over / KO checks
  - Same playerSwitchForTurnFlag chain

Realistic 14-turn batched executeBuffered (per GasProfile event):
  Before: 1,862,790 gas
  After:  1,855,677 gas
  Saved:  7,113 gas (~0.4%)

Per-region (battle 2 alone, subtracting battle 1's accumulated counters):
  R3 RoundStart: -3,390  (158,668 -> 155,278)
  R8 RoundEnd:   -3,633  (395,939 -> 392,306)

Below my original ~10-30k estimate. The IR optimizer + via_ir already
inlines internal function calls aggressively, so fusing only saves the
small amount of redundant stack-frame setup the optimizer couldn't fold.
Still above the 5k keep-threshold I set. Legacy executeWithDualSignedMoves
gets the same savings since _executeInternal is shared.

AfterMove block (player-effects + global-effects pair, runs twice per turn)
NOT fused -- different shape (2 calls vs 3) and includes the
_inlineStaminaRegen call between them, so a clean fusion is more
invasive for less payoff.
…e findings

The per-region gasleft() markers + GasProfile event added ~25k overhead
per batched executeBuffered call. The trace served its purpose -- now
removed to restore clean production gas numbers. If profile is needed
again, the structure is preserved in the prior commit's history
(89b440a + scroll-back in OPT_PLAN).

Final clean realistic 14-turn steady-state (with single-sig + fusion):
  LEGACY total (single-tx harness, biased):     1,933,178 gas
  BATCHED submit (14 submits, summed):            194,813
  BATCHED execute (1 executeBuf):               1,831,339
  BATCHED total:                                2,026,152
  Production-legacy estimate (14 separate txs): 2,747,178
  Batched saves vs production legacy:             427,026 (~15.5%)

Access tally: batched - legacy = -35 SSTOREs / -934 SLOADs (unchanged
from prior; storage I/O delta was already at the floor before the
single-sig / fusion compute optimizations).

OPT_PLAN updated with the full trace findings:
  - 47% (843k) of executeBuffered = effects dispatch
  - 35% (628k) = _handleMove (real game logic)
  -  2% (37k) = framework overhead (near floor)
  - Skipped: per-turn active-mon cache (correctness: mid-turn switches)
  - Skipped: preload effects into memory (complexity vs ~30k payoff
    didn't pencil; queued for effect-heavy benchmark)
  - Kept: _handleEffectsTriple fusion (~7k/game)
  - Kept: single-sig drop (~88k/game on batched, ~52k on legacy)
…eInternal

Four batch-friendly structural cleanups in one go, totaling ~68k saved
per 14-turn batched game (~3.4% of executeBuffered):

(1) Drop MonMoves event emission per turn (~28k).
    Off-chain consumers reconstruct per-turn moves from the manager-side
    moveBuffer SSTOREs (batched) or from the executeWith* calldata (legacy).
    No event needed in either case. The _emitMonMoves helper + the
    MonMoves event definition are removed.

(2) Drop EngineExecute event emission per turn (~10k).
    The manager already emits TurnsExecuted(battleKey, startTurnId,
    executedCount, winner) once per batched call, which strictly subsumes
    per-turn EngineExecute. For legacy, the executeWithDualSignedMoves /
    executeWithMoves tx itself is the on-chain event.

(3) Hoist battleKeyForWrite assignment from _executeInternal to entry
    points (~1.4k). In batched mode the value is set once before the
    loop instead of N times inside _executeInternal -- saves N-1 TSTOREs.
    For legacy entry points the cost is unchanged (one set per call).

(4) Hoist numHooks + hasInlineStaminaRegen reads from _executeInternal
    to the caller, threading them through as function params (~3-5k).
    Both fields live in BC.slot2 and are constant for a battle's
    lifetime; reading once per batch instead of once per turn saves
    ~13 warm SLOADs.

Realistic 14-turn steady-state delta vs prior commit (94cf55d post-fuse):
  LEGACY total          :   1,933,178 -> 1,867,567  (-65,611, ~3.4%)
  BATCHED execute       :   1,831,339 -> 1,762,241  (-69,098, ~3.8%)
  BATCHED total         :   2,026,152 -> 1,957,203  (-68,949, ~3.4%)
  Production saves vs L :     427,026 -> 430,364    (+3,338 widening)

Access tally now shows -906 SLOADs (was -934): the difference is the
~28 _getActiveMonIndex reads inside _emitMonMoves that no longer fire.

All 533 tests pass.
Net regression on the realistic 14-turn steady-state profile (~31k
storage saved vs ~190k TLOAD-check overhead). Same shape as the
BC.slot0/1 shadow rejection above: high read-to-write ratio kills
the win.

https://claude.ai/code/session_01AWNvQmDFzSK7sYMDXtyXD5
Each call to _getActiveMonIndex(battleKeyForWrite) compiles down to three TLOADs
plus a stack frame the IR optimizer couldn't fold across distinct call points
(~300-500g per call, not just one TLOAD as initially estimated). Function-frame
caching only — never crosses _handleMove, the only thing that can change
active mon via switch moves. Effect / ability lifecycle hooks never call
switchActiveMon (verified by grep across src/effects/, src/mons/), so the
hoist is safe across _runEffects and _handleEffectsTriple branches.

Realistic 14-turn steady-state: batched -136k (-7.8%), legacy -121k (-6.5%).
All snapshot suites improved 1-10% per scenario.
…straint)

HardReset (src/mons/nirvamma/HardReset.sol) is IMoveSet, BasicEffect with an
onAfterMove hook that calls engine.switchActiveMon. Effects can change the
active mon mid-_runEffects. Previously-cached `_getActiveMonIndex` values
carried across distinct effect-iteration call frames could go stale.

Audit of the H hoists by safety class:
 - Pure compute / internal-only: kept (_computePriorityPlayerIndex,
   _checkForGameOverOrKO, RoundEnd inline stamina regen).
 - Cached value consumed before any external call: kept (turn-0 ability
   activations: no IAbility.activateOnSwitch calls switchActiveMon;
   _addEffect onApply and removeEffect onRemove: both unpacks complete
   before the hook call).
 - Function-frame caching matching legacy semantics: kept (_runEffects;
   matches the original which also cached at function top + documented
   with a constraint comment).
 - Cross-branch caching across _runEffects calls: REVERTED
   (_handleEffectsTriple) — today's RoundStart/RoundEnd effects don't
   switch, but adding one would silently break this. Defensive depth.

Net realistic 14-turn steady-state vs original baseline:
 - batched: -126,762 gas (-7.2%, was -136,678 before safety reverts)
 - legacy:  -112,502 gas (-6.0%, was -121,240 before)

All 533 tests pass including the 4 HardReset tests.
claude added 22 commits May 23, 2026 18:04
…PLAN

The two preceding revert commits restore the engine to the post-caching-sweep
state. This entry documents WHY tiered storage was tried and why it didn't
pay on the realistic profile, so future readers don't redo the experiment.

Key findings from the analysis:
 - Theoretical storage savings: ~17.7k/game (113 fewer SLOADs, 9 fewer SSTOREs).
 - Measured savings: ~3.5k. Implied runtime compute overhead: ~14k.
 - Engine bytecode actually shrank 174 bytes (not bloat — runtime instructions).
 - The +5 cold SSTOREs were cold-penalty reshuffling from cold SLOADs (offset).
 - Realistic profile is ~50% StatBoosts (always external) and has ZERO real
   effect writes during execute (all SSTOREs are no-ops via MappingAllocator
   slot reuse) so the write-side win is completely unmeasured.

Branch lands back at 1,590,098 batched / 1,712,843 legacy (-172k / -154k
cumulative vs the pre-H original baseline).
OPT_PLAN §7 / Phase 2.5. CPU manager now supports a buffered flow that
mirrors the PvP batched API:

 - selectMoveWithStateHint(battleKey, aliceMove, aliceExtra, aliceSalt,
   CPUContext calldata projectedState) — Alice supplies the projected
   post-prior-turn state in calldata; the CPU consumes it (unverified)
   to pick its move. Per §7.1 the hint isn't verified — lying just
   produces suboptimal CPU play against Alice herself, so there's no
   incentive to cheat. CPU salt = keccak(timestamp, aliceSalt, turnId)
   per §7.4 (turnId guards against in-block collisions). Emits
   CPUTurnSalt(battleKey, turnId, timestamp) for off-chain replay.
 - executeBuffered(battleKey) — drains the buffer in one tx via
   engine.executeBatchedTurns. Anyone can call (auth lives at the engine
   gate via msg.sender == config.moveManager, and this contract IS the
   moveManager for battles started via it). Fires _afterTurn on game-end.

Buffer layout matches SignedCommitManager exactly (same 256-bit packed
slot, same counter packing) so engine.executeBatchedTurns consumes
either buffer interchangeably. Legacy selectMove stays callable on the
same contract; first-of-batch sync to engine's turnId makes mid-battle
mode alternation seamless.

Tests (test/CPUBatchTest.sol, 7 passing): single-submit-execute,
multi-batch counter accounting, legacy↔batched mode alternation,
empty-buffer revert, non-p0 revert, game-over revert (HP=20 1-hit-KO
orchestration), and the lying-hint scenario (engine state stays
consistent under deliberately-wrong CPUContext). Uses MockBatchedCPU
(deterministic scripted moves) to avoid coupling tests to real CPU
heuristics.

Byte-equivalence vs legacy selectMove is NOT verified — legacy salt is
keccak(battleKey, msg.sender, timestamp) vs batched's
keccak(timestamp, aliceSalt, turnId), so engine RNG differs and damage
rolls diverge. Behavioural equivalence (battle completes, state
consistent, counters correct) is what the suite asserts.

Caveat for production CPUs: OkayCPU/FairCPU/BetterCPU make engine
state reads inside calculateMove. In batched mode those return STALE
state (engine hasn't advanced past prior buffered turns), so CPU
quality drops — exactly the trade-off documented in §7.1. Fix is to
migrate each CPU to ctx-only logic; not blocking this ship.

Adding storage (moveBuffer + bufferCounters) to CPUMoveManager shifts
optimizer choices for all subclasses — BetterCPUInlineGasTest
regressed ~387g per turn (~0.2% per scenario). Acceptable for the
new feature.
Mirrors test/BatchGasTest.sol pattern for CPU mode. Single-tx harness
gas comparison at B ∈ {4, 8, 14, 20, 30} plus a per-turn access tally
(test_cpuBatchAccessTally_B14) recording each call's storage diff
separately so we get production-accurate cold/warm split.

Empirical findings vs MockBatchedCPU (deterministic, no engine reads):
  B=4:  legacy 537k, batched 656k  (in-harness +119k worse, prod +110k)
  B=8:  legacy 1.04M, batched 1.28M (in-harness +240k, prod +135k)
  B=14: legacy 1.80M, batched 2.22M (in-harness +423k, prod +174k)
  B=20: legacy 2.55M, batched 3.16M (in-harness +610k, prod +217k)
  B=30: legacy 3.81M, batched 4.74M (in-harness +929k, prod +296k)

Per-turn harness regression is a CONSTANT ~30k (independent of B), so
the cold-tx penalty savings (~270k for B=14, scaling slightly with N)
never overtake it. CPU batched is a UX feature (async queueing) but
not a gas optimization at any practical B.

The access-tally test gives the authoritative cold-touch counts that
ground the production estimate (vs the heuristic in BatchGasTest):
  Legacy: 280 cold first-touches across 14 selectMove txs (~20/tx)
  Batched submits: 112 cold across 14 submit txs (~8/tx)
  Batched executeBuffered: 33 cold (1 tx)
  Delta: 135 cold first-touches saved in batched = ~270k saved.
Reframes Phase 2.5 as a separate contract instead of patching the legacy
CPUMoveManager. The "CPU" is a phantom opponent address; all decision
logic lives off-chain (player runs the transpiled engine locally to pick
the CPU's response, then submits both moves).

Why this works: no counterparty to cheat. The player can submit any CPU
move she wants — misrepresenting the CPU's "ideal" response just produces
a worse experience for the player herself. Eliminates the entire on-chain
ICPU.calculateMove path that legacy OkayCPU/FairCPU/BetterCPU pay for
per turn.

What was deleted vs the prior design:
 - CPUContext calldata hint param (~640 bytes calldata)
 - ICPU.calculateMove STATICCALL dispatch (~700g/submit + body)
 - keccak salt derivation (~80g)
 - per-submit CPUTurnSalt event (~2k)

Per-submit cost: ~43k (prior hint design) → ~22k (this).

Files:
 - src/cpu/BatchedCPUMoveManager.sol — abstract with submitTurn +
   executeBuffered + IMatchmaker. Reuses SignedCommitManager's buffer
   layout exactly so engine.executeBatchedTurns is shared.
 - test/mocks/SimpleBatchedCPU.sol — minimal concrete leaf (adds
   startBattle).
 - test/BatchedCPUTest.sol — 6 functional tests (submit-execute,
   multi-batch counter accounting, empty-buffer / non-p0 / post-game-over
   reverts, buffered-turn readback).
 - test/BatchedCPUGasTest.sol — gas + cold-access tally vs OkayCPU.

Measured at B=14 (test/BatchedCPUGasTest.sol):
   in-harness:    legacy 2,637,557  batched 2,030,352  -23.0% / -607k
   prod estimate: legacy ~3.49M     batched ~2.53M     ~-28% / -960k
   cold delta:    279 -> 92         (-187 cold first-touches in prod)

Legacy CPU contracts (CPUMoveManager, CPU, OkayCPU, FairCPU, BetterCPU)
and their tests are completely untouched — this is purely additive.
Battles choose one model at startBattle time via the moveManager field;
mid-battle alternation between the two CPU contracts isn't supported
(legacy and batched are separate products).

OPT_PLAN §7 rewritten to reflect the new design; §11 Phase 2.5 marked
done with measured numbers.

All 543 tests pass.
Four separate transient slots (_turnP0MoveEncoded, _turnP1MoveEncoded,
_turnP0Salt, _turnP1Salt) become one packed _turnTransient:

  [0..7]     p0 packedMoveIndex (storedMoveIndex | IS_REAL_TURN_BIT)
  [8..23]    p0 extraData
  [24..127]  p0 salt
  [128..135] p1 packedMoveIndex
  [136..151] p1 extraData
  [152..255] p1 salt

Per-side IS_REAL_TURN_BIT preserved so _getCurrentTurnMove can still
detect "this side's transient is populated" and fall back to storage
when not — DefaultCommitManager's execute(battleKey) flow keeps
working unchanged.

Per call:
 - executeWithMoves / executeWithSingleMove / executeBatchedTurns per
   iter: 4 TSTOREs -> 1 TSTORE.
 - executeBatchedTurns inter-iter reset: 4 -> 1 TSTORE.
 - setMove mid-execute (Sleep override): now TLOAD + RMW + TSTORE
   instead of plain TSTORE. +200g per sleep-tick. Rare, net positive.

Measured (realistic 14-turn + B=14 CPU batched):
 PvP batched execute: 1,590,098 -> 1,565,215  (-24,883 / -1.6%)
 PvP legacy single-tx: 1,712,843 -> 1,687,503 (-25,340 / -1.5%)
 CPU batched (B=14):  2,030,352 -> 1,997,760  (-32,592 / -1.6%)
 CPU legacy OkayCPU:  2,637,557 -> 2,608,227  (-29,330 / -1.1%)

Snapshot suites improved across the board (Inline_Execute -4.6k,
FirstBattle/ThirdBattle -17.3k, SecondBattle -18.6k,
StandardAttackPvP -2.1k per turn, BetterCPU -0.5-2k per scenario).
All 543 tests pass.

Cumulative vs original baseline:
 batched: 1,762,241 -> 1,565,215 = -197,026 (-11.2%)
 legacy:  1,867,567 -> 1,687,503 = -180,064 (-9.6%)
… STATICCALL

Repacks the BatchedCPUMoveManager's per-storageKey counter slot to also carry
the immutable p0 address + an observed gameOverFlag — letting submitTurn auth
and game-over checks happen via a single SLOAD instead of a STATICCALL into
the engine's getSubmitContext.

Layout (256 bits, keyed by storageKey so it reuses across battles via the
engine's MappingAllocator pattern):
  [0..30]   numExecuted   (uint31)
  [31]      gameOverFlag  (1 bit)
  [32..63]  numBuffered   (uint32)
  [64..95]  lastSubmitTs  (uint32, year 2106 overflow)
  [96..255] p0            (160-bit address)

Plus a separate `storageKeyOf[battleKey]` cache lets us skip getStorageKey
on subsequent submits (battleKey -> storageKey is immutable per battle).

Hot path (cache hit, all submits after the first):
  1 SLOAD storageKeyOf[battleKey]
  1 SLOAD bufferState[storageKey]  -> p0, gameOver, counters
  1 SSTORE moveBuffer[storageKey][nextTurnId]
  1 SSTORE bufferState[storageKey] (preserves p0, increments numBuffered)

Cold path (first submit per battle): falls back to getSubmitContext to
populate both caches, sync counters to engine's turnId, and continue.

executeBuffered:
 - Sets gameOverFlag if winner != 0 so subsequent submits revert fast.
 - Cold engine.end() timeouts won't propagate here (battle ends without
   touching this manager's state) but the manager isn't load-bearing for
   that flow anyway — it just won't accept new submits, which is the
   correct behavior.

Measured at B=14 (test/BatchedCPUGasTest.sol):
  Submits:   301,998 -> 284,442  (-17,556 / -5.8%)
  Execute: 1,695,762 -> 1,694,541 (-1,221)
  Total:   1,997,760 -> 1,978,983 (-18,777 / -0.9%)
  Cold first-touches: 92 -> 79   (-13 -> ~-26k production cold penalty)

In-harness saves vs legacy at B=14 went from 607,205 to 629,244 (+22k).

All 543 tests pass.
Adds `Engine.executeWithDualSignedMovesDirect(...)` — an opt-in entry
point for battles started with `moveManager = address(0)` that lets the
caller skip the SignedCommitManager STATICCALL + the redundant
`getCommitAuthForDualSigned` STATICCALL by doing the EIP-712 sig
verification and auth inline in the engine.

Battles with `moveManager` set continue to go through the manager
unchanged. The engine has its own EIP-712 domain ("Engine","1") so sigs
signed for the manager DON'T verify against the engine's direct path —
prevents cross-contamination if someone tries to relay a manager-bound
sig.

Caveat (documented in the function): stall-timeout via Engine.end()
requires either a `validator` set on the battle or hitting
MAX_BATTLE_DURATION. The inline timeout path calls into a commit
manager (which doesn't exist here). Battles using the direct path
should set a validator if they need stall-timeout semantics.

Measured at B=14 (test/EngineDualSignedDirectTest.sol):
  via manager:        1,741,827 gas
  via engine direct:  1,696,946 gas
  saved:                 44,881 gas (~3.2k per turn / ~2.6%)

Snapshot regression: ~300g per scenario across EngineGasTest /
StandardAttackPvPGasTest / etc. — bytecode bloat from EIP712 base +
new function. Acceptable for the opt-in win on the new path.

All 549 tests pass (was 543 + 6 new in EngineDualSignedDirectTest).
Two small cleanups to BatchedCPUMoveManager that are correctness-preserving
simplifications, with small production savings hidden by the test harness's
single-tx warmth bias:

1. submitTurn cache-miss path: skip the `bufferState[storageKey]` SLOAD.
   Cache miss implies first submit of this battle, so we unconditionally
   reset `packed` to (ctxTurnId, ctxP0). Any prior battle's leftover state
   at this storageKey (gameOver flag, old numExecuted from MappingAllocator
   reuse) is overwritten — the new battle owns the slot. Saves ~2k cold
   SLOAD per first-submit per battle in production.

2. executeBuffered _afterBattle: use cached p0 from `packed` instead of an
   extra STATICCALL into `engine.getPlayersForBattle`. Saves ~3k per
   game-end transition.

In the test harness, the warmup populates the bufferState slot before the
measured battle starts (MappingAllocator reuses storageKey), so the
cache-miss SLOAD is already warm in the test. The savings only show up
in production where each submit is its own tx.

All 549 tests pass. No measurable in-harness gas change (within noise
~79g at B=14).
Adds three additive APIs so external IMoveSet/IAbility contracts can collapse
the canonical "loop getEffects to dedup then addEffect" and "check globalKV
flag then setGlobalKV" patterns into single calls, and read both sides' stats
+ state + effects in one staticcall:

  - addEffectIfNotPresent(target, mon, effect, data) -> bool
    Coalesces the 17-site ability idempotency-guard pattern. Storage-side
    scan against live + tombstoned slots; only calls _addEffectInternal when
    not already present.

  - getAndInitGlobalKV(key, valueIfZero) -> previousValue
    Read + conditional init in one call. Useful for once-per-battle flag
    patterns (~5 sites). Eagerly initializes if previous was zero, so
    callers that need conditional set-after-work should keep the split form.

  - getMoveContext(battleKey, atkPlayer, atkMon, defPlayer, defMon)
    Returns MonStats + MonState + EffectInstance[] for both sides in one
    view. Replaces the 4-7 individual getMonStatsForBattle /
    getMonStateForBattle / getEffects callbacks the worst-offender custom
    moves do today. Sentinel deltas are sanitized to 0 to match
    getMonStateForBattle semantics; tombstoned slots are filtered out.

No existing call sites are migrated; all 557 existing tests pass. Snapshots
show a small (~+1000g per execute) dispatch-table regression from the three
new selectors, which migration of custom moves to these APIs will recoup
several times over.

Eight new tests in EngineMoveAPITest cover correctness and write-context
gating for all three APIs.
Removes getBattleValidator and getMoveManager — both declared in IEngine
and implemented in Engine, but never called from any src/ contract (no
matchmaker, commit manager, validator, hook, mon, or test references).
Pure dead dispatch-table weight.

Verified by direct grep across src/, test/, script/: only the IEngine
declarations and Engine implementations themselves reference these names.

Saves ~250-650g per external call to the engine on every code path, ~3-4k
per battle setup. Partial offset against the +1000g/execute regression from
the three new coalesced move-facing APIs in 6276d2c.

Also corrects the CommitContext doc comment — the lightweight context's
real benefit over BattleContext is fewer encoded return fields (cheaper
ABI encode/decode), not fewer SLOADs (both getters load the same storage
slots). Confirmed by measuring a CommitContext → BattleContext swap that
regressed execute paths by +5k each.

All 557 tests pass.
Removes getMonStateForStorageKey and getPrevPlayerSwitchForTurnFlagForBattleState.
Neither was called from any production contract — only from tests:

  - EngineTest.sol used getMonStateForStorageKey(battleKey, …) 4× with battleKey
    as the storageKey arg. That's the same data getMonStateForBattle returns
    (which internally does _resolveStorageKey), so the test swap is a no-op.
  - BatchEquivalenceTest._assertBattlesEqual read
    getPrevPlayerSwitchForTurnFlagForBattleState on two battles. Replaced with
    a getBattle(key) destructure pulling prevPlayerSwitchForTurnFlag off
    BattleData. Tests don't care about per-call gas cost.

Recovers another ~500-1200g per execute path on top of the two zero-caller
removals in 0952574. Combined with that commit, the 4 removals recoup
roughly 60-65% of the +APIs dispatch regression from 6276d2c.

All 557 tests pass.
…Present

Replaces the canonical "iterate getEffects to dedup, then addEffect" idiom
across 13 sites in 12 mon contracts with a single addEffectIfNotPresent call.
Each migration drops ~7 lines of boilerplate + one STATICCALL for getEffects
+ the in-move loop iteration.

Sites migrated (all 12 are clean dedup-on-address pattern):
  aurox: IronWall, UpOnly
  ekineki: SneakAttack (special: guards entire body, uses returned bool)
  embursa: Tinderclaws (activateOnSwitch only; _removeBurnIfPresent kept)
  gorillax: Angery
  inutia: ChainExpansion, Interweaving
  malalien: ActusReus
  nirvamma: Adaptor
  pengym: PostWorkout
  sofabbi: CarrotHarvest
  xmon: Dreamcatcher, Somniphobia

IronWall and SneakAttack use the `if (!addEffectIfNotPresent(...)) return;`
form because the effect-presence check guards the entire function body
(initial heal / damage calc skipped on subsequent calls in the same lifecycle).

NOT migrated (different semantics):
  ghouliath/RiseFromTheGrave — uses globalKV flag, not getEffects
  nirvamma/HardReset — data-bit conditional dedup
  xmon/NightTerrors — find-or-update, not add-only
  embursa/Tinderclaws site 2 (_removeBurnIfPresent) — remove pattern
  aurox/GildedRecovery — remove pattern
  iblivion/Baselight — _findEffect tuple-returning helper

All 557 tests pass.
When auditing the 9 globalKV consumer sites for adoption, only RiseFromTheGrave
matches the eager-init flag semantics this API was designed for. The other 8
are read-modify-write counters (HoneyBribe, SnackBreak, ModalBolt, Chronoffense
turn-anchor), pure reads for branching (PostWorkout, Tinderclaws), explicit
set/clear pairs (HeatBeaconLib), or conditional set-after-work patterns
(SaviorComplex). None of those can fold into "if previous was zero, set
valueIfZero" without losing functionality.

One adoption candidate doesn't justify maintaining the API surface — keep the
engine lean, leave RiseFromTheGrave on the explicit get + set pair.

Removes the IEngine declaration, Engine implementation, the op=2 branch in
MockNewAPIMove, and the three EngineMoveAPITest cases. Snapshot impact is
within compiler noise (+/- ~90g, no consistent direction) since the function
had no production callers to begin with.

554/554 tests pass.
Collapses the 6 individual getMonStatsForBattle / getMonStateForBattle
callbacks SneakAttack uses to build its DamageCalcContext into a single
getMoveContext call (attacker stats + state, defender stats + state, all
covered).

Measured savings on the existing EkinekiTest scenarios:
  test_sneakAttackHitsNonActiveMon:  5,019,144 → 5,006,241  (-12,903)
  test_sneakAttackOncePerSwitchIn:   5,129,646 → 5,116,773  (-12,873)
  test_sneakAttackResetsOnSwitchIn:  5,315,998 → 5,300,216  (-15,782)

~13-16k saved per test, dominated by per-move-invocation cost. Validates
that getMoveContext earns its dispatch-table weight on high-read sites.

Defender mon is a non-active team slot indexed from extraData — works
because getMoveContext accepts arbitrary mon indices, not just active.

All 554 tests pass.
…-use callers

When auditing migration candidates beyond SneakAttack, every other multi-read
site that fits the (attacker, defender) pair model uses only 2–4 of the
returned fields. Measured impact on HoneyBribe (3 fields used) was +97k
regression per test — the fat context's hidden costs dominate when most data
goes unread:

  - 4 stat/state structs fully fetched (4 SLOADs + struct construction each)
  - Both effect arrays iterated and memory-allocated even when caller
    consults neither
  - ~1.1kb of ABI encoding/decoding for the returned struct

SneakAttack used ~10 of the available fields and saved ~13k per call — but
it's the only such site in the codebase. Keeping a one-consumer API plus
paying its dispatch cost on every other engine call doesn't pencil out.

Reverts SneakAttack to its original individual-getter form (keeping the
addEffectIfNotPresent change from d1edf81). Drops getMoveContext + the
_sanitizeMonState helper from Engine.sol, the IEngine declaration, the
MoveContext struct from Structs.sol, and the three EngineMoveAPITest cases.

Net engine surface vs pre-APIs baseline: -4 functions (1 added:
addEffectIfNotPresent; 5 removed: getBattleValidator, getMoveManager,
getMonStateForStorageKey, getPrevPlayerSwitchForTurnFlagForBattleState,
getAndInitGlobalKV).

Snapshot impact vs pre-removal: -638g to -1478g per execute path,
-6457g to -7532g per battle setup, -682g per BetterCPU turn. Versus the
original baseline (commit bdc0505, pre-APIs), all hot paths are now
NEGATIVE gas — the engine is leaner than before this exercise started.

All 551 tests pass.
Covers the full 53-commit arc since main:
  - Net IEngine surface delta (+4 / -4, focused-not-bloated)
  - Storage layout changes (BattleData slot 0/1 split, MoveDecision pack)
  - New managers: BatchedCPUMoveManager + SignedCommitManager batched flow
  - Engine internals (transient shadow, slot-1 coalescing, event drop, etc.)
  - 12 mon migrations to addEffectIfNotPresent
  - Gas-impact table (hot paths net negative vs baseline)
  - "Tried and reverted" lessons (getMoveContext, getAndInitGlobalKV,
    tiered EffectInstance storage, salt size reduction, etc.)
  - Migration guide for downstream consumers
  - src/cpu/CPU.sol: unused IMoveSet import
  - src/cpu/BetterCPU.sol: unused Type import
  - src/cpu/HeuristicCPUBase.sol: rename aggressive-branch bestIdx to
    aggBestIdx to silence the shadow warning vs the later same-name local
  - script/EngineAndPeriphery.s.sol: unused DefaultValidator/SimplePM
    imports + commented-out SimplePM deploy lines
  - test/{EngineDualSignedDirectTest,BatchedCPUTest,BatchAccessProfileRealisticTest}.sol:
    unused imports flagged by the lint pass

All from forge build warnings; no behavior change.
…, tighten docs

Unused locals (build warnings):
  - 4× \`BattleData storage battle = battleData[battleKey]\` reads that the
    function body never references (_addEffectInternal, _removeEffectAtSlot,
    _dispatchStandardAttack, executeWithDualSignedMovesDirect)

Unused params dropped (with all callsites updated):
  - _checkForGameOverOrKO: \`BattleData storage battle\` (6 callers)
  - _computePriorityPlayerIndex: \`BattleData storage battle\` (2 callers)

DRY: addEffectIfNotPresent collapses its 3-branch dispatch using
\`_loadEffectsCount\` + the p0/p1 mapping-ternary pattern already used by
\`_getEffectsForTarget\`. -12 LOC, same behavior.

Doc tightening (verbose → terse, kept the WHY):
  - executeWithDualSignedMovesDirect natspec (12 → 6 lines)
  - getSubmitContext natspec (9 → 3 lines)
  - IEngine.addEffectIfNotPresent natspec (4 → 2 lines)
  - IEngine.getStorageKey natspec (4 → 3 lines)
  - BatchedCPUMoveManager: class @notice/@dev (14 → 7 lines), bufferState
    layout doc (11 → 8 lines), storageKeyOf doc (4 → 1 line), submitTurn
    /executeBuffered inline blow-by-blow comments removed where code is
    self-evident (kept the non-obvious WHY around first-submit reset
    semantics and the manager-as-moveManager note in executeBuffered).
  - SignedCommitManager: class @notice/@dev (26 → 8 lines). Dropped the
    re-walk of the 3-tx vs 1-tx protocol — that's in OPT_PLAN/CHANGELOG.
    Kept the security WHY (both sigs required vs unilateral revealer).

Snapshot refresh: small per-execute deltas (~50-150g) from compiler
re-optimization across the engine surface changes in this cleanup pass.
Replaces 66kb of in-progress design speculation with a 171-line summary of
what shipped, what was tried and rejected, what was deferred, and the
measured savings. Cross-references CHANGELOG.md for per-commit detail.

Sections:
  - What shipped (core mechanism, shadow layer, layout repacks, single-tx
    engine wins, new APIs, surface trims)
  - What was skipped (deferred phases + 10 rejected experiments with
    measurement-backed reasons)
  - Measured savings (CPU batched, PvP legacy, realistic-game steady-state,
    final engine-surface deltas)
  - 6 lessons worth carrying into v2 work
@sudo-owen sudo-owen changed the title add 3 storage-access profile scenarios to BatchInstrumentationTest Major Optimization Patch May 24, 2026
claude added 7 commits May 24, 2026 23:45
Adds two helpers that skip the 9-field memory unpack \`_loadMonState\`
performs:

  - _readMonStateDelta now operates directly on the packed slot from
    _readMonStatePacked (shift+mask per field, sentinel→0 conversion
    preserved). Used by the public getMonStateForBattle, so external
    move callers benefit too.
  - _isMonKnockedOut: dedicated single-bit check for the hot KO-guard
    pattern. Saves ~220g/call vs _loadMonState(...).isKnockedOut.

Migrated 10 single-field _loadMonState callsites:
  - 8× .isKnockedOut → _isMonKnockedOut
  - 2× .speedDelta in _computePriorityPlayerIndex → _readMonStateDelta(...,Speed)
    (also removes the now-redundant CLEARED_MON_STATE_SENTINEL check at
    the call site since the helper sanitizes internally)

Multi-field _loadMonState callers (10+ remaining) keep the existing
load-modify-store pattern — they amortize the unpack across multiple
fields.

Snapshot deltas vs main: recovers 25-30% of the legacy-path regression
introduced by the MonState shadow refactor. EngineGasTest B1_Execute went
from +69k (+7.6% vs main) to +50k (+5.4%); FirstBattle from +290k (+9.9%)
to +218k (+7.5%); BetterCPU Turn1 from +33k (+13.7%) to +20k (+8.2%).

Residual ~5-10% gap vs main is infrastructure cost the legacy path can't
escape without a full execute-internal split: TLOAD-check tax in
_readMonStatePacked (~100g × ~250 calls/game), BD slot-1 helper-routing,
and the dispatch-table cost of the new external entrypoints. Phase B2
(per-field MonState write helpers for the write-heavy frames) is the
next lever if more recovery is needed.

All 551 tests pass.
…nches

Phase B2 (per OPT_PLAN follow-up). Refactors _updateMonStateInternal to
skip the 9-field memory unpack/repack for the 8 numeric/bool branches it
already special-cases:

  - Hp/Stamina/Speed/Atk/Def/SpAtk/SpDef → _addToMonStateDeltaField:
    direct packed RMW with sentinel-as-zero semantics. Saves ~410g per
    call vs the load-modify-store pattern through _loadMonState.
  - ShouldSkipTurn → _setShouldSkipTurn: single-bit flip on the packed
    slot.

IsKnockedOut keeps the full-unpack path — its KO-bitmap + winner-check +
OnUpdateMonState side effects are intertwined with the in-memory struct
mutation, so factoring it out for a few hundred gas isn't worth the
restructuring.

Snapshot deltas vs pre-B2 (B1 baseline 8816414):
  EngineGasTest FirstBattle  3,138k → 3,128k  (-10k)
  EngineGasTest SecondBattle 3,193k → 3,180k  (-13k)
  EngineGasTest ThirdBattle  2,510k → 2,500k  (-10k)
  InlineEngine B1_Execute      937k →   935k  (-2k)
  InlineEngine FirstBattle   2,771k → 2,761k  (-10k)
Per-execute hot paths in EngineGasTest/BetterCPUInlineGasTest unchanged
because the mock attacks they exercise don't call engine.updateMonState.
Savings appear on battle-lifecycle paths (KO handling, stamina regen)
and on the Inline harness's Battle setups.

Total recovery vs main (B1+B2 combined): ~25-30% of the legacy-path
regression. Residual ~5-9% gap vs main is the architectural cost of the
shadow infrastructure on the legacy code path (TLOAD-check tax in
_readMonStatePacked, BD slot-1 helper-routing, new dispatch entries) —
recoverable only via Option A (parallel execute internals).

All 551 tests pass.
Each shadow-routed helper (_loadMonState, _readMonStatePacked,
_getActiveMonIndex, _isMonKnockedOut, _getWinnerIndex, _getTurnId,
_loadKoBitmaps, ...) used to TLOAD _batchShadowActive at every call to
decide whether to consult the shadow stack. _executeInternal now reads
_batchShadowActive once at the top and threads bool isBatched through:

  _handleMove, _handleEffects, _handleEffectsTriple, _runEffects,
  _runSingleEffect, _handleSwitch, _addEffectInternal, _activateAbility,
  _inlineAbilityActivation, _removeEffectAtSlot, _computePriorityPlayerIndex,
  _dealDamageInternal, _updateMonStateInternal, _dispatchStandardAttackInternal,
  _inlineStandardAttack, _inlineStaminaRegen, _inlineRegenStaminaForMon,
  _getDamageCalcContextInternal, _checkForGameOverOrKO,
  _checkAndSetWinnerIfGameOver

External entries (addEffect, addEffectIfNotPresent, removeEffect,
dispatchStandardAttack, validatePlayerMoveForBattle, getCPUContext,
getValidationContext, getBattleState, getMonStatesForSide,
computePriorityPlayerIndex, getDamageCalcContext) read _batchShadowActive
once and pass the bool. Public no-bool overloads remain for backward
compatibility with mons/effects calling into IEngine.

Gas (vs branch tip before this commit):
  EngineGasTest: -11k to -88k per scenario (-2% to -3%)
  StandardAttackPvPGasTest: -8k to -8.5k per turn (-6% to -10%)
  BetterCPUInlineGasTest: -2k to -8k per turn (-3% to -10%)
  EngineOptimizationTest: -13k / -22k

vs origin/main: legacy regression now sits at +3-5% (was +5-15%);
StandardAttackPvP Turn0_Lead now beats main outright.

All 546 tests pass.
The threading refactor passes bool isBatched explicitly through every
shadow-routed helper. A regression where any helper forgets to forward it
(or defaults to false) inside a batched flow would silently break mid-batch
state visibility — sub-turn N+1 would read stale storage instead of the
shadow value sub-turn N wrote.

MockStateProbeMove records the opponent's MonState field via
getMonStateForBattle into globalKV. Two tests use it:

  test_batchedShadow_probeObservesMidBatchDamage — damage P1 in sub-turn 1,
  probe P1's HpDelta in sub-turn 2, assert probe sees the negative delta
  (not 0). Between sub-turns the shadow is the only carrier; a mis-threaded
  read would observe stale storage.

  test_batchedShadow_probeObservesAccumulatedDamage — two damaging turns
  back-to-back, then probe in turn 3; asserts the cumulative delta is
  visible (each turn's shadow write must compose).
After threading bool isBatched through every internal caller, the no-bool
wrappers (e.g. _getActiveMonIndex(battleKey)) were only used by:
  - _getPlayerSwitchForTurnFlag(battleKey) at executeBatchedTurns L522
    (always batched) and executeWithSingleMove L595 (never batched)
  - _readMonStateDelta(...) at getMonStateForBattle (external view)
  - _getPrevPlayerSwitchForTurnFlag (dead)

Inline the explicit bool argument at the three live callsites:
  - L522 passes true (we just set _batchShadowActive = true above)
  - L595 passes false (executeWithSingleMove is a fresh external entry —
    _batchShadowActive can only become true inside executeBatchedTurns,
    and it's reset to false on that function's exit)
  - getMonStateForBattle reads _batchShadowActive once and threads it

Delete 21 wrappers (~108 LOC) including the dead prev-flag helpers.

Net gas: -107g on the executeWithSingleMove path (one TLOAD saved).
Everything else unchanged because the wrappers were unused in the
measured paths. Real win is the surface-area reduction.

All 548 tests pass.
_handleMove read BattleData slot 1 three separate times: turnId at the
top, attacker's activeMonIndex right after, defender's activeMonIndex
just before the move dispatch. All three are the same packed slot;
each was a separate SLOAD because the helpers couldn't be deduplicated
across function call boundaries by the optimizer.

Read the slot once into `slot1Packed`, extract `turnIdCached` and
`cachedPackedActiveMon` via bit shifts, and reuse both for the attacker
and defender mon reads in both the INLINE and EXTERNAL paths.

Safety: the cache is only consumed pre-external-call. Mutation barriers
inside _handleMove are:
  - moveSet.move(...) / _inlineStandardAttack → PreDamage effects: these
    can call engine.switchActiveMon (PistolSquat / HitAndDip / RoundTrip
    do this) — but the cached values were already consumed by then.
  - _handleSwitch (in the SWITCH branch): writes slot1, called after the
    cache has been used.
  - moveSet.stamina() and validator.validateSpecificMoveSelection() are
    read-only by interface contract — treated as safe.

Gas per turn:
  EngineGasTest: -1.5k to -2.8k per Execute scenario (-0.3% to -0.5%)
  StandardAttackPvPGasTest: -1,154g per turn (-1.4%)
  BetterCPUInlineGasTest: -906g per BothAttack turn (-0.4%)
  EngineOptimizationTest: -1.8k / -2.5k
  Multi-battle EngineGasTest scenarios: -10k to -12k

All 548 tests pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants