fix: re-tick immediately when on_tick overruns its interval boundary#416
fix: re-tick immediately when on_tick overruns its interval boundary#416MegaRedHand wants to merge 4 commits into
Conversation
🤖 Kimi Code ReviewThe PR correctly fixes issue #413 (silent skipping of overrun duties) and implements proper fast-forward logic for offline catch-up. The state machine logic is sound: ticks are now tagged with their intended interval time, and late ticks fire immediately to process missed duties rather than skipping them. Minor issues & suggestions:
Consensus correctness verification:
Testing: Verdict: Approved with minor nits. The logic is correct and secure for consensus operations. Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt |
🤖 Codex Code Review
The late-tick attribution itself looks directionally correct, and the added unit tests cover the helper math well. I did not find a direct security/XMSS/SSZ issue in this diff. I could not run Automated review by OpenAI Codex · gpt-5.4 · custom prompt |
🤖 Claude Code ReviewHere is the review: PR 416: fix: run skipped tick intervals instead of silently dropping themOverviewThe PR addresses a real correctness bug: block building (~1.3 s) overran the 800 ms interval, causing the proposer to never attest on any slot it proposed. The fix carries the intended interval count in the The root-cause diagnosis is correct, the solution is well-motivated, and the code is clean. One medium issue worth addressing before merge. Medium: Loop in
|
Greptile SummaryThis PR fixes a silent duty-skipping bug (#413) where an overlong interval-0 duty (e.g. inline block building at ~1.3s) caused the next tick to be scheduled past the interval-1 boundary, silently dropping attestation duties. Each
Confidence Score: 4/5The change is safe to merge. The catch-up loop is correctly bounded (clamp_tick_time at entry + natural break when a future interval hasn't yet been reached), the tick-tagging approach eliminates the silent duty-skip, and the unit tests cover the exact failure scenario from issue #413. The core scheduling logic is sound and the test suite is thorough. The only non-trivial observation is that the last_tick_instant metric will report misleading durations during consecutive catch-up iterations, which could obscure overrun events in monitoring. This does not affect correctness of duty execution. crates/blockchain/src/lib.rs — the handle_tick catch-up loop and its interaction with the last_tick_instant metric are the only areas worth a second look.
|
| Filename | Overview |
|---|---|
| crates/blockchain/src/lib.rs | Core fix: handle_tick now carries an interval-count in Tick, runs a catch-up loop for overrun intervals, and uses clamp_tick_time / ms_until_tick helpers. Logic is correct and well-tested; one subtle metric side-effect noted (see comment). |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Tick message arrives with time=N] --> B[clamp_tick_time: fast-forward if more than one slot behind]
B --> C[on_tick with timestamp from tick_time]
C --> D{ms_until_tick next boundary}
D -->|delay > 0| E[send_after delay, Tick time=next]
E --> F[Done]
D -->|delay == 0 boundary already passed| G[warn: tick overran interval]
G --> H[increment tick_time]
H --> C
Comments Outside Diff (1)
-
crates/blockchain/src/lib.rs, line 196-199 (link)last_tick_instantmetric skewed during catch-up loop iterationson_tickcallsobserve_tick_interval_duration(prev_instant.elapsed())and then resetslast_tick_instanton every invocation. When the catch-up loop runs multiple iterations back-to-back (delay == 0), each successive call measures only the previous iteration's processing time rather than the ~800 ms wall-clock interval. Monitoring dashboards will show near-zero or processing-time-only values for catch-up ticks, making the metric misleading precisely when an overrun is occurring. Consider skipping the observation (or tagging it as a catch-up sample) for in-place iterations to keep the metric semantically consistent.Prompt To Fix With AI
This is a comment left during a code review. Path: crates/blockchain/src/lib.rs Line: 196-199 Comment: **`last_tick_instant` metric skewed during catch-up loop iterations** `on_tick` calls `observe_tick_interval_duration(prev_instant.elapsed())` and then resets `last_tick_instant` on every invocation. When the catch-up loop runs multiple iterations back-to-back (delay == 0), each successive call measures only the previous iteration's processing time rather than the ~800 ms wall-clock interval. Monitoring dashboards will show near-zero or processing-time-only values for catch-up ticks, making the metric misleading precisely when an overrun is occurring. Consider skipping the observation (or tagging it as a catch-up sample) for in-place iterations to keep the metric semantically consistent. How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
crates/blockchain/src/lib.rs:196-199
**`last_tick_instant` metric skewed during catch-up loop iterations**
`on_tick` calls `observe_tick_interval_duration(prev_instant.elapsed())` and then resets `last_tick_instant` on every invocation. When the catch-up loop runs multiple iterations back-to-back (delay == 0), each successive call measures only the previous iteration's processing time rather than the ~800 ms wall-clock interval. Monitoring dashboards will show near-zero or processing-time-only values for catch-up ticks, making the metric misleading precisely when an overrun is occurring. Consider skipping the observation (or tagging it as a catch-up sample) for in-place iterations to keep the metric semantically consistent.
Reviews (1): Last reviewed commit: "fix: run skipped tick intervals instead ..." | Re-trigger Greptile
handle_tick scheduled the next tick after on_tick returned, with a delay computed from the timestamp captured at entry. A duty overrunning the 800ms interval (e.g. inline block-building proofs taking ~1.3s) pushed the next tick past the following boundary, so that interval's duty never ran: a proposing node never attested on its own slot, which alone prevents 1- and 2-node devnets from ever justifying. Drive tick processing from store.time() (the last processed interval) instead of the wall clock at handling time: each tick loops over every interval boundary that has passed but was not processed yet, running missed intervals in place (with a warning) so their duties still execute. When more than a slot behind (e.g. a long stall), jump straight to the wall clock, matching store::on_tick's existing fast-forward. Closes #413
b20c5e7 to
15750d7
Compare
…them" This reverts commit 15750d7.
handle_tick scheduled the next tick from the timestamp captured at entry, so a duty overrunning the 800ms interval (e.g. inline block-building proofs ~1.3s) pushed the next tick past the following boundary and that interval's duty never ran: a proposing node never attested on its own slot, which alone prevents 1- and 2-node devnets from ever justifying. Re-read the clock after on_tick and compare elapsed time against the interval slack remaining at entry. If a boundary was crossed, tick again immediately so the missed interval's duty still runs; otherwise schedule for the next boundary from the post-tick timestamp. The comparison keys off the boundary actually being crossed rather than a fixed 800ms threshold, so it also handles scheduling jitter and exact-boundary entry. Closes #413
Closes #413
Problem
handle_tickscheduled the next tick afteron_tickreturned, with a delay computed from the timestamp captured at entry. If a duty overran the 800ms interval (block building proves inline, ~1.3s observed), the next tick fired past the following boundary and that interval's duty was silently skipped.Concrete symptom: a proposing node never attests on any slot it proposes (interval-0 overrun skips interval 1). This alone makes 1- and 2-node devnets unable to justify, and loses ~1/N of attestations per slot on N-node devnets.
Fix
Minimal change, contained to
handle_tick:on_tickand compare elapsed time against the interval slack remaining at entry (remaining_at_entry).elapsed >= remaining_at_entry), tick again immediately so the missed interval's duty still runs.The comparison keys off the boundary actually being crossed rather than a fixed 800ms threshold, so it also handles scheduling jitter (a late-firing tick whose
on_tickcrosses the boundary in <800ms) and exact-boundary entry.Scope / tradeoff
Testing
Existing blockchain unit tests,
cargo fmt, andclippy -D warningspass.