Skip to content

feat(node): skip checkpoint sync when recent state data is in the db#403

Merged
MegaRedHand merged 5 commits into
lambdaclass:mainfrom
conache:skip-checkpoint-sync-when-state-in-db
Jun 2, 2026
Merged

feat(node): skip checkpoint sync when recent state data is in the db#403
MegaRedHand merged 5 commits into
lambdaclass:mainfrom
conache:skip-checkpoint-sync-when-state-in-db

Conversation

@conache
Copy link
Copy Markdown
Contributor

@conache conache commented Jun 1, 2026

When --checkpoint-sync-url is provided, the node currently always downloads a fresh finalized state from the peer, even if a recent state is already on disk. Skip the network round-trip when the persisted state is fresh enough to resume from.

This PR introduces a state data freshness threshold - MAX_RESUMABLE_DB_STATE_AGE = 450 slots (~30 min at 4s/slot) - picked conservatively considering the per-block backfill cost. I couldn't identify in the specs a formal weak-subjectivity period or a method of calculating it, so this is a judgement call; happy to take any suggestions on the value or a better approach for it.

What Changed

  • crates/storage/src/store.rs — added Store::from_db_state(backend, expected_genesis_time), a no-write constructor that wraps an already-initialized backend. Returns None if the backend is empty or its persisted genesis_time doesn't match. Added the MAX_RESUMABLE_DB_STATE_AGE = 450 constant (~30 min at 4s/slot).
  • crates/storage/src/lib.rs — re-exported MAX_RESUMABLE_DB_STATE_AGE.
  • bin/ethlambda/src/main.rs — in the checkpoint-sync branch of fetch_initial_state, try Store::from_db_state first. If the persisted finalized slot is within MAX_RESUMABLE_DB_STATE_AGE of wall-clock, return the resumed store and skip checkpoint sync. Otherwise warn and fall through to the existing sync path.

Correctness / Behavior Guarantees

  • New short-circuit fires only when: checkpoint URL provided AND DB populated AND persisted genesis_time matches AND current_slot - latest_finalized.slot <= MAX_RESUMABLE_DB_STATE_AGE. Everything else remains unchanged.
  • 30 min threshold is conservative for the current BlocksByRoot-only backfill cost; can be increased once BlocksByRange long-range sync (feat(p2p): use BlocksByRange for long-range sync #351) is added.

Tests Added / Run

Three unit tests in crates/storage/src/store.rs covering the from_db_state contract:

  • from_db_state_returns_none_on_empty_backend
  • from_db_state_returns_some_on_matching_genesis_time
  • from_db_state_returns_none_on_genesis_time_mismatch

Related Issues / PRs

✅ Verification Checklist

  • Ran make fmt — clean
  • Ran make lint (clippy with -D warnings) — clean
  • Ran cargo test --workspace --release — all passing
  • Local devnet test

@conache conache changed the title Skip checkpoint sync when state in db feat(node): skip checkpoint sync when recent state data is in the db Jun 1, 2026
@conache conache marked this pull request as ready for review June 1, 2026 13:55
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 1, 2026

Greptile Summary

Adds a startup fast-path that skips the checkpoint-sync network round-trip when a persisted store is already within MAX_RESUMABLE_DB_STATE_AGE (450 slots, ~30 min) of the current wall-clock slot. The genesis-time cross-check and conservative threshold make the short-circuit safe.

  • Store::from_db_state is a new no-write constructor that wraps an existing backend; it validates genesis_time via KEY_CONFIG and returns None for empty or mismatched backends.
  • fetch_initial_state in main.rs probes this constructor before hitting the network, computing the slot gap via the same saturating_sub / MILLISECONDS_PER_SLOT arithmetic used elsewhere in the codebase.
  • Three unit tests cover the empty, matching, and mismatched genesis-time cases for from_db_state.

Confidence Score: 4/5

Safe to merge; the fast-path only fires under a narrow set of conditions and falls through to the existing checkpoint-sync path in all other cases.

The core logic — slot-gap arithmetic, genesis-time validation, and the backend Arc clone/drop sequence — is correct and consistent with existing codebase patterns. The one noteworthy gap is that from_db_state validates only KEY_CONFIG before returning a Store, leaving a narrow window where a partially-initialized DB (e.g. after a schema migration or an older-format DB) could cause a panic in the caller rather than a clean fallback to checkpoint sync.

crates/storage/src/store.rs — specifically the from_db_state constructor's partial metadata validation.

Important Files Changed

Filename Overview
crates/storage/src/store.rs Adds from_db_state constructor and MAX_RESUMABLE_DB_STATE_AGE constant; logic is sound but the constructor only validates KEY_CONFIG, so a DB with missing KEY_LATEST_FINALIZED would panic in the caller rather than returning None.
bin/ethlambda/src/main.rs Checkpoint-sync fast-path added before network fetch; slot-gap arithmetic using saturating_sub and MILLISECONDS_PER_SLOT is consistent with the existing codebase pattern.
crates/storage/src/lib.rs Re-exports MAX_RESUMABLE_DB_STATE_AGE from store; trivial one-line change, no issues.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[fetch_initial_state called] --> B{checkpoint_url provided?}
    B -- No --> C[Build genesis Store\nfrom_anchor_state]
    B -- Yes --> D[Store::from_db_state\nbackend.clone, genesis_time]
    D -- None: empty or\ngenesis_time mismatch --> G
    D -- Some store --> E[Compute slot gap\ncurrent_slot - finalized_slot]
    E --> F{gap <= 450 slots?}
    F -- Yes --> H[Return existing\nDB store - skip sync]
    F -- No --> I[warn: state stale\ndrop store]
    I --> G[checkpoint_sync:\nfetch_finalized_anchor]
    G --> J[get_forkchoice_store\nwrites new state to backend]
    J --> K[Return new Store]
Loading
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
crates/storage/src/store.rs:527-558
**`from_db_state` only validates `KEY_CONFIG` before returning a `Store`**

The function checks that `KEY_CONFIG` is present and that `genesis_time` matches, then constructs and returns a `Store`. Immediately in the caller (`main.rs`), `store.latest_finalized().slot` is invoked, which calls `get_metadata(KEY_LATEST_FINALIZED)`. That helper unconditionally panics with `"metadata key exists"` if the key is absent. If the DB ever has `KEY_CONFIG` but is missing `KEY_LATEST_FINALIZED` (e.g. a schema change between node versions, a partially-migrated DB, or a crash *between* two separate write batches in future refactors), the node will panic at startup instead of falling through cleanly to checkpoint sync. Consider probing `KEY_LATEST_FINALIZED` inside `from_db_state` and returning `None` when it is absent, consistent with the function's existing "empty or inconsistent → `None`" contract.

Reviews (1): Last reviewed commit: "Add tests" | Re-trigger Greptile

Comment on lines +527 to +558
/// Build a Store from the state already persisted in the storage backend.
///
/// Returns `None` if the backend is empty or its persisted `genesis_time`
/// doesn't match `expected_genesis_time`.
pub fn from_db_state(
backend: Arc<dyn StorageBackend>,
expected_genesis_time: u64,
) -> Option<Self> {
let persisted_config = {
let view = backend.begin_read().expect("read view");
let bytes = view.get(Table::Metadata, KEY_CONFIG).expect("get config")?;
ChainConfig::from_ssz_bytes(&bytes).expect("valid config")
};
if persisted_config.genesis_time != expected_genesis_time {
warn!(
db_genesis_time = persisted_config.genesis_time,
expected_genesis_time,
"Persisted DB has a different genesis_time; treating as empty"
);
return None;
}
info!("Loaded store from persisted DB state");
Some(Self {
backend,
new_payloads: Arc::new(Mutex::new(PayloadBuffer::new(NEW_PAYLOAD_CAP))),
known_payloads: Arc::new(Mutex::new(PayloadBuffer::new(AGGREGATED_PAYLOAD_CAP))),
gossip_signatures: Arc::new(Mutex::new(GossipSignatureBuffer::new(
GOSSIP_SIGNATURE_CAP,
))),
})
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 from_db_state only validates KEY_CONFIG before returning a Store

The function checks that KEY_CONFIG is present and that genesis_time matches, then constructs and returns a Store. Immediately in the caller (main.rs), store.latest_finalized().slot is invoked, which calls get_metadata(KEY_LATEST_FINALIZED). That helper unconditionally panics with "metadata key exists" if the key is absent. If the DB ever has KEY_CONFIG but is missing KEY_LATEST_FINALIZED (e.g. a schema change between node versions, a partially-migrated DB, or a crash between two separate write batches in future refactors), the node will panic at startup instead of falling through cleanly to checkpoint sync. Consider probing KEY_LATEST_FINALIZED inside from_db_state and returning None when it is absent, consistent with the function's existing "empty or inconsistent → None" contract.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/storage/src/store.rs
Line: 527-558

Comment:
**`from_db_state` only validates `KEY_CONFIG` before returning a `Store`**

The function checks that `KEY_CONFIG` is present and that `genesis_time` matches, then constructs and returns a `Store`. Immediately in the caller (`main.rs`), `store.latest_finalized().slot` is invoked, which calls `get_metadata(KEY_LATEST_FINALIZED)`. That helper unconditionally panics with `"metadata key exists"` if the key is absent. If the DB ever has `KEY_CONFIG` but is missing `KEY_LATEST_FINALIZED` (e.g. a schema change between node versions, a partially-migrated DB, or a crash *between* two separate write batches in future refactors), the node will panic at startup instead of falling through cleanly to checkpoint sync. Consider probing `KEY_LATEST_FINALIZED` inside `from_db_state` and returning `None` when it is absent, consistent with the function's existing "empty or inconsistent → `None`" contract.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added probe in 4bc539b

@MegaRedHand MegaRedHand merged commit 4653568 into lambdaclass:main Jun 2, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Avoid checkpoint-sync when we have recent state in the DB

3 participants