feat(lumera): add LEP-6 chain client extensions#286
Open
feat(lumera): add LEP-6 chain client extensions#286
Conversation
fba3844 to
6d3160b
Compare
The new LEP-6 chain-driven self-healing runtime (dispatcher, healer, verifier, finalizer, transport handler, SQLite dedup, cascade staging/publish, peer client) is well-structured, thoroughly tested, and cleanly integrated. The previous concurrency concern has been addressed with a
Mention @roomote in a comment to request specific changes to this pull request or fix all unresolved issues. |
6d3160b to
38849ea
Compare
38849ea to
fd7466f
Compare
Introduces pkg/storagechallenge/deterministic/lep6.go, the off-chain
computation library shared by the storage_challenge runtime, recheck
service, and self-healing dispatcher. Every function is pure (no I/O,
no clock, no goroutines) so independent reporters challenging the same
(target, ticket) pair produce byte-identical StorageProofResult fields.
Functions land in two categories:
CHAIN-MIRRORED (must match lumera/x/audit/v1/keeper/audit_peer_assignment.go
byte-for-byte; the chain re-runs them to validate MsgSubmitEpochReport):
- SelectLEP6Targets — 1/3 deterministic target subset
(SHA-256(seed||0x00||account||0x00||"challenge_target"),
targetCount = ceil(N/divisor) clamped to [1, N])
- PairChallengerToTarget / AssignChallengerTargets — challenger->target
pairing (label "pair"), with no-self and lex tie-break
SUPERNODE-CANONICAL (chain stores outputs as opaque strings; this file
defines the canonical encoding all reporters must use to stay in lockstep):
- ClassifyTicketBucket — RECENT/OLD bucket classification using
Action.BlockHeight (Action.UpdatedHeight does not exist; see
docs/plans/LEP6_SUPERNODE_IMPLEMENTATION_PLAN.md "Resolved Decision 3")
- SelectTicketForBucket — deterministic per-(target,bucket) ticket pick
with excluded-set support for active heal ops
- SelectArtifactClass — LEP-6 §10 weighted roll (20% INDEX / 80% SYMBOL)
with deterministic fallback when a class has no artifacts
- SelectArtifactOrdinal — uniform ordinal mod artifactCount
- ComputeMultiRangeOffsets — k=4 range offsets in [0, size-rangeLen)
- ComputeCompoundChallengeHash — BLAKE3 over concat of slices in offset
order (lukechampine.com/blake3 to match the chain's library)
- DerivationInputHash — canonical hex of derivation inputs
- TranscriptHash — full canonical transcript identifier with sorted
observer ids; struct-input form prevents field-order mistakes
Domain separators ("challenge_target", "pair", "ticket_rank",
"artifact_class", "artifact_ordinal", "range_offset",
"derivation_input", "transcript") and enum string forms ("INDEX"/
"SYMBOL", "RECENT"/"OLD"/"PROBATION"/"RECHECK") are package
constants; freezing them prevents accidental drift between callers and
tests. Any change is a protocol-level break that requires versioning.
Tests:
- TestStorageTruthAssignmentHash_KnownVector locks the byte-level SHA-256
composition against an independent computation, guaranteeing the
chain-mirrored helper has not drifted.
- TestSelectLEP6Targets_OneThirdCoverage_AssignmentMatchesChain uses the
chain's own audit_peer_assignment_test.go fixture
(seed="01234567890123456789012345678901", active={sn-a..sn-f},
divisor=3) — output {sn-f, sn-e}.
- TestAssignChallengerTargets_KnownAssignment locks the full pairing
{sn-a -> sn-f, sn-b -> sn-e}.
- TestSelectArtifactClass_WeightedDistribution validates ~20% INDEX
over 5000 draws (±2% tolerance).
- Determinism, sensitivity, error-path, and out-of-bounds tests for
every primitive.
Verified: `go test ./pkg/storagechallenge/deterministic/...` passes; the
existing deterministic_test.go pre-LEP-6 tests continue to pass.
…#288) Implements the PR3 compound storage challenge runtime on top of the latest LEP-6 PR1 foundation branch after PR2 was merged into it. Highlights: - Add compound proof request/response fields and regenerated supernode proto bindings. - Add recipient-side GetCompoundProof handler with signed responses and range validation. - Add challenger-side LEP6Dispatcher for assigned-target dispatch across RECENT/OLD buckets. - Add result buffer implementing host_reporter ProofResultProvider with deterministic chain-cap throttling. - Add deterministic cascade metadata resolution helpers for artifact count, key, and exact artifact size. - Add production ChainTicketProvider backed by final Lumera x/action ListActionsBySuperNode query. - Wire startup to use ChainTicketProvider and cascade metadata/action size resolution instead of NoTicketProvider. - Classify target RPC timeout/no-response as TIMEOUT_OR_NO_RESPONSE and malformed transcripts as INVALID_TRANSCRIPT. - Extend action module bindings/mocks with ListActionsBySuperNode. - Preserve PR1 provider concurrency hardening and PR2 deterministic roocode fixes after rebase. Lumera dependency/source: - github.com/LumeraProtocol/lumera v1.12.0 - chain source: lumera/master 451f8a8e7ff30b3370cba59fab8e6228473a348b Validation: - git diff --check origin/supernode/LEP-6-chain-client-extensions..HEAD: pass - go test ./pkg/storagechallenge/... ./supernode/storage_challenge ./supernode/transport/grpc/storage_challenge ./supernode/host_reporter ./pkg/lumera/modules/action ./pkg/lumera/modules/audit ./pkg/lumera/modules/audit_msg -count=1 -v: pass - go vet ./pkg/storagechallenge/... ./supernode/storage_challenge ./supernode/transport/grpc/storage_challenge ./supernode/host_reporter ./pkg/lumera/modules/action ./pkg/lumera/modules/audit ./pkg/lumera/modules/audit_msg: pass - go test ./... -count=1: pass Plan: docs/plans/LEP6_SUPERNODE_IMPLEMENTATION_PLAN_v3_MASTER.md PR3
…289) Replaces the gonode-era peer-watchlist self-healing with a chain-mediated LEP-6 §18-§22 (Workstream C) implementation. Healer reconstructs locally and STAGES (no KAD publish), verifiers fetch reconstructed bytes from the assigned healer over a streaming gRPC RPC (§19 healer-served path) and hash-compare against op.ResultHash, then publish to KAD only after chain VERIFIED quorum. Three-phase flow Phase 1 — RECONSTRUCT (no publish) cascade.RecoveryReseed(PersistArtifacts=false, StagingDir) → download remaining symbols → RaptorQ-decode → verify file hash against Action.DataHash → re-encode → stage symbols+idFiles+layout +reconstructed.bin to ~/.supernode/heal-staging/<op_id>/. Submit MsgClaimHealComplete{HealManifestHash}; chain transitions SCHEDULED → HEALER_REPORTED, sets op.ResultHash = HealManifestHash. Phase 2 — VERIFY (§19 healer-served path) Verifier opens supernode.SelfHealingService/ServeReconstructedArtefacts on the assigned healer (op.HealerSupernodeAccount), streams the reconstructed bytes, computes BLAKE3 base64 (=Action.DataHash recipe via cascadekit.ComputeBlake3DataHashB64), compares against op.ResultHash (NOT Action.DataHash — chain enforces at lumera/x/audit/v1/keeper/msg_storage_truth.go:291), and submits MsgSubmitHealVerification{verified, hash}. Chain quorum n/2+1. Phase 3 — PUBLISH (only on VERIFIED) Finalizer polls heal_claims_submitted (Opt 2b per-op poll, folded into single tick loop alongside healer + verifier dispatch), reads op.Status, calls cascade.PublishStagedArtefacts on VERIFIED (same storeArtefacts path as register/upload), deletes staging on FAILED/EXPIRED. Chain may reschedule a different healer on EXPIRED. Crash-recovery / restart-safety Submit-then-persist ordering: SQLite dedup row is written ONLY after chain has accepted the tx. A failed submit (mempool, signing, chain reject) leaves no row and staging is removed, so the next tick can retry cleanly. If chain accepted a prior submit but the supernode crashed before persisting, the next tick's resubmit fails with "does not accept healer completion claim" and reconcileExistingClaim re-fetches the heal-op, confirms chain ResultHash equals our manifest, and persists the dedup row so finalizer takes over. Negative-attestation hash: chain rejects empty VerificationHash even on verified=false (msg_storage_truth.go:271-273). Verifier synthesizes a deterministic non-empty placeholder (sha256("lep6:negative-attestation:"+reason) base64) on fetch_failed and hash_compute_failed paths. Chain only validates VerificationHash content for positive votes (msg_storage_truth.go:288-294), so any non-empty value is well-formed for negatives. Components added supernode/self_healing/ service.go Single tick loop; mode gate (UNSPECIFIED skips); healer dispatch; verifier dispatch; finalizer poll; sync.Map in-flight + buffered semaphores (reconstructs=2, verifications=4, publishes=2). healer.go Phase 1: submit-then-persist ordering; reconcileExistingClaim handles post-crash recovery when chain accepted a prior submit. verifier.go Phase 2: fetch from assigned healer, retry with exponential backoff (3 attempts), submit verified= false with non-empty placeholder hash on persistent fetch failure; positive-path hash compares against op.ResultHash; reconciles chain-side "verification already submitted" idempotency. finalizer.go Phase 3: VERIFIED → publish + cleanup; FAILED/ EXPIRED → cleanup only; transient states no-op. peer_client.go secureVerifierFetcher dials via the same secure-rpc / lumeraid stack the legacy storage_challenge loop uses. supernode/transport/grpc/self_healing/handler.go Streaming ServeReconstructedArtefacts RPC. DefaultCallerIdentityResolver pulls verifier identity from the secure-rpc (Lumera ALTS) handshake via pkg/reachability.GrpcRemoteIdentityAndAddr — production wiring uses this so req.VerifierAccount is never trusted alone. Authorizes caller ∈ op.VerifierSupernodeAccounts AND identity == op.HealerSupernodeAccount; refuses with FailedPrecondition when not the assigned healer and PermissionDenied for unassigned callers. 1 MiB chunks. proto/supernode/self_healing.proto SelfHealingService { ServeReconstructedArtefacts streams chunks }. Makefile gen-supernode wires it; gen/supernode/self_healing*.pb.go regenerated. supernode/cascade/reseed.go Split RecoveryReseed: PersistArtifacts=true (legacy/republish) vs PersistArtifacts=false (LEP-6 stage-only). Adds stageArtefacts + PublishStagedArtefacts. Stages reconstructed file bytes and a JSON manifest the §19 transport reads. supernode/cascade/staged.go ReadStagedHealOp helper used by the transport handler. supernode/cascade/interfaces.go CascadeTask interface gains RecoveryReseed + PublishStagedArtefacts so self_healing depends only on the factory abstraction. pkg/storage/queries/self_healing_lep6.go Tables heal_claims_submitted (PK heal_op_id) and heal_verifications_submitted (PK (heal_op_id, verifier_account)) for restart dedup. Typed sentinel errors ErrLEP6ClaimAlreadyRecorded / ErrLEP6VerificationAlreadyRecorded. Migrations wired in OpenHistoryDB. pkg/storage/queries/local.go LocalStoreInterface embeds LEP6HealQueries. supernode/config/config.go SelfHealingConfig YAML block (enabled, poll_interval_ms, max_concurrent_*, staging_dir, verifier_fetch_timeout_ms, verifier_fetch_attempts). Default disabled until activation. supernode/cmd/start.go Constructs selfHealingService.Service + selfHealingRPC.Server (with DefaultCallerIdentityResolver) when SelfHealingConfig.Enabled, registers SelfHealingService_ServiceDesc on the gRPC server, appends the runner to the lifecycle services list. Reuses cService (cascade factory) and historyStore. Tests (16 mandatory; all PASS) supernode/self_healing/service_test.go 1. TestVerifier_ReadsOpResultHashForComparison (R-bug pin) 2. TestVerifier_HashMismatchProducesVerifiedFalse 2b. TestVerifier_FetchFailureSubmitsNonEmptyHash (BLOCKER pin) 3. TestVerifier_FetchesFromAssignedHealerOnly (§19 gate) 6. TestHealer_FailedSubmitDoesNotPersistDedupRow (ordering) 6b. TestHealer_ReconcilesExistingChainClaimAfterCrash (recovery) 7. TestHealer_RaptorQReconstructionFailureSkipsClaim (Scenario C1) 8. TestFinalizer_VerifiedTriggersPublishToKAD (Scenario A) 9. TestFinalizer_FailedSkipsPublish_DeletesStaging (Scenario B) 10. TestFinalizer_ExpiredSkipsPublish_DeletesStaging (Scenario C2) 11. TestService_NoRoleSkipsOp 12. TestService_UnspecifiedModeSkipsEntirely (mode gate) 13. TestService_FinalStateOpsIgnored 14. TestDedup_RestartDoesNotResubmit (3-layer dedup) supernode/transport/grpc/self_healing/handler_test.go 4. TestServeReconstructedArtefacts_AuthorizesOnlyAssignedVerifiers 5. TestServeReconstructedArtefacts_RejectsUnassignedCaller (also covers non-assigned-healer FailedPrecondition refusal) pkg/storage/queries/self_healing_lep6_test.go TestLEP6_HealClaim_RoundTripAndDedup TestLEP6_HealVerification_PerVerifierDedup Validation go test ./supernode/self_healing/... PASS (2.66s) go test ./supernode/transport/grpc/self_healing/... PASS (0.09s) go test ./supernode/cascade/... PASS (0.09s) go test ./pkg/storage/queries/... PASS (0.20s) go test ./pkg/storagechallenge/... ./supernode/storage_challenge \ ./supernode/host_reporter ./pkg/lumera/modules/audit \ ./pkg/lumera/modules/audit_msg PASS go vet (touched + all transitively reachable pkgs) PASS go build (targeted) PASS (full repo go build fails only on pre-existing github.com/kolesa-team/go-webp libwebp-dev system-header issue; unrelated to this change.) Resolved decisions applied ✓ Branch base: PR-3 tip f79f88f, NOT self-healing-improvements (single chain-driven service per Bilal direction; legacy 3-way Request/Verify/Commit RPC discarded). ✓ Verifier compares against op.ResultHash (chain msg_storage_truth.go :291). Pinned by TestVerifier_ReadsOpResultHashForComparison. ✓ Hash recipe = cascadekit.ComputeBlake3DataHashB64 (=Action.DataHash recipe). Same recipe healer + verifier + chain enforce. ✓ KAD publish AFTER chain VERIFIED (§19 healer-served-path gate); staging directory is the only authority before quorum. ✓ Finalizer mechanism: Opt 2b (per-op GetHealOp poll, folded into single tick loop) — no Tendermint WS, no monotonic-growth poll. ✓ Concurrency default: semaphore=2 reconstructs (RaptorQ RAM-aware), 4 verifications, 2 publishes. ✓ Mode gate: UNSPECIFIED skips dispatcher entirely (Service.tick early-return; verified by TestService_UnspecifiedModeSkipsEntirely). ✓ Three-layer dedup: sync.Map + bounded semaphores + SQLite (heal_claims_submitted + heal_verifications_submitted). ✓ Submit-then-persist ordering with reconcile path for crash recovery. ✓ Non-empty placeholder VerificationHash on negative attestations (chain rejects empty regardless of verified bool). ✓ Caller authentication via secure-rpc / Lumera ALTS handshake at transport layer; req.VerifierAccount never trusted alone in production. Plan: docs/plans/LEP6_PR4_EXECUTION_PLAN.md
Implements the PR-5 Supernode side of LEP-6 storage-truth recheck evidence on top of the PR-4 heal-op dispatch branch. Public surfaces added: - supernode/recheck: Candidate, RecheckResult, Finder, Attestor, Service, ReporterSource, SupernodeReporterSource, eligibility and outcome mapping helpers. - pkg/storage/queries: RecheckQueries plus SQLite-backed HasRecheckSubmission and RecordRecheckSubmission. - pkg/lumera/modules/audit: GetEpochReportsByReporter query wrapper for network-wide candidate discovery. - supernode/storage_challenge: LEP6Dispatcher.Recheck to execute RECHECK-bucket proofs without adding results to epoch reports. Spec/chain alignment decisions: - Candidate discovery is network-wide: the service lists registered supernodes and scans EpochReportsByReporter over the configured lookback window, rather than only scanning this node's own report. - Recheck candidate eligibility mirrors chain storage transcript records: only HASH_MISMATCH, TIMEOUT_OR_NO_RESPONSE, OBSERVER_QUORUM_FAIL, and INVALID_TRANSCRIPT originals are eligible. - The service rejects self-target candidates and self-reported challenged results because chain SubmitStorageRecheckEvidence rejects creator == challenged_supernode_account and creator == challenged result reporter. - Recheck execution maps local PASS to PASS and confirmed hash mismatch to RECHECK_CONFIRMED_FAIL; timeout/quorum/invalid transcript classes remain explicit and are not collapsed. - Recheck execution reuses the PR-3 compound dispatcher in RECHECK bucket mode with an isolated temporary buffer so recheck results are submitted only through MsgSubmitStorageRecheckEvidence and are never included in host epoch reports. - Local dedup is submit-then-persist keyed by epoch_id + ticket_id (creator/self is implicit locally); tx hard-fail does not persist, while chain replay/already-submitted errors persist local dedup for idempotence. - Startup/config wiring is additive under storage_challenge.lep6.recheck and remains disabled unless explicitly enabled. Tests added/updated: - Eligibility matrix for all eligible and rejected result classes. - Outcome mapping for PASS, RECHECK_CONFIRMED_FAIL, timeout, quorum, and invalid transcript. - Finder lookback/order/limit/local-dedup behavior. - Network-wide reporter discovery regression so peer-reported failures are discovered and not self-report-only. - Self-target and self-reported candidate rejection pinned against chain validation. - Service mode gate and submit path. - Attestor submit-then-persist, tx hard-fail retry safety, idempotent already-submitted handling, and required-field rejection. - SQLite recheck submission idempotence/dedup preservation. - Dispatcher RECHECK execution path integration through focused package tests. Validation: - PATH=/home/openclaw/.local/go/bin:$PATH go test ./supernode/recheck ./pkg/storage/queries ./supernode/storage_challenge ./supernode/cmd ./pkg/lumera/modules/audit => PASS - PATH=/home/openclaw/.local/go/bin:$PATH go test ./supernode/host_reporter ./supernode/self_healing ./supernode/transport/grpc/self_healing ./supernode/recheck ./pkg/storage/queries ./supernode/storage_challenge ./supernode/cmd ./pkg/lumera/modules/audit => PASS - PATH=/home/openclaw/.local/go/bin:$PATH go vet ./supernode/recheck ./pkg/storage/queries ./supernode/storage_challenge ./supernode/cmd ./pkg/lumera/modules/audit ./supernode/host_reporter ./supernode/self_healing ./supernode/transport/grpc/self_healing => PASS - git diff --check => PASS - PATH=/home/openclaw/.local/go/bin:$PATH go test ./... => expected local environment failure only in pkg/storage/files due missing go-webp system headers webp/decode.h and webp/encode.h; other visible packages pass. Parent: supernode/LEP-6-heal-op-dispatch @ 043fba4.
Resolves all 33 items from mateeullahmalik's CHANGES_REQUESTED review on PR #286. Per Matee's lens — silent data-loss, chain N/R/D math fragility, operator-impact / fee-burn, DoS / bulk-exfil on authenticated handlers, English-substring chain-error matching — every finding is closed without spec divergence and with regression tests. Highlights by failure class: 1. Typed chain-error sentinels (H9 umbrella; foundation for C3/C4/C5/H1/L3) - new pkg/lumera/chainerrors with predicates (errors.Is + gRPC code + substring fallback) and transient short-circuit; replaces every strings.Contains(err.Error(), …) under self_healing/recheck/ storage_challenge. 2. Storage layer (C2, M8, M12, L3, L4) - recheck dedup migrated to (heal_op_id, target_supernode) PK with typed ErrAlreadyExists, INSERT … ON CONFLICT DO NOTHING. - PRAGMA-guarded ALTER TABLE migrations; tx-helper sequence cap and fairness; finder per-reporter error isolation. 3. Self-healing safety overhaul (C3, C4, C5, H1, H2, H7, M2, M5, M7, L2) - reconcile-not-purge for transient errors; pre-submit deadline-epoch check; paginated GetHealOpsByStatus; bounded streaming caps; per-op deadline goroutines; finalizer runs regardless of mode-gate; reseed via os.Rename / io.Copy; canonical negative-attestation reason taxonomy. 4. Storage-challenge dispatch + buffer (C2, H3, H4, H5, H6, M4, M9, M10, M11, L5) - chain-anchored partial rows for pre-derivation early returns (ctx.Err passthrough); sign errors drop the row + metric, never lie; arrival-order + (target,bucket) fairness buffer (no lex shaping); no class swap when rolled class empty (NO_ELIGIBLE_TICKET only); bounded LRU index-size cache; SQLite-persisted lastSubmittedEpoch; at-least-one-class-non-zero ticket gate; per-call recheck buffer. 5. Operator config / shutdown / probing (C1, M1, M3, M6, L6) - LEP-6 toggles default-FALSE on missing config; startup advisory WARN names each disabled service; structural validator rejects recheck=true with parents disabled; staging-dir resolved via GetFullPath; historyStore.CloseHistoryDB moved after services drain; probeTCP taxonomy distinguishes ECONNREFUSED (CLOSED) from DNS / EHOSTUNREACH / ctx.Err / Timeout (UNKNOWN); fixtures aligned with new gating chain. 6. Transport handlers (C6, H8, L1) - GetCompoundProof: per-call MaxCompoundRanges=16, per-range cap 4*LEP6CompoundRangeLenBytes, MaxCompoundAggregateBytes=16 KiB; rejected before any artifact bytes are read. - ServeReconstructedArtefacts: gated on op.Status == HealOpStatus_HEAL_OP_STATUS_HEALER_REPORTED. - NewServer rejects nil resolveCaller; NewServerForTest is the documented test-only escape hatch. Spec-fidelity: no scoring constants changed; no chain-side semantics altered. Chain-anchored validator rules cited at chain path:line for every consensus- affecting branch (PK shape, partial-row class, dispatch class fallback, deadline sentinel). Validation: go build, go vet, focused per-wave package tests, and full go test $(go list ./... | grep -v /tests) -count=1 sweep — zero regressions across 50+ packages.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.