metrics.md: add lean_aggregator_skipped_total + update aggregated_signatures_building_time status#36
Open
ch4r10t33r wants to merge 1 commit into
Open
Conversation
…natures_building_time status
`lean_aggregator_skipped_total` (Validator Metrics) gives cross-client
visibility into missed aggregations. Today no standard skip counter
exists — zeam exposes a client-namespaced `zeam_aggregate_skip_total`,
no other client has anything equivalent. Deriving "missed aggregation"
from coverage gauges is best-effort and silently misses 100% drop
failures.
Proposed labels:
not_aggregator — slot in which the node had no aggregation duty
(bookkeeping; lets you separate "no duty" from
"had duty but skipped")
not_synced — wall-lag or sync-status gate prevented aggregation
missing_state — pre-state for the att_data target was unavailable
spawn_failed — aggregation worker queue was full / spawn error
other — catch-all so clients can adopt incrementally
`sum by (reason) (rate(lean_aggregator_skipped_total[5m]))` then tells
you both the duty distribution and the genuine-miss rate.
Zeam status set to 📝 (in-progress): currently has a semantically-
equivalent counter under a `zeam_*` prefix that will be renamed once
adopted upstream.
Also updates `lean_pq_sig_aggregated_signatures_building_time_seconds`:
- Grandine: □ → ✅ (verified exposed on the live devnet, ~600 obs)
- Zeam: 📝 → ✅ (implemented in zeam PR #941, exposed in
devnet image sha256:bb801c18..., ~500 obs per
aggregator)
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a new cross-client metric
lean_aggregator_skipped_total{reason=...}so operators can answer "how many slots did aggregation actually run for, and how many were skipped and why?" with a single counter rather than deriving the answer from coverage gauges or grepping logs.Also updates the existing
lean_pq_sig_aggregated_signatures_building_time_secondsrow to reflect what's actually exposed on the live devnet (Grandine and Zeam).Motivation
Today no client exposes a standard skip counter. Operators surveying the live devnet found:
zeam_aggregate_skip_total{reason=not_aggregator|not_synced|missing_state|spawn_failed}(client-namespaced)That leaves "did this aggregator drop a slot?" as a derived question. The two indirect proxies — per-slot subnet coverage (
lean_attestation_aggregate_coverage_subnets) andlean_pq_sig_aggregated_signatures_totalrate — both miss the "had-duty-and-silently-dropped" case, and the coverage gauge isn't even exposed by every client.A first-class counter for skips ends the guesswork and makes "missed aggregations" comparable across the fleet.
Proposed label values
not_aggregatornot_syncedbehind_peersand aggregation is gated)missing_statespawn_failedotherSum across labels = total aggregation cycles seen.
sum by (reason) (rate(lean_aggregator_skipped_total[5m]))then gives both the duty distribution and the genuine-miss rate.Status table
zeam_aggregate_skip_total; rename tolean_aggregator_skipped_totalupstream-adoptionDrive-by updates
lean_pq_sig_aggregated_signatures_building_time_seconds:sha256:bb801c18…, ~500 observations per aggregator (p50≈0.38s)Test plan
metrics.mdrendering locally