Skip to content

metrics.md: add lean_aggregator_skipped_total + update aggregated_signatures_building_time status#36

Open
ch4r10t33r wants to merge 1 commit into
leanEthereum:mainfrom
ch4r10t33r:add-aggregator-skipped-metric
Open

metrics.md: add lean_aggregator_skipped_total + update aggregated_signatures_building_time status#36
ch4r10t33r wants to merge 1 commit into
leanEthereum:mainfrom
ch4r10t33r:add-aggregator-skipped-metric

Conversation

@ch4r10t33r
Copy link
Copy Markdown

Summary

Adds a new cross-client metric lean_aggregator_skipped_total{reason=...} so operators can answer "how many slots did aggregation actually run for, and how many were skipped and why?" with a single counter rather than deriving the answer from coverage gauges or grepping logs.

Also updates the existing lean_pq_sig_aggregated_signatures_building_time_seconds row to reflect what's actually exposed on the live devnet (Grandine and Zeam).

Motivation

Today no client exposes a standard skip counter. Operators surveying the live devnet found:

  • zeam exposes zeam_aggregate_skip_total{reason=not_aggregator|not_synced|missing_state|spawn_failed} (client-namespaced)
  • ream / grandine / ethlambda / lantern expose nothing equivalent

That leaves "did this aggregator drop a slot?" as a derived question. The two indirect proxies — per-slot subnet coverage (lean_attestation_aggregate_coverage_subnets) and lean_pq_sig_aggregated_signatures_total rate — both miss the "had-duty-and-silently-dropped" case, and the coverage gauge isn't even exposed by every client.

A first-class counter for skips ends the guesswork and makes "missed aggregations" comparable across the fleet.

Proposed label values

reason When it fires
not_aggregator This slot's aggregation duty wasn't ours. Bookkeeping — lets you separate "no duty" from "had duty but skipped"
not_synced Wall-lag or sync-status gate prevented aggregation (e.g. node is in behind_peers and aggregation is gated)
missing_state Pre-state for the att_data target couldn't be resolved when the aggregator ran
spawn_failed Aggregation worker queue was full / spawn error
other Catch-all so clients can adopt incrementally without enumerating every internal failure mode

Sum across labels = total aggregation cycles seen. sum by (reason) (rate(lean_aggregator_skipped_total[5m])) then gives both the duty distribution and the genuine-miss rate.

Status table

Client Status Notes
Zeam 📝 Has equivalent counter under zeam_aggregate_skip_total; rename to lean_aggregator_skipped_total upstream-adoption
Others Not yet implemented

Drive-by updates

lean_pq_sig_aggregated_signatures_building_time_seconds:

  • Grandine: □ → ✅ — verified exposed on the live devnet (~600 observations on grandine_0 over a ~16 min run, p50≈1.19s)
  • Zeam: 📝 → ✅ — implemented in zeam #941, exposed on devnet image sha256:bb801c18…, ~500 observations per aggregator (p50≈0.38s)

Test plan

  • Reviewed metrics.md rendering locally
  • Reviewers confirm naming + label set is acceptable
  • Reviewers from each client team confirm/correct the status table

…natures_building_time status

`lean_aggregator_skipped_total` (Validator Metrics) gives cross-client
visibility into missed aggregations. Today no standard skip counter
exists — zeam exposes a client-namespaced `zeam_aggregate_skip_total`,
no other client has anything equivalent. Deriving "missed aggregation"
from coverage gauges is best-effort and silently misses 100% drop
failures.

Proposed labels:
  not_aggregator  — slot in which the node had no aggregation duty
                    (bookkeeping; lets you separate "no duty" from
                     "had duty but skipped")
  not_synced      — wall-lag or sync-status gate prevented aggregation
  missing_state   — pre-state for the att_data target was unavailable
  spawn_failed    — aggregation worker queue was full / spawn error
  other           — catch-all so clients can adopt incrementally

`sum by (reason) (rate(lean_aggregator_skipped_total[5m]))` then tells
you both the duty distribution and the genuine-miss rate.

Zeam status set to 📝 (in-progress): currently has a semantically-
equivalent counter under a `zeam_*` prefix that will be renamed once
adopted upstream.

Also updates `lean_pq_sig_aggregated_signatures_building_time_seconds`:
  - Grandine: □ → ✅ (verified exposed on the live devnet, ~600 obs)
  - Zeam:     📝 → ✅ (implemented in zeam PR #941, exposed in
                       devnet image sha256:bb801c18..., ~500 obs per
                       aggregator)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant