fix(publisher): filter ACK quorum candidates by <contextGraphsServed>#556
fix(publisher): filter ACK quorum candidates by <contextGraphsServed>#556branarakic wants to merge 4 commits into
<contextGraphsServed>#556Conversation
When the ACK collector picks core-node candidates for a publish quorum, it currently uses every connected core regardless of which context graphs that core has advertised hosting. Cores that don't host the target CG can only respond by throwing inside their StorageACK handler (typically `No data found in SWM graph ...`), which the publisher sees as a libp2p stream reset and has to time out / retry against. In practice this surfaces as opaque quorum failures (see GitHub issue #541 — `repnet-v2-official` is `accessPolicy=public, replicationPolicy=full`, so every core must host it; one beacon failed to enroll the CG after a chain redeploy, the publisher kept selecting it as an ACK candidate, and the publish failed with `MinSignaturesRequirementNotMet`). Add an optional `getCorePeersHostingContextGraph(cgIdStr)` dependency on `ACKCollectorDeps`. When provided, the collector queries it before ACK collection: - Filtered set ≥ `requiredACKs` → use it; log how many cores were excluded by the filter (with their last-8 ids). - Filtered set < `requiredACKs` → log a single WARN naming the CG and spelling out which connected cores do vs. don't advertise hosting it, then **fall back to the full connected-core set**. This deliberately keeps publishes live during discovery races / stale-registry windows but makes hosting-coverage bugs visible in the log instead of as opaque ACK timeouts. Implementation lives in `packages/publisher/src/hosting-resolver.ts` and runs a single SPARQL `SELECT` against the agent registry graph, delimiter-aware so `repnet` never falsely matches `repnet-edge-smoke`. Both call sites — `packages/agent/src/dkg-agent.ts:createV10ACKProvider` and the daemon's `ackTransportFactory` in `packages/cli/src/daemon/lifecycle.ts` — wire the dep through; the indirection in `packages/cli/src/publisher-runner.ts:ACKTransportFactory` threads the optional method through so existing embedders keep working unchanged. Tests in `packages/publisher/test/v10-ack-edge-cases.test.ts` cover: filtered-ok, partial→fallback+WARN, empty→fallback+WARN, throw→ fallback+WARN, async filter, no-dep parity with today, and a small `resolvePeersHostingContextGraph` suite incl. the literal #541 hosting layout (peer-A and peer-B advertise the CG, peer-C doesn't, only A and B come back). Co-authored-by: Cursor <cursoragent@cursor.com>
| PREFIX skill: <https://dkg.origintrail.io/skill#> | ||
| SELECT DISTINCT ?peerId WHERE { | ||
| GRAPH ?g { | ||
| ?agent dkg:peerId ?peerId ; |
There was a problem hiding this comment.
🔴 Bug: This query never constrains the result set to core-node profiles (dkg:nodeRole "core" / dkg:CoreNode). During early startup getConnectedCorePeers() intentionally falls back to all connected peers until protocol discovery completes, so a few edge nodes that advertise the CG can satisfy this filter and the collector will send /storage-ack requests only to peers that do not implement that protocol. Filter to core nodes here, or intersect against the discovered core-peer set before these matches are treated as ACK candidates.
| const matched = allConnected.filter(p => hostingSet.has(p)); | ||
| const excluded = allConnected.filter(p => !hostingSet.has(p)); | ||
|
|
||
| if (matched.length >= REQUIRED_ACKS) { |
There was a problem hiding this comment.
🔴 Bug: When matched.length >= REQUIRED_ACKS, the hosting-filtered set becomes the only candidate list. That turns the registry into a hard gate: one stale/false-positive advertisement in matched can drop the publish to 2/3 ACKs even though excluded connected peers could have satisfied quorum, which is a behavioral regression from the previous 'try every connected core' path. Safer behavior is to prioritize matched first, then continue with the remaining connected peers if quorum is still missing.
- hosting-resolver: constrain SPARQL to dkg:CoreNode agents so an edge node that publishes contextGraphsServed (e.g. for join-time discovery) can no longer leak into the ACK candidate set; without this the early-startup fallback in getConnectedCorePeers() to "all connected peers" could let the collector dial a peer that just stream-resets /dkg/10.0.0/storage-ack. - ack-collector: replace the hard-gate "matched OR fallback" choice with two-wave dialling. The priority wave (advertised cores) is dialled first; the fallback wave (the rest of the connected pool) is dialled only if the priority wave fails to reach quorum. This removes the regression where a single stale advertisement on a matched peer could fail a publish the rest of the pool could have satisfied, while still avoiding wasted dials on cores that don't host the CG in the happy path. - tests: cover edge-node filtering in the resolver, fallback-on- stale-advertisement and "no fallback dialled when priority alone satisfies quorum" in the collector. Update fixture to emit rdf:type dkg:CoreNode/EdgeNode to match the new query. Co-authored-by: Cursor <cursoragent@cursor.com>
| if (this.deps.getCorePeersHostingContextGraph) { | ||
| let hostingPeers: string[] = []; | ||
| try { | ||
| hostingPeers = await Promise.resolve( |
There was a problem hiding this comment.
🔴 Bug: The hosting-filter lookup is awaited before the ACK timeout starts. If the local registry query hangs or gets slow, publish can block indefinitely without ever dialing peers. Please bound this lookup with a short fail-open timeout, or start the overall timeout before awaiting it.
| PREFIX dkg: <https://dkg.network/ontology#> | ||
| PREFIX skill: <https://dkg.origintrail.io/skill#> | ||
| SELECT DISTINCT ?peerId WHERE { | ||
| GRAPH ?g { |
There was a problem hiding this comment.
🟡 Issue: GRAPH ?g makes ACK routing trust matching triples from any named graph in the store, not just the agent-registry graph where profiles are published. A copied or stale profile elsewhere can skew the priority wave toward the wrong peers. Scope this query to the authoritative registry data graph instead of all graphs.
- ack-collector: bound the optional `getCorePeersHostingContextGraph` lookup with `HOSTING_FILTER_TIMEOUT_MS` (1.5s). The lookup runs against the local triple store BEFORE the `ACK_TIMEOUT_MS` budget even begins, so a slow / hung registry query could previously block a publish indefinitely. On timeout we treat the result as "no hosting signal" and fall back to the legacy single-wave path. - hosting-resolver: pin the SPARQL to the authoritative agent- registry named graph (`did:dkg:context-graph:agents`, mirroring `AGENT_REGISTRY_GRAPH` in `packages/agent/src/profile.ts`). `GRAPH ?g` previously let ACK routing trust profile triples landing in any named graph (e.g. a stale snapshot held elsewhere in the store), which could skew the priority wave toward the wrong peers. Re-export the constant as `AGENT_REGISTRY_NAMED_GRAPH` for downstream callers / tests. - tests: add coverage for both — a lookup-hangs-forever case (must fall back to single-wave dialling well below the ACK timeout) and a stray-named-graph case (a copied profile in a non-registry graph must NOT be picked up). Co-authored-by: Cursor <cursoragent@cursor.com>
| let hostingPeers: string[] = []; | ||
| try { | ||
| const lookupPromise = Promise.resolve( | ||
| this.deps.getCorePeersHostingContextGraph(contextGraphIdStr), |
There was a problem hiding this comment.
🟡 Issue: The hosting lookup is keyed by contextGraphIdStr, but the ACK handler loads data from swmGraphId ?? contextGraphIdStr. In the remap flow, peers advertise the source SWM graph, not the target on-chain ACK domain, so this will prioritize the wrong peers or miss the hosting signal entirely. Use params.swmGraphId ?? contextGraphIdStr for the hosting-filter query.
| .map(p => p.toString()) | ||
| .filter(id => id !== this.peerId) | ||
| .filter(id => advertisedSet.has(id)); | ||
| } catch { |
There was a problem hiding this comment.
🟡 Issue: Catching and returning [] here hides real resolver/store failures from ACKCollector, so production regressions become indistinguishable from a legitimate 'no hosting signal' fallback. The collector already handles failures safely and logs the cause; either let the error propagate to it, or log the exception before returning [] here (same issue in packages/cli/src/daemon/lifecycle.ts).
There was a problem hiding this comment.
Codex review produced 2 comment(s) but all targeted lines outside the diff and were dropped. Check the workflow logs for details.
Summary
Fixes the publish failure shape described in #541 by stopping the ACK collector from dialling cores that don't advertise hosting the target context graph.
When the ACK collector picks candidates for a quorum, it currently uses every connected core regardless of which context graphs that core has advertised hosting. Cores that don't host the target CG can only respond by throwing inside their StorageACK handler (typically
No data found in SWM graph …), which the publisher sees as a libp2p stream reset — and the publish times out / fails on-chain withMinSignaturesRequirementNotMet.What's in this PR
getCorePeersHostingContextGraph(cgIdStr)dependency onACKCollectorDeps. When provided, the collector queries it before ACK collection:requiredACKs→ uses it; logs how many cores were excluded with their last-8 ids.requiredACKs→ logs a single WARN naming the CG and spelling out which connected cores do vs. don't advertise hosting it, then falls back to the full connected-core set. This deliberately keeps publishes live during discovery races / stale-registry windows, but makes hosting-coverage bugs visible in the log instead of presenting as opaque ACK timeouts.packages/publisher/src/hosting-resolver.tsruns a single SPARQLSELECTagainst the agent-registry graph and is delimiter-aware so e.g.…/repnetnever falsely matches…/repnet-edge-smoke.packages/agent/src/dkg-agent.ts:createV10ACKProviderand the daemon'sackTransportFactoryinpackages/cli/src/daemon/lifecycle.ts— wire the dep through.packages/cli/src/publisher-runner.ts:ACKTransportFactorythreads the optional method through so existing embedders keep working unchanged.The decision to warn-and-fall-back rather than hard-fail keeps the same publish path live for legitimate cases where the hosting registry is incomplete (e.g. discovery just hasn't caught up yet), while making the actual #541 root cause — beacon-3 not advertising
repnet-v2-officialdespite the CG beingreplicationPolicy=full— pop out of the warning logs immediately. A separate issue covers fixing beacon-3's chain-event poller / CG-discovery loop so it reconciles against the on-chain CG list rather than relying purely on live events.Test plan
pnpm --filter @origintrail-official/dkg-publisher build— greenpnpm --filter @origintrail-official/dkg-agent build— greenpnpm --filter "./packages/cli" build— greenpnpm --filter @origintrail-official/dkg-publisher test— 921 passed / 1 skipped (62 files), all greenp2p-resilience.test.ts,dkg-agent-diagnostics.test.ts) — greenpackages/publisher/test/v10-ack-edge-cases.test.tscover the new collector behaviour: filtered-ok, partial→fallback+WARN, empty→fallback+WARN, throw→fallback+WARN, async filter, and no-dep parity with todayresolvePeersHostingContextGraphdirectly: exact-UAL match, prefix-isolation guard, the literal RepNet CG15 publish cannot collect identity 1 ACK while public graph can #541 hosting layout (cores A and B advertise the CG, C doesn't, only A and B come back), empty-UAL, and SPARQL-escaped pathological UALBackward compatibility
The new dependency is optional: any embedder that didn't pass
getCorePeersHostingContextGraphkeeps the exact pre-existing collector behaviour. ExistingACKCollectorDepsconsumers compile and run unchanged.Related
Made with Cursor