fix(publisher): filter ACK quorum candidates by `<contextGraphsServed>` by branarakic · Pull Request #556 · OriginTrail/dkg

branarakic · 2026-05-17T17:04:49Z

Summary

Fixes the publish failure shape described in #541 by stopping the ACK collector from dialling cores that don't advertise hosting the target context graph.

When the ACK collector picks candidates for a quorum, it currently uses every connected core regardless of which context graphs that core has advertised hosting. Cores that don't host the target CG can only respond by throwing inside their StorageACK handler (typically No data found in SWM graph …), which the publisher sees as a libp2p stream reset — and the publish times out / fails on-chain with MinSignaturesRequirementNotMet.

What's in this PR

New optional getCorePeersHostingContextGraph(cgIdStr) dependency on ACKCollectorDeps. When provided, the collector queries it before ACK collection:
- Filtered set ≥ requiredACKs → uses it; logs how many cores were excluded with their last-8 ids.
- Filtered set < requiredACKs → logs a single WARN naming the CG and spelling out which connected cores do vs. don't advertise hosting it, then falls back to the full connected-core set. This deliberately keeps publishes live during discovery races / stale-registry windows, but makes hosting-coverage bugs visible in the log instead of presenting as opaque ACK timeouts.
Implementation: packages/publisher/src/hosting-resolver.ts runs a single SPARQL SELECT against the agent-registry graph and is delimiter-aware so e.g. …/repnet never falsely matches …/repnet-edge-smoke.
Both call sites — packages/agent/src/dkg-agent.ts:createV10ACKProvider and the daemon's ackTransportFactory in packages/cli/src/daemon/lifecycle.ts — wire the dep through. packages/cli/src/publisher-runner.ts:ACKTransportFactory threads the optional method through so existing embedders keep working unchanged.

The decision to warn-and-fall-back rather than hard-fail keeps the same publish path live for legitimate cases where the hosting registry is incomplete (e.g. discovery just hasn't caught up yet), while making the actual #541 root cause — beacon-3 not advertising repnet-v2-official despite the CG being replicationPolicy=full — pop out of the warning logs immediately. A separate issue covers fixing beacon-3's chain-event poller / CG-discovery loop so it reconciles against the on-chain CG list rather than relying purely on live events.

Test plan

pnpm --filter @origintrail-official/dkg-publisher build — green
pnpm --filter @origintrail-official/dkg-agent build — green
pnpm --filter "./packages/cli" build — green
pnpm --filter @origintrail-official/dkg-publisher test — 921 passed / 1 skipped (62 files), all green
Spot-checked agent suites that exercise the V10 ACK provider (p2p-resilience.test.ts, dkg-agent-diagnostics.test.ts) — green
Six new tests in packages/publisher/test/v10-ack-edge-cases.test.ts cover the new collector behaviour: filtered-ok, partial→fallback+WARN, empty→fallback+WARN, throw→fallback+WARN, async filter, and no-dep parity with today
Five new tests cover resolvePeersHostingContextGraph directly: exact-UAL match, prefix-isolation guard, the literal RepNet CG15 publish cannot collect identity 1 ACK while public graph can #541 hosting layout (cores A and B advertise the CG, C doesn't, only A and B come back), empty-UAL, and SPARQL-escaped pathological UAL

Backward compatibility

The new dependency is optional: any embedder that didn't pass getCorePeersHostingContextGraph keeps the exact pre-existing collector behaviour. Existing ACKCollectorDeps consumers compile and run unchanged.

When the ACK collector picks core-node candidates for a publish quorum, it currently uses every connected core regardless of which context graphs that core has advertised hosting. Cores that don't host the target CG can only respond by throwing inside their StorageACK handler (typically `No data found in SWM graph ...`), which the publisher sees as a libp2p stream reset and has to time out / retry against. In practice this surfaces as opaque quorum failures (see GitHub issue #541 — `repnet-v2-official` is `accessPolicy=public, replicationPolicy=full`, so every core must host it; one beacon failed to enroll the CG after a chain redeploy, the publisher kept selecting it as an ACK candidate, and the publish failed with `MinSignaturesRequirementNotMet`). Add an optional `getCorePeersHostingContextGraph(cgIdStr)` dependency on `ACKCollectorDeps`. When provided, the collector queries it before ACK collection: - Filtered set ≥ `requiredACKs` → use it; log how many cores were excluded by the filter (with their last-8 ids). - Filtered set < `requiredACKs` → log a single WARN naming the CG and spelling out which connected cores do vs. don't advertise hosting it, then **fall back to the full connected-core set**. This deliberately keeps publishes live during discovery races / stale-registry windows but makes hosting-coverage bugs visible in the log instead of as opaque ACK timeouts. Implementation lives in `packages/publisher/src/hosting-resolver.ts` and runs a single SPARQL `SELECT` against the agent registry graph, delimiter-aware so `repnet` never falsely matches `repnet-edge-smoke`. Both call sites — `packages/agent/src/dkg-agent.ts:createV10ACKProvider` and the daemon's `ackTransportFactory` in `packages/cli/src/daemon/lifecycle.ts` — wire the dep through; the indirection in `packages/cli/src/publisher-runner.ts:ACKTransportFactory` threads the optional method through so existing embedders keep working unchanged. Tests in `packages/publisher/test/v10-ack-edge-cases.test.ts` cover: filtered-ok, partial→fallback+WARN, empty→fallback+WARN, throw→ fallback+WARN, async filter, no-dep parity with today, and a small `resolvePeersHostingContextGraph` suite incl. the literal #541 hosting layout (peer-A and peer-B advertise the CG, peer-C doesn't, only A and B come back). Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions · 2026-05-17T17:07:22Z

+PREFIX skill: <https://dkg.origintrail.io/skill#>
+SELECT DISTINCT ?peerId WHERE {
+  GRAPH ?g {
+    ?agent dkg:peerId ?peerId ;


🔴 Bug: This query never constrains the result set to core-node profiles (dkg:nodeRole "core" / dkg:CoreNode). During early startup getConnectedCorePeers() intentionally falls back to all connected peers until protocol discovery completes, so a few edge nodes that advertise the CG can satisfy this filter and the collector will send /storage-ack requests only to peers that do not implement that protocol. Filter to core nodes here, or intersect against the discovered core-peer set before these matches are treated as ACK candidates.

github-actions · 2026-05-17T17:07:22Z

+      const matched = allConnected.filter(p => hostingSet.has(p));
+      const excluded = allConnected.filter(p => !hostingSet.has(p));
+
+      if (matched.length >= REQUIRED_ACKS) {


🔴 Bug: When matched.length >= REQUIRED_ACKS, the hosting-filtered set becomes the only candidate list. That turns the registry into a hard gate: one stale/false-positive advertisement in matched can drop the publish to 2/3 ACKs even though excluded connected peers could have satisfied quorum, which is a behavioral regression from the previous 'try every connected core' path. Safer behavior is to prioritize matched first, then continue with the remaining connected peers if quorum is still missing.

- hosting-resolver: constrain SPARQL to dkg:CoreNode agents so an edge node that publishes contextGraphsServed (e.g. for join-time discovery) can no longer leak into the ACK candidate set; without this the early-startup fallback in getConnectedCorePeers() to "all connected peers" could let the collector dial a peer that just stream-resets /dkg/10.0.0/storage-ack. - ack-collector: replace the hard-gate "matched OR fallback" choice with two-wave dialling. The priority wave (advertised cores) is dialled first; the fallback wave (the rest of the connected pool) is dialled only if the priority wave fails to reach quorum. This removes the regression where a single stale advertisement on a matched peer could fail a publish the rest of the pool could have satisfied, while still avoiding wasted dials on cores that don't host the CG in the happy path. - tests: cover edge-node filtering in the resolver, fallback-on- stale-advertisement and "no fallback dialled when priority alone satisfies quorum" in the collector. Update fixture to emit rdf:type dkg:CoreNode/EdgeNode to match the new query. Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions · 2026-05-17T17:44:25Z

+    if (this.deps.getCorePeersHostingContextGraph) {
+      let hostingPeers: string[] = [];
+      try {
+        hostingPeers = await Promise.resolve(


🔴 Bug: The hosting-filter lookup is awaited before the ACK timeout starts. If the local registry query hangs or gets slow, publish can block indefinitely without ever dialing peers. Please bound this lookup with a short fail-open timeout, or start the overall timeout before awaiting it.

github-actions · 2026-05-17T17:44:25Z

+PREFIX dkg:   <https://dkg.network/ontology#>
+PREFIX skill: <https://dkg.origintrail.io/skill#>
+SELECT DISTINCT ?peerId WHERE {
+  GRAPH ?g {


🟡 Issue: GRAPH ?g makes ACK routing trust matching triples from any named graph in the store, not just the agent-registry graph where profiles are published. A copied or stale profile elsewhere can skew the priority wave toward the wrong peers. Scope this query to the authoritative registry data graph instead of all graphs.

- ack-collector: bound the optional `getCorePeersHostingContextGraph` lookup with `HOSTING_FILTER_TIMEOUT_MS` (1.5s). The lookup runs against the local triple store BEFORE the `ACK_TIMEOUT_MS` budget even begins, so a slow / hung registry query could previously block a publish indefinitely. On timeout we treat the result as "no hosting signal" and fall back to the legacy single-wave path. - hosting-resolver: pin the SPARQL to the authoritative agent- registry named graph (`did:dkg:context-graph:agents`, mirroring `AGENT_REGISTRY_GRAPH` in `packages/agent/src/profile.ts`). `GRAPH ?g` previously let ACK routing trust profile triples landing in any named graph (e.g. a stale snapshot held elsewhere in the store), which could skew the priority wave toward the wrong peers. Re-export the constant as `AGENT_REGISTRY_NAMED_GRAPH` for downstream callers / tests. - tests: add coverage for both — a lookup-hangs-forever case (must fall back to single-wave dialling well below the ACK timeout) and a stray-named-graph case (a copied profile in a non-registry graph must NOT be picked up). Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions · 2026-05-17T18:30:54Z

+      let hostingPeers: string[] = [];
+      try {
+        const lookupPromise = Promise.resolve(
+          this.deps.getCorePeersHostingContextGraph(contextGraphIdStr),


🟡 Issue: The hosting lookup is keyed by contextGraphIdStr, but the ACK handler loads data from swmGraphId ?? contextGraphIdStr. In the remap flow, peers advertise the source SWM graph, not the target on-chain ACK domain, so this will prioritize the wrong peers or miss the hosting signal entirely. Use params.swmGraphId ?? contextGraphIdStr for the hosting-filter query.

github-actions · 2026-05-17T18:30:54Z

+            .map(p => p.toString())
+            .filter(id => id !== this.peerId)
+            .filter(id => advertisedSet.has(id));
+        } catch {


🟡 Issue: Catching and returning [] here hides real resolver/store failures from ACKCollector, so production regressions become indistinguishable from a legitimate 'no hosting signal' fallback. The collector already handles failures safely and logs the cause; either let the error propagate to it, or log the exception before returning [] here (same issue in packages/cli/src/daemon/lifecycle.ts).

github-actions

Codex review produced 2 comment(s) but all targeted lines outside the diff and were dropped. Check the workflow logs for details.

github-actions Bot reviewed May 17, 2026

View reviewed changes

branarakic mentioned this pull request May 17, 2026

feat(publisher): structured StorageACK declines instead of stream resets #559

Open

8 tasks

github-actions Bot reviewed May 17, 2026

View reviewed changes

fix(publisher): resolve hosting by source swm graph

c127553

github-actions Bot reviewed May 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(publisher): filter ACK quorum candidates by `<contextGraphsServed>`#556

fix(publisher): filter ACK quorum candidates by `<contextGraphsServed>`#556
branarakic wants to merge 4 commits into
mainfrom
fix/publisher-ack-quorum-filter-by-hosting

branarakic commented May 17, 2026

Uh oh!

github-actions Bot May 17, 2026

Uh oh!

github-actions Bot May 17, 2026

Uh oh!

github-actions Bot May 17, 2026

Uh oh!

github-actions Bot May 17, 2026

Uh oh!

github-actions Bot May 17, 2026

Uh oh!

github-actions Bot May 17, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

branarakic commented May 17, 2026

Summary

What's in this PR

Test plan

Backward compatibility

Related

Uh oh!

github-actions Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant