Design: connector resilience — skip-if-degraded, gap tracking & user-triggered backfill#181
Design: connector resilience — skip-if-degraded, gap tracking & user-triggered backfill#181davidesner wants to merge 3 commits into
Conversation
…ser-triggered backfill Design for detecting connector degradation before a scheduled run (via the token-free `claude mcp list` health check), a configurable on_degraded policy (skip/warn/run per slot type), one-gap-per-outage tracking, and a user-triggered subagent-based backfill surfaced in the briefing and macOS app. Complementary to #121 (independent; PR links it).
8bc1db8 to
a6cc66c
Compare
Preflight is point-in-time and fails open, so it can miss (a) inconclusive probes and (b) mid-session token expiry. Add a reactive post-session reconciliation that reads the run's actual connector telemetry and records a gap retroactively when a critical connector was dark — guaranteeing no silent gap. Reconciliation reads JSONL telemetry and is gated on #121; the preflight is not.
ReviewThe problem framing is excellent and the layering (prevent → record → recover) is sound. But verifying the spec against the actual runner templates and engine surfaced a number of confirmed defects — several of the doc's "existing mechanism" claims are wrong, and a few core mechanisms are unimplementable as written. All findings are in Findings
Also confirmed, below the severity cap
On the question in the PR descriptionThe skip/warn/run shape is right — strict for briefing/consolidation, relaxed for research. But as specified, 🤖 Generated with Claude Code |
…, probe wiring, gap lifecycle Per review on #181, most severe first: 1. Drop the required_threshold quorum axis. Degraded = any critical connector down; tolerance is tuned per slot type via required_in_types. A majority quorum let the motivating outage (Slack+Linear = 2 of 10) run blind. 2. Skip signalling uses the budget-check pattern: guarded call, exit 3 = policy skip → orderly exit 0, any other rc = error → fail open. A bare non-zero under set -euo pipefail read as a crash and killed post-session steps. 3. Preflight/reconcile go into all three runner templates; hoist MODE in run-dreaming.sh.tmpl above the pre-session gates. 4. Warn transport rides the .scout-cache pending-file seam (env export from a child process cannot reach the session); degraded-run KB writes get a mechanical provenance tag; reconciliation records gaps for warn too. 5. Add explicit harness_server_name / preflight_command fields to connectors.yaml; existing registries cannot be matched to mcp-list output and wizard tool-name probes are unusable headless. 6. Reconciliation reuses the connector_health_report classifier (Pattern #48 / #54 suppression) instead of a naive zero-success rule; drop the sourceless --session flag. 7. Gap lifecycle gains a reopen+merge transition; backfill command is derived at render time, not stored (stale by construction). 8. Anonymize the mcp-list example (CLAUDE.md rule). Also: deeplink wording replaced with the app's existing CLI launcher; inconclusive-streak alerting after 3 consecutive failed probes; honest Phase-1 scoping of fail-open; explicit gitignore rule for gaps.jsonl; provenance marker cited as proposed (not existing); backfill defaults to one agent per connector with chunking only on volume and synthesis-owned relevance filtering; probe timing corrected (~10s full list) with a required timeout; banner example consistent with skipped_runs.
|
All findings verified against the repo and addressed in 79f7d8f. Per finding:
Below-cap items all taken as well: deeplink wording replaced with the existing On the description question: adopted your recommendation — fixed |
Motivation — the user story
That experience exposes three gaps in how Scout handles connectors today:
be up. Absence of signal looks identical to absence of data — the user never
learns a connector's token expired.
KB that later dreaming/briefing runs build on. A missed run is recoverable;
a run that looks complete but was blind is worse than a miss.
what was missed — the value is simply gone.
The existing
connector_health_report.pyis retrospective (rolls up JSONLlogs after the fact) and, per #121, has been structurally dark for weeks. It
can't prevent a blind run.
What this PR contains
Design doc only —
docs/superpowers/specs/2026-07-01-connector-resilience-design.md.No code yet; this is to open the discussion.
Design in one paragraph
Add a pre-session connector preflight to the runner (sibling to the existing
budget-check.shgate) that usesclaude mcp list— a token-free, ~2.4sbuilt-in health check — to detect degradation before the main session
launches. A configurable
on_degradedpolicy (skip/warn/run, per slottype) decides whether to run: strict for briefing/consolidation, relaxed for
dreaming/research. Because the preflight is point-in-time and fails open, a
post-session reconciliation acts as a safety net — reading the run's actual
connector telemetry and recording a gap retroactively if a critical connector
was dark. Skipped/blind runs record one gap per contiguous outage in
.scout-state/gaps.jsonl(window = last-healthy-run → recovery). Gaps surface inthe briefing output and the macOS app with a one-click backfill that runs a
/scout-backfillskill — which fans out subagents per connector × time chunkto reconstruct the entire missed period (however long) without exhausting
context.
Notable design decisions
prevention and fails open (a broken probe must not block all runs). A
post-session reconciliation is the safety net: it reads the run's actual
connector telemetry and records a gap retroactively if a critical connector was
dark — catching both the fail-open case and mid-session token expiry. Together
they mean no silent gap; the preflight alone does not. (Reconciliation reads
JSONL telemetry and is thus gated on Runner templates never export SCOUT_MODE — connector-log / session-tool-log hooks short-circuit; connector telemetry dark since 2026-06-07 #121; the preflight is not.)
when to spend.
single record and a single backfill action; the skill scales agent count to the
window.
isolated behind the preflight script and the stored
backfill_commandstring.No abstraction built now.
Relationship to #121
Two different relationships depending on the layer: the preflight probes live
connector state and works even while #121 is unfixed; the post-session
reconciliation reads the JSONL telemetry and is therefore gated on #121
(degrades safely — records nothing until the telemetry works). #121 still needs
its own ~2-line fix (runner never exports
SCOUT_MODE) and will be linked fromthe implementing PR, ideally landing before/with Phase 2.
Phasing (shippable increments)
is fixed) — makes gaps visible via both the proactive and reactive paths.
/scout-backfillskill — the recovery action.Status
Design for discussion — not yet planned/scheduled. Reviewers: does the
on_degradedposture (skip briefing/consolidation, warn dreaming, run research)match how you'd want it to behave?
🤖 Generated with Claude Code