Skip to content

Design: connector resilience — skip-if-degraded, gap tracking & user-triggered backfill#181

Open
davidesner wants to merge 3 commits into
mainfrom
spec/connector-resilience
Open

Design: connector resilience — skip-if-degraded, gap tracking & user-triggered backfill#181
davidesner wants to merge 3 commits into
mainfrom
spec/connector-resilience

Conversation

@davidesner

@davidesner davidesner commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Motivation — the user story

As a Scout user returning from 6 weeks of leave, I wanted a summary of
everything I'd missed so I could catch up. The scheduled pipelines had been
running the whole time — but Linear and Slack were down, so the briefings
they produced had little value. Worse, I had no signal that the connectors
were even down: a briefing saying "nothing notable in Slack" is
indistinguishable from "Slack was unreachable." By the time I realized, the
window was gone. I want a clear signal that I have missing data, and a way to
backfill the entire missed period on demand.

That experience exposes three gaps in how Scout handles connectors today:

  1. Silent degradation. A run executes against whatever connectors happen to
    be up. Absence of signal looks identical to absence of data — the user never
    learns a connector's token expired.
  2. KB poisoning. A blind consolidation writes partial-view entries into the
    KB that later dreaming/briefing runs build on. A missed run is recoverable;
    a run that looks complete but was blind is worse than a miss.
  3. No recovery path. After an extended outage there's no way to reconstruct
    what was missed — the value is simply gone.

The existing connector_health_report.py is retrospective (rolls up JSONL
logs after the fact) and, per #121, has been structurally dark for weeks. It
can't prevent a blind run.

What this PR contains

Design doc onlydocs/superpowers/specs/2026-07-01-connector-resilience-design.md.
No code yet; this is to open the discussion.

Design in one paragraph

Add a pre-session connector preflight to the runner (sibling to the existing
budget-check.sh gate) that uses claude mcp list — a token-free, ~2.4s
built-in health check
— to detect degradation before the main session
launches. A configurable on_degraded policy (skip/warn/run, per slot
type) decides whether to run: strict for briefing/consolidation, relaxed for
dreaming/research. Because the preflight is point-in-time and fails open, a
post-session reconciliation acts as a safety net — reading the run's actual
connector telemetry and recording a gap retroactively if a critical connector
was dark. Skipped/blind runs record one gap per contiguous outage in
.scout-state/gaps.jsonl (window = last-healthy-run → recovery). Gaps surface in
the briefing output and the macOS app with a one-click backfill that runs a
/scout-backfill skill — which fans out subagents per connector × time chunk
to reconstruct the entire missed period (however long) without exhausting
context.

Notable design decisions

  • Two detection layers, not one. The pre-session preflight is best-effort
    prevention and fails open (a broken probe must not block all runs). A
    post-session reconciliation is the safety net: it reads the run's actual
    connector telemetry and records a gap retroactively if a critical connector was
    dark — catching both the fail-open case and mid-session token expiry. Together
    they mean no silent gap; the preflight alone does not. (Reconciliation reads
    JSONL telemetry and is thus gated on Runner templates never export SCOUT_MODE — connector-log / session-tool-log hooks short-circuit; connector telemetry dark since 2026-06-07 #121; the preflight is not.)
  • Backfill is always user-triggered — it's token-expensive; the user decides
    when to spend.
  • One gap = one outage = one button = N agents internally. A 6-week hole is a
    single record and a single backfill action; the skill scales agent count to the
    window.
  • Multi-harness (Codex, Gemini CLI, …) is a marked future direction only
    isolated behind the preflight script and the stored backfill_command string.
    No abstraction built now.

Relationship to #121

Two different relationships depending on the layer: the preflight probes live
connector state and works even while #121 is unfixed; the post-session
reconciliation
reads the JSONL telemetry and is therefore gated on #121
(degrades safely — records nothing until the telemetry works). #121 still needs
its own ~2-line fix (runner never exports SCOUT_MODE) and will be linked from
the implementing PR, ideally landing before/with Phase 2.

Phasing (shippable increments)

  1. Preflight + skip/warn/run policy — stops blind runs. Independent of Runner templates never export SCOUT_MODE — connector-log / session-tool-log hooks short-circuit; connector telemetry dark since 2026-06-07 #121.
  2. Gap tracking + CLI + briefing banner + post-session reconciliation (once Runner templates never export SCOUT_MODE — connector-log / session-tool-log hooks short-circuit; connector telemetry dark since 2026-06-07 #121
    is fixed) — makes gaps visible via both the proactive and reactive paths.
  3. /scout-backfill skill — the recovery action.
  4. macOS app gap surface + Backfill button.

Status

Design for discussion — not yet planned/scheduled. Reviewers: does the
on_degraded posture (skip briefing/consolidation, warn dreaming, run research)
match how you'd want it to behave?

🤖 Generated with Claude Code

…ser-triggered backfill

Design for detecting connector degradation before a scheduled run (via the
token-free `claude mcp list` health check), a configurable on_degraded policy
(skip/warn/run per slot type), one-gap-per-outage tracking, and a
user-triggered subagent-based backfill surfaced in the briefing and macOS app.

Complementary to #121 (independent; PR links it).
@davidesner davidesner force-pushed the spec/connector-resilience branch from 8bc1db8 to a6cc66c Compare July 1, 2026 21:28
Preflight is point-in-time and fails open, so it can miss (a) inconclusive
probes and (b) mid-session token expiry. Add a reactive post-session
reconciliation that reads the run's actual connector telemetry and records a
gap retroactively when a critical connector was dark — guaranteeing no silent
gap. Reconciliation reads JSONL telemetry and is gated on #121; the preflight
is not.
@jordanrburger jordanrburger requested a review from a team July 1, 2026 21:53
@jordanrburger

Copy link
Copy Markdown
Collaborator

Review

The problem framing is excellent and the layering (prevent → record → recover) is sound. But verifying the spec against the actual runner templates and engine surfaced a number of confirmed defects — several of the doc's "existing mechanism" claims are wrong, and a few core mechanisms are unimplementable as written. All findings are in 2026-07-01-connector-resilience-design.md, most severe first.

Findings

  1. L194 — The recommended consolidation: {skip, majority} posture fails the design's own motivating scenario, and "degraded" has two irreconciled definitions. Slack+Linear down leaves 8 of the 10 critical connectors up (verified against connectors.yaml — exactly 10 list consolidation in required_in_types), which passes the majority quorum, so the 6-week blind consolidation from §Problem runs anyway. Separately, Layer 1 step 4 hardcodes degraded = "one or more not connected" while Layer 2 says required_threshold defines degraded; no rule says which wins, nor whether step 7's gap-close/last_healthy_run side effects fire when the quorum passes with connectors still down (which would close a gap mid-outage).

  2. L121 — "Exits non-zero so the runner halts" collides with the real runner: a skip becomes indistinguishable from a crash, and a preflight crash blocks runs. run-scout.sh.tmpl runs set -euo pipefail; budget-check's actual precedent is an if ! guard converting failure to an explicit exit 0, and other steps are guarded with || true. A bare non-zero exit aborts the script before post-session-backfill and the write-session-cost fallback row ever run; _spawn_runner (schedule_tick.py) launches detached with DEVNULL stdio and never reads the exit code. And any accidental preflight failure (unhandled exception, missing scoutctl → exit 127) reads as a deliberate skip — contradicting the doc's own fail-open requirement that "a broken probe must not silently block all runs". The skip needs the budget-check pattern (guarded call → explicit exit 0) plus a distinguished exit code for real errors.

  3. L99 — The preflight/reconcile are specified only for run-scout.sh.tmpl, but dreaming and research slots dispatch via their own templates. schedule.yaml maps dreaming slots to run-dreaming.sh and research to run-research.sh (bootstrap.py:146-148); run-scout.sh.tmpl only serves briefing/consolidation. Following the doc's "Where" ships the Layer-2 dreaming/research policies as dead config: no preflight, no last_healthy_run tracking, and no gap-close edge ever executes for those slot types. (run-dreaming.sh.tmpl also defines MODE after the pre-session gate area, so ordering needs attention there.)

  4. L123 — Warn mode doesn't work end-to-end: the SCOUT_DEGRADED_CONNECTORS transport is impossible as written, and the KB-poisoning guard is prompt-hope. scoutctl is a child process — an export inside it dies with it, and no relay protocol is specified; the runner's PROMPT is a fixed heredoc that reads no env vars, and nothing in the repo reads SCOUT_DEGRADED_CONNECTORS (cited issue Runner templates never export SCOUT_MODE — connector-log / session-tool-log hooks short-circuit; connector telemetry dark since 2026-06-07 #121 is this exact env-var-to-session seam failing silently). Even if transported, the only defense against the doc's own problem v0.3.0 — pre-session hooks, work + meta-review, action-items GUI #2 is the session "refraining" from negative signals — unenforced, untested (the only warn test asserts no gap), and per L164 no gap is recorded when it fails, contradicting the "no silent gap" claim at L170. The proven seam already exists: .scout-cache/connector-alerts-pending.md written by connector_health_report for the session to consume — the degradation banner should ride that channel, and degraded-run KB writes deserve a mechanical provenance tag (the doc already invents one for backfill).

  5. L113 — The probe wiring is unimplementable: neither registry can be matched against claude mcp list output, and the existing probes check the wrong surface. connectors.yaml keys are mcp:claude_ai_Slack-style with human display_names ("Slack") — no field yields the Status:-line name ("claude.ai Slack"), and underscore↔dot/space inversion is ambiguous (claude_ai_Github_MCP). connector-probes.yaml stores in-session MCP tool names built for the /scout-setup wizard — a headless scoutctl cannot invoke them, so "reuses the existing ProbeKind enum" buys only the mcp-vs-bash bit. And linear's probe targets the plugin-scoped connector (mcp__plugin_linear_linear__list_teams) while scheduled runs use claude.ai Linear. The design needs an explicit harness-server-name field per connector.

  6. L156 — Layer 1b reconciliation is unimplementable and unsafe as specified: --session <id> has no source, and the naive zero-success rule re-derives false-positive classes the codebase already paid to suppress. The session id exists only in the hook payloads; the runner never captures it, claude-with-retry.sh re-runs claude -p from scratch (multiple ids per runner invocation), and the day-keyed connector-calls-*.jsonl mixes four consolidation runs per weekday, so mode+date can't disambiguate either. Meanwhile connector_health_report.py already classifies per-session dark connectors with Pattern launchd plist re-install uses bootstrap without prior bootout — old job stays loaded #48 never-wired suppression and the IdMap.load: exists()+open() TOCTOU and unhandled json.JSONDecodeError #54 cross-mode liveness window — the doc's bare "zero successful calls" rule would recurringly open bogus, backfill-prompting gaps for never-wired or legitimately-unused critical connectors. Spec should direct reuse of the existing classifier, not just its input files.

  7. L126 — The gap lifecycle violates its own one-gap-per-outage invariant, and the stored backfill_command is stale by construction. Step 7 closes gaps on a healthy probe, pre-session. In the mid-session-failure blind spot the doc itself enumerates (L142), reconciliation then finds a recovered record ("extend" is defined only for open gaps, and no reopen/merge transition exists) and a last_healthy_run stamped at the blind run itself — so it can only open a second record with a wrong from, violating Acceptance README: installation instructions are incorrect #3. Also the example record has "to": null while its backfill_command already embeds --to 2026-06-26, and no passage says the string is regenerated on extend/close — surfaces would fire a wrong-window backfill. Simpler: derive the command at render time from slot_type/from/to (a renderer is exactly as good a harness seam as a stored string).

  8. L70 — Public-repo identifier leak in the pasted claude mcp list output ("Keboola Jokes"). CLAUDE.md: "No real identifiers… Vendors/products: a generic noun, not the brand." This string appears in zero tracked files today — merging would be the first occurrence in the repo — and L350 explicitly derives future parser fixtures from this block, pulling it into the anonymization rule's literal scope. Replace with a generic stand-in (e.g. claude.ai Acme Tools: … - ✘ Failed to connect) before merge.

Also confirmed, below the severity cap

  • L263 — the claude:// deeplink doesn't exist as described: the scheme belongs to Claude Desktop and only prefills a chat (flakily — the sibling app's own code works around it); the app's real CLI launcher is AppleScript/Terminal (ClaudeLauncher.swift). Acceptance docs: spec, plan 1, and followups tracker #5 is satisfiable via the existing launcher; the deeplink wording should go.
  • L79/L332 — fail-open + glyph-parsing means a routine CLI format change silently disables the entire protection (nothing alerts on repeated inconclusive probes), and the Error-handling claim that fail-open "does not mean the gap goes unrecorded" is false in Phase 1 / while Runner templates never export SCOUT_MODE — connector-log / session-tool-log hooks short-circuit; connector telemetry dark since 2026-06-07 #121 is unfixed — the safety net doesn't exist yet.
  • L216.scout-state files are not gitignored; the convention is the opposite (id-map.json is explicitly git-committed by post-session-backfill). gaps.jsonl needs an explicit new ignore rule.
  • L288 — there is no "phases-backport provenance convention" to mirror yet: feat(phases): scoutctl phases backport — add --date flag and write '<!-- backported {date} -->' provenance marker #170 is open and phase_backport.py writes no marker; the marker exists only in the 2026-06-16 spec.
  • L280 — the connector×week fan-out assigns the "still open/relevant at --to" filter to chunk agents that structurally can't evaluate it, specifies no dedup for boundary-spanning items and no concurrency cap; most connectors can paginate a 6-week window in one per-connector agent, and the per-connector recipes should parameterize the existing phases/connectors/*.md scan playbooks rather than restate them.
  • L74 — measured on a real config: claude mcp list health-checks all 44 configured servers in ~10s, not the ~2.4s benchmarked for a single mcp get. Worth a real benchmark and a timeout in the spec.
  • L259 — the banner example says "3 consolidation runs were skipped May 15 – Jun 26" but the record for that window says skipped_runs: 41.

On the question in the PR description

The skip/warn/run shape is right — strict for briefing/consolidation, relaxed for research. But as specified, majority on consolidation defeats the exact outage that motivated the design (finding 1), and warn-for-dreaming rests on an unenforceable prompt instruction with no gap recorded when it fails (finding 4). I'd recommend fixed all semantics (degraded = any critical connector down), with tolerance tuned per slot type by which connectors are marked critical in required_in_types — the per-connector mechanism the design already builds on — rather than a second quorum axis. One verified positive worth keeping prominent: the per-slot-type gap grain survived adversarial review — under skip semantics no run executes during a gap, so slot-type × window is the correct unit of missing data.


🤖 Generated with Claude Code

…, probe wiring, gap lifecycle

Per review on #181, most severe first:

1. Drop the required_threshold quorum axis. Degraded = any critical
   connector down; tolerance is tuned per slot type via required_in_types.
   A majority quorum let the motivating outage (Slack+Linear = 2 of 10)
   run blind.
2. Skip signalling uses the budget-check pattern: guarded call, exit 3 =
   policy skip → orderly exit 0, any other rc = error → fail open. A bare
   non-zero under set -euo pipefail read as a crash and killed post-session
   steps.
3. Preflight/reconcile go into all three runner templates; hoist MODE in
   run-dreaming.sh.tmpl above the pre-session gates.
4. Warn transport rides the .scout-cache pending-file seam (env export from
   a child process cannot reach the session); degraded-run KB writes get a
   mechanical provenance tag; reconciliation records gaps for warn too.
5. Add explicit harness_server_name / preflight_command fields to
   connectors.yaml; existing registries cannot be matched to mcp-list
   output and wizard tool-name probes are unusable headless.
6. Reconciliation reuses the connector_health_report classifier (Pattern
   #48 / #54 suppression) instead of a naive zero-success rule; drop the
   sourceless --session flag.
7. Gap lifecycle gains a reopen+merge transition; backfill command is
   derived at render time, not stored (stale by construction).
8. Anonymize the mcp-list example (CLAUDE.md rule).

Also: deeplink wording replaced with the app's existing CLI launcher;
inconclusive-streak alerting after 3 consecutive failed probes; honest
Phase-1 scoping of fail-open; explicit gitignore rule for gaps.jsonl;
provenance marker cited as proposed (not existing); backfill defaults to
one agent per connector with chunking only on volume and synthesis-owned
relevance filtering; probe timing corrected (~10s full list) with a
required timeout; banner example consistent with skipped_runs.
@davidesner

davidesner commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

All findings verified against the repo and addressed in 79f7d8f. Per finding:

  1. Quorum axis removed. Degraded = any critical connector down, one definition everywhere. Tolerance is tuned per slot type via required_in_types (the overlay), as you suggested. The healthy-edge side effects (gap close, last_healthy_run stamp) now explicitly fire only when all critical connectors are connected.
  2. Budget-check pattern adopted. Guarded if ! call; exit 3 = policy skip → orderly exit 0; any other rc = preflight error → logged and failed open. Spec includes the runner snippet.
  3. All three templates, with the run-dreaming.sh.tmpl MODE-hoisting called out.
  4. Warn rides the .scout-cache pending-file seam (connector-degradation-pending.md, same channel as connector-alerts-pending.md); env-var relay dropped. Degraded-run KB writes get a mechanical provenance tag. Reconciliation now records gaps for warn too — only run opts out — which resolves the L164/L170 contradiction.
  5. Explicit probe fields added to connectors.yaml: harness_server_name (verbatim match against mcp list) and preflight_command for bash probes. The ProbeKind reuse claim is gone; connectors with neither field simply aren't probed.
  6. Reconciliation delegates to compute_critical_alertslaunchd plist re-install uses bootstrap without prior bootout — old job stays loaded #48/IdMap.load: exists()+open() TOCTOU and unhandled json.JSONDecodeError #54 suppression comes for free; the --session <id> flag is dropped (current session = the classifier's most-recent-session, and the spec documents why no reliable id exists runner-side).
  7. Reopen+merge transition added (state machine in the spec); reconciliation re-stamps last_healthy_run on reopen. backfill_command is no longer stored — rendered at read time from slot_type/from/to, and the renderer is now the named harness seam.
  8. Example anonymized.

Below-cap items all taken as well: deeplink wording replaced with the existing ClaudeLauncher mechanism; inconclusive-streak alert after 3 consecutive failed probes plus honest Phase-1 scoping of fail-open ("no silent gap" only holds once Layer 1b ships); explicit gitignore rule for gaps.jsonl with the committed-id-map.json convention noted; provenance marker cited as proposed in the #170 spec, not existing; backfill restructured to one agent per connector by default (chunk on volume, concurrency cap, synthesis owns dedup + still-open filtering, recipes parameterize phases/connectors/*.md); probe timing corrected (~10s full list on a 44-server config) with a required timeout; banner example now matches skipped_runs.

On the description question: adopted your recommendation — fixed all semantics with per-slot-type tolerance via connector criticality, no second axis. Acceptance #2 now pins the motivating scenario: there is no configuration under which Slack+Linear-down consolidation runs blind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants