Skip to content

fix(churn): speak warpline-mcp's newline JSON-RPC + honest paging/keying disclosure#77

Merged
tachyon-beep merged 2 commits into
mainfrom
feat/warpline-churn-consumer
Jun 29, 2026
Merged

fix(churn): speak warpline-mcp's newline JSON-RPC + honest paging/keying disclosure#77
tachyon-beep merged 2 commits into
mainfrom
feat/warpline-churn-consumer

Conversation

@tachyon-beep

Copy link
Copy Markdown
Collaborator

Problem

The Warpline churn consumer was a live NO-GO: entity_high_churn_list / entity_recent_change_list hung whenever integrations.warpline.enabled: true. Two transport bugs, both confirmed against warpline's source and reproduced live on /home/john/lacuna:

  1. Framing mismatch. The client drove warpline-mcp over Loomweave's Content-Length plugin framing (ADR-002), but warpline-mcp reads newline-delimited JSON-RPC (for line in sys.stdin, one response line per request line). Warpline never parsed the request → the read blocked forever.
  2. Wrong launcher (found only in live validation). The default command was warpline mcp, which is not a valid warpline subcommand (usage error → broken pipe). The MCP stdio server ships as the standalone warpline-mcp binary.

Plus: the call omitted the required repo, sent an unsupported actor, and had no timeout.

The fix

Transport (loomweave-federation/src/warpline.rs)

  • Newline-delimited JSON-RPC; the whole handshake+call runs on a worker thread bounded by recv_timeout + kill-on-timeout, so a hung warpline degrades to the honest warpline-unreachable response instead of hanging. stderr → Stdio::null so a large traceback can't block the child.
  • Default launcher resolves to warpline-mcp (not warpline mcp).
  • Send required repo: <project_root>; drop the unsupported actor; parse result.structuredContent, falling back to result.content[0].text.
  • Config: keep WarplineConfig.actor (reserved, for deny_unknown_fields back-compat with configs that set it) and add an honored timeout_seconds (default 10).

Honesty floor — two distinct zeros must not be conflated with a genuine never-observed 0:

  • Send limit = refs.len() so warpline's default 100-item page doesn't silently cap the join at the top-100-by-churn.
  • churn_truncated + overflow_partial(): warpline keeps a 200-item in-band lead and spills the rest, so an over-cap scope's truncated tail is disclosed, not shipped as fabricated zeros.
  • churn_unresolved + unresolved_ref_count(): warpline returns locator: null for an unresolved SEI ref, so a keying-miss 0 ("real churn UNKNOWN") is disclosed rather than read as "this code never changes."

Disabled / unreachable / empty degrade paths (warpline-disabled, warpline-unreachable, churn_source: "warpline", signal) are preserved.

Validation

Live on /home/john/lacuna (release + uv-installed binary):

  • enabled → real nonzero ranked churn with churn_truncated{counted:200,total:580}
  • recent_change → real recency, no hang
  • scoped → churn_unresolved{count:54}
  • disabled → honest-empty

564 federation+mcp tests pass (+5 new: real-subprocess transport, timeout-not-hang, default-command guard, both disclosures); fmt + workspace clippy -D warnings + cargo doc -D warnings clean.

Follow-ups (filed as observations, not in this PR)

  • Deep-pagination via warpline's overflow dump for >200-candidate scopes.
  • The loomweave↔warpline locator-dialect + NULL-sei keying gap that produces the unresolved refs (federation-contract fix).
  • The same Content-Length-vs-newline framing latent in filigree.rs.

🤖 Generated with Claude Code

tachyon-beep and others added 2 commits June 27, 2026 12:49
…urn/recently_changed

The entity_high_churn_list / entity_recent_change_list MCP surfaces were
dead-by-design: loomweave does not populate git_churn_count in v1.0 and, by the
loomweave<->warpline seam HARD RULE, retains no cross-run history. This wires the
read-time join to Warpline's FROZEN warpline_entity_churn_count_get
(warpline.entity_churn_count.v1, 2026-06-13 interface lock SS1A / GV-LW-2).

Read-only / enrich-only / dependency-sink:
- New WarplineLookup trait (single method, the churn read) + WarplineMcpClient
  (MCP-stdio subprocess, mirrors the Filigree consumer pattern), injected as an
  Option<Arc<dyn ..>> defaulting to None. Lives in loomweave-federation.
- Candidate universe = the entity catalogue (new entities_for_churn_candidates
  query), NOT the empty git_churn_count scan; warpline holds the counts.
  Scope-filter BEFORE the warpline call, then rank the scoped set by the
  returned counts. SEI-keyed refs with locator fallback (never drops a
  candidate). One bounded call, joined at read time, retained nowhere.
- Honest-degrade (lock SS1C): warpline disabled/unreachable -> honest-empty with
  a warpline-named missing-signal note + churn_source provenance; never
  empty-as-clean, never a hard error breaking the core flow.

GV-LW-2 is an executable test: the full frozen envelope fixture parsed through
the real parse path, driven via an injected fake (no live MCP call from the hub
context). Both honest-degrade paths (disabled, unreachable) are tested.

Blast radius: READS the frozen warpline_entity_churn_count_get contract only;
mints no new sibling obligation (the producer already ships). Live cross-member
validation against a real Warpline, deep-pagination beyond warpline's limit, and
the >CHURN_SCAN_CAP single-frame request bound are tracked follow-ups.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ing disclosure

The Warpline churn consumer was a live NO-GO: entity_high_churn_list /
entity_recent_change_list hung when integrations.warpline.enabled=true. Two
transport bugs, both confirmed against warpline's source and live on lacuna:

1. Framing mismatch. The client drove warpline over Loomweave's Content-Length
   plugin framing, but warpline-mcp reads NEWLINE-delimited JSON-RPC
   (`for line in sys.stdin`, one response line per request line) — so warpline
   never parsed the request and the read blocked forever.
2. Wrong launcher (found only live). The default command was `warpline mcp`,
   which is not a valid warpline subcommand (usage error -> broken pipe). The
   MCP stdio server ships as the standalone `warpline-mcp` binary.

Transport rewrite (loomweave-federation/src/warpline.rs):
- Newline-delimited JSON-RPC; the whole handshake+call runs on a worker thread
  bounded by recv_timeout + kill-on-timeout, so a hung warpline degrades to the
  honest `warpline-unreachable` response instead of hanging (req #5). stderr is
  discarded (Stdio::null) so a large traceback can't block the child.
- Default launcher resolves to `warpline-mcp` (not `warpline mcp`).
- Send required `repo: <project_root>` (req #2); drop the unsupported `actor`
  param (req #3) — it is not in warpline's frozen churn schema. Parse
  result.structuredContent, falling back to content[0].text (req #4).
- Config: keep WarplineConfig.actor (reserved, for deny_unknown_fields
  back-compat with configs that set it) and add an honored timeout_seconds
  (default 10), retiring the stale "no timeout knob" doc.

Honesty floor (two zeros must not be conflated with a never-observed 0):
- Send limit = refs.len() so warpline's default 100-item page does not silently
  cap the join at the top-100-by-churn.
- churn_truncated + ChurnCountResponse::overflow_partial(): warpline keeps a
  200-item in-band lead and spills the rest (apply_overflow), so an over-cap
  scope's truncated tail is DISCLOSED, not shipped as fabricated zeros.
- churn_unresolved + unresolved_ref_count(): warpline returns locator:null for
  an unresolved SEI ref (a resolve hit always sets the NOT-NULL locator), so a
  keying-miss 0 ("real churn UNKNOWN") is disclosed rather than read as "this
  code never changes". Closes the disclosure asymmetry with churn_truncated.

Disabled / unreachable / empty degrade paths (warpline-disabled,
warpline-unreachable, churn_source:"warpline", signal) are preserved (req #6).

Validated live on /home/john/lacuna (release + uv-installed binary): enabled ->
real nonzero ranked churn with churn_truncated{counted:200,total:580};
recent_change -> real recency, no hang; scoped -> churn_unresolved{count:54};
disabled -> honest-empty. 564 federation+mcp tests pass; fmt + workspace clippy
+ cargo doc all clean.

Follow-ups (filed as filigree observations, not addressed here): deep-pagination
via warpline's overflow dump for >200-candidate scopes; the loomweave<->warpline
locator-dialect + NULL-sei keying gap that produces the unresolved refs; and the
same Content-Length-vs-newline framing latent in filigree.rs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 30549a3af1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +869 to +870
let (candidates, scan_truncated) =
entities_for_churn_candidates(conn, CHURN_SCAN_CAP)?;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Apply scope before capping churn candidates

When an index has more than CHURN_SCAN_CAP entities, this fetches the first 50k rows ordered by id and only then applies the requested scope. For a scoped query targeting an entity or path whose rows sort after that cap, in_scope becomes empty, no refs are sent to Warpline, and the tool can report no recorded changes even though the exact scoped entity has churn data. The candidate query needs to be constrained by the resolved scope before applying the cap, or exact/id scopes should be materialized directly.

Useful? React with 👍 / 👎.

WarplineEntityRef::for_entity(&e.id, sei.as_deref())
})
.collect();
let response = match client.entity_churn_counts(&refs, mode.window().as_ref()) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Release the reader before calling Warpline

This call runs inside readers.with_reader, so the SQLite reader connection is held while the code spawns and waits on the Warpline subprocess (up to the configured timeout, 10s by default). With Warpline enabled but slow or hung, concurrent churn requests can occupy the reader pool and block unrelated MCP reads even though the database is not needed during the external round trip. Collect the refs, drop the reader before the subprocess call, then reacquire a reader to render the page.

Useful? React with 👍 / 👎.

tachyon-beep added a commit that referenced this pull request Jun 28, 2026
…t-Length) (#78)

filigree.rs's MCP stdio client (the observation_create / observation_dismiss
path, used by propose_guidance and guidance promotion) framed requests with
Loomweave's Content-Length plugin framing, but filigree-mcp uses the official MCP
Python SDK (`mcp.server.stdio.stdio_server`), whose stdio transport is
NEWLINE-delimited JSON-RPC. Same bug class as the Warpline churn consumer (#77).

Verified empirically against the installed filigree-mcp: a newline-delimited
initialize gets a clean result; a Content-Length-framed one makes filigree-mcp
emit an "Internal Server Error" notification, after which loomweave's
Content-Length reader cannot parse filigree's newline responses and the call
hangs. The HTTP read path (issues_for / entity-associations) was unaffected —
only the stdio observation seam was broken.

Fix (mirrors the warpline transport fix):
- write_mcp_json / read_mcp_json now use newline framing: one compact JSON line +
  \n; responses read line-by-line, skipping non-matching ids (the init result and
  the notification's id:null error), EOF-before-match surfaced as an error.
- Extracted run_mcp_tool_over_command(program, args, root, timeout, tool, args):
  the handshake+call runs on a worker thread bounded by recv_timeout + kill, so a
  hung filigree-mcp degrades instead of blocking forever (FILIGREE_MCP_TIMEOUT,
  10s). stderr -> Stdio::null so a large traceback can't block the child. The
  resolved command is a parameter, so the transport is unit-testable with an
  injected fake newline server (no env mutation — set_var is unsafe under
  edition 2024 + unsafe_code=deny).
- Last-resort launcher fallback `("filigree", ["mcp"])` -> `filigree-mcp` (the
  real binary); `filigree mcp` is not a valid subcommand. The happy path still
  resolves `python -m filigree.mcp_server` via `filigree mcp-status`.

TDD: newline-framing helper round-trip + EOF-error, fallback-command guard, and
a real-subprocess happy-path + timeout-not-hang test driving a fake newline
server. 131 federation tests pass; fmt + clippy (federation/mcp/cli, -D warnings)
+ cargo doc clean. Live-probed against the real filigree-mcp on /home/john/lacuna
(newline initialize+tools/call round-trips; Content-Length errors).

Residual (pre-existing, NOT bounded here): resolve_filigree_mcp_command runs
`filigree mcp-status --json` via a plain blocking .output() before the
timeout-bounded section, so a hung mcp-status is an unbounded wait outside the
new deadline. Short-lived in practice; bounding it is a follow-up.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
tachyon-beep added a commit that referenced this pull request Jun 28, 2026
@tachyon-beep tachyon-beep merged commit 1d2b4fa into main Jun 29, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant