Skip to content

test(prototype): AGE-104 PTY vs print MCP Task proof harness#90

Draft
nestharus wants to merge 1 commit into
mainfrom
prototype-tests-age-104-pty-mcp-gap
Draft

test(prototype): AGE-104 PTY vs print MCP Task proof harness#90
nestharus wants to merge 1 commit into
mainfrom
prototype-tests-age-104-pty-mcp-gap

Conversation

@nestharus

Copy link
Copy Markdown
Owner

Verdict

CONFIRM (it works). The AGE-104 P2 evidence shows that the Claude Code interactive PTY mode can dispatch mcp__<server>__Task-class tools when the launch shape uses --allowedTools mcp__<server>__<name>,... or omits the tool filter; only --tools mcp__<server>__<name>,... suppresses MCP tools/call dispatch in true PTY mode while print mode succeeds under all three flag shapes. The dossier answer.md headline, the proof-test audit re-run of p2-proof-tests.sh (M3-C1 expected-failure, M3-C2 success, M3-C3 success; audit verdict LOW), and the 4x3 truth-table sweep all back this conclusion. AGE-89's orchestrator-mode MCP-Task gate is unblocked subject to the AGE-101 / AGE-102 / AGE-103 amendments and the AGE-113 / AGE-114 hardening tickets.

Reviewer focus

Review follows ~/ai/conventions/prototype-review.md. Reviewers should comment on test design, outcome alignment, pending-marker traceability, and whether the published harness supports the dossier CONFIRM verdict. Specifically:

  • Does the M3-C1 / M3-C2 / M3-C3 focused proof (and the full 4x3 sweep) correctly reproduce the PTY --tools suppression and the --allowedTools / no-filter PTY positive controls?
  • Do the prototype-pending: markers cite real Linear keys/URLs traceable to the spawned implementation tickets listed below?
  • Do the test scope and assertions line up with the dossier verdict and the per-ticket carry-forward in dossier/test-publication-manifest.md?

This is not review of disposable prototype source quality. Do not request stylistic, structural, or hygiene changes to harness internals that do not change the behavior contract.

Test manifest

Branch: prototype-tests-age-104-pty-mcp-gap (HEAD b3347b7).

Published harness lives under prototype-tests/age-104-pty-mcp-gap/ on the prototype-test branch. The manifest paths reference the planning-tree evidence sources; the on-branch publication paths are equivalent contents.

Published path on prototype-test branch Manifest entry Spawned ticket mapping
prototype-tests/age-104-pty-mcp-gap/p2-proof-tests.sh (M3-C1 negative control: --tools mcp__... fails in PTY) dossier/evidence/p2-proof-tests.sh ST-01 → AGE-101
prototype-tests/age-104-pty-mcp-gap/p2-proof-tests.sh (M3-C2 positive control: --allowedTools mcp__... works in PTY) dossier/evidence/p2-proof-tests.sh ST-01 → AGE-101, ST-04 → AGE-103
prototype-tests/age-104-pty-mcp-gap/p2-proof-tests.sh (M3-C3 positive control: no filter works in PTY) dossier/evidence/p2-proof-tests.sh ST-01 → AGE-101, ST-06 → AGE-113
prototype-tests/age-104-pty-mcp-gap/p2-proof-tests.sh (all M3/M4 PTY cells: PTY-with-hooks ≡ PTY-without-hooks) dossier/evidence/p2-proof-tests.sh ST-05 → AGE-112
prototype-tests/age-104-pty-mcp-gap/p2-truth-table-harness/run-all.sh (full 4×3 sweep) dossier/evidence/p2-truth-table-harness/run-all.sh ST-02 → AGE-90, ST-03 → AGE-102, ST-07 → AGE-114, ST-08 → AGE-115
prototype-tests/age-104-pty-mcp-gap/p2-truth-table-harness/server.py (MCP server backing the sweep) dossier/evidence/p2-truth-table-harness/server.py inherited alongside run-all.sh (supports ST-02 / ST-03 / ST-07 / ST-08 above)

Expected control pattern reproduced by prototype-tests/age-104-pty-mcp-gap/p2-proof-tests.sh:

  • M3-C1 (PTY w/ hooks + --tools mcp__...) — tools/call never reaches the MCP server; expected failure today.
  • M3-C2 (PTY w/ hooks + --allowedTools mcp__...) — tools/call reaches the server and the tool result returns to Claude.
  • M3-C3 (PTY w/ hooks + no filter) — tools/call reaches the server and the tool result returns to Claude.
  • Script exits 0 only when all three assertions match the expected outcomes.

prototype-tests/age-104-pty-mcp-gap/MARKERS.md carries the full per-cell ticket mapping.

Pending markers

The prototype-pending marker convention is defined in ~/ai/conventions/prototype-pending-tests.md. The reason format the convention requires is:

prototype-pending: implementation pending in <ticket-key-or-url>; remove marker and make this test pass

These markers are traceable implementation handoff debt, not generic skip/xfail permission. Each cited ticket has a real Linear key/URL; placeholder ticket text is not present. Implementation must remove the marker token (the prototype-pending: colon-prefixed sentinel) from the listed test files or supersede the test with a traceable strictly stronger equivalent recorded in dossier/test-publication-manifest.md, the spawned ticket payload, or the Phase 6 Step 6b output index. The marker is not permission to discard the original assertions.

The harness is bash + Python (not Pytest / Playwright / cargo), so the convention's preferred-runner mapping falls through to the comment-block sentinel header at the top of each published file. grep -RIn "^# prototype-pending:" prototype-tests/age-104-pty-mcp-gap/ enumerates every traceable marker on this branch.

Dossier link

AGE-104 prototype dossier: planning/prototype-age-104-pty-mcp-gap/dossier/.

  • Answer: planning/prototype-age-104-pty-mcp-gap/dossier/answer.md
  • Risk profile: planning/prototype-age-104-pty-mcp-gap/dossier/risk-profile.md
  • Evidence directory: planning/prototype-age-104-pty-mcp-gap/dossier/evidence/ (including p2-truth-table.md, p2-proof-tests.sh, p2-stabilize-output.md, p2-truth-table-harness/, and per-vector v{1..4}-verdict.md)
  • Proof-test audit: planning/prototype-age-104-pty-mcp-gap/dossier/proof-test-audit.md (verdict LOW; re-run of p2-proof-tests.sh shows M3-C1 expected-failure, M3-C2 success, M3-C3 success)
  • Spawned tickets dossier: planning/prototype-age-104-pty-mcp-gap/dossier/spawned-tickets.md
  • Test publication manifest: planning/prototype-age-104-pty-mcp-gap/dossier/test-publication-manifest.md

Spawned implementation tickets

Mapping per dossier/spawned-tickets.md (ST-01 through ST-08) and dossier/test-publication-manifest.md per-ticket carry-forward table:

  • ST-01 — AGE-101: wire proxy PTY driver for Claude Code one-shot execution. Inherits M3-C1 negative, M3-C2 positive, M3-C3 positive. Acceptance: Remove the prototype-pending: markers in the listed test files, make these tests pass, and preserve the original assertions unless the manifest, spawned ticket payload, or Phase 6 Step 6b output index records a strictly stronger equivalent supersession.

  • ST-02 — AGE-90: agent-runner flip Claude family providers from headless to proxy mode. Inherits the full 4×3 truth-table sweep (M3/M4 C1 failure and C2/C3 success). Acceptance: Remove the prototype-pending: markers in the listed test files, make these tests pass, and preserve the original assertions unless the manifest, spawned ticket payload, or Phase 6 Step 6b output index records a strictly stronger equivalent supersession.

  • ST-03 — AGE-102: add driver-owned MCP replacement tools for Claude Code. Inherits M3-C2 positive, M3-C3 positive, and the supporting MCP server Task implementation in p2-truth-table-harness/server.py. Acceptance: Remove the prototype-pending: markers in the listed test files, make these tests pass, and preserve the original assertions unless the manifest, spawned ticket payload, or Phase 6 Step 6b output index records a strictly stronger equivalent supersession.

  • ST-04 — AGE-103: extend providers.toml with invocation-mode schema. Inherits M3-C1 negative and M3-C2 positive. Acceptance: Remove the prototype-pending: markers in the listed test files, make these tests pass, and preserve the original assertions unless the manifest, spawned ticket payload, or Phase 6 Step 6b output index records a strictly stronger equivalent supersession.

  • ST-05 — AGE-112: decide orchestrator-mode completion scope until PTY MCP Task is fixed. Inherits all M3/M4 PTY cells. Acceptance: Remove the prototype-pending: markers in the listed test files, make these tests pass, and preserve the original assertions unless the manifest, spawned ticket payload, or Phase 6 Step 6b output index records a strictly stronger equivalent supersession.

  • ST-06 — AGE-113: harden Claude proxy launch-shape regression coverage. Inherits M3-C1 negative, M3-C2 positive, M3-C3 positive; AGE-104 proof harness serves as the regression spec. Acceptance: Remove the prototype-pending: markers in the listed test files, make these tests pass, and preserve the original assertions unless the manifest, spawned ticket payload, or Phase 6 Step 6b output index records a strictly stronger equivalent supersession.

  • ST-07 — AGE-114: document the AGE-104 Claude launch-shape constraint. Inherits the expected control pattern from p2-proof-tests.sh and the 2.1.141–2.1.143 version range from V4 evidence. Acceptance: Remove the prototype-pending: markers in the listed test files, make these tests pass, and preserve the original assertions unless the manifest, spawned ticket payload, or Phase 6 Step 6b output index records a strictly stronger equivalent supersession.

  • ST-08 — AGE-115: file or explicitly decline upstream Claude Code bug report. Inherits the minimal repro from p2-proof-tests.sh plus the full truth-table expectation. Acceptance: Remove the prototype-pending: markers in the listed test files, make these tests pass, and preserve the original assertions unless the manifest, spawned ticket payload, or Phase 6 Step 6b output index records a strictly stronger equivalent supersession.

Anti-scope

  • This is not a production implementation PR. It is a prototype-test contract PR per ~/ai/conventions/prototype-pending-tests.md and ~/ai/conventions/prototype-review.md.
  • Reviewers should not request prototype-source quality changes (style, structural refactors, harness internals, file layout) that do not change the behavior contract.
  • CodeRabbit is not run by default on this PR.
  • Production PR-review gates (CI green-gate, full lint/format, code-coverage thresholds, architecture review) are not applied by default. The published files live under prototype-tests/age-104-pty-mcp-gap/ and are explicitly outside the production test surface; spawned implementation tickets bring the contract into production via their own PRs.
  • Approval of this PR is approval of the test contract carried forward to the listed spawned tickets, not approval of any production behavior change.

…roof harness

Proof tests for AGE-104 (sub-prototype of AGE-89 prototype-age-89-clarify) showing
that Claude Code 2.1.141-2.1.143 interactive PTY mode suppresses MCP tools/call
dispatch when launched with --tools mcp__<server>__<name>,..., while print mode
accepts the same flag. Two confirmed PTY workarounds: --allowedTools mcp__...
or omit the tool filter.

Files:
- p2-proof-tests.sh: focused proof script asserting M3-C1 fails, M3-C2 + M3-C3
  succeed in interactive PTY w/ hooks.
- p2-truth-table-harness/: full 4x3 sweep (4 modes x 3 flag shapes).
- MARKERS.md: prototype-pending marker mapping per spawned implementation ticket
  (AGE-101, AGE-90, AGE-102, AGE-103, AGE-112, AGE-113, AGE-114, AGE-115).
- README.md: reviewer guidance per ~/ai/conventions/prototype-review.md.

This is a prototype-test contract PR, not a production implementation PR. Review
follows prototype-review.md: test design, outcome alignment, pending-marker
traceability, dossier verdict support. Production PR-review gates do not apply.

prototype-pending markers cite real Linear keys. Spawned implementation tickets
must remove markers and make these tests pass against production code, preserving
original assertions unless a strictly stronger equivalent supersession is
recorded in the manifest, ticket payload, or Phase 6 Step 6b output index.

Refs: AGE-104 dossier at planning/prototype-age-104-pty-mcp-gap/dossier/
@linear

linear Bot commented May 15, 2026

Copy link
Copy Markdown

AGE-112

AGE-104

nestharus added a commit that referenced this pull request May 16, 2026
New `docs/architecture/claude-proxy-mcp-launch-shape.md` captures the
AGE-104 finding that interactive PTY Claude proxy MCP replacement
launches must use `--allowedTools mcp__<server>__<tool>,...` or omit
the tool filter; `--tools mcp__<server>__<tool>,...` suppresses MCP
`tools/call` dispatch in PTY mode (Claude Code 2.1.141-2.1.143). Print
mode is unaffected. Cites the canonical PR #90 reproducer harness
(`prototype-tests/age-104-pty-mcp-gap/p2-proof-tests.sh`), the M3-C1/
M3-C2/M3-C3 control pattern, the `/home/nes/.local/share/claude/
versions/2.1.143` binary-path caveat, the bounded `--allowedTools`
camelCase vs `--allowed-tools` kebab-case spelling ambiguity (PTY MCP
equivalence is not in evidence; the current Rust executor at
`crates/oulipoly-runtime/src/executor/cli.rs::apply_provider_policy`
emits kebab-case), the upstream-issue absence (AGE-115 tracks the
file-or-decline decision), and a version-bump runbook re-verification
recipe.

Additive cross-links from `docs/architecture/provider-accounts-redesign
.md` (sections "Example: Claude CLI Integration" and "Version-Aware
Integration"), `README.md` (sections "Interactive REPL", "providers
.toml", "Adding a Model"), and `AGENTS.md` (section "Model Command
Syntax") route readers to the runbook before they edit Claude proxy
launch flags.

DECISIONS.md records the AGE-114 Phase 2.5 dispositions: D1
inherited-estimate cold-start (proceed without baseline; AGE-104
prototype is the prototype-first satisfaction); D2 problem-map gate
skipped per `skip_problem_map_gate=true`; D3 Phase 2.5 verdict HIGH
accepted with exhaustive mode for the runbook + provider-accounts-
redesign surfaces.

Per ACR-126 prototype-pending carry-forward, the inherited PR #90
proof tests are recorded as strictly stronger equivalent supersession
via canonical-reproducer citation in the runbook (Proof Command +
Version-Bump Runbook sections); not imported into this branch because
they live on the prototype-test branch and a docs-only diff has no
production code to validate against.

Refs: AGE-114, AGE-89 (parent), AGE-104 (prototype), AGE-101, AGE-102,
AGE-103, AGE-113, AGE-115.
nestharus added a commit that referenced this pull request May 16, 2026
* docs(arch): add Claude proxy MCP launch-shape runbook (AGE-114)

New `docs/architecture/claude-proxy-mcp-launch-shape.md` captures the
AGE-104 finding that interactive PTY Claude proxy MCP replacement
launches must use `--allowedTools mcp__<server>__<tool>,...` or omit
the tool filter; `--tools mcp__<server>__<tool>,...` suppresses MCP
`tools/call` dispatch in PTY mode (Claude Code 2.1.141-2.1.143). Print
mode is unaffected. Cites the canonical PR #90 reproducer harness
(`prototype-tests/age-104-pty-mcp-gap/p2-proof-tests.sh`), the M3-C1/
M3-C2/M3-C3 control pattern, the `/home/nes/.local/share/claude/
versions/2.1.143` binary-path caveat, the bounded `--allowedTools`
camelCase vs `--allowed-tools` kebab-case spelling ambiguity (PTY MCP
equivalence is not in evidence; the current Rust executor at
`crates/oulipoly-runtime/src/executor/cli.rs::apply_provider_policy`
emits kebab-case), the upstream-issue absence (AGE-115 tracks the
file-or-decline decision), and a version-bump runbook re-verification
recipe.

Additive cross-links from `docs/architecture/provider-accounts-redesign
.md` (sections "Example: Claude CLI Integration" and "Version-Aware
Integration"), `README.md` (sections "Interactive REPL", "providers
.toml", "Adding a Model"), and `AGENTS.md` (section "Model Command
Syntax") route readers to the runbook before they edit Claude proxy
launch flags.

DECISIONS.md records the AGE-114 Phase 2.5 dispositions: D1
inherited-estimate cold-start (proceed without baseline; AGE-104
prototype is the prototype-first satisfaction); D2 problem-map gate
skipped per `skip_problem_map_gate=true`; D3 Phase 2.5 verdict HIGH
accepted with exhaustive mode for the runbook + provider-accounts-
redesign surfaces.

Per ACR-126 prototype-pending carry-forward, the inherited PR #90
proof tests are recorded as strictly stronger equivalent supersession
via canonical-reproducer citation in the runbook (Proof Command +
Version-Bump Runbook sections); not imported into this branch because
they live on the prototype-test branch and a docs-only diff has no
production code to validate against.

Refs: AGE-114, AGE-89 (parent), AGE-104 (prototype), AGE-101, AGE-102,
AGE-103, AGE-113, AGE-115.

* docs(decisions): record AGE-114 D4/D5 Phase 6 Tier-1 retry lineage

D4 — Step 6c R1 missing `consumed:` log evidence; autonomous Tier-1
rewind-and-retry per orchestrator spec § Step 6c violation rule.

D5 — Step 6c model substitution to claude-opus for consumed-evidence
reliability after R2/R3 (gpt-high → codex2) and R4 (claude-opus →
claude4 with inline placement) collapsed the consumed-echo. R5's
final-block prompt structure succeeded; all 97 consumed: rows landed
in the captured log.

* docs(decisions): record AGE-114 D6 Phase 8 test-audit MEDIUM disposition

D6 — Phase 8 test-audit returned MEDIUM solely because acceptance-
checklist row AC-016 searches for `no filter` literally but the
runbook uses `no tool filter`. The product-docs content is correct
(M3-C3 documented as succeeding with no tool filter; AC-046 separately
verifies the no-filter allowance against `## Rule`). Accepted as
recipe-weakness residual with no content gap, per pr-review.md
fix-pass disposition rules. ACR-156/162/163 LOW-only rule applies to
code-quality gates only; PR-review gates have their own disposition
policy.
nestharus added a commit that referenced this pull request May 16, 2026
)

New `docs/external-issues/claude-code-pty-mcp-task-availability.md` records
the file-or-decline decision for the AGE-104 finding that Claude Code
2.1.141-2.1.143 fails to dispatch MCP `tools/call` in interactive PTY mode
under `--tools mcp__<server>__<name>,...` while print mode succeeds and PTY
mode succeeds under `--allowedTools` or no tool filter.

Disposition: file local only (draft, do not submit). The document is the
durable internal record and is shaped to be self-contained for an external
Anthropic maintainer audience. A future authorized step may submit it via
`gh api repos/anthropics/claude-code/issues` behind an explicit root
authorization gate; that submission is out of scope for this WU per the
proposal's anti-scope.

Includes the 4x3 truth-table summary from AGE-104 P2 evidence (M3-C1 fails
under `--tools`; M3-C2 / M3-C3 succeed), points to the canonical PR #90
proof harness, names the local workaround flag shapes, and frames the
suggested fix surface as the main-PTY tool-dispatch / tool-registry path
per the AGE-104 V4 verdict.

Cross-links to `docs/architecture/claude-proxy-mcp-launch-shape.md` (AGE-114
runbook) as the maintainer-facing operational reference; the AGE-115 ticket
URL; AGE-104 dossier; and sister tickets AGE-101/102/103/113/114.

DECISIONS.md additions record the WU-local pipeline decisions: D1 (Phase
2.5 inherited-estimate cold-start resolved inline as procedural), D2
(problem-map human gate skipped via dispatch override), D3 (Phase 5
base-update from stale main to origin/main 4d0d168 to bring AGE-114's
runbook into the WU's base).

Inherited prototype-pending markers on `prototype-tests/age-104-pty-mcp-gap/`
are NOT removed: the decision document supersedes the marker-removal
requirement for AGE-115's slice via canonical-reproducer citation per
`~/ai/conventions/prototype-pending-tests.md`. Marker removal belongs to
the future WU that ships a Claude Code upstream patch or local runtime
workaround making M3-C1 pass.

Refs: AGE-115, AGE-104 (predecessor prototype), AGE-114 (sibling WU runbook)

Closes AGE-115
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant