test(prototype): AGE-104 PTY vs print MCP Task proof harness by nestharus · Pull Request #90 · nestharus/agent-runner

nestharus · 2026-05-15T23:50:13Z

Verdict

CONFIRM (it works). The AGE-104 P2 evidence shows that the Claude Code interactive PTY mode can dispatch mcp__<server>__Task-class tools when the launch shape uses --allowedTools mcp__<server>__<name>,... or omits the tool filter; only --tools mcp__<server>__<name>,... suppresses MCP tools/call dispatch in true PTY mode while print mode succeeds under all three flag shapes. The dossier answer.md headline, the proof-test audit re-run of p2-proof-tests.sh (M3-C1 expected-failure, M3-C2 success, M3-C3 success; audit verdict LOW), and the 4x3 truth-table sweep all back this conclusion. AGE-89's orchestrator-mode MCP-Task gate is unblocked subject to the AGE-101 / AGE-102 / AGE-103 amendments and the AGE-113 / AGE-114 hardening tickets.

Reviewer focus

Review follows ~/ai/conventions/prototype-review.md. Reviewers should comment on test design, outcome alignment, pending-marker traceability, and whether the published harness supports the dossier CONFIRM verdict. Specifically:

Does the M3-C1 / M3-C2 / M3-C3 focused proof (and the full 4x3 sweep) correctly reproduce the PTY --tools suppression and the --allowedTools / no-filter PTY positive controls?
Do the prototype-pending: markers cite real Linear keys/URLs traceable to the spawned implementation tickets listed below?
Do the test scope and assertions line up with the dossier verdict and the per-ticket carry-forward in dossier/test-publication-manifest.md?

This is not review of disposable prototype source quality. Do not request stylistic, structural, or hygiene changes to harness internals that do not change the behavior contract.

Test manifest

Branch: prototype-tests-age-104-pty-mcp-gap (HEAD b3347b7).

Published harness lives under prototype-tests/age-104-pty-mcp-gap/ on the prototype-test branch. The manifest paths reference the planning-tree evidence sources; the on-branch publication paths are equivalent contents.

Published path on prototype-test branch	Manifest entry	Spawned ticket mapping
`prototype-tests/age-104-pty-mcp-gap/p2-proof-tests.sh` (M3-C1 negative control: `--tools mcp__...` fails in PTY)	`dossier/evidence/p2-proof-tests.sh`	ST-01 → AGE-101
`prototype-tests/age-104-pty-mcp-gap/p2-proof-tests.sh` (M3-C2 positive control: `--allowedTools mcp__...` works in PTY)	`dossier/evidence/p2-proof-tests.sh`	ST-01 → AGE-101, ST-04 → AGE-103
`prototype-tests/age-104-pty-mcp-gap/p2-proof-tests.sh` (M3-C3 positive control: no filter works in PTY)	`dossier/evidence/p2-proof-tests.sh`	ST-01 → AGE-101, ST-06 → AGE-113
`prototype-tests/age-104-pty-mcp-gap/p2-proof-tests.sh` (all M3/M4 PTY cells: PTY-with-hooks ≡ PTY-without-hooks)	`dossier/evidence/p2-proof-tests.sh`	ST-05 → AGE-112
`prototype-tests/age-104-pty-mcp-gap/p2-truth-table-harness/run-all.sh` (full 4×3 sweep)	`dossier/evidence/p2-truth-table-harness/run-all.sh`	ST-02 → AGE-90, ST-03 → AGE-102, ST-07 → AGE-114, ST-08 → AGE-115
`prototype-tests/age-104-pty-mcp-gap/p2-truth-table-harness/server.py` (MCP server backing the sweep)	`dossier/evidence/p2-truth-table-harness/server.py`	inherited alongside `run-all.sh` (supports ST-02 / ST-03 / ST-07 / ST-08 above)

Expected control pattern reproduced by prototype-tests/age-104-pty-mcp-gap/p2-proof-tests.sh:

M3-C1 (PTY w/ hooks + --tools mcp__...) — tools/call never reaches the MCP server; expected failure today.
M3-C2 (PTY w/ hooks + --allowedTools mcp__...) — tools/call reaches the server and the tool result returns to Claude.
M3-C3 (PTY w/ hooks + no filter) — tools/call reaches the server and the tool result returns to Claude.
Script exits 0 only when all three assertions match the expected outcomes.

prototype-tests/age-104-pty-mcp-gap/MARKERS.md carries the full per-cell ticket mapping.

Pending markers

The prototype-pending marker convention is defined in ~/ai/conventions/prototype-pending-tests.md. The reason format the convention requires is:

prototype-pending: implementation pending in <ticket-key-or-url>; remove marker and make this test pass

These markers are traceable implementation handoff debt, not generic skip/xfail permission. Each cited ticket has a real Linear key/URL; placeholder ticket text is not present. Implementation must remove the marker token (the prototype-pending: colon-prefixed sentinel) from the listed test files or supersede the test with a traceable strictly stronger equivalent recorded in dossier/test-publication-manifest.md, the spawned ticket payload, or the Phase 6 Step 6b output index. The marker is not permission to discard the original assertions.

The harness is bash + Python (not Pytest / Playwright / cargo), so the convention's preferred-runner mapping falls through to the comment-block sentinel header at the top of each published file. grep -RIn "^# prototype-pending:" prototype-tests/age-104-pty-mcp-gap/ enumerates every traceable marker on this branch.

Dossier link

AGE-104 prototype dossier: planning/prototype-age-104-pty-mcp-gap/dossier/.

Answer: planning/prototype-age-104-pty-mcp-gap/dossier/answer.md
Risk profile: planning/prototype-age-104-pty-mcp-gap/dossier/risk-profile.md
Evidence directory: planning/prototype-age-104-pty-mcp-gap/dossier/evidence/ (including p2-truth-table.md, p2-proof-tests.sh, p2-stabilize-output.md, p2-truth-table-harness/, and per-vector v{1..4}-verdict.md)
Proof-test audit: planning/prototype-age-104-pty-mcp-gap/dossier/proof-test-audit.md (verdict LOW; re-run of p2-proof-tests.sh shows M3-C1 expected-failure, M3-C2 success, M3-C3 success)
Spawned tickets dossier: planning/prototype-age-104-pty-mcp-gap/dossier/spawned-tickets.md
Test publication manifest: planning/prototype-age-104-pty-mcp-gap/dossier/test-publication-manifest.md

Spawned implementation tickets

Mapping per dossier/spawned-tickets.md (ST-01 through ST-08) and dossier/test-publication-manifest.md per-ticket carry-forward table:

ST-01 — AGE-101: wire proxy PTY driver for Claude Code one-shot execution. Inherits M3-C1 negative, M3-C2 positive, M3-C3 positive. Acceptance: Remove the prototype-pending: markers in the listed test files, make these tests pass, and preserve the original assertions unless the manifest, spawned ticket payload, or Phase 6 Step 6b output index records a strictly stronger equivalent supersession.
ST-02 — AGE-90: agent-runner flip Claude family providers from headless to proxy mode. Inherits the full 4×3 truth-table sweep (M3/M4 C1 failure and C2/C3 success). Acceptance: Remove the prototype-pending: markers in the listed test files, make these tests pass, and preserve the original assertions unless the manifest, spawned ticket payload, or Phase 6 Step 6b output index records a strictly stronger equivalent supersession.
ST-03 — AGE-102: add driver-owned MCP replacement tools for Claude Code. Inherits M3-C2 positive, M3-C3 positive, and the supporting MCP server Task implementation in p2-truth-table-harness/server.py. Acceptance: Remove the prototype-pending: markers in the listed test files, make these tests pass, and preserve the original assertions unless the manifest, spawned ticket payload, or Phase 6 Step 6b output index records a strictly stronger equivalent supersession.
ST-04 — AGE-103: extend providers.toml with invocation-mode schema. Inherits M3-C1 negative and M3-C2 positive. Acceptance: Remove the prototype-pending: markers in the listed test files, make these tests pass, and preserve the original assertions unless the manifest, spawned ticket payload, or Phase 6 Step 6b output index records a strictly stronger equivalent supersession.
ST-05 — AGE-112: decide orchestrator-mode completion scope until PTY MCP Task is fixed. Inherits all M3/M4 PTY cells. Acceptance: Remove the prototype-pending: markers in the listed test files, make these tests pass, and preserve the original assertions unless the manifest, spawned ticket payload, or Phase 6 Step 6b output index records a strictly stronger equivalent supersession.
ST-06 — AGE-113: harden Claude proxy launch-shape regression coverage. Inherits M3-C1 negative, M3-C2 positive, M3-C3 positive; AGE-104 proof harness serves as the regression spec. Acceptance: Remove the prototype-pending: markers in the listed test files, make these tests pass, and preserve the original assertions unless the manifest, spawned ticket payload, or Phase 6 Step 6b output index records a strictly stronger equivalent supersession.
ST-07 — AGE-114: document the AGE-104 Claude launch-shape constraint. Inherits the expected control pattern from p2-proof-tests.sh and the 2.1.141–2.1.143 version range from V4 evidence. Acceptance: Remove the prototype-pending: markers in the listed test files, make these tests pass, and preserve the original assertions unless the manifest, spawned ticket payload, or Phase 6 Step 6b output index records a strictly stronger equivalent supersession.
ST-08 — AGE-115: file or explicitly decline upstream Claude Code bug report. Inherits the minimal repro from p2-proof-tests.sh plus the full truth-table expectation. Acceptance: Remove the prototype-pending: markers in the listed test files, make these tests pass, and preserve the original assertions unless the manifest, spawned ticket payload, or Phase 6 Step 6b output index records a strictly stronger equivalent supersession.

Anti-scope

This is not a production implementation PR. It is a prototype-test contract PR per ~/ai/conventions/prototype-pending-tests.md and ~/ai/conventions/prototype-review.md.
Reviewers should not request prototype-source quality changes (style, structural refactors, harness internals, file layout) that do not change the behavior contract.
CodeRabbit is not run by default on this PR.
Production PR-review gates (CI green-gate, full lint/format, code-coverage thresholds, architecture review) are not applied by default. The published files live under prototype-tests/age-104-pty-mcp-gap/ and are explicitly outside the production test surface; spawned implementation tickets bring the contract into production via their own PRs.
Approval of this PR is approval of the test contract carried forward to the listed spawned tickets, not approval of any production behavior change.

…roof harness Proof tests for AGE-104 (sub-prototype of AGE-89 prototype-age-89-clarify) showing that Claude Code 2.1.141-2.1.143 interactive PTY mode suppresses MCP tools/call dispatch when launched with --tools mcp__<server>__<name>,..., while print mode accepts the same flag. Two confirmed PTY workarounds: --allowedTools mcp__... or omit the tool filter. Files: - p2-proof-tests.sh: focused proof script asserting M3-C1 fails, M3-C2 + M3-C3 succeed in interactive PTY w/ hooks. - p2-truth-table-harness/: full 4x3 sweep (4 modes x 3 flag shapes). - MARKERS.md: prototype-pending marker mapping per spawned implementation ticket (AGE-101, AGE-90, AGE-102, AGE-103, AGE-112, AGE-113, AGE-114, AGE-115). - README.md: reviewer guidance per ~/ai/conventions/prototype-review.md. This is a prototype-test contract PR, not a production implementation PR. Review follows prototype-review.md: test design, outcome alignment, pending-marker traceability, dossier verdict support. Production PR-review gates do not apply. prototype-pending markers cite real Linear keys. Spawned implementation tickets must remove markers and make these tests pass against production code, preserving original assertions unless a strictly stronger equivalent supersession is recorded in the manifest, ticket payload, or Phase 6 Step 6b output index. Refs: AGE-104 dossier at planning/prototype-age-104-pty-mcp-gap/dossier/

linear · 2026-05-15T23:50:17Z

AGE-112

AGE-104

New `docs/architecture/claude-proxy-mcp-launch-shape.md` captures the AGE-104 finding that interactive PTY Claude proxy MCP replacement launches must use `--allowedTools mcp__<server>__<tool>,...` or omit the tool filter; `--tools mcp__<server>__<tool>,...` suppresses MCP `tools/call` dispatch in PTY mode (Claude Code 2.1.141-2.1.143). Print mode is unaffected. Cites the canonical PR #90 reproducer harness (`prototype-tests/age-104-pty-mcp-gap/p2-proof-tests.sh`), the M3-C1/ M3-C2/M3-C3 control pattern, the `/home/nes/.local/share/claude/ versions/2.1.143` binary-path caveat, the bounded `--allowedTools` camelCase vs `--allowed-tools` kebab-case spelling ambiguity (PTY MCP equivalence is not in evidence; the current Rust executor at `crates/oulipoly-runtime/src/executor/cli.rs::apply_provider_policy` emits kebab-case), the upstream-issue absence (AGE-115 tracks the file-or-decline decision), and a version-bump runbook re-verification recipe. Additive cross-links from `docs/architecture/provider-accounts-redesign .md` (sections "Example: Claude CLI Integration" and "Version-Aware Integration"), `README.md` (sections "Interactive REPL", "providers .toml", "Adding a Model"), and `AGENTS.md` (section "Model Command Syntax") route readers to the runbook before they edit Claude proxy launch flags. DECISIONS.md records the AGE-114 Phase 2.5 dispositions: D1 inherited-estimate cold-start (proceed without baseline; AGE-104 prototype is the prototype-first satisfaction); D2 problem-map gate skipped per `skip_problem_map_gate=true`; D3 Phase 2.5 verdict HIGH accepted with exhaustive mode for the runbook + provider-accounts- redesign surfaces. Per ACR-126 prototype-pending carry-forward, the inherited PR #90 proof tests are recorded as strictly stronger equivalent supersession via canonical-reproducer citation in the runbook (Proof Command + Version-Bump Runbook sections); not imported into this branch because they live on the prototype-test branch and a docs-only diff has no production code to validate against. Refs: AGE-114, AGE-89 (parent), AGE-104 (prototype), AGE-101, AGE-102, AGE-103, AGE-113, AGE-115.

* docs(arch): add Claude proxy MCP launch-shape runbook (AGE-114) New `docs/architecture/claude-proxy-mcp-launch-shape.md` captures the AGE-104 finding that interactive PTY Claude proxy MCP replacement launches must use `--allowedTools mcp__<server>__<tool>,...` or omit the tool filter; `--tools mcp__<server>__<tool>,...` suppresses MCP `tools/call` dispatch in PTY mode (Claude Code 2.1.141-2.1.143). Print mode is unaffected. Cites the canonical PR #90 reproducer harness (`prototype-tests/age-104-pty-mcp-gap/p2-proof-tests.sh`), the M3-C1/ M3-C2/M3-C3 control pattern, the `/home/nes/.local/share/claude/ versions/2.1.143` binary-path caveat, the bounded `--allowedTools` camelCase vs `--allowed-tools` kebab-case spelling ambiguity (PTY MCP equivalence is not in evidence; the current Rust executor at `crates/oulipoly-runtime/src/executor/cli.rs::apply_provider_policy` emits kebab-case), the upstream-issue absence (AGE-115 tracks the file-or-decline decision), and a version-bump runbook re-verification recipe. Additive cross-links from `docs/architecture/provider-accounts-redesign .md` (sections "Example: Claude CLI Integration" and "Version-Aware Integration"), `README.md` (sections "Interactive REPL", "providers .toml", "Adding a Model"), and `AGENTS.md` (section "Model Command Syntax") route readers to the runbook before they edit Claude proxy launch flags. DECISIONS.md records the AGE-114 Phase 2.5 dispositions: D1 inherited-estimate cold-start (proceed without baseline; AGE-104 prototype is the prototype-first satisfaction); D2 problem-map gate skipped per `skip_problem_map_gate=true`; D3 Phase 2.5 verdict HIGH accepted with exhaustive mode for the runbook + provider-accounts- redesign surfaces. Per ACR-126 prototype-pending carry-forward, the inherited PR #90 proof tests are recorded as strictly stronger equivalent supersession via canonical-reproducer citation in the runbook (Proof Command + Version-Bump Runbook sections); not imported into this branch because they live on the prototype-test branch and a docs-only diff has no production code to validate against. Refs: AGE-114, AGE-89 (parent), AGE-104 (prototype), AGE-101, AGE-102, AGE-103, AGE-113, AGE-115. * docs(decisions): record AGE-114 D4/D5 Phase 6 Tier-1 retry lineage D4 — Step 6c R1 missing `consumed:` log evidence; autonomous Tier-1 rewind-and-retry per orchestrator spec § Step 6c violation rule. D5 — Step 6c model substitution to claude-opus for consumed-evidence reliability after R2/R3 (gpt-high → codex2) and R4 (claude-opus → claude4 with inline placement) collapsed the consumed-echo. R5's final-block prompt structure succeeded; all 97 consumed: rows landed in the captured log. * docs(decisions): record AGE-114 D6 Phase 8 test-audit MEDIUM disposition D6 — Phase 8 test-audit returned MEDIUM solely because acceptance- checklist row AC-016 searches for `no filter` literally but the runbook uses `no tool filter`. The product-docs content is correct (M3-C3 documented as succeeding with no tool filter; AC-046 separately verifies the no-filter allowance against `## Rule`). Accepted as recipe-weakness residual with no content gap, per pr-review.md fix-pass disposition rules. ACR-156/162/163 LOW-only rule applies to code-quality gates only; PR-review gates have their own disposition policy.

) New `docs/external-issues/claude-code-pty-mcp-task-availability.md` records the file-or-decline decision for the AGE-104 finding that Claude Code 2.1.141-2.1.143 fails to dispatch MCP `tools/call` in interactive PTY mode under `--tools mcp__<server>__<name>,...` while print mode succeeds and PTY mode succeeds under `--allowedTools` or no tool filter. Disposition: file local only (draft, do not submit). The document is the durable internal record and is shaped to be self-contained for an external Anthropic maintainer audience. A future authorized step may submit it via `gh api repos/anthropics/claude-code/issues` behind an explicit root authorization gate; that submission is out of scope for this WU per the proposal's anti-scope. Includes the 4x3 truth-table summary from AGE-104 P2 evidence (M3-C1 fails under `--tools`; M3-C2 / M3-C3 succeed), points to the canonical PR #90 proof harness, names the local workaround flag shapes, and frames the suggested fix surface as the main-PTY tool-dispatch / tool-registry path per the AGE-104 V4 verdict. Cross-links to `docs/architecture/claude-proxy-mcp-launch-shape.md` (AGE-114 runbook) as the maintainer-facing operational reference; the AGE-115 ticket URL; AGE-104 dossier; and sister tickets AGE-101/102/103/113/114. DECISIONS.md additions record the WU-local pipeline decisions: D1 (Phase 2.5 inherited-estimate cold-start resolved inline as procedural), D2 (problem-map human gate skipped via dispatch override), D3 (Phase 5 base-update from stale main to origin/main 4d0d168 to bring AGE-114's runbook into the WU's base). Inherited prototype-pending markers on `prototype-tests/age-104-pty-mcp-gap/` are NOT removed: the decision document supersedes the marker-removal requirement for AGE-115's slice via canonical-reproducer citation per `~/ai/conventions/prototype-pending-tests.md`. Marker removal belongs to the future WU that ships a Claude Code upstream patch or local runtime workaround making M3-C1 pass. Refs: AGE-115, AGE-104 (predecessor prototype), AGE-114 (sibling WU runbook) Closes AGE-115

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(prototype): AGE-104 PTY vs print MCP Task proof harness#90

test(prototype): AGE-104 PTY vs print MCP Task proof harness#90
nestharus wants to merge 1 commit into
mainfrom
prototype-tests-age-104-pty-mcp-gap

nestharus commented May 15, 2026

Uh oh!

linear Bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nestharus commented May 15, 2026

Verdict

Reviewer focus

Test manifest

Pending markers

Dossier link

Spawned implementation tickets

Anti-scope

Uh oh!

linear Bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant