test(prototype): AGE-104 PTY vs print MCP Task proof harness#90
Draft
nestharus wants to merge 1 commit into
Draft
test(prototype): AGE-104 PTY vs print MCP Task proof harness#90nestharus wants to merge 1 commit into
nestharus wants to merge 1 commit into
Conversation
…roof harness Proof tests for AGE-104 (sub-prototype of AGE-89 prototype-age-89-clarify) showing that Claude Code 2.1.141-2.1.143 interactive PTY mode suppresses MCP tools/call dispatch when launched with --tools mcp__<server>__<name>,..., while print mode accepts the same flag. Two confirmed PTY workarounds: --allowedTools mcp__... or omit the tool filter. Files: - p2-proof-tests.sh: focused proof script asserting M3-C1 fails, M3-C2 + M3-C3 succeed in interactive PTY w/ hooks. - p2-truth-table-harness/: full 4x3 sweep (4 modes x 3 flag shapes). - MARKERS.md: prototype-pending marker mapping per spawned implementation ticket (AGE-101, AGE-90, AGE-102, AGE-103, AGE-112, AGE-113, AGE-114, AGE-115). - README.md: reviewer guidance per ~/ai/conventions/prototype-review.md. This is a prototype-test contract PR, not a production implementation PR. Review follows prototype-review.md: test design, outcome alignment, pending-marker traceability, dossier verdict support. Production PR-review gates do not apply. prototype-pending markers cite real Linear keys. Spawned implementation tickets must remove markers and make these tests pass against production code, preserving original assertions unless a strictly stronger equivalent supersession is recorded in the manifest, ticket payload, or Phase 6 Step 6b output index. Refs: AGE-104 dossier at planning/prototype-age-104-pty-mcp-gap/dossier/
nestharus
added a commit
that referenced
this pull request
May 16, 2026
New `docs/architecture/claude-proxy-mcp-launch-shape.md` captures the AGE-104 finding that interactive PTY Claude proxy MCP replacement launches must use `--allowedTools mcp__<server>__<tool>,...` or omit the tool filter; `--tools mcp__<server>__<tool>,...` suppresses MCP `tools/call` dispatch in PTY mode (Claude Code 2.1.141-2.1.143). Print mode is unaffected. Cites the canonical PR #90 reproducer harness (`prototype-tests/age-104-pty-mcp-gap/p2-proof-tests.sh`), the M3-C1/ M3-C2/M3-C3 control pattern, the `/home/nes/.local/share/claude/ versions/2.1.143` binary-path caveat, the bounded `--allowedTools` camelCase vs `--allowed-tools` kebab-case spelling ambiguity (PTY MCP equivalence is not in evidence; the current Rust executor at `crates/oulipoly-runtime/src/executor/cli.rs::apply_provider_policy` emits kebab-case), the upstream-issue absence (AGE-115 tracks the file-or-decline decision), and a version-bump runbook re-verification recipe. Additive cross-links from `docs/architecture/provider-accounts-redesign .md` (sections "Example: Claude CLI Integration" and "Version-Aware Integration"), `README.md` (sections "Interactive REPL", "providers .toml", "Adding a Model"), and `AGENTS.md` (section "Model Command Syntax") route readers to the runbook before they edit Claude proxy launch flags. DECISIONS.md records the AGE-114 Phase 2.5 dispositions: D1 inherited-estimate cold-start (proceed without baseline; AGE-104 prototype is the prototype-first satisfaction); D2 problem-map gate skipped per `skip_problem_map_gate=true`; D3 Phase 2.5 verdict HIGH accepted with exhaustive mode for the runbook + provider-accounts- redesign surfaces. Per ACR-126 prototype-pending carry-forward, the inherited PR #90 proof tests are recorded as strictly stronger equivalent supersession via canonical-reproducer citation in the runbook (Proof Command + Version-Bump Runbook sections); not imported into this branch because they live on the prototype-test branch and a docs-only diff has no production code to validate against. Refs: AGE-114, AGE-89 (parent), AGE-104 (prototype), AGE-101, AGE-102, AGE-103, AGE-113, AGE-115.
nestharus
added a commit
that referenced
this pull request
May 16, 2026
* docs(arch): add Claude proxy MCP launch-shape runbook (AGE-114) New `docs/architecture/claude-proxy-mcp-launch-shape.md` captures the AGE-104 finding that interactive PTY Claude proxy MCP replacement launches must use `--allowedTools mcp__<server>__<tool>,...` or omit the tool filter; `--tools mcp__<server>__<tool>,...` suppresses MCP `tools/call` dispatch in PTY mode (Claude Code 2.1.141-2.1.143). Print mode is unaffected. Cites the canonical PR #90 reproducer harness (`prototype-tests/age-104-pty-mcp-gap/p2-proof-tests.sh`), the M3-C1/ M3-C2/M3-C3 control pattern, the `/home/nes/.local/share/claude/ versions/2.1.143` binary-path caveat, the bounded `--allowedTools` camelCase vs `--allowed-tools` kebab-case spelling ambiguity (PTY MCP equivalence is not in evidence; the current Rust executor at `crates/oulipoly-runtime/src/executor/cli.rs::apply_provider_policy` emits kebab-case), the upstream-issue absence (AGE-115 tracks the file-or-decline decision), and a version-bump runbook re-verification recipe. Additive cross-links from `docs/architecture/provider-accounts-redesign .md` (sections "Example: Claude CLI Integration" and "Version-Aware Integration"), `README.md` (sections "Interactive REPL", "providers .toml", "Adding a Model"), and `AGENTS.md` (section "Model Command Syntax") route readers to the runbook before they edit Claude proxy launch flags. DECISIONS.md records the AGE-114 Phase 2.5 dispositions: D1 inherited-estimate cold-start (proceed without baseline; AGE-104 prototype is the prototype-first satisfaction); D2 problem-map gate skipped per `skip_problem_map_gate=true`; D3 Phase 2.5 verdict HIGH accepted with exhaustive mode for the runbook + provider-accounts- redesign surfaces. Per ACR-126 prototype-pending carry-forward, the inherited PR #90 proof tests are recorded as strictly stronger equivalent supersession via canonical-reproducer citation in the runbook (Proof Command + Version-Bump Runbook sections); not imported into this branch because they live on the prototype-test branch and a docs-only diff has no production code to validate against. Refs: AGE-114, AGE-89 (parent), AGE-104 (prototype), AGE-101, AGE-102, AGE-103, AGE-113, AGE-115. * docs(decisions): record AGE-114 D4/D5 Phase 6 Tier-1 retry lineage D4 — Step 6c R1 missing `consumed:` log evidence; autonomous Tier-1 rewind-and-retry per orchestrator spec § Step 6c violation rule. D5 — Step 6c model substitution to claude-opus for consumed-evidence reliability after R2/R3 (gpt-high → codex2) and R4 (claude-opus → claude4 with inline placement) collapsed the consumed-echo. R5's final-block prompt structure succeeded; all 97 consumed: rows landed in the captured log. * docs(decisions): record AGE-114 D6 Phase 8 test-audit MEDIUM disposition D6 — Phase 8 test-audit returned MEDIUM solely because acceptance- checklist row AC-016 searches for `no filter` literally but the runbook uses `no tool filter`. The product-docs content is correct (M3-C3 documented as succeeding with no tool filter; AC-046 separately verifies the no-filter allowance against `## Rule`). Accepted as recipe-weakness residual with no content gap, per pr-review.md fix-pass disposition rules. ACR-156/162/163 LOW-only rule applies to code-quality gates only; PR-review gates have their own disposition policy.
nestharus
added a commit
that referenced
this pull request
May 16, 2026
) New `docs/external-issues/claude-code-pty-mcp-task-availability.md` records the file-or-decline decision for the AGE-104 finding that Claude Code 2.1.141-2.1.143 fails to dispatch MCP `tools/call` in interactive PTY mode under `--tools mcp__<server>__<name>,...` while print mode succeeds and PTY mode succeeds under `--allowedTools` or no tool filter. Disposition: file local only (draft, do not submit). The document is the durable internal record and is shaped to be self-contained for an external Anthropic maintainer audience. A future authorized step may submit it via `gh api repos/anthropics/claude-code/issues` behind an explicit root authorization gate; that submission is out of scope for this WU per the proposal's anti-scope. Includes the 4x3 truth-table summary from AGE-104 P2 evidence (M3-C1 fails under `--tools`; M3-C2 / M3-C3 succeed), points to the canonical PR #90 proof harness, names the local workaround flag shapes, and frames the suggested fix surface as the main-PTY tool-dispatch / tool-registry path per the AGE-104 V4 verdict. Cross-links to `docs/architecture/claude-proxy-mcp-launch-shape.md` (AGE-114 runbook) as the maintainer-facing operational reference; the AGE-115 ticket URL; AGE-104 dossier; and sister tickets AGE-101/102/103/113/114. DECISIONS.md additions record the WU-local pipeline decisions: D1 (Phase 2.5 inherited-estimate cold-start resolved inline as procedural), D2 (problem-map human gate skipped via dispatch override), D3 (Phase 5 base-update from stale main to origin/main 4d0d168 to bring AGE-114's runbook into the WU's base). Inherited prototype-pending markers on `prototype-tests/age-104-pty-mcp-gap/` are NOT removed: the decision document supersedes the marker-removal requirement for AGE-115's slice via canonical-reproducer citation per `~/ai/conventions/prototype-pending-tests.md`. Marker removal belongs to the future WU that ships a Claude Code upstream patch or local runtime workaround making M3-C1 pass. Refs: AGE-115, AGE-104 (predecessor prototype), AGE-114 (sibling WU runbook) Closes AGE-115
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Verdict
CONFIRM(it works). The AGE-104 P2 evidence shows that the Claude Code interactive PTY mode can dispatchmcp__<server>__Task-class tools when the launch shape uses--allowedTools mcp__<server>__<name>,...or omits the tool filter; only--tools mcp__<server>__<name>,...suppresses MCPtools/calldispatch in true PTY mode while print mode succeeds under all three flag shapes. The dossieranswer.mdheadline, the proof-test audit re-run ofp2-proof-tests.sh(M3-C1 expected-failure, M3-C2 success, M3-C3 success; audit verdictLOW), and the 4x3 truth-table sweep all back this conclusion. AGE-89's orchestrator-mode MCP-Task gate is unblocked subject to the AGE-101 / AGE-102 / AGE-103 amendments and the AGE-113 / AGE-114 hardening tickets.Reviewer focus
Review follows
~/ai/conventions/prototype-review.md. Reviewers should comment on test design, outcome alignment, pending-marker traceability, and whether the published harness supports the dossierCONFIRMverdict. Specifically:--toolssuppression and the--allowedTools/ no-filter PTY positive controls?prototype-pending:markers cite real Linear keys/URLs traceable to the spawned implementation tickets listed below?dossier/test-publication-manifest.md?This is not review of disposable prototype source quality. Do not request stylistic, structural, or hygiene changes to harness internals that do not change the behavior contract.
Test manifest
Branch:
prototype-tests-age-104-pty-mcp-gap(HEADb3347b7).Published harness lives under
prototype-tests/age-104-pty-mcp-gap/on the prototype-test branch. The manifest paths reference the planning-tree evidence sources; the on-branch publication paths are equivalent contents.prototype-tests/age-104-pty-mcp-gap/p2-proof-tests.sh(M3-C1 negative control:--tools mcp__...fails in PTY)dossier/evidence/p2-proof-tests.shprototype-tests/age-104-pty-mcp-gap/p2-proof-tests.sh(M3-C2 positive control:--allowedTools mcp__...works in PTY)dossier/evidence/p2-proof-tests.shprototype-tests/age-104-pty-mcp-gap/p2-proof-tests.sh(M3-C3 positive control: no filter works in PTY)dossier/evidence/p2-proof-tests.shprototype-tests/age-104-pty-mcp-gap/p2-proof-tests.sh(all M3/M4 PTY cells: PTY-with-hooks ≡ PTY-without-hooks)dossier/evidence/p2-proof-tests.shprototype-tests/age-104-pty-mcp-gap/p2-truth-table-harness/run-all.sh(full 4×3 sweep)dossier/evidence/p2-truth-table-harness/run-all.shprototype-tests/age-104-pty-mcp-gap/p2-truth-table-harness/server.py(MCP server backing the sweep)dossier/evidence/p2-truth-table-harness/server.pyrun-all.sh(supports ST-02 / ST-03 / ST-07 / ST-08 above)Expected control pattern reproduced by
prototype-tests/age-104-pty-mcp-gap/p2-proof-tests.sh:--tools mcp__...) —tools/callnever reaches the MCP server; expected failure today.--allowedTools mcp__...) —tools/callreaches the server and the tool result returns to Claude.tools/callreaches the server and the tool result returns to Claude.prototype-tests/age-104-pty-mcp-gap/MARKERS.mdcarries the full per-cell ticket mapping.Pending markers
The prototype-pending marker convention is defined in
~/ai/conventions/prototype-pending-tests.md. The reason format the convention requires is:These markers are traceable implementation handoff debt, not generic skip/xfail permission. Each cited ticket has a real Linear key/URL; placeholder ticket text is not present. Implementation must remove the marker token (the
prototype-pending:colon-prefixed sentinel) from the listed test files or supersede the test with a traceable strictly stronger equivalent recorded indossier/test-publication-manifest.md, the spawned ticket payload, or the Phase 6 Step 6b output index. The marker is not permission to discard the original assertions.The harness is bash + Python (not Pytest / Playwright / cargo), so the convention's preferred-runner mapping falls through to the comment-block sentinel header at the top of each published file.
grep -RIn "^# prototype-pending:" prototype-tests/age-104-pty-mcp-gap/enumerates every traceable marker on this branch.Dossier link
AGE-104 prototype dossier:
planning/prototype-age-104-pty-mcp-gap/dossier/.planning/prototype-age-104-pty-mcp-gap/dossier/answer.mdplanning/prototype-age-104-pty-mcp-gap/dossier/risk-profile.mdplanning/prototype-age-104-pty-mcp-gap/dossier/evidence/(includingp2-truth-table.md,p2-proof-tests.sh,p2-stabilize-output.md,p2-truth-table-harness/, and per-vectorv{1..4}-verdict.md)planning/prototype-age-104-pty-mcp-gap/dossier/proof-test-audit.md(verdictLOW; re-run ofp2-proof-tests.shshows M3-C1 expected-failure, M3-C2 success, M3-C3 success)planning/prototype-age-104-pty-mcp-gap/dossier/spawned-tickets.mdplanning/prototype-age-104-pty-mcp-gap/dossier/test-publication-manifest.mdSpawned implementation tickets
Mapping per
dossier/spawned-tickets.md(ST-01 through ST-08) anddossier/test-publication-manifest.mdper-ticket carry-forward table:ST-01 — AGE-101: wire proxy PTY driver for Claude Code one-shot execution. Inherits M3-C1 negative, M3-C2 positive, M3-C3 positive. Acceptance: Remove the
prototype-pending:markers in the listed test files, make these tests pass, and preserve the original assertions unless the manifest, spawned ticket payload, or Phase 6 Step 6b output index records a strictly stronger equivalent supersession.ST-02 — AGE-90: agent-runner flip Claude family providers from headless to proxy mode. Inherits the full 4×3 truth-table sweep (M3/M4 C1 failure and C2/C3 success). Acceptance: Remove the
prototype-pending:markers in the listed test files, make these tests pass, and preserve the original assertions unless the manifest, spawned ticket payload, or Phase 6 Step 6b output index records a strictly stronger equivalent supersession.ST-03 — AGE-102: add driver-owned MCP replacement tools for Claude Code. Inherits M3-C2 positive, M3-C3 positive, and the supporting MCP server
Taskimplementation inp2-truth-table-harness/server.py. Acceptance: Remove theprototype-pending:markers in the listed test files, make these tests pass, and preserve the original assertions unless the manifest, spawned ticket payload, or Phase 6 Step 6b output index records a strictly stronger equivalent supersession.ST-04 — AGE-103: extend providers.toml with invocation-mode schema. Inherits M3-C1 negative and M3-C2 positive. Acceptance: Remove the
prototype-pending:markers in the listed test files, make these tests pass, and preserve the original assertions unless the manifest, spawned ticket payload, or Phase 6 Step 6b output index records a strictly stronger equivalent supersession.ST-05 — AGE-112: decide orchestrator-mode completion scope until PTY MCP Task is fixed. Inherits all M3/M4 PTY cells. Acceptance: Remove the
prototype-pending:markers in the listed test files, make these tests pass, and preserve the original assertions unless the manifest, spawned ticket payload, or Phase 6 Step 6b output index records a strictly stronger equivalent supersession.ST-06 — AGE-113: harden Claude proxy launch-shape regression coverage. Inherits M3-C1 negative, M3-C2 positive, M3-C3 positive; AGE-104 proof harness serves as the regression spec. Acceptance: Remove the
prototype-pending:markers in the listed test files, make these tests pass, and preserve the original assertions unless the manifest, spawned ticket payload, or Phase 6 Step 6b output index records a strictly stronger equivalent supersession.ST-07 — AGE-114: document the AGE-104 Claude launch-shape constraint. Inherits the expected control pattern from
p2-proof-tests.shand the 2.1.141–2.1.143 version range from V4 evidence. Acceptance: Remove theprototype-pending:markers in the listed test files, make these tests pass, and preserve the original assertions unless the manifest, spawned ticket payload, or Phase 6 Step 6b output index records a strictly stronger equivalent supersession.ST-08 — AGE-115: file or explicitly decline upstream Claude Code bug report. Inherits the minimal repro from
p2-proof-tests.shplus the full truth-table expectation. Acceptance: Remove theprototype-pending:markers in the listed test files, make these tests pass, and preserve the original assertions unless the manifest, spawned ticket payload, or Phase 6 Step 6b output index records a strictly stronger equivalent supersession.Anti-scope
~/ai/conventions/prototype-pending-tests.mdand~/ai/conventions/prototype-review.md.prototype-tests/age-104-pty-mcp-gap/and are explicitly outside the production test surface; spawned implementation tickets bring the contract into production via their own PRs.