feat(cli): pin explicit model+thinking, add live API smoke for harness drift#71
Open
rawwerks wants to merge 5 commits into
Open
feat(cli): pin explicit model+thinking, add live API smoke for harness drift#71rawwerks wants to merge 5 commits into
rawwerks wants to merge 5 commits into
Conversation
Layer 0 + Layer 1 of the harness robustness plan: stop inheriting SDK defaults, and catch upstream drift before users hit it. * claude-sdk + codex-sdk now pass explicit model and (Claude only) thinking via tools/cli/src/harnesses/defaults.ts, the single audit point. Override per run via ANTHROPIC_MODEL / OPENAI_MODEL / CODEX_MODEL env vars. * Pin @anthropic-ai/claude-agent-sdk to exact 0.2.126. The previous ^0.2.90 resolved to 0.2.90, which hardcoded thinking.type: "enabled" — Opus 4.6+ / Sonnet 4.6+ now reject that with HTTP 400. * Add tools/cli/tests/live/smoke.live.test.ts and a `test:live` script using a dedicated vitest.live.config.ts. Tests auto-skip per harness when API keys are absent, so contributors without keys aren't blocked. * Add .github/workflows/cli-live-smoke.yml — runs the live smoke on every PR touching tools/cli/** plus workflow_dispatch. Uses org ANTHROPIC_API_KEY and OPENAI_API_KEY secrets. Complements the existing cli-real-harness-smoke.yml (binary-spawn end-to-end) with a faster harness-unit-level check. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Workflow_dispatch is blocked for new workflow files until they exist on the default branch. Adding the feature branch to the push trigger so the live smoke runs without opening a PR. To revert before merge.
0.2.126 introduced platform-specific native binary packaging via 8
optional-deps, but the linux-x64 and linux-x64-musl variants share
identical {os, cpu} constraints with no libc discriminator. npm
install on glibc Linux runners is non-deterministic, and the SDK's
runtime binary lookup can fail (CI run 25388337686 confirmed this).
0.2.107 is the last 0.2.x release with a bundled cli.js — no native
binary lookup, no platform-package mess. It already supports adaptive
thinking and modern model names like claude-sonnet-4-6, so it satisfies
the original Layer 0 requirement. Worth ~3 weeks of SDK improvements
to keep CI deterministic.
Layer 1 live smoke caught this drift exactly as designed.
Workflow has now landed on this branch and was proven (run 25388993268 green for both harnesses). Revert the feature-branch entry so the trigger surface goes back to its intended scope: PRs touching tools/cli/**, push to main, and workflow_dispatch.
claude-agent-sdk@0.2.107 declares ^0.81.0 on @anthropic-ai/sdk, which resolves to versions in the GHSA-p7fg-763f-g4gf range (insecure default file permissions in Local Filesystem Memory Tool, moderate severity). Adding an npm override to bump the transitive dep to ^0.92.0 (the patched line). The override is invisible to claude-agent-sdk's own caret constraint but ensures the lockfile resolves to a non-vulnerable version. 0.92.0 was published 2026-04-30, passes the 72h cooldown. Verified: npm audit --omit=dev clean, 159/159 unit tests pass, live smoke against Anthropic still green. Layer 1's CI sibling (cli-release-check.yml's audit job) caught this.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Layer 0 + Layer 1 of CLI harness robustness plan.
Why: prose run --harness claude-sdk was returning HTTP 400 (thinking.type.enabled not supported, use adaptive). Mocked tests passed because they never touched the wire format.
CI proof: run 25388993268 — both harnesses green (claude-sdk 4.05s, codex-sdk 6.71s).
Test plan: typecheck/build/test all pass (159/159); npm run test:live local green; CI live smoke green.
Follow-ups: tighten cli-real-harness-smoke.yml's prompt (too trivial to engage thinking); Layer 2 no-op detection; Layer 5 Renovate.
Generated with Claude Code.