feat(path-cli): cross-harness translation matrix + codex idempotence fixes by ben-emp · Pull Request #81 · empathic/toolpath

ben-emp · 2026-05-06T20:31:17Z

Summary

Three threads, one branch:

Cross-harness translation matrix — exercises every (source, target) pair across the five conversation harnesses (claude, codex, gemini, pi, opencode) and asserts a tight set of idempotence + survival invariants on the IR. Surfaced and fixed four real translation bugs in toolpath-codex along the way.
Per-harness real-fixture roundtrips, strengthened delegation invariants, synthetic compaction tests — backs the "good UX across translation" claim with tests at three layers (per-harness self-roundtrip, cross-harness matrix, targeted regression cases for compaction).
Full-fidelity Claude UI rendering through IR roundtrip — path import claude … && path export claude … now produces a JSONL that resumes in Claude Code with all the original UI affordances (diff view, shell output panel, file viewer, parent-chain navigation), and the same projection works for Codex/Gemini/opencode/Pi sources that come through with non-Claude tool names.

What's in here

Matrix infrastructure (`crates/path-cli/tests/cross_harness_matrix.rs`)

Two test functions:

matrix_translation (25 cells). Each source uses its own captured real-world session as input. Cell shape: A → IR → B → IR → B. After one A→B translation, B's projector + forward must be a fixed point on the resulting IR. Strict equality on idempotence; provenance independence on the second leg.
matrix_schema_validation (5 checks, one per harness). Project the real fixture through the harness's projector, serialize the native output to its on-disk wire format, re-parse via the harness's own reader. Catches projector output that's "valid IR" but produces native bytes the reader rejects.

Per-cell invariants (15 categories):

Category	Check
Skeleton	turn count, role sequence
Content	turn text (whitespace-normalized)
Tool calls	call_id set, input shape (snake-case canonicalized so `filePath` ↔ `file_path`), output content, error flag
Tokens	per-turn idempotence, total_usage idempotence, survival across the B leg
Reasoning	thinking idempotence + survival
Delegations	per-delegation `agent_id`/prompt/child-turn count idempotence + soft cross-leg survival
Other fields	model, stop_reason, parent_id graph, environment.cwd, files_changed set
Pollution	no foreign-namespace extras keys

Per-crate real-fixture roundtrips

crates/toolpath-{claude,codex,gemini,opencode,pi}/tests/real_fixture_roundtrip.rs (one per harness). Each loads the captured real-world session at test-fixtures/<harness>/ and asserts native → IR → Path → IR → native preserves the user-visible contract: meaningful turn count and roles, per-turn text, tool-call topology, delegation content, and that the projector output re-parses through the harness's own reader.

Claude additionally gets four regression locks for full-session roundtrip (added with the Claude UI fixes — see below):

read_then_project_preserves_line_count — bluntest possible UX-loss check: source.lines() count == projected entries + preamble.
read_then_project_preserves_metadata_entries — attachments and preamble line counts match source exactly.
read_then_project_preserves_tool_use_result_count — every toolUseResult blob present in source survives the roundtrip (Claude UI uses this for diff/shell-output rendering).
cache_roundtrip_preserves_line_counts_per_type — exercises the full path-cli flow (source JSONL → derive_path → cache JSON → extract → ClaudeProjector → JSONL) and asserts every entry type appears in the projection. Catches whole-category drops (ai-title / last-prompt / file-history-snapshot regressions).

Synthetic compaction roundtrips

Real compaction events fire when the model context window fills mid-session — they can't reliably be triggered by a 5-minute capture prompt. Synthetic minimum-shape fixtures are the right tool. One per harness whose format models compaction natively: Claude (compact_boundary), Codex (compacted rollout line), Pi (Entry::Compaction), opencode (PartData::Compaction). Gemini is skipped — its format doesn't model context-overflow compaction.

Real-fixture capture pipeline

scripts/capture-elicit-fixtures.sh — orchestrator with per-harness driver functions, snapshot-diff session capture, fresh scratch dir per run, skip-on-missing-CLI.
docs/agents/feature-elicit.md — workflow doc with manual fallback, per-harness invocation table, completeness checklist (now includes a delegation-event row).
docs/agents/feature-elicit.prompt.txt — harness-agnostic 10-task prompt (list, write, read, edit, glob, grep, errored read, run shell, sub-agent dispatch, summarize). Single source of truth for both the doc and the script.
test-fixtures/<harness>/convo.{jsonl,json} — five captured fixtures. The Claude capture carries a real Agent delegation.

Codex translation fixes (`toolpath-codex`)

Four real bugs surfaced by the matrix and fixed in the same PR:

Empty-text assistant turn anchor. Tool calls without a preceding message attached backward to the prior assistant turn during forward parsing because the projector skipped the message line for empty-text assistants. Fix: emit a message line for any assistant turn with content.
Tool is_error round-trip for non-shell tools. Codex's wire format only carries error info via exec_command_end.exit_code. For read / write / apply_patch, the error flag was silently flipping to false. Fix: stash is_error: true in function_call_output.extra, recover it on the forward path.
event_msg.token_count emission. The projector wasn't emitting token_count event_msgs, so all assistant Turn.token_usage data dropped on a codex round-trip.
thinking: Some("") non-idempotence. Codex's forward path drops empty thinking, so the second pass saw None and the emit condition was false — turn evaporated. Fix: treat Some("") as absent for emit decisions.

Claude UI rendering through IR roundtrip

When the matrix passed but path import claude && path export claude produced a JSONL whose UI rendering was visibly broken (no diffs, no shell output, mid-conversation rendering cut off entirely), it surfaced a stack of bugs no per-crate unit test could catch — they all lived in the import/export plumbing.

Self-roundtrip preservation (Claude → IR → Claude)

reader.rs: headerless lines (ai-title, last-prompt, queue-operation, permission-mode, file-history-snapshot) used to be dropped — they have no uuid so the typed parse path skipped them. Now they're captured into convo.preamble.
provider.rs: surfaces preamble lines and message-less entries (attachments, snapshots) as ConversationEvents with structured event.data. Folds the entry's typed version/user_type/request_id into Turn.extra["claude"].
derive.rs: emits preamble lines as conversation.event steps. Emits tool-result-only user entries as tool_result_user events (preserving original UUID + toolUseResult blob + entry_extra like promptId/slug) — without this, the entry vanishes and the next assistant's parentUuid points at a UUID we never re-emit, breaking the chain Claude's TUI traverses for rendering. Tracks had_thinking_part so encrypted-reasoning blocks survive. Uses entry.parent_uuid for step parents instead of the linear last_step_id chain — the linear form skipped over attachments and tool-result events.
project.rs: routes tool_result_user events back to user entries with their original UUIDs, restores the toolUseResult blob verbatim, rebuilds the JSONL preamble from preamble events.

Cross-harness Claude UI rendering (any harness → Claude)

provider.rs / project.rs: adds provider::native_name(category, args) — the reverse of tool_category — and a canonical_claude_tool_name dispatcher in the projector. Mirrors the same pattern already on toolpath-codex / -opencode / -gemini / -pi. Claude's UI dispatches its rich result panes by literal tool name, so exec_command / read_file / write_file rendered as opaque blocks even when their toolUseResult was well-formed. Now they come through as Bash / Read / Write / Edit and the renderers fire.
project.rs: canonical_claude_tool_input translates input keys to what each Claude tool's UI reads (cmd/path/oldString → command/file_path/old_string). tool_use_result_from_invocation builds the matching display blob per-category — Bash gets {stdout, stderr, …}, Edit gets {filePath, oldString, newString, structuredPatch, …} with hunks computed via similar::TextDiff, Read gets {type: "text", file: {filePath, content, …}}, Glob/Grep get {filenames, numFiles, pattern}.
project.rs: tracks parent_rewrites so that when a synthesized tool-result entry is emitted between two assistant turns, the next turn's parentUuid points at the synthesized result instead of the prior assistant.

Codex enables extract to see its tool calls

toolpath-codex/derive.rs: was writing tool_calls (a name+status summary), but toolpath_convo::extract reads tool_uses (full id/name/input/category/result). The mismatch dropped every Codex tool call at the cache boundary — path export claude showed text only. Adds the canonical tool_uses array; keeps the legacy summary for human-readable consumers.

ConversationView events through derive_path

toolpath-convo/derive.rs + extract.rs: derive_path was emitting steps for view.turns only — anything in view.events disappeared on the IR-to-Path-to-IR trip. Now each event becomes a conversation.event step with its data flattened into structural extras. Two housekeeping keys (event_data_type, event_source_id) keep the original type field out of StructuralChange.change_type (otherwise a Codex user_message event collides with serde's #[serde(rename = "type")]).

Supporting changes

toolpath-convo: added PartialEq, Eq derives on TokenUsage so the matrix can compare token shapes structurally.
toolpath-gemini: drive-by clippy::unnecessary_map_or fix that was blocking -D warnings. Pre-existing.
toolpath-claude: adds the workspace similar dep, used by structured_patch_hunks in the projector.

Why

Per-harness round-trip tests catch projector regressions for that harness alone. They can't catch the class of bug that only surfaces when projecting across harnesses, or in the import/export plumbing rather than the projector itself — silent text drops, foreign-namespace extras leaking through serde flatten, tool-name remapping gaps, arg-key shape mismatches (camelCase ↔ snake_case), schema-required fields the projector forgot to populate, and (the load-bearing one for the Claude UI work) the derive_path → extract boundary dropping data the projector never had a chance to see.

The opencode "preparing edit…" / Zod TypeError class lived in the cross-harness gap. The Claude UI work then closed a parallel gap: tests can pass on every per-crate invariant and every cross-harness invariant and still produce a JSONL Claude Code can't render, because rendering requires fields and parent-chain shapes that no test was checking.

Verification

# Matrix tests
cargo test -p path-cli --test cross_harness_matrix

# Per-harness real-fixture and compaction roundtrips
cargo test -p toolpath-claude -p toolpath-codex -p toolpath-gemini -p toolpath-opencode -p toolpath-pi --tests

# Workspace regression sweep
cargo test --workspace

# Clippy clean (uses -D warnings)
cargo clippy --workspace --tests -- -D warnings

End-to-end UI verification (any Claude or Codex session you have on disk):

SESSION=<session uuid>
cargo run --release -p path-cli -- import claude --project "$(pwd)" --session "$SESSION" --force
cargo run --release -p path-cli -- export claude --input "claude-path-claude-${SESSION:0:8}" --project /tmp/replay
cd /tmp/replay && claude -r "$SESSION"
# Diffs, shell output, file viewer, parent-chain navigation — all should render.

# Same flow with import codex / opencode / gemini / pi — Claude UI still renders the rich panes.

To re-capture fixtures after a harness CLI version bump:

./scripts/capture-elicit-fixtures.sh             # all installed harnesses
./scripts/capture-elicit-fixtures.sh claude pi   # subset

Known limitations

One real fixture variant per harness. Compaction is now covered via synthetic fixtures, but other edge cases remain: retry parts, MCP tools, very long content, multi-segment chained sessions (Claude's JSONL rotation), unicode stress.
Cross-harness delegation survival is soft. The matrix accepts a delegation's agent_id surviving as a regular tool_use call_id rather than as a DelegatedWork entry — Codex/opencode don't model sub-agents natively.
Compaction structural metadata is documented loss. compact_boundary markers, isCompactSummary flags, and part.compaction ConversationEvents don't survive derive→extract. Surrounding messages do.
Claude UI rendering is verified manually. No headless TUI sentinel scraping in CI; the new cache_roundtrip_preserves_line_counts_per_type test catches the structural class but a pure rendering bug in Claude Code's TUI wouldn't be caught here.
Opencode schema check is JSON-self-roundtrip, not SQLite-roundtrip. SQL-leg coverage lives in toolpath-opencode/tests/projection_roundtrip.rs and compaction_roundtrip.rs.

Commits

chore(toolpath-convo): derive PartialEq+Eq on TokenUsage
chore(toolpath-gemini): use contains_key over get(..).is_none()
fix(codex): translation idempotence fixes from cross-harness matrix
feat(scripts): elicit-fixture capture pipeline
feat(path-cli): cross-harness translation matrix tests
refactor(path-cli): drop synthetic-IR matrix variant as redundant
refactor(test-fixtures): relocate to workspace-root test-fixtures/
test: add per-crate real-fixture projection roundtrips
feat(path-cli): strengthen delegation invariants in matrix
feat(scripts): extend elicit prompt with sub-agent dispatch step
test: add synthetic compaction roundtrip tests
chore(test-fixtures): refresh real-world captures
feat(toolpath-convo): preserve view.events through Path roundtrip
feat(claude): full-fidelity Claude UI rendering through IR roundtrip
test(claude): regression locks for full-session roundtrip

github-actions · 2026-05-06T20:34:50Z

🔍 Preview deployed: https://068a86aa.toolpath.pages.dev

Needed by downstream invariant tests that compare per-turn token usage across IR round-trips. The fields are all `Option<u32>`, so structural equality is well-defined.

Four bugs surfaced (and fixed) by the new cross-harness matrix tests: 1. Empty-text assistant turn message-line emission (the anchor bug). The forward path uses `response_item.message` as the turn anchor; tool calls without a preceding message attach to whatever assistant turn was last seen, even across an intervening user message. The projector previously skipped the message line for assistants whose text was empty even when they had tool calls or thinking, so those tool calls got fused backward into the prior assistant turn during forward parsing. Fix: emit a message line for any assistant turn with content (text, tool calls, or non-empty thinking). 2. Tool `is_error` round-trip for non-shell tools. Codex's wire format only carries error info via `exec_command_end.exit_code`, which only fires for shell tools. For read/write/apply_patch, the error flag was silently flipping to `false` after round-trip. Fix: stash `is_error: true` in the function_call_output (and custom_tool_call_output) `extra` map on projection, recover it on the forward path. `attach_tool_output` now takes an explicit `is_error` parameter and ORs with any prior state. 3. `event_msg.token_count` emission. The projector was never emitting token_count event_msgs, so all assistant `Turn.token_usage` data silently dropped on a codex round-trip. Fix: emit a `token_count` event_msg before each assistant message line (the forward path's `pending_token_usage` attaches to the next pushed turn). Adds `convo_usage_to_codex_json` helper mapping IR TokenUsage → codex TokenUsage JSON (input_tokens, cached_input_tokens, output_tokens). 4. `thinking: Some("")` non-idempotence. Claude's forward path produces empty-string thinking on assistant turns whose reasoning blocks have only metadata (signatures, etc.). The projector's emit-message-line condition was `turn.thinking.is_some()` which triggers on `Some("")`, so the first pass emits a message line. But codex's forward path drops empty thinking, so the second pass sees `thinking: None` and the same condition is false — turn drops out. Fix: treat `Some("")` as absent for emit decisions. Updated unit test `assistant_turn_with_function_call_and_output` to match the new line ordering (token_count now precedes message in the output stream).

Adds the workflow for capturing real-world conversation fixtures from each conversation harness on demand: - `docs/agents/feature-elicit.prompt.txt` — the harness-agnostic 9-task prompt (list, write, read, edit, glob, grep, errored read, run shell, summarize). Single source of truth for both the doc and the script. - `docs/agents/feature-elicit.md` — workflow doc covering both the automated path and the manual fallback, the per-harness invocation table (claude / codex / gemini / pi / opencode), the completeness checklist (every tool category + thinking + error result + summary), and where the resulting session files land. - `scripts/capture-elicit-fixtures.sh` — orchestrator. Per-harness driver functions, snapshot-diff session capture, fresh scratch dir per run, skip-on-missing-CLI, captured stderr surfaced on failure (not silenced). Run as `./scripts/capture-elicit-fixtures.sh` for all harnesses or pass a subset like `claude codex`. Output lands at `crates/path-cli/tests/fixtures/<harness>/convo.{jsonl,json}`. The pipeline is the input side of the cross-harness translation matrix test (next commit). Captured fixtures themselves are committed there.

Three test functions covering 60 distinct assertions across the five conversation harnesses (claude, codex, gemini, pi, opencode): - `matrix_synthetic` (25 cells): drives a controlled inline IR through every (source, target) pair. `A → IR → B → IR → B` shape — after one A→B translation, B's projector + forward must be a fixed point on the resulting IR. Strict equality on idempotence; provenance independence on the second leg. - `matrix_real_fixtures` (25 cells): same shape, but each source uses its own real-world captured session as input. Loads fixtures via the harness's own native reader (ConversationReader / RolloutReader / read_session_from_file / etc.). Catches richness-driven bugs the synthetic IR can't reach. - `matrix_schema_validation` (10 checks): for each harness × {synthetic, real fixture}, project IR → serialize native to wire format → re-parse via the harness's reader. Catches projector output that's "valid IR" but produces native bytes the reader rejects. Per-cell invariants (14 categories): Skeleton: turn count, role sequence Content: turn text (whitespace-normalized) Tool calls: call_id set, input shape (snake-case canonicalized for cross-harness `filePath` ↔ `file_path`), output content, error flag Tokens: per-turn idempotence, total_usage idempotence, survival (tokens present pre-target must survive the B leg) Reasoning: thinking idempotence + survival Other fields: model, stop_reason, parent_id graph, environment.cwd, delegation count, files_changed set Pollution: no foreign-namespace extras keys Failures aggregate per cell so a single test run reports every divergence, not just the first. Fixtures (~170KB total, captured via scripts/capture-elicit-fixtures.sh): crates/path-cli/tests/fixtures/{claude,codex,gemini,opencode,pi}/convo.{jsonl,json} Each is the harness's native session output for the 9-task elicitation prompt, machine-captured from a fresh scratch directory. Loaded through each harness's reader API; opencode's `export` JSON wrapper is converted inline to the Session struct since opencode's primary store is SQLite.

The matrix had two cell-running tests: one driven by an inline `canonical_source_view()` (synthetic IR, 6 turns) and one driven by each harness's captured real-world fixture. The real fixtures are a strict superset of the synthetic shape (19+ turns each, more tool diversity), so any cell that would fail under synthetic also fails under real fixtures. The synthetic variant added test surface without adding bug-finding coverage. Removed: - `matrix_synthetic` test - `canonical_source_view`, `user_turn`, `assistant_turn` builders - The synthetic case from `matrix_schema_validation` Tightened: `matrix_translation` (renamed from `matrix_real_fixtures`) now panics with an actionable message when a fixture is missing instead of silently skipping. Same for the schema validation test. That's the right behavior — fixtures are checked in, so missing means something is genuinely wrong. Net: 169 fewer lines, same coverage, clearer failure modes when fixtures get accidentally removed.

Move the captured real-world fixtures from crates/path-cli/tests/fixtures/ to test-fixtures/ at the workspace root. They're shared by the cross-harness matrix and (in subsequent commits) by per-crate roundtrip tests, so the workspace root is a more honest home than burying them under one consumer crate. - Renames the five fixture directories. - Updates scripts/capture-elicit-fixtures.sh to write there. - Updates fixtures_dir() in cross_harness_matrix.rs. - Updates docs/agents/feature-elicit.md path references. Picks up a stray rustfmt cleanup in toolpath-codex/src/project.rs along the way.

Each per-harness crate gets a tests/real_fixture_roundtrip.rs that loads the captured real-world session at test-fixtures/<harness>/ and runs it through the full native → IR → Path → IR → native pipeline. Asserts: - fixture loads non-empty with at least one user + assistant turn - meaningful turn count and roles preserved (system envelopes excluded) - per-turn text preserved (whitespace-normalized) - tool-call topology preserved (id sets, names, results) - per-delegation content (agent_id, prompt, child-turn count) preserved - projector output re-parses through the harness's own reader (or, for opencode where the wire is SQLite, that the projected Session serdes symmetrically as JSON) Claude additionally checks total token usage preservation when present. Complements the cross-harness matrix: each leg's per-harness self- roundtrip is now exercised on production-shape data, not just synthetic minimums.

The previous `delegations` invariant only checked total count. A projector that drops a sub-agent's child turns or scrambles its agent_ids would slip past as long as the total count of delegations stayed the same. Production sub-agent flows are too important to permit that. Strengthen the idempotence check (view_first vs view_second) to assert per-delegation content equality: agent_id set, child-turn count, and prompt text. Add a new `delegations_survive` cross-leg invariant (view_after_source vs view_first) that catches drops on the A→B translation path itself. The cross-leg check is deliberately soft: a source delegation's agent_id must remain findable in the target IR — *either* as a delegation (preferred — preserves the structural sub-agent semantics) *or* as a regular tool_use call_id (acceptable — the visible tool call is still present even if the harness can't natively model delegation). This matches the "good UX" goal: when you open a Claude session in Codex or opencode, you still see the dispatch tool call and its result; you just lose the metadata that says "this was a delegation." A flat-out drop of the call_id is the regression.

Adds step 9 to feature-elicit.prompt.txt asking the agent to dispatch a sub-agent (Claude Task / Gemini sub-agent / Codex sub-task / opencode subtask) with a small word-counting instruction. Worded harness-neutrally; harnesses without a dispatch tool are told to skip explicitly so the lack-of-delegation is itself observable. Updates the docs for the new task count and adds a "1+ delegation event" entry to the completeness checklist. Pairs with the strengthened delegation invariants in the cross-harness matrix: subsequent capture refreshes will populate delegation content in the fixtures, and the matrix will start exercising the new assertions automatically.

Real compaction events fire when the model context window fills mid-session — they can't reliably be triggered by a 5-minute capture prompt. Synthetic minimum-shape fixtures are the right tool for this regression class. One per harness whose format models compaction as a first-class concept: - Claude: pre-compact turns → compact_boundary marker → synthetic isCompactSummary user message → post-compact turns. - Codex: response_items around a `compacted` rollout line. - Pi: messages around an Entry::Compaction line (Pi's first-class compaction entry type, alongside BranchSummary). - opencode: SQL fixture with a `compaction` PartData in mid-session. Gemini is skipped — its format doesn't model context-overflow compaction (the `summary` field is a sub-agent result, different concept). Each test asserts: fixture loads without panic on the compaction line + pre-compact content survives roundtrip + post-compact content survives + projector output re-parses through the harness's own reader. Documented limitations (compact_boundary marker / isCompactSummary flag / part.compaction ConversationEvent not surviving derive→extract) are called out in module headers as acceptable losses for "good UX" today — the surrounding conversation content is preserved, which is what users actually read.

Re-ran scripts/capture-elicit-fixtures.sh against current harness versions. Notable shifts: - Claude: 39 → 80 lines, 19 → 47 turns. Captured a real `Agent` delegation (toolu_01L4FfhHHhpph7aoab3dt4bn), the first fixture to exercise the new delegation invariants. Going Claude → Codex / Claude → opencode now stresses the soft cross-leg check (delegation agent_id surviving as a delegation OR as a tool_use call_id); both pass under the soft bar. - Codex: smaller (8 turns) — `codex exec` mode lacks a sub-agent dispatch tool, so step 9 became "count the words yourself." - Gemini, opencode: similar size to before. - Pi: 22 turns, full prompt walked. Re-captured after a first run with a weak local model gave up early. All 50+ test binaries pass with the refreshed fixtures.

derive_path was emitting steps for view.turns only — anything in view.events (attachments, headerless preamble lines, provider-specific non-turn entries) silently disappeared on the IR-to-Path-to-IR trip. For sessions with rich provider extras this dropped 10–25% of the source content before the projector ever saw it. Now each event becomes a `conversation.event` step with its data flattened into structural extras. Two housekeeping keys (`event_data_type`, `event_source_id`) keep the original `type` field out of `StructuralChange`'s `change_type` slot — without them, a `type: "user_message"` event collides with serde's `#[serde(rename = "type")]` on `change_type` and `Graph::from_json` fails to disambiguate `PathOrRef`. extract_conversation strips both keys back out and restores the `type` value into event.data so the data round-trips clean. Also tightens the ToolResult.content docstring to clarify it's the model-visible text (paired with provider-specific display blobs that live in event.data on the appropriate harness).

Make `path import claude … && path export claude …` produce a JSONL that resumes in Claude Code with all the original UI affordances (diff view, shell output panel, file viewer, parent-chain navigation), and make the same projection work for Codex/Gemini/opencode/Pi sources that come through with non-Claude tool names. Previously, opening a roundtripped session showed text-only assistant turns: every diff and shell output was either dropped or rendered as an opaque block. Every fix below is a layer of that. ## Self-roundtrip preservation (Claude → IR → Claude) * **reader.rs**: headerless lines (ai-title, last-prompt, queue-operation, permission-mode, file-history-snapshot) used to be dropped — they have no uuid so the typed parse path skipped them. Now they're captured into `convo.preamble` as raw JSON values. * **provider.rs**: surfaces preamble lines and message-less entries (attachments, snapshots) as `ConversationEvent`s with structured `event.data` keyed by their original fields. Folds the entry's typed `version`/`user_type`/`request_id` into `Turn.extra["claude"]` so projection can restore the request-correlation metadata. * **derive.rs**: emits preamble lines as `conversation.event` steps so they survive Path roundtrip. Emits tool-result-only user entries as `tool_result_user` events (preserving original UUID + toolUseResult blob + entry_extra like promptId/slug) — without this, the entry vanishes and the next assistant's parentUuid points at a UUID we never re-emit, breaking the chain Claude's TUI traverses for rendering. Tracks `had_thinking_part` so encrypted-reasoning blocks (text-empty thinking with a signature) survive instead of being filtered by the empty-content skip. Uses `entry.parent_uuid` for step parents instead of the linear `last_step_id` chain — the linear form skipped over attachments and tool-result events, so subsequent assistant turns ended up with parent pointers that bypassed the entries they actually responded to. * **project.rs**: routes `tool_result_user` events back to user entries with their original UUIDs, restores the `toolUseResult` blob verbatim, and rebuilds the JSONL preamble from preamble events. Falls back to per-tool-use synthesis only when no preserved event exists (cross-harness sources). ## Cross-harness Claude UI rendering (any harness → Claude) * **provider.rs / project.rs**: adds `provider::native_name(category, args)` — the reverse of `tool_category` — and a `canonical_claude_tool_name` dispatcher in the projector. Mirrors the same pattern already on toolpath-codex / toolpath-opencode / toolpath-gemini / toolpath-pi. Claude's UI dispatches its rich result panes by literal tool name, so `exec_command` / `read_file` / `write_file` / `apply_patch` rendered as opaque blocks even when their toolUseResult was well-formed. Now they come through as `Bash` / `Read` / `Write` / `Edit` and the renderers fire. * **project.rs**: `canonical_claude_tool_input` translates input keys to what each Claude tool's UI reads (`cmd`/`path`/`oldString` → `command`/`file_path`/`old_string`). `tool_use_result_from_invocation` builds the matching display blob per-category — Bash gets `{stdout, stderr, …}`, Edit gets `{filePath, oldString, newString, structuredPatch, …}` with hunks computed via `similar::TextDiff`, Read gets `{type: "text", file: {filePath, content, …}}`, Glob/Grep get `{filenames, numFiles, pattern}`. * **project.rs**: tracks `parent_rewrites` so that when a synthesized tool-result entry is emitted between two assistant turns, the next turn's parentUuid points at the synthesized result instead of the prior assistant. Without this the synthesized result is orphaned. ## Codex enables extract to see its tool calls * **toolpath-codex/derive.rs**: was writing `tool_calls` (a name+status summary), but `toolpath_convo::extract` reads `tool_uses` (full id/name/input/category/result). The mismatch dropped every Codex tool call at the cache boundary — `path export claude` showed text only. Adds the canonical `tool_uses` array; keeps the legacy `tool_calls` summary for human-readable consumers. ## Dep * **toolpath-claude/Cargo.toml**: adds the workspace `similar` dep, used by `structured_patch_hunks` in the projector.

Three tests, all on the captured real fixture: * `read_then_project_preserves_line_count` — bluntest possible UX-loss check: source.lines() == projected.preamble + projected.entries. Catches silent drops at the reader, IR-conversion, or projection step. * `read_then_project_preserves_metadata_entries` — attachments and preamble counts match source exactly. Locks the headerless-line and message-less-entry preservation. * `read_then_project_preserves_tool_use_result_count` — every source entry with a `toolUseResult` blob still has one after roundtrip. Claude's UI uses this field for diff/shell-output rendering, so a drop here is a visible regression on resume. * `cache_roundtrip_preserves_line_counts_per_type` — exercises the full path-cli flow: source JSONL → toolpath_claude::derive_path → cached Path JSON → toolpath_convo::extract → ClaudeProjector → JSONL. Asserts every entry type that appeared in source also appears in the projection. Catches whole-category drops (ai-title / last-prompt / file-history-snapshot regressions specifically).

pnpm/action-setup@v4 with version: latest started resolving to pnpm 11 after 11.0.0 went stable (2026-04-28); pnpm 11 require()s node:sqlite, which needs Node >= 22.5, but the deploy job pins Node 20. Pin pnpm to 10 (compatible with the existing lockfileVersion 9.0).

benbaarber · 2026-05-11T15:40:34Z

https://pathbase-dev.fly.dev/u/dev/ben/paths/claude-23d4ee65

akesling · 2026-05-11T19:02:35Z

+    "last-prompt",
+    "queue-operation",
+    "permission-mode",
+    "file-history-snapshot",


What happens when Claude adds a new preamble event type? Do we include it in the Toolpath?

we no longer track these, new preamble types would be treated the same

…s lines verbatim Headerless JSONL lines (ai-title, last-prompt, queue-operation, permission-mode, file-history-snapshot, and anything unrecognized) are no longer routed by an enumerated type list. conversation_to_view and derive::derive_path now stash the whole line verbatim under ConversationEvent.data["raw"] / the conversation.event step's extra["raw"]; project_view identifies a headerless event by that "raw" key and dumps the line straight back onto convo.preamble. An unrecognized headerless line now round-trips instead of being mangled through project_event into a malformed entry. The permission-mode fallback (synthesized when the path carried no real preamble) is unchanged. Behavior of attachments / message-less entries (entry_to_event + project_event) and tool-result-only user entries (tool_result_user events + tool_result_event_to_entry, with the parent_rewrites chain-patching) is untouched.

The codex derive shouldn't reach across to a sibling crate's projector in a doc comment. Reword to describe the generic extract -> ConversationView round-trip instead.

akesling

LGTM

benbaarber requested a review from akesling May 7, 2026 18:52

benbaarber assigned akesling May 7, 2026

benbaarber added 15 commits May 11, 2026 11:12

chore(toolpath-convo): derive PartialEq+Eq on TokenUsage

1d3f979

Needed by downstream invariant tests that compare per-turn token usage across IR round-trips. The fields are all `Option<u32>`, so structural equality is well-defined.

ben-emp force-pushed the ben/cross-harness-matrix branch from 1c186a9 to 9ce0d66 Compare May 11, 2026 15:13

akesling reviewed May 11, 2026

View reviewed changes

ben-emp force-pushed the ben/cross-harness-matrix branch from 1e49061 to 386c530 Compare May 12, 2026 21:13

akesling reviewed May 13, 2026

View reviewed changes

Comment thread crates/toolpath-claude/src/derive.rs

Comment thread crates/toolpath-claude/src/derive.rs

Comment thread crates/toolpath-codex/src/derive.rs Outdated

docs(codex): drop stray ClaudeProjector reference in tool_uses comment

76691de

The codex derive shouldn't reach across to a sibling crate's projector in a doc comment. Reword to describe the generic extract -> ConversationView round-trip instead.

ben-emp force-pushed the ben/cross-harness-matrix branch from e0aa776 to 76691de Compare May 13, 2026 15:27

akesling approved these changes May 13, 2026

View reviewed changes

benbaarber merged commit 34e0def into main May 13, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(path-cli): cross-harness translation matrix + codex idempotence fixes#81

feat(path-cli): cross-harness translation matrix + codex idempotence fixes#81
benbaarber merged 17 commits into
mainfrom
ben/cross-harness-matrix

ben-emp commented May 6, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 6, 2026 •

edited

Loading

Uh oh!

benbaarber commented May 11, 2026

Uh oh!

Uh oh!

akesling May 11, 2026

Uh oh!

benbaarber May 13, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

akesling left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ben-emp commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in here

Matrix infrastructure (crates/path-cli/tests/cross_harness_matrix.rs)

Per-crate real-fixture roundtrips

Synthetic compaction roundtrips

Real-fixture capture pipeline

Codex translation fixes (toolpath-codex)

Claude UI rendering through IR roundtrip

Supporting changes

Why

Verification

Known limitations

Commits

Uh oh!

github-actions Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benbaarber commented May 11, 2026

Uh oh!

Uh oh!

akesling May 11, 2026

Choose a reason for hiding this comment

Uh oh!

benbaarber May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

akesling left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ben-emp commented May 6, 2026 •

edited

Loading

Matrix infrastructure (`crates/path-cli/tests/cross_harness_matrix.rs`)

Codex translation fixes (`toolpath-codex`)

github-actions Bot commented May 6, 2026 •

edited

Loading