feat(path-cli): cross-harness translation matrix + codex idempotence fixes#81
Merged
Conversation
|
🔍 Preview deployed: https://068a86aa.toolpath.pages.dev |
Needed by downstream invariant tests that compare per-turn token usage across IR round-trips. The fields are all `Option<u32>`, so structural equality is well-defined.
Four bugs surfaced (and fixed) by the new cross-harness matrix tests:
1. Empty-text assistant turn message-line emission (the anchor bug).
The forward path uses `response_item.message` as the turn anchor;
tool calls without a preceding message attach to whatever assistant
turn was last seen, even across an intervening user message. The
projector previously skipped the message line for assistants whose
text was empty even when they had tool calls or thinking, so those
tool calls got fused backward into the prior assistant turn during
forward parsing. Fix: emit a message line for any assistant turn
with content (text, tool calls, or non-empty thinking).
2. Tool `is_error` round-trip for non-shell tools. Codex's wire format
only carries error info via `exec_command_end.exit_code`, which
only fires for shell tools. For read/write/apply_patch, the error
flag was silently flipping to `false` after round-trip. Fix: stash
`is_error: true` in the function_call_output (and
custom_tool_call_output) `extra` map on projection, recover it on
the forward path. `attach_tool_output` now takes an explicit
`is_error` parameter and ORs with any prior state.
3. `event_msg.token_count` emission. The projector was never emitting
token_count event_msgs, so all assistant `Turn.token_usage` data
silently dropped on a codex round-trip. Fix: emit a `token_count`
event_msg before each assistant message line (the forward path's
`pending_token_usage` attaches to the next pushed turn). Adds
`convo_usage_to_codex_json` helper mapping IR TokenUsage → codex
TokenUsage JSON (input_tokens, cached_input_tokens, output_tokens).
4. `thinking: Some("")` non-idempotence. Claude's forward path
produces empty-string thinking on assistant turns whose reasoning
blocks have only metadata (signatures, etc.). The projector's
emit-message-line condition was `turn.thinking.is_some()` which
triggers on `Some("")`, so the first pass emits a message line.
But codex's forward path drops empty thinking, so the second pass
sees `thinking: None` and the same condition is false — turn drops
out. Fix: treat `Some("")` as absent for emit decisions.
Updated unit test `assistant_turn_with_function_call_and_output` to
match the new line ordering (token_count now precedes message in the
output stream).
Adds the workflow for capturing real-world conversation fixtures from
each conversation harness on demand:
- `docs/agents/feature-elicit.prompt.txt` — the harness-agnostic 9-task
prompt (list, write, read, edit, glob, grep, errored read, run shell,
summarize). Single source of truth for both the doc and the script.
- `docs/agents/feature-elicit.md` — workflow doc covering both the
automated path and the manual fallback, the per-harness invocation
table (claude / codex / gemini / pi / opencode), the completeness
checklist (every tool category + thinking + error result + summary),
and where the resulting session files land.
- `scripts/capture-elicit-fixtures.sh` — orchestrator. Per-harness
driver functions, snapshot-diff session capture, fresh scratch dir
per run, skip-on-missing-CLI, captured stderr surfaced on failure
(not silenced). Run as `./scripts/capture-elicit-fixtures.sh` for
all harnesses or pass a subset like `claude codex`. Output lands at
`crates/path-cli/tests/fixtures/<harness>/convo.{jsonl,json}`.
The pipeline is the input side of the cross-harness translation matrix
test (next commit). Captured fixtures themselves are committed there.
Three test functions covering 60 distinct assertions across the five
conversation harnesses (claude, codex, gemini, pi, opencode):
- `matrix_synthetic` (25 cells): drives a controlled inline IR through
every (source, target) pair. `A → IR → B → IR → B` shape — after one
A→B translation, B's projector + forward must be a fixed point on
the resulting IR. Strict equality on idempotence; provenance
independence on the second leg.
- `matrix_real_fixtures` (25 cells): same shape, but each source uses
its own real-world captured session as input. Loads fixtures via the
harness's own native reader (ConversationReader / RolloutReader /
read_session_from_file / etc.). Catches richness-driven bugs the
synthetic IR can't reach.
- `matrix_schema_validation` (10 checks): for each harness × {synthetic,
real fixture}, project IR → serialize native to wire format → re-parse
via the harness's reader. Catches projector output that's "valid IR"
but produces native bytes the reader rejects.
Per-cell invariants (14 categories):
Skeleton: turn count, role sequence
Content: turn text (whitespace-normalized)
Tool calls: call_id set, input shape (snake-case canonicalized for
cross-harness `filePath` ↔ `file_path`), output content, error flag
Tokens: per-turn idempotence, total_usage idempotence, survival
(tokens present pre-target must survive the B leg)
Reasoning: thinking idempotence + survival
Other fields: model, stop_reason, parent_id graph, environment.cwd,
delegation count, files_changed set
Pollution: no foreign-namespace extras keys
Failures aggregate per cell so a single test run reports every
divergence, not just the first.
Fixtures (~170KB total, captured via scripts/capture-elicit-fixtures.sh):
crates/path-cli/tests/fixtures/{claude,codex,gemini,opencode,pi}/convo.{jsonl,json}
Each is the harness's native session output for the 9-task elicitation
prompt, machine-captured from a fresh scratch directory. Loaded through
each harness's reader API; opencode's `export` JSON wrapper is converted
inline to the Session struct since opencode's primary store is SQLite.
The matrix had two cell-running tests: one driven by an inline `canonical_source_view()` (synthetic IR, 6 turns) and one driven by each harness's captured real-world fixture. The real fixtures are a strict superset of the synthetic shape (19+ turns each, more tool diversity), so any cell that would fail under synthetic also fails under real fixtures. The synthetic variant added test surface without adding bug-finding coverage. Removed: - `matrix_synthetic` test - `canonical_source_view`, `user_turn`, `assistant_turn` builders - The synthetic case from `matrix_schema_validation` Tightened: `matrix_translation` (renamed from `matrix_real_fixtures`) now panics with an actionable message when a fixture is missing instead of silently skipping. Same for the schema validation test. That's the right behavior — fixtures are checked in, so missing means something is genuinely wrong. Net: 169 fewer lines, same coverage, clearer failure modes when fixtures get accidentally removed.
Move the captured real-world fixtures from crates/path-cli/tests/fixtures/ to test-fixtures/ at the workspace root. They're shared by the cross-harness matrix and (in subsequent commits) by per-crate roundtrip tests, so the workspace root is a more honest home than burying them under one consumer crate. - Renames the five fixture directories. - Updates scripts/capture-elicit-fixtures.sh to write there. - Updates fixtures_dir() in cross_harness_matrix.rs. - Updates docs/agents/feature-elicit.md path references. Picks up a stray rustfmt cleanup in toolpath-codex/src/project.rs along the way.
Each per-harness crate gets a tests/real_fixture_roundtrip.rs that loads the captured real-world session at test-fixtures/<harness>/ and runs it through the full native → IR → Path → IR → native pipeline. Asserts: - fixture loads non-empty with at least one user + assistant turn - meaningful turn count and roles preserved (system envelopes excluded) - per-turn text preserved (whitespace-normalized) - tool-call topology preserved (id sets, names, results) - per-delegation content (agent_id, prompt, child-turn count) preserved - projector output re-parses through the harness's own reader (or, for opencode where the wire is SQLite, that the projected Session serdes symmetrically as JSON) Claude additionally checks total token usage preservation when present. Complements the cross-harness matrix: each leg's per-harness self- roundtrip is now exercised on production-shape data, not just synthetic minimums.
The previous `delegations` invariant only checked total count. A projector that drops a sub-agent's child turns or scrambles its agent_ids would slip past as long as the total count of delegations stayed the same. Production sub-agent flows are too important to permit that. Strengthen the idempotence check (view_first vs view_second) to assert per-delegation content equality: agent_id set, child-turn count, and prompt text. Add a new `delegations_survive` cross-leg invariant (view_after_source vs view_first) that catches drops on the A→B translation path itself. The cross-leg check is deliberately soft: a source delegation's agent_id must remain findable in the target IR — *either* as a delegation (preferred — preserves the structural sub-agent semantics) *or* as a regular tool_use call_id (acceptable — the visible tool call is still present even if the harness can't natively model delegation). This matches the "good UX" goal: when you open a Claude session in Codex or opencode, you still see the dispatch tool call and its result; you just lose the metadata that says "this was a delegation." A flat-out drop of the call_id is the regression.
Adds step 9 to feature-elicit.prompt.txt asking the agent to dispatch a sub-agent (Claude Task / Gemini sub-agent / Codex sub-task / opencode subtask) with a small word-counting instruction. Worded harness-neutrally; harnesses without a dispatch tool are told to skip explicitly so the lack-of-delegation is itself observable. Updates the docs for the new task count and adds a "1+ delegation event" entry to the completeness checklist. Pairs with the strengthened delegation invariants in the cross-harness matrix: subsequent capture refreshes will populate delegation content in the fixtures, and the matrix will start exercising the new assertions automatically.
Real compaction events fire when the model context window fills mid-session — they can't reliably be triggered by a 5-minute capture prompt. Synthetic minimum-shape fixtures are the right tool for this regression class. One per harness whose format models compaction as a first-class concept: - Claude: pre-compact turns → compact_boundary marker → synthetic isCompactSummary user message → post-compact turns. - Codex: response_items around a `compacted` rollout line. - Pi: messages around an Entry::Compaction line (Pi's first-class compaction entry type, alongside BranchSummary). - opencode: SQL fixture with a `compaction` PartData in mid-session. Gemini is skipped — its format doesn't model context-overflow compaction (the `summary` field is a sub-agent result, different concept). Each test asserts: fixture loads without panic on the compaction line + pre-compact content survives roundtrip + post-compact content survives + projector output re-parses through the harness's own reader. Documented limitations (compact_boundary marker / isCompactSummary flag / part.compaction ConversationEvent not surviving derive→extract) are called out in module headers as acceptable losses for "good UX" today — the surrounding conversation content is preserved, which is what users actually read.
Re-ran scripts/capture-elicit-fixtures.sh against current harness versions. Notable shifts: - Claude: 39 → 80 lines, 19 → 47 turns. Captured a real `Agent` delegation (toolu_01L4FfhHHhpph7aoab3dt4bn), the first fixture to exercise the new delegation invariants. Going Claude → Codex / Claude → opencode now stresses the soft cross-leg check (delegation agent_id surviving as a delegation OR as a tool_use call_id); both pass under the soft bar. - Codex: smaller (8 turns) — `codex exec` mode lacks a sub-agent dispatch tool, so step 9 became "count the words yourself." - Gemini, opencode: similar size to before. - Pi: 22 turns, full prompt walked. Re-captured after a first run with a weak local model gave up early. All 50+ test binaries pass with the refreshed fixtures.
derive_path was emitting steps for view.turns only — anything in view.events (attachments, headerless preamble lines, provider-specific non-turn entries) silently disappeared on the IR-to-Path-to-IR trip. For sessions with rich provider extras this dropped 10–25% of the source content before the projector ever saw it. Now each event becomes a `conversation.event` step with its data flattened into structural extras. Two housekeeping keys (`event_data_type`, `event_source_id`) keep the original `type` field out of `StructuralChange`'s `change_type` slot — without them, a `type: "user_message"` event collides with serde's `#[serde(rename = "type")]` on `change_type` and `Graph::from_json` fails to disambiguate `PathOrRef`. extract_conversation strips both keys back out and restores the `type` value into event.data so the data round-trips clean. Also tightens the ToolResult.content docstring to clarify it's the model-visible text (paired with provider-specific display blobs that live in event.data on the appropriate harness).
Make `path import claude … && path export claude …` produce a JSONL
that resumes in Claude Code with all the original UI affordances
(diff view, shell output panel, file viewer, parent-chain navigation),
and make the same projection work for Codex/Gemini/opencode/Pi sources
that come through with non-Claude tool names.
Previously, opening a roundtripped session showed text-only assistant
turns: every diff and shell output was either dropped or rendered as
an opaque block. Every fix below is a layer of that.
## Self-roundtrip preservation (Claude → IR → Claude)
* **reader.rs**: headerless lines (ai-title, last-prompt,
queue-operation, permission-mode, file-history-snapshot) used to be
dropped — they have no uuid so the typed parse path skipped them.
Now they're captured into `convo.preamble` as raw JSON values.
* **provider.rs**: surfaces preamble lines and message-less entries
(attachments, snapshots) as `ConversationEvent`s with structured
`event.data` keyed by their original fields. Folds the entry's
typed `version`/`user_type`/`request_id` into `Turn.extra["claude"]`
so projection can restore the request-correlation metadata.
* **derive.rs**: emits preamble lines as `conversation.event` steps so
they survive Path roundtrip. Emits tool-result-only user entries as
`tool_result_user` events (preserving original UUID + toolUseResult
blob + entry_extra like promptId/slug) — without this, the entry
vanishes and the next assistant's parentUuid points at a UUID we
never re-emit, breaking the chain Claude's TUI traverses for
rendering. Tracks `had_thinking_part` so encrypted-reasoning blocks
(text-empty thinking with a signature) survive instead of being
filtered by the empty-content skip. Uses `entry.parent_uuid` for
step parents instead of the linear `last_step_id` chain — the linear
form skipped over attachments and tool-result events, so subsequent
assistant turns ended up with parent pointers that bypassed the
entries they actually responded to.
* **project.rs**: routes `tool_result_user` events back to user entries
with their original UUIDs, restores the `toolUseResult` blob
verbatim, and rebuilds the JSONL preamble from preamble events.
Falls back to per-tool-use synthesis only when no preserved event
exists (cross-harness sources).
## Cross-harness Claude UI rendering (any harness → Claude)
* **provider.rs / project.rs**: adds `provider::native_name(category, args)`
— the reverse of `tool_category` — and a `canonical_claude_tool_name`
dispatcher in the projector. Mirrors the same pattern already on
toolpath-codex / toolpath-opencode / toolpath-gemini / toolpath-pi.
Claude's UI dispatches its rich result panes by literal tool name,
so `exec_command` / `read_file` / `write_file` / `apply_patch`
rendered as opaque blocks even when their toolUseResult was
well-formed. Now they come through as `Bash` / `Read` / `Write` /
`Edit` and the renderers fire.
* **project.rs**: `canonical_claude_tool_input` translates input keys
to what each Claude tool's UI reads (`cmd`/`path`/`oldString` →
`command`/`file_path`/`old_string`). `tool_use_result_from_invocation`
builds the matching display blob per-category — Bash gets
`{stdout, stderr, …}`, Edit gets `{filePath, oldString, newString,
structuredPatch, …}` with hunks computed via `similar::TextDiff`,
Read gets `{type: "text", file: {filePath, content, …}}`, Glob/Grep
get `{filenames, numFiles, pattern}`.
* **project.rs**: tracks `parent_rewrites` so that when a synthesized
tool-result entry is emitted between two assistant turns, the next
turn's parentUuid points at the synthesized result instead of the
prior assistant. Without this the synthesized result is orphaned.
## Codex enables extract to see its tool calls
* **toolpath-codex/derive.rs**: was writing `tool_calls` (a name+status
summary), but `toolpath_convo::extract` reads `tool_uses` (full
id/name/input/category/result). The mismatch dropped every Codex
tool call at the cache boundary — `path export claude` showed text
only. Adds the canonical `tool_uses` array; keeps the legacy
`tool_calls` summary for human-readable consumers.
## Dep
* **toolpath-claude/Cargo.toml**: adds the workspace `similar` dep,
used by `structured_patch_hunks` in the projector.
Three tests, all on the captured real fixture: * `read_then_project_preserves_line_count` — bluntest possible UX-loss check: source.lines() == projected.preamble + projected.entries. Catches silent drops at the reader, IR-conversion, or projection step. * `read_then_project_preserves_metadata_entries` — attachments and preamble counts match source exactly. Locks the headerless-line and message-less-entry preservation. * `read_then_project_preserves_tool_use_result_count` — every source entry with a `toolUseResult` blob still has one after roundtrip. Claude's UI uses this field for diff/shell-output rendering, so a drop here is a visible regression on resume. * `cache_roundtrip_preserves_line_counts_per_type` — exercises the full path-cli flow: source JSONL → toolpath_claude::derive_path → cached Path JSON → toolpath_convo::extract → ClaudeProjector → JSONL. Asserts every entry type that appeared in source also appears in the projection. Catches whole-category drops (ai-title / last-prompt / file-history-snapshot regressions specifically).
pnpm/action-setup@v4 with version: latest started resolving to pnpm 11 after 11.0.0 went stable (2026-04-28); pnpm 11 require()s node:sqlite, which needs Node >= 22.5, but the deploy job pins Node 20. Pin pnpm to 10 (compatible with the existing lockfileVersion 9.0).
1c186a9 to
9ce0d66
Compare
Collaborator
akesling
reviewed
May 11, 2026
| "last-prompt", | ||
| "queue-operation", | ||
| "permission-mode", | ||
| "file-history-snapshot", |
Contributor
There was a problem hiding this comment.
What happens when Claude adds a new preamble event type? Do we include it in the Toolpath?
Collaborator
There was a problem hiding this comment.
we no longer track these, new preamble types would be treated the same
…s lines verbatim Headerless JSONL lines (ai-title, last-prompt, queue-operation, permission-mode, file-history-snapshot, and anything unrecognized) are no longer routed by an enumerated type list. conversation_to_view and derive::derive_path now stash the whole line verbatim under ConversationEvent.data["raw"] / the conversation.event step's extra["raw"]; project_view identifies a headerless event by that "raw" key and dumps the line straight back onto convo.preamble. An unrecognized headerless line now round-trips instead of being mangled through project_event into a malformed entry. The permission-mode fallback (synthesized when the path carried no real preamble) is unchanged. Behavior of attachments / message-less entries (entry_to_event + project_event) and tool-result-only user entries (tool_result_user events + tool_result_event_to_entry, with the parent_rewrites chain-patching) is untouched.
1e49061 to
386c530
Compare
akesling
reviewed
May 13, 2026
The codex derive shouldn't reach across to a sibling crate's projector in a doc comment. Reword to describe the generic extract -> ConversationView round-trip instead.
e0aa776 to
76691de
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three threads, one branch:
toolpath-codexalong the way.path import claude … && path export claude …now produces a JSONL that resumes in Claude Code with all the original UI affordances (diff view, shell output panel, file viewer, parent-chain navigation), and the same projection works for Codex/Gemini/opencode/Pi sources that come through with non-Claude tool names.What's in here
Matrix infrastructure (
crates/path-cli/tests/cross_harness_matrix.rs)Two test functions:
matrix_translation(25 cells). Each source uses its own captured real-world session as input. Cell shape:A → IR → B → IR → B. After one A→B translation, B's projector + forward must be a fixed point on the resulting IR. Strict equality on idempotence; provenance independence on the second leg.matrix_schema_validation(5 checks, one per harness). Project the real fixture through the harness's projector, serialize the native output to its on-disk wire format, re-parse via the harness's own reader. Catches projector output that's "valid IR" but produces native bytes the reader rejects.Per-cell invariants (15 categories):
filePath↔file_path), output content, error flagagent_id/prompt/child-turn count idempotence + soft cross-leg survivalPer-crate real-fixture roundtrips
crates/toolpath-{claude,codex,gemini,opencode,pi}/tests/real_fixture_roundtrip.rs(one per harness). Each loads the captured real-world session attest-fixtures/<harness>/and asserts native → IR → Path → IR → native preserves the user-visible contract: meaningful turn count and roles, per-turn text, tool-call topology, delegation content, and that the projector output re-parses through the harness's own reader.Claude additionally gets four regression locks for full-session roundtrip (added with the Claude UI fixes — see below):
read_then_project_preserves_line_count— bluntest possible UX-loss check: source.lines() count == projected entries + preamble.read_then_project_preserves_metadata_entries— attachments and preamble line counts match source exactly.read_then_project_preserves_tool_use_result_count— everytoolUseResultblob present in source survives the roundtrip (Claude UI uses this for diff/shell-output rendering).cache_roundtrip_preserves_line_counts_per_type— exercises the full path-cli flow (source JSONL → derive_path → cache JSON → extract → ClaudeProjector → JSONL) and asserts every entry type appears in the projection. Catches whole-category drops (ai-title / last-prompt / file-history-snapshot regressions).Synthetic compaction roundtrips
Real compaction events fire when the model context window fills mid-session — they can't reliably be triggered by a 5-minute capture prompt. Synthetic minimum-shape fixtures are the right tool. One per harness whose format models compaction natively: Claude (
compact_boundary), Codex (compactedrollout line), Pi (Entry::Compaction), opencode (PartData::Compaction). Gemini is skipped — its format doesn't model context-overflow compaction.Real-fixture capture pipeline
scripts/capture-elicit-fixtures.sh— orchestrator with per-harness driver functions, snapshot-diff session capture, fresh scratch dir per run, skip-on-missing-CLI.docs/agents/feature-elicit.md— workflow doc with manual fallback, per-harness invocation table, completeness checklist (now includes a delegation-event row).docs/agents/feature-elicit.prompt.txt— harness-agnostic 10-task prompt (list, write, read, edit, glob, grep, errored read, run shell, sub-agent dispatch, summarize). Single source of truth for both the doc and the script.test-fixtures/<harness>/convo.{jsonl,json}— five captured fixtures. The Claude capture carries a realAgentdelegation.Codex translation fixes (
toolpath-codex)Four real bugs surfaced by the matrix and fixed in the same PR:
is_errorround-trip for non-shell tools. Codex's wire format only carries error info viaexec_command_end.exit_code. Forread/write/apply_patch, the error flag was silently flipping tofalse. Fix: stashis_error: trueinfunction_call_output.extra, recover it on the forward path.event_msg.token_countemission. The projector wasn't emittingtoken_countevent_msgs, so all assistantTurn.token_usagedata dropped on a codex round-trip.thinking: Some("")non-idempotence. Codex's forward path drops empty thinking, so the second pass sawNoneand the emit condition was false — turn evaporated. Fix: treatSome("")as absent for emit decisions.Claude UI rendering through IR roundtrip
When the matrix passed but
path import claude && path export claudeproduced a JSONL whose UI rendering was visibly broken (no diffs, no shell output, mid-conversation rendering cut off entirely), it surfaced a stack of bugs no per-crate unit test could catch — they all lived in the import/export plumbing.Self-roundtrip preservation (Claude → IR → Claude)
reader.rs: headerless lines (ai-title, last-prompt, queue-operation, permission-mode, file-history-snapshot) used to be dropped — they have no uuid so the typed parse path skipped them. Now they're captured intoconvo.preamble.provider.rs: surfaces preamble lines and message-less entries (attachments, snapshots) asConversationEvents with structuredevent.data. Folds the entry's typedversion/user_type/request_idintoTurn.extra["claude"].derive.rs: emits preamble lines asconversation.eventsteps. Emits tool-result-only user entries astool_result_userevents (preserving original UUID + toolUseResult blob + entry_extra like promptId/slug) — without this, the entry vanishes and the next assistant's parentUuid points at a UUID we never re-emit, breaking the chain Claude's TUI traverses for rendering. Trackshad_thinking_partso encrypted-reasoning blocks survive. Usesentry.parent_uuidfor step parents instead of the linearlast_step_idchain — the linear form skipped over attachments and tool-result events.project.rs: routestool_result_userevents back to user entries with their original UUIDs, restores thetoolUseResultblob verbatim, rebuilds the JSONL preamble from preamble events.Cross-harness Claude UI rendering (any harness → Claude)
provider.rs/project.rs: addsprovider::native_name(category, args)— the reverse oftool_category— and acanonical_claude_tool_namedispatcher in the projector. Mirrors the same pattern already on toolpath-codex / -opencode / -gemini / -pi. Claude's UI dispatches its rich result panes by literal tool name, soexec_command/read_file/write_filerendered as opaque blocks even when their toolUseResult was well-formed. Now they come through asBash/Read/Write/Editand the renderers fire.project.rs:canonical_claude_tool_inputtranslates input keys to what each Claude tool's UI reads (cmd/path/oldString→command/file_path/old_string).tool_use_result_from_invocationbuilds the matching display blob per-category — Bash gets{stdout, stderr, …}, Edit gets{filePath, oldString, newString, structuredPatch, …}with hunks computed viasimilar::TextDiff, Read gets{type: "text", file: {filePath, content, …}}, Glob/Grep get{filenames, numFiles, pattern}.project.rs: tracksparent_rewritesso that when a synthesized tool-result entry is emitted between two assistant turns, the next turn's parentUuid points at the synthesized result instead of the prior assistant.Codex enables extract to see its tool calls
toolpath-codex/derive.rs: was writingtool_calls(a name+status summary), buttoolpath_convo::extractreadstool_uses(full id/name/input/category/result). The mismatch dropped every Codex tool call at the cache boundary —path export claudeshowed text only. Adds the canonicaltool_usesarray; keeps the legacy summary for human-readable consumers.ConversationView events through derive_path
toolpath-convo/derive.rs+extract.rs: derive_path was emitting steps for view.turns only — anything in view.events disappeared on the IR-to-Path-to-IR trip. Now each event becomes aconversation.eventstep with its data flattened into structural extras. Two housekeeping keys (event_data_type,event_source_id) keep the originaltypefield out ofStructuralChange.change_type(otherwise a Codexuser_messageevent collides with serde's#[serde(rename = "type")]).Supporting changes
toolpath-convo: addedPartialEq, Eqderives onTokenUsageso the matrix can compare token shapes structurally.toolpath-gemini: drive-byclippy::unnecessary_map_orfix that was blocking-D warnings. Pre-existing.toolpath-claude: adds the workspacesimilardep, used bystructured_patch_hunksin the projector.Why
Per-harness round-trip tests catch projector regressions for that harness alone. They can't catch the class of bug that only surfaces when projecting across harnesses, or in the import/export plumbing rather than the projector itself — silent text drops, foreign-namespace extras leaking through serde flatten, tool-name remapping gaps, arg-key shape mismatches (camelCase ↔ snake_case), schema-required fields the projector forgot to populate, and (the load-bearing one for the Claude UI work) the
derive_path→extractboundary dropping data the projector never had a chance to see.The opencode "preparing edit…" / Zod TypeError class lived in the cross-harness gap. The Claude UI work then closed a parallel gap: tests can pass on every per-crate invariant and every cross-harness invariant and still produce a JSONL Claude Code can't render, because rendering requires fields and parent-chain shapes that no test was checking.
Verification
End-to-end UI verification (any Claude or Codex session you have on disk):
To re-capture fixtures after a harness CLI version bump:
Known limitations
agent_idsurviving as a regulartool_usecall_id rather than as aDelegatedWorkentry — Codex/opencode don't model sub-agents natively.compact_boundarymarkers,isCompactSummaryflags, andpart.compactionConversationEvents don't survive derive→extract. Surrounding messages do.cache_roundtrip_preserves_line_counts_per_typetest catches the structural class but a pure rendering bug in Claude Code's TUI wouldn't be caught here.toolpath-opencode/tests/projection_roundtrip.rsandcompaction_roundtrip.rs.Commits
chore(toolpath-convo): derive PartialEq+Eq on TokenUsagechore(toolpath-gemini): use contains_key over get(..).is_none()fix(codex): translation idempotence fixes from cross-harness matrixfeat(scripts): elicit-fixture capture pipelinefeat(path-cli): cross-harness translation matrix testsrefactor(path-cli): drop synthetic-IR matrix variant as redundantrefactor(test-fixtures): relocate to workspace-root test-fixtures/test: add per-crate real-fixture projection roundtripsfeat(path-cli): strengthen delegation invariants in matrixfeat(scripts): extend elicit prompt with sub-agent dispatch steptest: add synthetic compaction roundtrip testschore(test-fixtures): refresh real-world capturesfeat(toolpath-convo): preserve view.events through Path roundtripfeat(claude): full-fidelity Claude UI rendering through IR roundtriptest(claude): regression locks for full-session roundtrip