Skip to content

feat(path-cli): cross-harness translation matrix + codex idempotence fixes#81

Merged
benbaarber merged 17 commits into
mainfrom
ben/cross-harness-matrix
May 13, 2026
Merged

feat(path-cli): cross-harness translation matrix + codex idempotence fixes#81
benbaarber merged 17 commits into
mainfrom
ben/cross-harness-matrix

Conversation

@ben-emp
Copy link
Copy Markdown
Collaborator

@ben-emp ben-emp commented May 6, 2026

Summary

Three threads, one branch:

  1. Cross-harness translation matrix — exercises every (source, target) pair across the five conversation harnesses (claude, codex, gemini, pi, opencode) and asserts a tight set of idempotence + survival invariants on the IR. Surfaced and fixed four real translation bugs in toolpath-codex along the way.
  2. Per-harness real-fixture roundtrips, strengthened delegation invariants, synthetic compaction tests — backs the "good UX across translation" claim with tests at three layers (per-harness self-roundtrip, cross-harness matrix, targeted regression cases for compaction).
  3. Full-fidelity Claude UI rendering through IR roundtrippath import claude … && path export claude … now produces a JSONL that resumes in Claude Code with all the original UI affordances (diff view, shell output panel, file viewer, parent-chain navigation), and the same projection works for Codex/Gemini/opencode/Pi sources that come through with non-Claude tool names.

What's in here

Matrix infrastructure (crates/path-cli/tests/cross_harness_matrix.rs)

Two test functions:

  • matrix_translation (25 cells). Each source uses its own captured real-world session as input. Cell shape: A → IR → B → IR → B. After one A→B translation, B's projector + forward must be a fixed point on the resulting IR. Strict equality on idempotence; provenance independence on the second leg.
  • matrix_schema_validation (5 checks, one per harness). Project the real fixture through the harness's projector, serialize the native output to its on-disk wire format, re-parse via the harness's own reader. Catches projector output that's "valid IR" but produces native bytes the reader rejects.

Per-cell invariants (15 categories):

Category Check
Skeleton turn count, role sequence
Content turn text (whitespace-normalized)
Tool calls call_id set, input shape (snake-case canonicalized so filePathfile_path), output content, error flag
Tokens per-turn idempotence, total_usage idempotence, survival across the B leg
Reasoning thinking idempotence + survival
Delegations per-delegation agent_id/prompt/child-turn count idempotence + soft cross-leg survival
Other fields model, stop_reason, parent_id graph, environment.cwd, files_changed set
Pollution no foreign-namespace extras keys

Per-crate real-fixture roundtrips

crates/toolpath-{claude,codex,gemini,opencode,pi}/tests/real_fixture_roundtrip.rs (one per harness). Each loads the captured real-world session at test-fixtures/<harness>/ and asserts native → IR → Path → IR → native preserves the user-visible contract: meaningful turn count and roles, per-turn text, tool-call topology, delegation content, and that the projector output re-parses through the harness's own reader.

Claude additionally gets four regression locks for full-session roundtrip (added with the Claude UI fixes — see below):

  • read_then_project_preserves_line_count — bluntest possible UX-loss check: source.lines() count == projected entries + preamble.
  • read_then_project_preserves_metadata_entries — attachments and preamble line counts match source exactly.
  • read_then_project_preserves_tool_use_result_count — every toolUseResult blob present in source survives the roundtrip (Claude UI uses this for diff/shell-output rendering).
  • cache_roundtrip_preserves_line_counts_per_type — exercises the full path-cli flow (source JSONL → derive_path → cache JSON → extract → ClaudeProjector → JSONL) and asserts every entry type appears in the projection. Catches whole-category drops (ai-title / last-prompt / file-history-snapshot regressions).

Synthetic compaction roundtrips

Real compaction events fire when the model context window fills mid-session — they can't reliably be triggered by a 5-minute capture prompt. Synthetic minimum-shape fixtures are the right tool. One per harness whose format models compaction natively: Claude (compact_boundary), Codex (compacted rollout line), Pi (Entry::Compaction), opencode (PartData::Compaction). Gemini is skipped — its format doesn't model context-overflow compaction.

Real-fixture capture pipeline

  • scripts/capture-elicit-fixtures.sh — orchestrator with per-harness driver functions, snapshot-diff session capture, fresh scratch dir per run, skip-on-missing-CLI.
  • docs/agents/feature-elicit.md — workflow doc with manual fallback, per-harness invocation table, completeness checklist (now includes a delegation-event row).
  • docs/agents/feature-elicit.prompt.txt — harness-agnostic 10-task prompt (list, write, read, edit, glob, grep, errored read, run shell, sub-agent dispatch, summarize). Single source of truth for both the doc and the script.
  • test-fixtures/<harness>/convo.{jsonl,json} — five captured fixtures. The Claude capture carries a real Agent delegation.

Codex translation fixes (toolpath-codex)

Four real bugs surfaced by the matrix and fixed in the same PR:

  1. Empty-text assistant turn anchor. Tool calls without a preceding message attached backward to the prior assistant turn during forward parsing because the projector skipped the message line for empty-text assistants. Fix: emit a message line for any assistant turn with content.
  2. Tool is_error round-trip for non-shell tools. Codex's wire format only carries error info via exec_command_end.exit_code. For read / write / apply_patch, the error flag was silently flipping to false. Fix: stash is_error: true in function_call_output.extra, recover it on the forward path.
  3. event_msg.token_count emission. The projector wasn't emitting token_count event_msgs, so all assistant Turn.token_usage data dropped on a codex round-trip.
  4. thinking: Some("") non-idempotence. Codex's forward path drops empty thinking, so the second pass saw None and the emit condition was false — turn evaporated. Fix: treat Some("") as absent for emit decisions.

Claude UI rendering through IR roundtrip

When the matrix passed but path import claude && path export claude produced a JSONL whose UI rendering was visibly broken (no diffs, no shell output, mid-conversation rendering cut off entirely), it surfaced a stack of bugs no per-crate unit test could catch — they all lived in the import/export plumbing.

Self-roundtrip preservation (Claude → IR → Claude)

  • reader.rs: headerless lines (ai-title, last-prompt, queue-operation, permission-mode, file-history-snapshot) used to be dropped — they have no uuid so the typed parse path skipped them. Now they're captured into convo.preamble.
  • provider.rs: surfaces preamble lines and message-less entries (attachments, snapshots) as ConversationEvents with structured event.data. Folds the entry's typed version/user_type/request_id into Turn.extra["claude"].
  • derive.rs: emits preamble lines as conversation.event steps. Emits tool-result-only user entries as tool_result_user events (preserving original UUID + toolUseResult blob + entry_extra like promptId/slug) — without this, the entry vanishes and the next assistant's parentUuid points at a UUID we never re-emit, breaking the chain Claude's TUI traverses for rendering. Tracks had_thinking_part so encrypted-reasoning blocks survive. Uses entry.parent_uuid for step parents instead of the linear last_step_id chain — the linear form skipped over attachments and tool-result events.
  • project.rs: routes tool_result_user events back to user entries with their original UUIDs, restores the toolUseResult blob verbatim, rebuilds the JSONL preamble from preamble events.

Cross-harness Claude UI rendering (any harness → Claude)

  • provider.rs / project.rs: adds provider::native_name(category, args) — the reverse of tool_category — and a canonical_claude_tool_name dispatcher in the projector. Mirrors the same pattern already on toolpath-codex / -opencode / -gemini / -pi. Claude's UI dispatches its rich result panes by literal tool name, so exec_command / read_file / write_file rendered as opaque blocks even when their toolUseResult was well-formed. Now they come through as Bash / Read / Write / Edit and the renderers fire.
  • project.rs: canonical_claude_tool_input translates input keys to what each Claude tool's UI reads (cmd/path/oldStringcommand/file_path/old_string). tool_use_result_from_invocation builds the matching display blob per-category — Bash gets {stdout, stderr, …}, Edit gets {filePath, oldString, newString, structuredPatch, …} with hunks computed via similar::TextDiff, Read gets {type: "text", file: {filePath, content, …}}, Glob/Grep get {filenames, numFiles, pattern}.
  • project.rs: tracks parent_rewrites so that when a synthesized tool-result entry is emitted between two assistant turns, the next turn's parentUuid points at the synthesized result instead of the prior assistant.

Codex enables extract to see its tool calls

  • toolpath-codex/derive.rs: was writing tool_calls (a name+status summary), but toolpath_convo::extract reads tool_uses (full id/name/input/category/result). The mismatch dropped every Codex tool call at the cache boundary — path export claude showed text only. Adds the canonical tool_uses array; keeps the legacy summary for human-readable consumers.

ConversationView events through derive_path

  • toolpath-convo/derive.rs + extract.rs: derive_path was emitting steps for view.turns only — anything in view.events disappeared on the IR-to-Path-to-IR trip. Now each event becomes a conversation.event step with its data flattened into structural extras. Two housekeeping keys (event_data_type, event_source_id) keep the original type field out of StructuralChange.change_type (otherwise a Codex user_message event collides with serde's #[serde(rename = "type")]).

Supporting changes

  • toolpath-convo: added PartialEq, Eq derives on TokenUsage so the matrix can compare token shapes structurally.
  • toolpath-gemini: drive-by clippy::unnecessary_map_or fix that was blocking -D warnings. Pre-existing.
  • toolpath-claude: adds the workspace similar dep, used by structured_patch_hunks in the projector.

Why

Per-harness round-trip tests catch projector regressions for that harness alone. They can't catch the class of bug that only surfaces when projecting across harnesses, or in the import/export plumbing rather than the projector itself — silent text drops, foreign-namespace extras leaking through serde flatten, tool-name remapping gaps, arg-key shape mismatches (camelCase ↔ snake_case), schema-required fields the projector forgot to populate, and (the load-bearing one for the Claude UI work) the derive_pathextract boundary dropping data the projector never had a chance to see.

The opencode "preparing edit…" / Zod TypeError class lived in the cross-harness gap. The Claude UI work then closed a parallel gap: tests can pass on every per-crate invariant and every cross-harness invariant and still produce a JSONL Claude Code can't render, because rendering requires fields and parent-chain shapes that no test was checking.

Verification

# Matrix tests
cargo test -p path-cli --test cross_harness_matrix

# Per-harness real-fixture and compaction roundtrips
cargo test -p toolpath-claude -p toolpath-codex -p toolpath-gemini -p toolpath-opencode -p toolpath-pi --tests

# Workspace regression sweep
cargo test --workspace

# Clippy clean (uses -D warnings)
cargo clippy --workspace --tests -- -D warnings

End-to-end UI verification (any Claude or Codex session you have on disk):

SESSION=<session uuid>
cargo run --release -p path-cli -- import claude --project "$(pwd)" --session "$SESSION" --force
cargo run --release -p path-cli -- export claude --input "claude-path-claude-${SESSION:0:8}" --project /tmp/replay
cd /tmp/replay && claude -r "$SESSION"
# Diffs, shell output, file viewer, parent-chain navigation — all should render.

# Same flow with import codex / opencode / gemini / pi — Claude UI still renders the rich panes.

To re-capture fixtures after a harness CLI version bump:

./scripts/capture-elicit-fixtures.sh             # all installed harnesses
./scripts/capture-elicit-fixtures.sh claude pi   # subset

Known limitations

  • One real fixture variant per harness. Compaction is now covered via synthetic fixtures, but other edge cases remain: retry parts, MCP tools, very long content, multi-segment chained sessions (Claude's JSONL rotation), unicode stress.
  • Cross-harness delegation survival is soft. The matrix accepts a delegation's agent_id surviving as a regular tool_use call_id rather than as a DelegatedWork entry — Codex/opencode don't model sub-agents natively.
  • Compaction structural metadata is documented loss. compact_boundary markers, isCompactSummary flags, and part.compaction ConversationEvents don't survive derive→extract. Surrounding messages do.
  • Claude UI rendering is verified manually. No headless TUI sentinel scraping in CI; the new cache_roundtrip_preserves_line_counts_per_type test catches the structural class but a pure rendering bug in Claude Code's TUI wouldn't be caught here.
  • Opencode schema check is JSON-self-roundtrip, not SQLite-roundtrip. SQL-leg coverage lives in toolpath-opencode/tests/projection_roundtrip.rs and compaction_roundtrip.rs.

Commits

  • chore(toolpath-convo): derive PartialEq+Eq on TokenUsage
  • chore(toolpath-gemini): use contains_key over get(..).is_none()
  • fix(codex): translation idempotence fixes from cross-harness matrix
  • feat(scripts): elicit-fixture capture pipeline
  • feat(path-cli): cross-harness translation matrix tests
  • refactor(path-cli): drop synthetic-IR matrix variant as redundant
  • refactor(test-fixtures): relocate to workspace-root test-fixtures/
  • test: add per-crate real-fixture projection roundtrips
  • feat(path-cli): strengthen delegation invariants in matrix
  • feat(scripts): extend elicit prompt with sub-agent dispatch step
  • test: add synthetic compaction roundtrip tests
  • chore(test-fixtures): refresh real-world captures
  • feat(toolpath-convo): preserve view.events through Path roundtrip
  • feat(claude): full-fidelity Claude UI rendering through IR roundtrip
  • test(claude): regression locks for full-session roundtrip

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

🔍 Preview deployed: https://068a86aa.toolpath.pages.dev

benbaarber added 15 commits May 11, 2026 11:12
Needed by downstream invariant tests that compare per-turn token usage
across IR round-trips. The fields are all `Option<u32>`, so structural
equality is well-defined.
Four bugs surfaced (and fixed) by the new cross-harness matrix tests:

1. Empty-text assistant turn message-line emission (the anchor bug).
   The forward path uses `response_item.message` as the turn anchor;
   tool calls without a preceding message attach to whatever assistant
   turn was last seen, even across an intervening user message. The
   projector previously skipped the message line for assistants whose
   text was empty even when they had tool calls or thinking, so those
   tool calls got fused backward into the prior assistant turn during
   forward parsing. Fix: emit a message line for any assistant turn
   with content (text, tool calls, or non-empty thinking).

2. Tool `is_error` round-trip for non-shell tools. Codex's wire format
   only carries error info via `exec_command_end.exit_code`, which
   only fires for shell tools. For read/write/apply_patch, the error
   flag was silently flipping to `false` after round-trip. Fix: stash
   `is_error: true` in the function_call_output (and
   custom_tool_call_output) `extra` map on projection, recover it on
   the forward path. `attach_tool_output` now takes an explicit
   `is_error` parameter and ORs with any prior state.

3. `event_msg.token_count` emission. The projector was never emitting
   token_count event_msgs, so all assistant `Turn.token_usage` data
   silently dropped on a codex round-trip. Fix: emit a `token_count`
   event_msg before each assistant message line (the forward path's
   `pending_token_usage` attaches to the next pushed turn). Adds
   `convo_usage_to_codex_json` helper mapping IR TokenUsage → codex
   TokenUsage JSON (input_tokens, cached_input_tokens, output_tokens).

4. `thinking: Some("")` non-idempotence. Claude's forward path
   produces empty-string thinking on assistant turns whose reasoning
   blocks have only metadata (signatures, etc.). The projector's
   emit-message-line condition was `turn.thinking.is_some()` which
   triggers on `Some("")`, so the first pass emits a message line.
   But codex's forward path drops empty thinking, so the second pass
   sees `thinking: None` and the same condition is false — turn drops
   out. Fix: treat `Some("")` as absent for emit decisions.

Updated unit test `assistant_turn_with_function_call_and_output` to
match the new line ordering (token_count now precedes message in the
output stream).
Adds the workflow for capturing real-world conversation fixtures from
each conversation harness on demand:

- `docs/agents/feature-elicit.prompt.txt` — the harness-agnostic 9-task
  prompt (list, write, read, edit, glob, grep, errored read, run shell,
  summarize). Single source of truth for both the doc and the script.

- `docs/agents/feature-elicit.md` — workflow doc covering both the
  automated path and the manual fallback, the per-harness invocation
  table (claude / codex / gemini / pi / opencode), the completeness
  checklist (every tool category + thinking + error result + summary),
  and where the resulting session files land.

- `scripts/capture-elicit-fixtures.sh` — orchestrator. Per-harness
  driver functions, snapshot-diff session capture, fresh scratch dir
  per run, skip-on-missing-CLI, captured stderr surfaced on failure
  (not silenced). Run as `./scripts/capture-elicit-fixtures.sh` for
  all harnesses or pass a subset like `claude codex`. Output lands at
  `crates/path-cli/tests/fixtures/<harness>/convo.{jsonl,json}`.

The pipeline is the input side of the cross-harness translation matrix
test (next commit). Captured fixtures themselves are committed there.
Three test functions covering 60 distinct assertions across the five
conversation harnesses (claude, codex, gemini, pi, opencode):

- `matrix_synthetic` (25 cells): drives a controlled inline IR through
  every (source, target) pair. `A → IR → B → IR → B` shape — after one
  A→B translation, B's projector + forward must be a fixed point on
  the resulting IR. Strict equality on idempotence; provenance
  independence on the second leg.

- `matrix_real_fixtures` (25 cells): same shape, but each source uses
  its own real-world captured session as input. Loads fixtures via the
  harness's own native reader (ConversationReader / RolloutReader /
  read_session_from_file / etc.). Catches richness-driven bugs the
  synthetic IR can't reach.

- `matrix_schema_validation` (10 checks): for each harness × {synthetic,
  real fixture}, project IR → serialize native to wire format → re-parse
  via the harness's reader. Catches projector output that's "valid IR"
  but produces native bytes the reader rejects.

Per-cell invariants (14 categories):

  Skeleton: turn count, role sequence
  Content: turn text (whitespace-normalized)
  Tool calls: call_id set, input shape (snake-case canonicalized for
    cross-harness `filePath` ↔ `file_path`), output content, error flag
  Tokens: per-turn idempotence, total_usage idempotence, survival
    (tokens present pre-target must survive the B leg)
  Reasoning: thinking idempotence + survival
  Other fields: model, stop_reason, parent_id graph, environment.cwd,
    delegation count, files_changed set
  Pollution: no foreign-namespace extras keys

Failures aggregate per cell so a single test run reports every
divergence, not just the first.

Fixtures (~170KB total, captured via scripts/capture-elicit-fixtures.sh):

  crates/path-cli/tests/fixtures/{claude,codex,gemini,opencode,pi}/convo.{jsonl,json}

Each is the harness's native session output for the 9-task elicitation
prompt, machine-captured from a fresh scratch directory. Loaded through
each harness's reader API; opencode's `export` JSON wrapper is converted
inline to the Session struct since opencode's primary store is SQLite.
The matrix had two cell-running tests: one driven by an inline
`canonical_source_view()` (synthetic IR, 6 turns) and one driven by
each harness's captured real-world fixture. The real fixtures are a
strict superset of the synthetic shape (19+ turns each, more tool
diversity), so any cell that would fail under synthetic also fails
under real fixtures. The synthetic variant added test surface without
adding bug-finding coverage.

Removed:
- `matrix_synthetic` test
- `canonical_source_view`, `user_turn`, `assistant_turn` builders
- The synthetic case from `matrix_schema_validation`

Tightened: `matrix_translation` (renamed from `matrix_real_fixtures`)
now panics with an actionable message when a fixture is missing
instead of silently skipping. Same for the schema validation test.
That's the right behavior — fixtures are checked in, so missing means
something is genuinely wrong.

Net: 169 fewer lines, same coverage, clearer failure modes when
fixtures get accidentally removed.
Move the captured real-world fixtures from
crates/path-cli/tests/fixtures/ to test-fixtures/ at the workspace
root. They're shared by the cross-harness matrix and (in subsequent
commits) by per-crate roundtrip tests, so the workspace root is a
more honest home than burying them under one consumer crate.

- Renames the five fixture directories.
- Updates scripts/capture-elicit-fixtures.sh to write there.
- Updates fixtures_dir() in cross_harness_matrix.rs.
- Updates docs/agents/feature-elicit.md path references.

Picks up a stray rustfmt cleanup in toolpath-codex/src/project.rs
along the way.
Each per-harness crate gets a tests/real_fixture_roundtrip.rs that
loads the captured real-world session at test-fixtures/<harness>/
and runs it through the full native → IR → Path → IR → native
pipeline. Asserts:

- fixture loads non-empty with at least one user + assistant turn
- meaningful turn count and roles preserved (system envelopes excluded)
- per-turn text preserved (whitespace-normalized)
- tool-call topology preserved (id sets, names, results)
- per-delegation content (agent_id, prompt, child-turn count) preserved
- projector output re-parses through the harness's own reader (or,
  for opencode where the wire is SQLite, that the projected Session
  serdes symmetrically as JSON)

Claude additionally checks total token usage preservation when present.

Complements the cross-harness matrix: each leg's per-harness self-
roundtrip is now exercised on production-shape data, not just
synthetic minimums.
The previous `delegations` invariant only checked total count. A
projector that drops a sub-agent's child turns or scrambles its
agent_ids would slip past as long as the total count of delegations
stayed the same. Production sub-agent flows are too important to
permit that.

Strengthen the idempotence check (view_first vs view_second) to assert
per-delegation content equality: agent_id set, child-turn count, and
prompt text. Add a new `delegations_survive` cross-leg invariant
(view_after_source vs view_first) that catches drops on the A→B
translation path itself.

The cross-leg check is deliberately soft: a source delegation's
agent_id must remain findable in the target IR — *either* as a
delegation (preferred — preserves the structural sub-agent semantics)
*or* as a regular tool_use call_id (acceptable — the visible tool
call is still present even if the harness can't natively model
delegation). This matches the "good UX" goal: when you open a Claude
session in Codex or opencode, you still see the dispatch tool call
and its result; you just lose the metadata that says "this was a
delegation." A flat-out drop of the call_id is the regression.
Adds step 9 to feature-elicit.prompt.txt asking the agent to dispatch
a sub-agent (Claude Task / Gemini sub-agent / Codex sub-task /
opencode subtask) with a small word-counting instruction. Worded
harness-neutrally; harnesses without a dispatch tool are told to skip
explicitly so the lack-of-delegation is itself observable.

Updates the docs for the new task count and adds a "1+ delegation
event" entry to the completeness checklist.

Pairs with the strengthened delegation invariants in the cross-harness
matrix: subsequent capture refreshes will populate delegation content
in the fixtures, and the matrix will start exercising the new
assertions automatically.
Real compaction events fire when the model context window fills
mid-session — they can't reliably be triggered by a 5-minute capture
prompt. Synthetic minimum-shape fixtures are the right tool for
this regression class. One per harness whose format models compaction
as a first-class concept:

- Claude: pre-compact turns → compact_boundary marker → synthetic
  isCompactSummary user message → post-compact turns.
- Codex: response_items around a `compacted` rollout line.
- Pi: messages around an Entry::Compaction line (Pi's first-class
  compaction entry type, alongside BranchSummary).
- opencode: SQL fixture with a `compaction` PartData in mid-session.

Gemini is skipped — its format doesn't model context-overflow
compaction (the `summary` field is a sub-agent result, different
concept).

Each test asserts: fixture loads without panic on the compaction
line + pre-compact content survives roundtrip + post-compact content
survives + projector output re-parses through the harness's own
reader. Documented limitations (compact_boundary marker / isCompactSummary
flag / part.compaction ConversationEvent not surviving derive→extract)
are called out in module headers as acceptable losses for "good UX"
today — the surrounding conversation content is preserved, which is
what users actually read.
Re-ran scripts/capture-elicit-fixtures.sh against current harness
versions. Notable shifts:

- Claude: 39 → 80 lines, 19 → 47 turns. Captured a real `Agent`
  delegation (toolu_01L4FfhHHhpph7aoab3dt4bn), the first fixture to
  exercise the new delegation invariants. Going Claude → Codex /
  Claude → opencode now stresses the soft cross-leg check (delegation
  agent_id surviving as a delegation OR as a tool_use call_id); both
  pass under the soft bar.
- Codex: smaller (8 turns) — `codex exec` mode lacks a sub-agent
  dispatch tool, so step 9 became "count the words yourself."
- Gemini, opencode: similar size to before.
- Pi: 22 turns, full prompt walked. Re-captured after a first run
  with a weak local model gave up early.

All 50+ test binaries pass with the refreshed fixtures.
derive_path was emitting steps for view.turns only — anything in
view.events (attachments, headerless preamble lines, provider-specific
non-turn entries) silently disappeared on the IR-to-Path-to-IR trip.
For sessions with rich provider extras this dropped 10–25% of the
source content before the projector ever saw it.

Now each event becomes a `conversation.event` step with its data
flattened into structural extras. Two housekeeping keys
(`event_data_type`, `event_source_id`) keep the original `type` field
out of `StructuralChange`'s `change_type` slot — without them, a
`type: "user_message"` event collides with serde's
`#[serde(rename = "type")]` on `change_type` and `Graph::from_json`
fails to disambiguate `PathOrRef`. extract_conversation strips both
keys back out and restores the `type` value into event.data so the
data round-trips clean.

Also tightens the ToolResult.content docstring to clarify it's the
model-visible text (paired with provider-specific display blobs that
live in event.data on the appropriate harness).
Make `path import claude … && path export claude …` produce a JSONL
that resumes in Claude Code with all the original UI affordances
(diff view, shell output panel, file viewer, parent-chain navigation),
and make the same projection work for Codex/Gemini/opencode/Pi sources
that come through with non-Claude tool names.

Previously, opening a roundtripped session showed text-only assistant
turns: every diff and shell output was either dropped or rendered as
an opaque block. Every fix below is a layer of that.

## Self-roundtrip preservation (Claude → IR → Claude)

* **reader.rs**: headerless lines (ai-title, last-prompt,
  queue-operation, permission-mode, file-history-snapshot) used to be
  dropped — they have no uuid so the typed parse path skipped them.
  Now they're captured into `convo.preamble` as raw JSON values.

* **provider.rs**: surfaces preamble lines and message-less entries
  (attachments, snapshots) as `ConversationEvent`s with structured
  `event.data` keyed by their original fields. Folds the entry's
  typed `version`/`user_type`/`request_id` into `Turn.extra["claude"]`
  so projection can restore the request-correlation metadata.

* **derive.rs**: emits preamble lines as `conversation.event` steps so
  they survive Path roundtrip. Emits tool-result-only user entries as
  `tool_result_user` events (preserving original UUID + toolUseResult
  blob + entry_extra like promptId/slug) — without this, the entry
  vanishes and the next assistant's parentUuid points at a UUID we
  never re-emit, breaking the chain Claude's TUI traverses for
  rendering. Tracks `had_thinking_part` so encrypted-reasoning blocks
  (text-empty thinking with a signature) survive instead of being
  filtered by the empty-content skip. Uses `entry.parent_uuid` for
  step parents instead of the linear `last_step_id` chain — the linear
  form skipped over attachments and tool-result events, so subsequent
  assistant turns ended up with parent pointers that bypassed the
  entries they actually responded to.

* **project.rs**: routes `tool_result_user` events back to user entries
  with their original UUIDs, restores the `toolUseResult` blob
  verbatim, and rebuilds the JSONL preamble from preamble events.
  Falls back to per-tool-use synthesis only when no preserved event
  exists (cross-harness sources).

## Cross-harness Claude UI rendering (any harness → Claude)

* **provider.rs / project.rs**: adds `provider::native_name(category, args)`
  — the reverse of `tool_category` — and a `canonical_claude_tool_name`
  dispatcher in the projector. Mirrors the same pattern already on
  toolpath-codex / toolpath-opencode / toolpath-gemini / toolpath-pi.
  Claude's UI dispatches its rich result panes by literal tool name,
  so `exec_command` / `read_file` / `write_file` / `apply_patch`
  rendered as opaque blocks even when their toolUseResult was
  well-formed. Now they come through as `Bash` / `Read` / `Write` /
  `Edit` and the renderers fire.

* **project.rs**: `canonical_claude_tool_input` translates input keys
  to what each Claude tool's UI reads (`cmd`/`path`/`oldString` →
  `command`/`file_path`/`old_string`). `tool_use_result_from_invocation`
  builds the matching display blob per-category — Bash gets
  `{stdout, stderr, …}`, Edit gets `{filePath, oldString, newString,
  structuredPatch, …}` with hunks computed via `similar::TextDiff`,
  Read gets `{type: "text", file: {filePath, content, …}}`, Glob/Grep
  get `{filenames, numFiles, pattern}`.

* **project.rs**: tracks `parent_rewrites` so that when a synthesized
  tool-result entry is emitted between two assistant turns, the next
  turn's parentUuid points at the synthesized result instead of the
  prior assistant. Without this the synthesized result is orphaned.

## Codex enables extract to see its tool calls

* **toolpath-codex/derive.rs**: was writing `tool_calls` (a name+status
  summary), but `toolpath_convo::extract` reads `tool_uses` (full
  id/name/input/category/result). The mismatch dropped every Codex
  tool call at the cache boundary — `path export claude` showed text
  only. Adds the canonical `tool_uses` array; keeps the legacy
  `tool_calls` summary for human-readable consumers.

## Dep

* **toolpath-claude/Cargo.toml**: adds the workspace `similar` dep,
  used by `structured_patch_hunks` in the projector.
Three tests, all on the captured real fixture:

* `read_then_project_preserves_line_count` — bluntest possible
  UX-loss check: source.lines() == projected.preamble + projected.entries.
  Catches silent drops at the reader, IR-conversion, or projection step.

* `read_then_project_preserves_metadata_entries` — attachments and
  preamble counts match source exactly. Locks the headerless-line and
  message-less-entry preservation.

* `read_then_project_preserves_tool_use_result_count` — every source
  entry with a `toolUseResult` blob still has one after roundtrip.
  Claude's UI uses this field for diff/shell-output rendering, so a
  drop here is a visible regression on resume.

* `cache_roundtrip_preserves_line_counts_per_type` — exercises the
  full path-cli flow: source JSONL → toolpath_claude::derive_path →
  cached Path JSON → toolpath_convo::extract → ClaudeProjector → JSONL.
  Asserts every entry type that appeared in source also appears in the
  projection. Catches whole-category drops (ai-title / last-prompt /
  file-history-snapshot regressions specifically).
pnpm/action-setup@v4 with version: latest started resolving to pnpm 11
after 11.0.0 went stable (2026-04-28); pnpm 11 require()s node:sqlite,
which needs Node >= 22.5, but the deploy job pins Node 20. Pin pnpm to
10 (compatible with the existing lockfileVersion 9.0).
@ben-emp ben-emp force-pushed the ben/cross-harness-matrix branch from 1c186a9 to 9ce0d66 Compare May 11, 2026 15:13
@benbaarber
Copy link
Copy Markdown
Collaborator

Comment thread crates/toolpath-claude/src/project.rs Outdated
Comment thread crates/toolpath-claude/src/project.rs Outdated
"last-prompt",
"queue-operation",
"permission-mode",
"file-history-snapshot",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when Claude adds a new preamble event type? Do we include it in the Toolpath?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we no longer track these, new preamble types would be treated the same

Comment thread crates/toolpath-convo/src/derive.rs
…s lines verbatim

Headerless JSONL lines (ai-title, last-prompt, queue-operation,
permission-mode, file-history-snapshot, and anything unrecognized) are no
longer routed by an enumerated type list. conversation_to_view and
derive::derive_path now stash the whole line verbatim under
ConversationEvent.data["raw"] / the conversation.event step's extra["raw"];
project_view identifies a headerless event by that "raw" key and dumps the
line straight back onto convo.preamble. An unrecognized headerless line now
round-trips instead of being mangled through project_event into a malformed
entry. The permission-mode fallback (synthesized when the path carried no
real preamble) is unchanged.

Behavior of attachments / message-less entries (entry_to_event +
project_event) and tool-result-only user entries (tool_result_user events +
tool_result_event_to_entry, with the parent_rewrites chain-patching) is
untouched.
@ben-emp ben-emp force-pushed the ben/cross-harness-matrix branch from 1e49061 to 386c530 Compare May 12, 2026 21:13
Comment thread crates/toolpath-claude/src/derive.rs
Comment thread crates/toolpath-claude/src/derive.rs
Comment thread crates/toolpath-codex/src/derive.rs Outdated
The codex derive shouldn't reach across to a sibling crate's projector in
a doc comment. Reword to describe the generic extract -> ConversationView
round-trip instead.
@ben-emp ben-emp force-pushed the ben/cross-harness-matrix branch from e0aa776 to 76691de Compare May 13, 2026 15:27
Copy link
Copy Markdown
Contributor

@akesling akesling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benbaarber benbaarber merged commit 34e0def into main May 13, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants