Skip to content

Add tracing spans to file compaction phases#28

Closed
emilk wants to merge 4 commits into
release-7.0.0from
emilk/trace-compaction-tasks
Closed

Add tracing spans to file compaction phases#28
emilk wants to merge 4 commits into
release-7.0.0from
emilk/trace-compaction-tasks

Conversation

@emilk

@emilk emilk commented Jun 25, 2026

Copy link
Copy Markdown
Member

What

compact_files (compact_files_with_planner) ran as a single opaque span in our telemetry. A multi-hour compaction pass showed up as one number with no way to see where the time went.

This breaks it into the three phases that actually do work, each with its own span:

Span Fields
plan_compaction num_tasks
rewrite_files (one per task, run concurrently) input_fragments, num_rows, can_binary_copy, new_fragments
commit_compaction num_tasks

Why

We hit a maintenance pass where a single dataset's compaction took ~80 minutes and there was zero visibility below the top-level call. With per-task spans, a slow pass can be attributed to a specific task and row count instead of guessed at, and the plan/rewrite/commit split shows which phase dominates.

This is purely additive instrumentation — no behavior change. Based on release-7.0.0 (the branch the Rerun dataplatform patches against).

🤖 Generated with Claude Code

emilk and others added 2 commits June 24, 2026 17:23
`compact_files` was a single opaque span in our telemetry — a multi-hour
compaction showed up as one number with no visibility into where the time
went. Break it into the three phases that actually do work:

- `plan_compaction` — wraps the planner, records `num_tasks`
- `rewrite_files` — one span per compaction task (these run concurrently),
  records `input_fragments`, `num_rows`, `can_binary_copy`, `new_fragments`
- `commit_compaction` — wraps index remapping + commit, records `num_tasks`

This turns the black box into a per-task duration distribution, so a slow
pass can be attributed to a specific task / row count rather than guessed at.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`latest_manifest` (used by `checkout_latest`) resolves the newest version
before reading the manifest. For object-store commit handling that lists the
version directory, so its cost grows with the number of un-cleaned versions —
exactly the state that also breaks tagged-version cleanup. It was previously
invisible. Wrap it in an `info_span!("resolve_latest_location")` so a slow
version resolution shows up on its own in a trace.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@emilk emilk marked this pull request as ready for review June 25, 2026 00:50
A slow `rewrite_files` task only told us its size (rows / fragments), not
which phase of the task ate the time. Wrap each phase in its own span:

- `migrate_fragments`      — recomputes missing fragment stats (can do IO)
- `prepare_reader`         — builds the scan plan for the non-binary path
- `write_fragments`        — the streaming scan+write (reads are lazy and
  fused into the write, so they can't be timed apart here)
- `rewrite_files_binary_copy` — the fast metadata/binary copy path
- `rechunk_stable_row_ids` — stable-row-id post-processing

Now a slow task can be attributed to a phase (e.g. a slow scan+write vs a
slow stat migration) rather than just its row count.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rerun-io rerun-io deleted a comment from github-actions Bot Jun 25, 2026
The CI workflows only triggered on branches matching 'main' or
'release/**'. The Rerun fork's release branches use a hyphen
(release-7.0.0), which 'release/**' does not match, so clippy/fmt/tests
never ran on PRs targeting them. Add a 'release-*' pattern to the push
and pull_request triggers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rerun-io rerun-io deleted a comment from github-actions Bot Jun 26, 2026
@amunra

amunra commented Jun 26, 2026

Copy link
Copy Markdown

I don't think this change is needed since we will be calling the lower-level compaction APIs as exposed in Lance anyhow. We'll be able to put the metrics within our own software.
IMO better do it there so we have less divergence of this code vs upstream.

@emilk emilk closed this Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants