Add tracing spans to file compaction phases#28
Closed
emilk wants to merge 4 commits into
Closed
Conversation
`compact_files` was a single opaque span in our telemetry — a multi-hour compaction showed up as one number with no visibility into where the time went. Break it into the three phases that actually do work: - `plan_compaction` — wraps the planner, records `num_tasks` - `rewrite_files` — one span per compaction task (these run concurrently), records `input_fragments`, `num_rows`, `can_binary_copy`, `new_fragments` - `commit_compaction` — wraps index remapping + commit, records `num_tasks` This turns the black box into a per-task duration distribution, so a slow pass can be attributed to a specific task / row count rather than guessed at. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`latest_manifest` (used by `checkout_latest`) resolves the newest version
before reading the manifest. For object-store commit handling that lists the
version directory, so its cost grows with the number of un-cleaned versions —
exactly the state that also breaks tagged-version cleanup. It was previously
invisible. Wrap it in an `info_span!("resolve_latest_location")` so a slow
version resolution shows up on its own in a trace.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A slow `rewrite_files` task only told us its size (rows / fragments), not which phase of the task ate the time. Wrap each phase in its own span: - `migrate_fragments` — recomputes missing fragment stats (can do IO) - `prepare_reader` — builds the scan plan for the non-binary path - `write_fragments` — the streaming scan+write (reads are lazy and fused into the write, so they can't be timed apart here) - `rewrite_files_binary_copy` — the fast metadata/binary copy path - `rechunk_stable_row_ids` — stable-row-id post-processing Now a slow task can be attributed to a phase (e.g. a slow scan+write vs a slow stat migration) rather than just its row count. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The CI workflows only triggered on branches matching 'main' or 'release/**'. The Rerun fork's release branches use a hyphen (release-7.0.0), which 'release/**' does not match, so clippy/fmt/tests never ran on PRs targeting them. Add a 'release-*' pattern to the push and pull_request triggers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
I don't think this change is needed since we will be calling the lower-level compaction APIs as exposed in Lance anyhow. We'll be able to put the metrics within our own software. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
compact_files(compact_files_with_planner) ran as a single opaque span in our telemetry. A multi-hour compaction pass showed up as one number with no way to see where the time went.This breaks it into the three phases that actually do work, each with its own span:
plan_compactionnum_tasksrewrite_files(one per task, run concurrently)input_fragments,num_rows,can_binary_copy,new_fragmentscommit_compactionnum_tasksWhy
We hit a maintenance pass where a single dataset's compaction took ~80 minutes and there was zero visibility below the top-level call. With per-task spans, a slow pass can be attributed to a specific task and row count instead of guessed at, and the plan/rewrite/commit split shows which phase dominates.
This is purely additive instrumentation — no behavior change. Based on
release-7.0.0(the branch the Rerun dataplatform patches against).🤖 Generated with Claude Code