Skip to content

feat: add referenced_files_size and physical_files_size table functions#323

Open
JingsongLi wants to merge 10 commits into
apache:mainfrom
JingsongLi:files_size
Open

feat: add referenced_files_size and physical_files_size table functions#323
JingsongLi wants to merge 10 commits into
apache:mainfrom
JingsongLi:files_size

Conversation

@JingsongLi
Copy link
Copy Markdown
Contributor

Purpose

Introduce two DataFusion table-valued functions for storage analysis:

  • referenced_files_size('table'): Aggregates manifest/data/index file counts and sizes across all snapshots from main branch, tags, and other branches. Output rows: total, branch:main, branch:. Uses a shared manifest cache to avoid redundant reads and processes snapshots concurrently.

  • physical_files_size('table'): Scans the table directory recursively and reports actual physical file sizes categorized by file type (manifest, data, index). Concurrently lists subdirectories for high throughput on object stores.

Both functions gracefully handle NotFound errors from concurrently deleted files during cleanup.

Brief change log

Tests

API and Format

Documentation

JingsongLi and others added 3 commits May 18, 2026 11:29
Introduce two DataFusion table-valued functions for storage analysis:

- `referenced_files_size('table')`: Aggregates manifest/data/index file
  counts and sizes across all snapshots from main branch, tags, and other
  branches. Output rows: total, branch:main, branch:<name>. Uses a shared
  manifest cache to avoid redundant reads and processes snapshots concurrently.

- `physical_files_size('table')`: Scans the table directory recursively and
  reports actual physical file sizes categorized by file type (manifest,
  data, index). Concurrently lists subdirectories for high throughput on
  object stores.

Both functions gracefully handle NotFound errors from concurrently deleted
files during cleanup.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
source: source.to_string(),
..Default::default()
};
for s in per_snapshot.into_iter().flatten() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A newer Paimon snapshot carries previous base/delta manifests into its new base manifest list, so the same still-referenced manifest/data/index file can appear in many snapshot summaries. Since this function is documented as comparable with physical size for orphan-file analysis, summing snapshots can make referenced size grow with snapshot count and even exceed physical size. This should collect a de-duplicated referenced file set by file identity/path first, then aggregate sizes from that set.

let all_manifest_metas: Vec<&ManifestFileMeta> =
manifest_lists.iter().flat_map(|ml| ml.iter()).collect();

summary.manifest_file_count = all_manifest_metas.len() as i64;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code only counts ManifestFileMeta entries inside the manifest lists, but the files named by snapshot.base_manifest_list(), delta_manifest_list(), and changelog_manifest_list() are themselves snapshot-referenced metadata files. The snapshot.index_manifest() file is also read below but not counted as a referenced manifest file. Since physical_files_size classifies both manifest-list-* and index-manifest-* as manifest files, the two functions cannot be safely compared until these directly referenced metadata files are included and de-duplicated.


// Store results in cache
let mut cache = manifest_cache.lock().unwrap();
for (path, entries) in uncached_paths.into_iter().zip(results) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manifest entries carry FileKind::Add/Delete, and after overwrite/compaction the same data file can be present as an ADD from a historical base manifest and as a DELETE in the current delta manifest. Java's orphan cleaner uses a used-file-name set, while scan semantics first merge ADD/DELETE entries before reading live files; this code does neither and simply adds entry count and file size. Please define whether this function reports cleanup-protected referenced files or live readable files, then aggregate from a de-duplicated file set or from merged ADD/DELETE entries instead of raw entries.

JingsongLi and others added 2 commits May 18, 2026 16:06
…ifest

Address review feedback:
- Use deduplicated file sets (HashMap<file_name, file_size>) instead of
  summing per-snapshot counts, so referenced size never exceeds physical size
- Include manifest list files and index manifest files in the manifest count,
  consistent with physical_files_size classification
- Include both ADD and DELETE manifest entries since both reference physical
  files protected from cleanup until the snapshot expires
- Cache stores per-manifest data file entries for deduplication across snapshots

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding these storage-inspection table functions. I reviewed the latest head (d6c3ca0) and found one correctness issue that should be fixed before merging.

Blocking: referenced_files_size does not include tags under non-main branches.

collect_referenced_files_summary merges tags only for the main table path (collect_tag_files(file_io, &sm, &tm, ...)), but for each branch it only calls collect_scope_files on the branch SnapshotManager. A branch can contain tag files under branch/branch-<name>/tag/, and those tag snapshots still protect their referenced manifest/data/index files from cleanup. Java OrphanFilesClean.safelyGetAllSnapshots(branch) handles this by switching to each branch and collecting snapshots, tagged snapshots, and changelogs for that branch.

I reproduced this locally with a temporary memory-table test that creates branch b1 with only branch/branch-b1/tag/tag-v1 referencing a manifest list/manifest, and no branch snapshot file. collect_referenced_files_summary returned branch:b1 with all zero counts, so branch-tag-only referenced files were missed.

Please merge branch tag files into each branch scope, e.g. by using TagManager::with_branch(branch_name) together with sm.with_branch(branch_name), and add a regression test for a branch whose referenced files are only reachable through a branch tag.

Validation I ran locally:

  • cargo fmt --all -- --check: passed.
  • cargo clippy -p paimon -p paimon-datafusion --all-targets -- -D warnings: passed.
  • cargo test -p paimon referenced_files -- --nocapture: passed.
  • A targeted temporary branch-tag repro failed as described above.
  • Full workspace clippy is blocked in my local environment by Python 3.6 vs pyo3 abi3-py310; cargo test -p paimon-datafusion also has unrelated fixture failures for missing partitioned_log_table tables.

JingsongLi and others added 3 commits May 18, 2026 16:53
Use TagManager::with_branch() to also collect tag snapshots under each
branch directory (branch/branch-<name>/tag/). Without this, files
referenced only through a branch tag were missed.

Added regression test: test_branch_tag_referenced_files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Main branch snapshots and tags now run concurrently via tokio::try_join
- All branches processed concurrently via try_join_all
- Within each branch, snapshots and branch-tags run concurrently
- Tags within a scope processed concurrently via try_join_all

No serial loops remain in the collection pipeline.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
for (path, entries) in uncached_paths.into_iter().zip(results) {
let file_entries: Vec<(String, i64)> = entries
.iter()
.map(|e| (e.file().file_name.clone(), e.file().file_size))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ADD/DELETE semantics look clearer now, but this still only adds file_name to the referenced data-file set.

Should we also include e.file().extra_files here? Java orphan cleanup treats extra files as used files too, and otherwise file-index / blob / lookup-related files can appear as orphan delta when comparing with physical_files_size.

DataFileMeta.extra_files (file-index, lookup files, etc.) are also
physical files protected from cleanup. Include them in the referenced
file set so they are not misidentified as orphans when compared with
physical_files_size.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
let mut file_entries: Vec<(String, i64)> = Vec::new();
for e in &entries {
file_entries.push((e.file().file_name.clone(), e.file().file_size));
for extra in &e.file().extra_files {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra_files are included now, but counting them as 0 bytes still undercounts data_file_size.

These files are protected physical files too, and referenced_files_size is meant to be compared with physical_files_size for orphan analysis. Can we stat the aligned extra-file paths here and use their real sizes instead?

.unwrap();

// Should have: total, branch:main, branch:b1
assert_eq!(result.len(), 3);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test does not really prove that branch-tag files are collected.

branch:b1 is listed from the branch directory even without merging branch tags, and the tag's manifest lists do not exist, so the counts stay zero. A stronger regression test would create readable manifest-list / manifest files reachable only through the branch tag and assert a non-zero count or size for branch:b1.

let branch_tm = tm.with_branch(branch_name);
async move {
let (mut branch_files, branch_tag_files) = tokio::try_join!(
collect_scope_files(file_io, &branch_sm, manifest_cache_ref),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Branch snapshots are collected now, but their manifest files are still resolved through the branch SnapshotManager.

branch_sm points to table/branch/branch-<name>, so manifest_path() looks under branch/branch-<name>/manifest. Paimon branches only keep branch-local snapshot/tag/schema files there; the manifest/data/index files remain under the table root. Because missing manifest lists are treated as empty, this can silently report zero referenced files for branch snapshots or branch tags.

We should read branch snapshot/tag files from the branch path, but resolve manifest paths from the table root.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants