Increase the throughput of the `validate_duplicate_files` by tomighita · Pull Request #2296 · apache/iceberg-rust

tomighita · 2026-03-30T11:49:36Z

Which issue does this PR close?

Closes Improve throughput of validate_duplicate_files #2295

What changes are included in this PR?

Increase the throughput of the validate_duplicate_files by starting all requests and polling rather than sequentially fetching each file.

Are these changes tested?

No need to add extra tests since the functionality should be equivalent and existing tests should capture this behaviour

CTTY · 2026-03-30T22:15:26Z

+                .entries()
+                .iter()
+                .map(|entry| entry.load_manifest(file_io))
+                .collect();


Should we buffer_unordered here? This is an IO operation and too many requests may overwhelm the storage backend

.try_buffer_unordered(32) should make the most object stores happy

tomighita · 2026-04-01T08:35:16Z

Thanks for your suggestion @CTTY! I have also incorporated @liurenjie1024's feedback from slack but I am a bit concerned about allocating threads without being explicit. For instance, in other places, we explicitly set the number of threads when setting the thread count [ref].

Any thoughts?

github-actions · 2026-05-02T00:36:27Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

tomighita · 2026-05-04T10:50:00Z

Can anyone take a look over this, so we can get this merged?

emkornfield · 2026-05-04T16:39:32Z

+                    let file_io = self.table.file_io().clone();
+                    spawn(async move { entry.load_manifest(&file_io).await })
+                })
+                .buffer_unordered(32)


Should 32 be a shared constant someplace?

+1 on making this a constant

Good point. Moved. If you don't love the name, feel free to suggest a new one 😅

emkornfield

One minor question is I don't know if the constant should be more centralized (e.g. do we want to limit all file operations to the same parallelism) but it could potentially be refactoring later.

emkornfield · 2026-05-05T08:33:15Z

 use crate::{Error, ErrorKind, TableRequirement, TableUpdate};

 const META_ROOT_PATH: &str = "metadata";
+const NUM_THREADS_VALIDATE_DUPLICATE_FILES: usize = 32;


Maybe a doc suggesting why it is 32.

tomighita · 2026-05-05T11:46:37Z

Ty for the review! 🥳

One minor question is I don't know if the constant should be more centralized (e.g. do we want to limit all file operations to the same parallelism) but it could potentially be refactoring later

@emkornfield I like the idea to centralise but I am afraid here this does not translate well to other operations. I would be in favour of refactoring this later and re-using it where needed, if needed.

blackmwk · 2026-05-19T10:39:10Z

+            futures::stream::iter(entries)
+                .map(|entry| {
+                    let file_io = self.table.file_io().clone();
+                    spawn(async move { entry.load_manifest(&file_io).await })


We should use the newly added Runtime api rather than raw tokio call.

blackmwk · 2026-05-19T10:41:45Z

+                    let file_io = self.table.file_io().clone();
+                    spawn(async move { entry.load_manifest(&file_io).await })
+                })
+                .buffer_unordered(NUM_THREADS_VALIDATE_DUPLICATE_FILES)


We will not need this if we use the Runtime api.

blackmwk · 2026-05-19T10:45:35Z

+                    spawn(async move { entry.load_manifest(&file_io).await })
+                })
+                .buffer_unordered(NUM_THREADS_VALIDATE_DUPLICATE_FILES)
+                .try_for_each(|manifest| {


This is not rust idiomatic. You could simply do as following:

let referenced_files = streams.flat_map(_.entries().map(_.file_path).filter(...).collect();

CTTY · 2026-05-20T22:00:16Z

+                                })
+                                .collect::<Vec<_>>(),
+                        )
+                        .then(|handle| async move { handle.await? })


I think this will resolve each handle in the stream order. Since we don't really care about the order of results here. try_join_all will be better. it will also be more readable

for the filtering logic we can still use a for-loop since it's not really the bottleneck here

+1, this actually limited the concurrency.

blackmwk · 2026-05-21T02:53:11Z

+                                        .io()
+                                        .spawn(async move { entry.load_manifest(&file_io).await })
+                                })
+                                .collect::<Vec<_>>(),


Do we really need to collect to vec here?

blackmwk · 2026-05-21T03:00:06Z

+                                })
+                                .collect::<Vec<_>>(),
+                        )
+                        .then(|handle| async move { handle.await? })


+1, this actually limited the concurrency.

blackmwk · 2026-05-21T03:03:15Z

+                                .collect::<Vec<_>>(),
+                        )
+                        .then(|handle| async move { handle.await? })
+                        .try_fold(Vec::new(), |mut acc, manifest| {


This is unnecessarily complicated. The whole thing could be simplified as:

stream.flat_map_unordered().filter().collect()

Replaces the stream + then + try_fold chain with try_join_all and a plain for-loop. then resolved handles in stream order which limited concurrency; try_join_all lets all spawned manifest loads complete in parallel. The filtering loop runs in microseconds and isn't a hot path, so the combinator chain wasn't earning its complexity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Each future now returns the filtered file paths from its own manifest; try_join_all yields Vec<Vec<String>> which is flattened. Avoids materializing all Manifests before filtering. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tomighita · 2026-06-02T13:56:15Z

Sorry folks, was a bit caught up with some other stuff and could not address this. I have changed it to group
into a FuturesUnordered but not sure how to re-write the try_fold bits as @blackmwk suggested... 😓

CTTY

I was thinking more of something like the following to make it slightly more readable

let manifests: Vec<Manifest> = manifest_list
    .consume_entries()
    .into_iter()
    .map(|entry| {
        let file_io = file_io.clone();
        runtime
            .io()
            .spawn(..)
    })
    .collect::<FuturesUnordered<_>>()
    .await?;

let referenced_files: Vec<String> = manifests
    .iter()
    .flat_map(|m| m.entries())
    .filter(....)
    .collect();

But the existing code looks good to me!

tomighita and others added 2 commits March 30, 2026 14:31

Implement futures unordered when checking

1d10f6b

Merge branch 'main' into tomighita/increase-duplicate-check-throughput

96777d5

tomighita marked this pull request as ready for review March 30, 2026 12:22

CTTY reviewed Mar 30, 2026

View reviewed changes

tomighita added 2 commits March 31, 2026 13:52

Change to task per fetch

65da6db

Move to buffered and threads

b9d544f

tomighita requested a review from CTTY April 1, 2026 10:28

github-actions Bot added the stale label May 2, 2026

Merge branch 'main' into tomighita/increase-duplicate-check-throughput

9319652

emkornfield reviewed May 4, 2026

View reviewed changes

github-actions Bot removed the stale label May 5, 2026

tomighita and others added 2 commits May 5, 2026 10:09

Move num threads to constant

5a91720

Merge branch 'main' into tomighita/increase-duplicate-check-throughput

3c2da2c

tomighita requested a review from emkornfield May 5, 2026 07:15

emkornfield approved these changes May 5, 2026

View reviewed changes

emkornfield reviewed May 5, 2026

View reviewed changes

Add docstring to explain the motivation of the variable and its value

84f65d2

tomighita and others added 3 commits May 12, 2026 10:36

Merge branch 'main' into tomighita/increase-duplicate-check-throughput

ec66e87

Trigger Build

da2e742

Merge branch 'main' into tomighita/increase-duplicate-check-throughput

c0fc03f

blackmwk reviewed May 19, 2026

View reviewed changes

tomighita and others added 2 commits May 19, 2026 14:53

Merge branch 'main' into tomighita/increase-duplicate-check-throughput

dd72bb9

Use the new runtime api and spawn all tasks at once

bb2119e

tomighita requested a review from blackmwk May 19, 2026 13:31

tomighita added 2 commits May 20, 2026 09:33

Merge branch 'main' into tomighita/increase-duplicate-check-throughput

4582374

fix clippy

b202048

CTTY reviewed May 20, 2026

View reviewed changes

blackmwk reviewed May 21, 2026

View reviewed changes

tomighita and others added 4 commits May 22, 2026 13:58

Increase throughput with try_concat

69e425c

Merge branch 'main' into tomighita/increase-duplicate-check-throughput

f994974

tomighita requested review from CTTY and blackmwk June 2, 2026 13:56

CTTY approved these changes Jun 3, 2026

View reviewed changes

Conversation

tomighita commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Uh oh!

CTTY Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomighita commented Apr 1, 2026

Uh oh!

github-actions Bot commented May 2, 2026

Uh oh!

tomighita commented May 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emkornfield left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomighita commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomighita commented Jun 2, 2026

Uh oh!

CTTY left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tomighita commented Mar 30, 2026 •

edited

Loading

CTTY Mar 30, 2026 •

edited

Loading

tomighita commented May 5, 2026 •

edited

Loading