bench(reader): add ArrowReader benchmark harness#2558
Open
viirya wants to merge 2 commits into
Open
Conversation
There were no in-repo benchmarks for the ArrowReader, so every performance claim in the perf epic (apache#2172) had to be validated against external workloads. This adds a criterion harness that writes Parquet files to a local temp dir and reads them back through the normal FileIO path, measuring the per-FileScanTask overhead that dominates scans of tables with many small files. Benchmark groups: - many_small_files: scans 16/64/256 small files, reporting files/sec so per-file overhead is directly visible. - concurrency: a fixed corpus read at concurrency 1/4/16, exercising both the single-concurrency fast path and the buffered/flattened multi-task path. - migrated_table: files without embedded field IDs read via name mapping, isolating the migrated-table schema-resolution cost. - same_file_splits: one multi-row-group file read as 1/8/32 byte-range tasks, surfacing the redundant per-split metadata fetch that metadata caching (item apache#5) targets. - with_predicate: scans carrying a bound predicate with row-group filtering and row selection enabled, exercising the per-task row-filter setup. This gives a reproducible baseline for evaluating reader optimizations such as operator caching (apache#2177) and metadata reuse.
e1e4060 to
d2ac37e
Compare
Wall-clock time on the local FS cannot show the benefit of round-trip-saving optimizations, but it also undersells the same-file-split problem, whose cost is redundant *bytes fetched* rather than latency. Report ScanMetrics::bytes_read as the same_file_splits throughput basis (a deterministic count, not a sample), and print the read amplification per split count. Reading one ~1.2MB file as 32 byte-range splits fetches ~15.7x the file size, because each split re-fetches the Parquet metadata independently. This is the machine-independent evidence for the cost that same-file metadata caching (proposed in apache#2172, attempted in apache#2100) targets; a caching implementation would drive the amplification back toward 1x.
Collaborator
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
What changes are included in this PR?
Adds a criterion-based benchmark harness for the
ArrowReaderatcrates/iceberg/benches/arrow_reader.rs. Until now there were no in-repo reader benchmarks, so every performance claim in the perf epic (#2172) had to be validated against external workloads. This harness writes Parquet files to a local temp dir and reads them back through the normalFileIOpath, measuring the per-FileScanTaskoverhead that dominates scans of tables with many small files. Because it runs on the local FS, it isolates CPU / per-task work rather than network latency.Benchmark groups, chosen to map onto the epic's code paths:
create_parquet_record_batch_stream_builder()call for migrated tables #2176 path).This adds
criterion(with theasync_tokiofeature) as a workspace dev-dependency and a[[bench]]entry to theicebergcrate.Measuring
bytes_read, not just timeWall-clock time on the local FS cannot show the benefit of optimizations that save round-trips (e.g. the metadata size hint #2173 or range coalescing #2181) — those need a latency-bearing backend, which is left as follow-up. But the same-file-split problem is different: its cost is redundant bytes fetched, not latency, so it is measurable even locally.
The
same_file_splitsgroup therefore reportsScanMetrics::bytes_readas its throughput basis (a deterministic count, not a sampled measurement) and prints the read amplification per split count. On a ~1.2 MiB file:Reading one file as 32 byte-range splits fetches ~15.7x the file size, because each split re-fetches the Parquet metadata independently. This is machine-independent evidence for the cost that same-file metadata caching targets — #2100 was closed partly because its author could not demonstrate a benefit on their workload (a ~1:1 task-to-file table, where same-file caching has nothing to hit); with this benchmark a caching implementation would show the amplification dropping back toward 1x.
Run with:
Are these changes tested?
This is a benchmark, not a code change. The harness compiles and runs (
cargo bench -p iceberg --bench arrow_reader), and reuses the same Parquet-writing and reader-driving patterns as the existing reader unit tests.cargo fmtandcargo clippy -p iceberg --benchesare clean.