Skip to content

bench(reader): add ArrowReader benchmark harness#2558

Open
viirya wants to merge 2 commits into
apache:mainfrom
viirya:bench/arrow-reader-perf
Open

bench(reader): add ArrowReader benchmark harness#2558
viirya wants to merge 2 commits into
apache:mainfrom
viirya:bench/arrow-reader-perf

Conversation

@viirya
Copy link
Copy Markdown
Member

@viirya viirya commented May 31, 2026

Which issue does this PR close?

What changes are included in this PR?

Adds a criterion-based benchmark harness for the ArrowReader at crates/iceberg/benches/arrow_reader.rs. Until now there were no in-repo reader benchmarks, so every performance claim in the perf epic (#2172) had to be validated against external workloads. This harness writes Parquet files to a local temp dir and reads them back through the normal FileIO path, measuring the per-FileScanTask overhead that dominates scans of tables with many small files. Because it runs on the local FS, it isolates CPU / per-task work rather than network latency.

Benchmark groups, chosen to map onto the epic's code paths:

This adds criterion (with the async_tokio feature) as a workspace dev-dependency and a [[bench]] entry to the iceberg crate.

Measuring bytes_read, not just time

Wall-clock time on the local FS cannot show the benefit of optimizations that save round-trips (e.g. the metadata size hint #2173 or range coalescing #2181) — those need a latency-bearing backend, which is left as follow-up. But the same-file-split problem is different: its cost is redundant bytes fetched, not latency, so it is measurable even locally.

The same_file_splits group therefore reports ScanMetrics::bytes_read as its throughput basis (a deterministic count, not a sampled measurement) and prints the read amplification per split count. On a ~1.2 MiB file:

same_file_splits/1:  bytes_read = 1,694,258  (1.43x file size)
same_file_splits/8:  bytes_read = 5,529,051  (4.66x file size)
same_file_splits/32: bytes_read = 18,674,303 (15.72x file size)

Reading one file as 32 byte-range splits fetches ~15.7x the file size, because each split re-fetches the Parquet metadata independently. This is machine-independent evidence for the cost that same-file metadata caching targets — #2100 was closed partly because its author could not demonstrate a benefit on their workload (a ~1:1 task-to-file table, where same-file caching has nothing to hit); with this benchmark a caching implementation would show the amplification dropping back toward 1x.

Run with:

cargo bench -p iceberg --bench arrow_reader

Are these changes tested?

This is a benchmark, not a code change. The harness compiles and runs (cargo bench -p iceberg --bench arrow_reader), and reuses the same Parquet-writing and reader-driving patterns as the existing reader unit tests. cargo fmt and cargo clippy -p iceberg --benches are clean.

There were no in-repo benchmarks for the ArrowReader, so every performance
claim in the perf epic (apache#2172) had to be validated against external
workloads. This adds a criterion harness that writes Parquet files to a
local temp dir and reads them back through the normal FileIO path, measuring
the per-FileScanTask overhead that dominates scans of tables with many small
files.

Benchmark groups:
- many_small_files: scans 16/64/256 small files, reporting files/sec so
  per-file overhead is directly visible.
- concurrency: a fixed corpus read at concurrency 1/4/16, exercising both
  the single-concurrency fast path and the buffered/flattened multi-task
  path.
- migrated_table: files without embedded field IDs read via name mapping,
  isolating the migrated-table schema-resolution cost.
- same_file_splits: one multi-row-group file read as 1/8/32 byte-range
  tasks, surfacing the redundant per-split metadata fetch that metadata
  caching (item apache#5) targets.
- with_predicate: scans carrying a bound predicate with row-group filtering
  and row selection enabled, exercising the per-task row-filter setup.

This gives a reproducible baseline for evaluating reader optimizations such
as operator caching (apache#2177) and metadata reuse.
@viirya viirya force-pushed the bench/arrow-reader-perf branch from e1e4060 to d2ac37e Compare May 31, 2026 17:38
Wall-clock time on the local FS cannot show the benefit of round-trip-saving
optimizations, but it also undersells the same-file-split problem, whose cost
is redundant *bytes fetched* rather than latency. Report ScanMetrics::bytes_read
as the same_file_splits throughput basis (a deterministic count, not a sample),
and print the read amplification per split count.

Reading one ~1.2MB file as 32 byte-range splits fetches ~15.7x the file size,
because each split re-fetches the Parquet metadata independently. This is the
machine-independent evidence for the cost that same-file metadata caching
(proposed in apache#2172, attempted in apache#2100) targets; a caching implementation
would drive the amplification back toward 1x.
@mbutrovich mbutrovich self-requested a review June 1, 2026 11:57
@mbutrovich
Copy link
Copy Markdown
Collaborator

Thanks for doing this @viirya! This would have made my life working on #2172 much easier. I will take a proper pass through it today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add an in-repo benchmark harness for ArrowReader

2 participants