Skip to content

bench: cycle distinct prefilled blocks to remove per-append fill#159

Merged
nclack merged 6 commits into
acquire-project:mainfrom
nclack:bench-150-prefill-blocks
Jun 18, 2026
Merged

bench: cycle distinct prefilled blocks to remove per-append fill#159
nclack merged 6 commits into
acquire-project:mainfrom
nclack:bench-150-prefill-blocks

Conversation

@nclack

@nclack nclack commented Jun 18, 2026

Copy link
Copy Markdown
Member

Closes #150.

The streaming benchmarks spent about half their time generating their own input, so the reported throughput measured the harness as much as the library, and the generator thread competed with the library for memory bandwidth.

Now the benchmarks fill one block once and reuse it for every append, so the measured loop is the writer rather than the generator. The per-append generator (pump_data) stays for the correctness tests, which rely on varied data per append.

Also fixes the single-stream path ignoring --dtype.

@codecov

codecov Bot commented Jun 18, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
see 7 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nclack

nclack commented Jun 18, 2026

Copy link
Copy Markdown
Member Author

Benchmark evidence (L40, RelWithDebInfo, zstd default, --frames 400)

Built this branch and ran two benches in each mode. default = new cycle-blocks; --busy-producer = the old per-append generation; --prefill = single reused block.

bench_stream_medfmt_single (single stream)

mode throughput data-gen time compressed (ratio)
default (cycle) 4.38 GiB/s 0.35 s (1.3% of wall) 17.17 GiB (0.147)
--busy-producer 4.38 GiB/s 10.01 s (37.4%) 16.49 GiB (0.141)
--prefill 4.45 GiB/s 0.05 s (0.2%) 17.15 GiB (0.146)

bench_stream_256cube_two_streams (interleaved, two streams — host-bandwidth-bound)

mode throughput (combined) data-gen time compressed
default (cycle) 7.40 GiB/s 0.35 s (3.4%) 7.41 GiB
--busy-producer 4.18 GiB/s 6.23 s (34.7%) 8.80 GiB
--prefill 8.67 GiB/s 0.05 s (0.5%) 7.41 GiB

What this shows

  • Data generation was a large, hidden cost: the old per-append fill spends ~35–37% of wall on the producer thread. The default removes it (≈0–3%), and the new Data gen time (% of wall) line makes that cost explicit instead of folded into throughput.
  • Where the run is host-bandwidth-bound (two streams), that fill competes with the library: not regenerating every append raises combined throughput from 4.18 → 7.40 GiB/s (+77%). Where the run is compute/library-bound (single stream, 4.38 GiB/s), the fill overlaps the pipeline so wall throughput is unchanged — but the 37%-of-wall fill is now reported rather than invisible.
  • Compression: the default tracks --prefill and is a few percent off --busy-producer. zstd compresses each chunk independently (no cross-append dedup), so cycling distinct blocks mainly guards against the pathological identical-block case rather than shifting these aggregate ratios.

Representative run (2 of the streaming benches). Build: conda CUDA toolkit, sm_89, the deps prefix on CMAKE_PREFIX_PATH; run with --frames 400.

@nclack

nclack commented Jun 18, 2026

Copy link
Copy Markdown
Member Author

Simplified to a single mode: the benchmarks now just prefill once and reuse the block. The earlier cycle-blocks default and --busy-producer/--prefill flags are gone — measurements showed cycle-blocks gave the same throughput and compression as prefill (a single 32M-element block already spans the data pattern, so its chunks already have representative per-chunk entropy, and zstd doesn't dedup across appends), so the extra machinery wasn't buying anything. --busy-producer was only a host-contention stress test, not the default.

In the benchmark numbers posted above, the relevant column is now --prefill (that is the default behavior); the cycle-blocks/busy-producer columns no longer exist.

@nclack nclack merged commit 46fa2e8 into acquire-project:main Jun 18, 2026
8 checks passed
@nclack nclack deleted the bench-150-prefill-blocks branch June 18, 2026 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Streaming benches spend ~half their time generating input; throughput conflates harness and library

1 participant