bench: cycle distinct prefilled blocks to remove per-append fill by nclack · Pull Request #159 · acquire-project/chucky

nclack · 2026-06-18T02:25:16Z

Closes #150.

The streaming benchmarks spent about half their time generating their own input, so the reported throughput measured the harness as much as the library, and the generator thread competed with the library for memory bandwidth.

Now the benchmarks fill one block once and reuse it for every append, so the measured loop is the writer rather than the generator. The per-append generator (pump_data) stays for the correctness tests, which rely on varied data per append.

Also fixes the single-stream path ignoring --dtype.

codecov · 2026-06-18T02:30:16Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
see 7 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

nclack · 2026-06-18T14:57:14Z

Benchmark evidence (L40, RelWithDebInfo, zstd default, `--frames 400`)

Built this branch and ran two benches in each mode. default = new cycle-blocks; --busy-producer = the old per-append generation; --prefill = single reused block.

bench_stream_medfmt_single (single stream)

mode	throughput	data-gen time	compressed (ratio)
default (cycle)	4.38 GiB/s	0.35 s (1.3% of wall)	17.17 GiB (0.147)
`--busy-producer`	4.38 GiB/s	10.01 s (37.4%)	16.49 GiB (0.141)
`--prefill`	4.45 GiB/s	0.05 s (0.2%)	17.15 GiB (0.146)

bench_stream_256cube_two_streams (interleaved, two streams — host-bandwidth-bound)

mode	throughput (combined)	data-gen time	compressed
default (cycle)	7.40 GiB/s	0.35 s (3.4%)	7.41 GiB
`--busy-producer`	4.18 GiB/s	6.23 s (34.7%)	8.80 GiB
`--prefill`	8.67 GiB/s	0.05 s (0.5%)	7.41 GiB

What this shows

Data generation was a large, hidden cost: the old per-append fill spends ~35–37% of wall on the producer thread. The default removes it (≈0–3%), and the new Data gen time (% of wall) line makes that cost explicit instead of folded into throughput.
Where the run is host-bandwidth-bound (two streams), that fill competes with the library: not regenerating every append raises combined throughput from 4.18 → 7.40 GiB/s (+77%). Where the run is compute/library-bound (single stream, 4.38 GiB/s), the fill overlaps the pipeline so wall throughput is unchanged — but the 37%-of-wall fill is now reported rather than invisible.
Compression: the default tracks --prefill and is a few percent off --busy-producer. zstd compresses each chunk independently (no cross-append dedup), so cycling distinct blocks mainly guards against the pathological identical-block case rather than shifting these aggregate ratios.

Representative run (2 of the streaming benches). Build: conda CUDA toolkit, sm_89, the deps prefix on CMAKE_PREFIX_PATH; run with --frames 400.

nclack · 2026-06-18T16:41:55Z

Simplified to a single mode: the benchmarks now just prefill once and reuse the block. The earlier cycle-blocks default and --busy-producer/--prefill flags are gone — measurements showed cycle-blocks gave the same throughput and compression as prefill (a single 32M-element block already spans the data pattern, so its chunks already have representative per-chunk entropy, and zstd doesn't dedup across appends), so the extra machinery wasn't buying anything. --busy-producer was only a host-contention stress test, not the default.

In the benchmark numbers posted above, the relevant column is now --prefill (that is the default behavior); the cycle-blocks/busy-producer columns no longer exist.

bench: cycle distinct pump blocks acquire-project#150

9ea87c0

nclack mentioned this pull request Jun 18, 2026

bench: cycle distinct prefilled blocks to remove per-append fill nclack/chucky#2

Closed

Nathan Clack added 4 commits June 18, 2026 03:49

bench: stride cycle blocks across pattern acquire-project#150

e17cd47

bench: cap pre-gen block memory acquire-project#150

411bff9

simplify: share pump block pre-gen

7bf7ea2

comments: trim pump block helpers

72f475c

bench: prefill-only data generation

c01c508

nclack merged commit 46fa2e8 into acquire-project:main Jun 18, 2026
8 checks passed

nclack deleted the bench-150-prefill-blocks branch June 18, 2026 18:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bench: cycle distinct prefilled blocks to remove per-append fill#159

bench: cycle distinct prefilled blocks to remove per-append fill#159
nclack merged 6 commits into
acquire-project:mainfrom
nclack:bench-150-prefill-blocks

nclack commented Jun 18, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

nclack commented Jun 18, 2026

Uh oh!

nclack commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

nclack commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

nclack commented Jun 18, 2026

Benchmark evidence (L40, RelWithDebInfo, zstd default, --frames 400)

What this shows

Uh oh!

nclack commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nclack commented Jun 18, 2026 •

edited

Loading

codecov Bot commented Jun 18, 2026 •

edited

Loading

Benchmark evidence (L40, RelWithDebInfo, zstd default, `--frames 400`)