bench: cycle distinct prefilled blocks to remove per-append fill#159
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 🚀 New features to boost your workflow:
|
Benchmark evidence (L40, RelWithDebInfo, zstd default,
|
| mode | throughput | data-gen time | compressed (ratio) |
|---|---|---|---|
| default (cycle) | 4.38 GiB/s | 0.35 s (1.3% of wall) | 17.17 GiB (0.147) |
--busy-producer |
4.38 GiB/s | 10.01 s (37.4%) | 16.49 GiB (0.141) |
--prefill |
4.45 GiB/s | 0.05 s (0.2%) | 17.15 GiB (0.146) |
bench_stream_256cube_two_streams (interleaved, two streams — host-bandwidth-bound)
| mode | throughput (combined) | data-gen time | compressed |
|---|---|---|---|
| default (cycle) | 7.40 GiB/s | 0.35 s (3.4%) | 7.41 GiB |
--busy-producer |
4.18 GiB/s | 6.23 s (34.7%) | 8.80 GiB |
--prefill |
8.67 GiB/s | 0.05 s (0.5%) | 7.41 GiB |
What this shows
- Data generation was a large, hidden cost: the old per-append fill spends ~35–37% of wall on the producer thread. The default removes it (≈0–3%), and the new
Data gen time (% of wall)line makes that cost explicit instead of folded into throughput. - Where the run is host-bandwidth-bound (two streams), that fill competes with the library: not regenerating every append raises combined throughput from 4.18 → 7.40 GiB/s (+77%). Where the run is compute/library-bound (single stream, 4.38 GiB/s), the fill overlaps the pipeline so wall throughput is unchanged — but the 37%-of-wall fill is now reported rather than invisible.
- Compression: the default tracks
--prefilland is a few percent off--busy-producer. zstd compresses each chunk independently (no cross-append dedup), so cycling distinct blocks mainly guards against the pathological identical-block case rather than shifting these aggregate ratios.
Representative run (2 of the streaming benches). Build: conda CUDA toolkit, sm_89, the deps prefix on CMAKE_PREFIX_PATH; run with --frames 400.
|
Simplified to a single mode: the benchmarks now just prefill once and reuse the block. The earlier cycle-blocks default and In the benchmark numbers posted above, the relevant column is now |
Closes #150.
The streaming benchmarks spent about half their time generating their own input, so the reported throughput measured the harness as much as the library, and the generator thread competed with the library for memory bandwidth.
Now the benchmarks fill one block once and reuse it for every append, so the measured loop is the writer rather than the generator. The per-append generator (
pump_data) stays for the correctness tests, which rely on varied data per append.Also fixes the single-stream path ignoring
--dtype.