gpu: snapshot per-batch LOD timing across batch boundary by nclack · Pull Request #3 · nclack/chucky

nclack · 2026-06-18T02:17:09Z

record_flush_metrics reads the per-LOD timing events lod_shared->timing[fc] for the batch being drained, but the producer re-records those events for the next batch on the same fc. Drains are lazy and run on the delivery worker, so the read can race the re-record and pick up a different batch's generation. CUDA event APIs are thread-safe and accumulate_metric_cu discards bad readings, so the only effect is occasional skew in reported per-LOD timing — never wrong output bytes.

A naive copy of the event handle into the handoff (as aggregate timing does) is insufficient: timing[fc].t_end is dual-purpose — it is also the GPU_EDGE_LOD_DONE ordering edge — and is re-recorded during the next batch's fill, before any drain ordering.

Fix:

Give the per-LOD timing 3 generations (LOD_TIMING_SLOTS); worst case is 3 simultaneously-live batches (draining + pending + filling). Each batch owns one generation for its whole lifetime, threaded through the schedule slot and handoff so the drain reads the generation it filled. Reuse (batch N+3 reuses N) is safe because N is joined before N+3's fill begins.
Decouple the ordering edge from the timing event: a stable per-fc lod_done[2] event now backs GPU_EDGE_LOD_DONE, seeded exactly where t_end was, so the compress-stream wait fires at the identical pipeline position.

Metrics-quality only; no data-correctness impact. No new unit test (a timing-skew race isn't reliably testable); existing multiscale/LOD ctests confirm metrics still populate.

nclack · 2026-06-18T02:25:43Z

Superseded by acquire-project#160 (PR retargeted to upstream).

gpu: snapshot per-batch LOD timing acquire-project#154

1ae2b62

nclack closed this Jun 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpu: snapshot per-batch LOD timing across batch boundary#3

gpu: snapshot per-batch LOD timing across batch boundary#3
nclack wants to merge 1 commit into
mainfrom
fix-154-lod-timing-snapshot

nclack commented Jun 18, 2026

Uh oh!

nclack commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nclack commented Jun 18, 2026

Uh oh!

nclack commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant