gpu: snapshot per-batch LOD timing across batch boundary#3
Closed
nclack wants to merge 1 commit into
Closed
Conversation
Owner
Author
|
Superseded by acquire-project#160 (PR retargeted to upstream). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes acquire-project#154.
record_flush_metricsreads the per-LOD timing eventslod_shared->timing[fc]for the batch being drained, but the producer re-records those events for the next batch on the samefc. Drains are lazy and run on the delivery worker, so the read can race the re-record and pick up a different batch's generation. CUDA event APIs are thread-safe andaccumulate_metric_cudiscards bad readings, so the only effect is occasional skew in reported per-LOD timing — never wrong output bytes.A naive copy of the event handle into the handoff (as aggregate timing does) is insufficient:
timing[fc].t_endis dual-purpose — it is also theGPU_EDGE_LOD_DONEordering edge — and is re-recorded during the next batch's fill, before any drain ordering.Fix:
LOD_TIMING_SLOTS); worst case is 3 simultaneously-live batches (draining + pending + filling). Each batch owns one generation for its whole lifetime, threaded through the schedule slot and handoff so the drain reads the generation it filled. Reuse (batch N+3 reuses N) is safe because N is joined before N+3's fill begins.lod_done[2]event now backsGPU_EDGE_LOD_DONE, seeded exactly wheret_endwas, so the compress-stream wait fires at the identical pipeline position.Metrics-quality only; no data-correctness impact. No new unit test (a timing-skew race isn't reliably testable); existing multiscale/LOD ctests confirm metrics still populate.