Periodic checkpoint output and lock-free hot-loop diagnostics#352
Open
krystophny wants to merge 1 commit into
Open
Periodic checkpoint output and lock-free hot-loop diagnostics#352krystophny wants to merge 1 commit into
krystophny wants to merge 1 commit into
Conversation
231334c to
71e8ce8
Compare
Tracing wrote its result files only at the very end, so a run killed before completion (a SLURM timeout, say) lost everything and gave no progress signal. The symplectic integrators also printed and dumped to fort.660x on every Newton non-convergence and r<0 event, deep in the inner loop, which floods stdout and serializes threads. - diag_counters: per-thread, cache-line-padded event counters. The integrators bump a counter instead of printing; totals are a reduction taken outside the hot path, with no atomic or critical on the count. - progress_monitor: once per finished particle a thread reads the clock (lock-free); every checkpoint_interval seconds one thread flushes the partial results and a status line under a rarely taken critical. A killed run keeps its last flushed output. - simple_main: write_output splits into a reusable write_results, so the periodic checkpoint and the final write share one path; the confined- fraction normalization moves there unchanged. - checkpoint_interval namelist key (seconds, default 10). Final output is unchanged: confined fractions still divide by ntestpart and the per-particle files are byte-identical. Verified that the monitor off versus on produces identical class_parts.dat on a deterministic volume run.
71e8ce8 to
64af5be
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The tracer writes its result files (
times_lost.dat,confined_fraction.dat,class_parts.dat) only in the finalwrite_output. A run killed before completion, such as a batch-scheduler timeout on a long slowing-down run, therefore loses everything and offers no progress signal while it runs.Separately, the symplectic integrators
printand dump tofort.660xon every Newton non-convergence and everyr<0event. These sit in the inner integration loop, so a difficult case floods stdout and serializes the OpenMP threads on the implicit I/O lock.Change
diag_counters: per-thread, cache-line-padded event counters. The integrators callcount_event(...)instead of printing; totals are a reduction over the thread columns, taken outside the hot path. No atomic and no critical on the increment.progress_monitor: once per finished particle a thread reads the wall clock, which is lock-free. Everycheckpoint_intervalseconds one thread enters a rarely taken critical, writes the partial results and a one-line status (percent done, ETA, aggregated event counts), and advances the deadline. A killed run keeps its last flushed output. The module holds a dump callback, so it depends on no tracing code.simple_main:write_outputsplits into a reusablewrite_resultsshared by the periodic checkpoint and the final write. The confined-fraction normalization moves there unchanged.checkpoint_interval(seconds, default 10; set <= 0 to disable).Verification
Build:
cmake --build build(Release, gfortran 15), library andsimple.xlink clean.Periodic output and recovery. A deterministic volume fast-classify with
checkpoint_interval=1:[progress] 10.00% 200/2000 t=19.1s eta=171.8s newton1_maxit=10 r_negative=10(aggregated counts, no per-iteration spam);confined_fraction.dat,times_lost.dat,progress.datpresent mid-run;kill -9, the files remain;times_lost.datholds the finished particles with-1sentinels for the rest.Final output unchanged. Same deterministic volume run with the monitor off (
checkpoint_interval=0) and on (=10, run completes under one interval) produces a byte-identicalclass_parts.dat(diffclean). Confined fractions still divide byntestpart; the per-particle files differ from the old code only byopen(newunit=...)in place of a hard-coded unit.The full
golden_recordsuite runs in CI on this PR.