Skip to content

Periodic checkpoint output and lock-free hot-loop diagnostics#352

Open
krystophny wants to merge 1 commit into
mainfrom
feature/progress-checkpoint
Open

Periodic checkpoint output and lock-free hot-loop diagnostics#352
krystophny wants to merge 1 commit into
mainfrom
feature/progress-checkpoint

Conversation

@krystophny
Copy link
Copy Markdown
Member

Motivation

The tracer writes its result files (times_lost.dat, confined_fraction.dat, class_parts.dat) only in the final write_output. A run killed before completion, such as a batch-scheduler timeout on a long slowing-down run, therefore loses everything and offers no progress signal while it runs.

Separately, the symplectic integrators print and dump to fort.660x on every Newton non-convergence and every r<0 event. These sit in the inner integration loop, so a difficult case floods stdout and serializes the OpenMP threads on the implicit I/O lock.

Change

  • diag_counters: per-thread, cache-line-padded event counters. The integrators call count_event(...) instead of printing; totals are a reduction over the thread columns, taken outside the hot path. No atomic and no critical on the increment.
  • progress_monitor: once per finished particle a thread reads the wall clock, which is lock-free. Every checkpoint_interval seconds one thread enters a rarely taken critical, writes the partial results and a one-line status (percent done, ETA, aggregated event counts), and advances the deadline. A killed run keeps its last flushed output. The module holds a dump callback, so it depends on no tracing code.
  • simple_main: write_output splits into a reusable write_results shared by the periodic checkpoint and the final write. The confined-fraction normalization moves there unchanged.
  • New namelist key checkpoint_interval (seconds, default 10; set <= 0 to disable).

Verification

Build: cmake --build build (Release, gfortran 15), library and simple.x link clean.

Periodic output and recovery. A deterministic volume fast-classify with checkpoint_interval=1:

  • status line every ~1 s: [progress] 10.00% 200/2000 t=19.1s eta=171.8s newton1_maxit=10 r_negative=10 (aggregated counts, no per-iteration spam);
  • confined_fraction.dat, times_lost.dat, progress.dat present mid-run;
  • after kill -9, the files remain; times_lost.dat holds the finished particles with -1 sentinels for the rest.

Final output unchanged. Same deterministic volume run with the monitor off (checkpoint_interval=0) and on (=10, run completes under one interval) produces a byte-identical class_parts.dat (diff clean). Confined fractions still divide by ntestpart; the per-particle files differ from the old code only by open(newunit=...) in place of a hard-coded unit.

The full golden_record suite runs in CI on this PR.

@krystophny krystophny force-pushed the feature/progress-checkpoint branch from 231334c to 71e8ce8 Compare May 26, 2026 06:20
Tracing wrote its result files only at the very end, so a run killed
before completion (a SLURM timeout, say) lost everything and gave no
progress signal. The symplectic integrators also printed and dumped to
fort.660x on every Newton non-convergence and r<0 event, deep in the
inner loop, which floods stdout and serializes threads.

- diag_counters: per-thread, cache-line-padded event counters. The
  integrators bump a counter instead of printing; totals are a reduction
  taken outside the hot path, with no atomic or critical on the count.
- progress_monitor: once per finished particle a thread reads the clock
  (lock-free); every checkpoint_interval seconds one thread flushes the
  partial results and a status line under a rarely taken critical. A
  killed run keeps its last flushed output.
- simple_main: write_output splits into a reusable write_results, so the
  periodic checkpoint and the final write share one path; the confined-
  fraction normalization moves there unchanged.
- checkpoint_interval namelist key (seconds, default 10).

Final output is unchanged: confined fractions still divide by ntestpart
and the per-particle files are byte-identical. Verified that the monitor
off versus on produces identical class_parts.dat on a deterministic
volume run.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant