megabatch: env-gated NaN capture wrapper around newton_schulz_func#78
Open
JohnLangford wants to merge 1 commit into
Open
megabatch: env-gated NaN capture wrapper around newton_schulz_func#78JohnLangford wants to merge 1 commit into
JohnLangford wants to merge 1 commit into
Conversation
Issue #76 reports intermittent NaN params with NorMuon + gram-newton-schulz + quack-kernels 0.4.1 on 6x Blackwell RTX PRO 6000 + DDP. The reporter and maintainers cannot consistently reproduce, so the next step is to capture the actual Newton-Schulz input that triggers the failure for offline replay. Add an env-gated wrapper around newton_schulz_func inside megabatch_orthogonalize_async: DION_NAN_CAPTURE=1 # enable DION_NAN_CAPTURE_DIR=./dion_nan_captures # output directory DION_NAN_CAPTURE_RAISE=1 # raise after dump (default) When enabled, each rank that observes a non-finite NS output writes a rank+shape+pid+timestamp-keyed .pt file containing the (cloned-up-front) input, the offending output, the rank, and epsilon, then by default raises RuntimeError. Multiple ranks dumping in parallel do not collide on a shared filesystem because the filename includes rank + pid. The check is local per rank: there is no extra collective. When the env var is unset the wrapper short-circuits at one os.environ lookup per megabatch ortho call (effectively free). Cloning the input only happens on the enabled path. Tests: CPU-only single-rank tests for the four behaviors - disabled-by-default no-op, dump-on-NaN, raise-by-default, no-dump-when- finite, and rank-keyed filenames. All five pass without GPU/NCCL.
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Re: #76. The reporter and maintainers cannot consistently reproduce the
NorMuon + gram-newton-schulz + quack-kernels 0.4.1 NaN bug, so the next
debugging step is to capture the actual NS input that triggers the failure
on the affected hardware (6x Blackwell RTX PRO 6000 + DDP).
This PR wraps
newton_schulz_funcinsidemegabatch_orthogonalize_asyncwith an env-gated capture wrapper. When enabled, each rank that observes a
non-finite NS output writes its own .pt dump (input + output + rank +
shape + epsilon) and by default raises
RuntimeError. The check islocal per rank — no extra collective — and it short-circuits at one
os.environlookup when disabled.Usage
Per-rank filenames are
dion_nan_capture_rank{rank}_shape{...}_pid{...}_{ts_ms}.ptso concurrent writes from multiple ranks on a shared filesystem do not collide.
Notes
os.environlookup per call..detach().clone()-d up front so we still have theoffending tensor even if a kernel mutates X in-place.
os,time).Test plan
test_capture_disabled_by_default— no env, no dumps even when NS returns NaNtest_capture_dumps_on_non_finite_output— dump contains correct input/output/shape/rank/epsilontest_capture_raises_by_default_when_enabled—RuntimeErrorafter dumptest_capture_no_dump_when_finite— finite output -> no dump even with env ontest_capture_filename_includes_rank— non-zero rank reflected in filenameAll five pass single-rank without GPU/NCCL.