Add GPU reproduction scripts, eval outputs, and validation sets#9
Open
Longchentong wants to merge 7 commits into
Open
Add GPU reproduction scripts, eval outputs, and validation sets#9Longchentong wants to merge 7 commits into
Longchentong wants to merge 7 commits into
Conversation
Conditional-generation reproduction (ELF-B 105M+35M):
- run_eval.sh: per-seed/per-GPU eval runner (bf16, batch override for HAMi 80GB cap)
- resilient_download.sh / prefetch_assets.py: fetch HF checkpoints+datasets through flaky proxy
- outputs/: 3-seed metrics+generations. De-En BLEU 25.74+-0.43 (paper 26.4);
XSum ROUGE-1/2/L 36.00/12.39/27.99 (paper 36.0/12.2/27.8)
- ckpts/{wmt14_de-en,xsum}_validation_t5: T5-embedded validation sets used by eval
- .gitignore: excludes 798MB checkpoints (see GitHub Release / HuggingFace)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Operator-level routing only — the Flow Matching objective, add_noise, net_out_to_v_x and train_step loss math are untouched, and model.forward still returns (output, decoder_logits). - geometry_router.py: GeometryRouter (parameter-free) computes per-example e_H via deterministic four-point delta_rel and e_S via constant-curvature cos-Gram eigvalue fit on diameter-normalized pairwise distances of a deterministic token subsample (prefix/mode control tokens excluded via geometry_mask). Scores run fp32 under no_grad and torch._dynamo.disable. - layers.py: GeometryRoutedAttention mixes Euclidean / hyperbolic (first-order Busemann-PV proxy or Poincare distance) / spherical (cosine or negative-angular) attention logits by soft gates; values stay in R^d so branch outputs mix directly. Original Attention untouched. - model.py: per-layer routers via geometry_router_layers spec; t and geometry_mask plumbed through blocks incl. gradient checkpointing. - config.py: geometry_router_* fields, all defaults OFF; example YAML train_owt_ELF-B_geom_router.yml. - Disabled model verified BIT-IDENTICAL to the pre-change code (same state_dict keys, max |out diff| = 0). Enabled model keeps the same state_dict (router has no params) so pretrained ckpts stay loadable. - tests/test_geometry_router.py: 15 tests (unittest, no new deps). - GPU smoke: torch.compile + bf16 autocast + grad ckpt + real train_step (Muon, self-cond CFG, mixed CE/L2 branching) all finite for 3 steps. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Review follow-ups: - validate_geometry_router_config(): geometry_router_on_attention=false, geometry_router_on_mlp=true and geometry_router_mode!='soft' now raise NotImplementedError instead of being silently ignored. Called from both train.py and geometry_model_kwargs() (covers eval.py too); no-op when the router is disabled. - geometry_router_denoiser_only (default false): resolves the 'denoiser attention operator' ambiguity explicitly. false = route every forward of the shared backbone (previous behavior, documented). true = strict denoiser semantics: per-row decoder_step_active tensors force decoder-mode rows to the pure Euclidean gate, and all-decoder forwards (decoder_step_active=True, i.e. the decode path at generation) bypass routing entirely via the original SDPA path. Forwards without decoder_step_active (ODE/SDE v-prediction, self-cond) stay routed. - 8 new tests (23 total): reserved-option rejection, decode-path bit-equality vs the disabled model, mixed-row routing, denoiser forwards staying routed. Disabled model re-verified bit-identical to pre-change code; torch.compile + bf16 smoke for mixed-row and decode paths. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…onfigs
- t5_encoder: the manual pairwise-mask stack walk was written against the
newer T5Block API (past_key_values/cache_position) and crashed on
transformers 4.44.2. Stock path is now preferred (4.44 expands 3D masks
correctly; verified bit-identical to the manual walk) with the manual walk
as a signature-adaptive fallback for newer transformers.
- train_de-en_ELF-B_scratch_{base,geom}.yml: identical 10-epoch from-scratch
De-En pilot arms (seed 0, batch 512, muon, grad ckpt); geom arm enables the
router on all layers with denoiser_only + learnable per-layer biases +
metrics logging.
- download_train_data.sh: resilient fetch of wmt14_de-en_train_t5 (2.0 GB).
- run_scratch_arm.sh: per-arm DDP launcher (GPU set, master port, log).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
LR schedule is constant, so a 5-epoch cap is identical to the 10-epoch run truncated at epoch 5. watch_and_stop.sh halts each arm right after its epoch-5 checkpoint + in-training eval are flushed (waits for the 'Epoch 6/' log line) so GPUs free up instead of burning compute to epoch 10. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…gates learn ~pure Euclidean) epoch-5 matched (1000-sample, seed42): base BLEU 21.85 vs geom 22.05 (dBLEU +0.21, ROUGE within +-0.23) — within single-seed noise, effectively tied. Learned gates drive gate_e -> ~1.0000 in deep layers: the model trains the router to disable the hyperbolic/sphere branches. v1 operator-level geometry routing gives no benefit here and costs ~3.4x train compute. res.csv rows 17-20 hold base/geom epoch-2 and epoch-5. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…h spiral Forces the hyperbolic/sphere branches to train early so the learned-gate collapse to Euclidean can't be blamed on under-trained branches. - geometry_router_gate_warmup_steps (config, default 0=off): for the first N steps, blend learned gates toward uniform [1/3,1/3,1/3] (linear 1->0). - GeometryRouter.gate_warmup_alpha: requires_grad=False Parameter (NOT a buffer — torch.compile constant-folds buffers; verified). Whole router forward runs under torch._dynamo.disable so the alpha (and trainable bias) stay live; Muon/Adam skip it (requires_grad filter); eval resets it to 0. - train.py drives the per-step schedule via set_gate_warmup_alpha(); metrics record the LEARNED gates (pre-blend), so we can see what the model chooses. - Verified under torch.compile: alpha 1->0 changes output by 0.71 (was masked before by the zero-init final_layer), bias still receives gradient. - New 15-epoch config: 4-GPU, fast router (sample_size=16, sphere_k=0.5,2.0), 5-epoch gate warmup. Tests updated for the extra (non-trainable) param. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Conditional-generation reproduction (ELF-B 105M+35M):