Add GPU reproduction scripts, eval outputs, and validation sets by Longchentong · Pull Request #9 · lillian039/ELF

Longchentong · 2026-06-12T14:17:43Z

Conditional-generation reproduction (ELF-B 105M+35M):

run_eval.sh: per-seed/per-GPU eval runner (bf16, batch override for HAMi 80GB cap)
resilient_download.sh / prefetch_assets.py: fetch HF checkpoints+datasets through flaky proxy
outputs/: 3-seed metrics+generations. De-En BLEU 25.74+-0.43 (paper 26.4); XSum ROUGE-1/2/L 36.00/12.39/27.99 (paper 36.0/12.2/27.8)
ckpts/{wmt14_de-en,xsum}_validation_t5: T5-embedded validation sets used by eval
.gitignore: excludes 798MB checkpoints (see GitHub Release / HuggingFace)

Conditional-generation reproduction (ELF-B 105M+35M): - run_eval.sh: per-seed/per-GPU eval runner (bf16, batch override for HAMi 80GB cap) - resilient_download.sh / prefetch_assets.py: fetch HF checkpoints+datasets through flaky proxy - outputs/: 3-seed metrics+generations. De-En BLEU 25.74+-0.43 (paper 26.4); XSum ROUGE-1/2/L 36.00/12.39/27.99 (paper 36.0/12.2/27.8) - ckpts/{wmt14_de-en,xsum}_validation_t5: T5-embedded validation sets used by eval - .gitignore: excludes 798MB checkpoints (see GitHub Release / HuggingFace) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Operator-level routing only — the Flow Matching objective, add_noise, net_out_to_v_x and train_step loss math are untouched, and model.forward still returns (output, decoder_logits). - geometry_router.py: GeometryRouter (parameter-free) computes per-example e_H via deterministic four-point delta_rel and e_S via constant-curvature cos-Gram eigvalue fit on diameter-normalized pairwise distances of a deterministic token subsample (prefix/mode control tokens excluded via geometry_mask). Scores run fp32 under no_grad and torch._dynamo.disable. - layers.py: GeometryRoutedAttention mixes Euclidean / hyperbolic (first-order Busemann-PV proxy or Poincare distance) / spherical (cosine or negative-angular) attention logits by soft gates; values stay in R^d so branch outputs mix directly. Original Attention untouched. - model.py: per-layer routers via geometry_router_layers spec; t and geometry_mask plumbed through blocks incl. gradient checkpointing. - config.py: geometry_router_* fields, all defaults OFF; example YAML train_owt_ELF-B_geom_router.yml. - Disabled model verified BIT-IDENTICAL to the pre-change code (same state_dict keys, max |out diff| = 0). Enabled model keeps the same state_dict (router has no params) so pretrained ckpts stay loadable. - tests/test_geometry_router.py: 15 tests (unittest, no new deps). - GPU smoke: torch.compile + bf16 autocast + grad ckpt + real train_step (Muon, self-cond CFG, mixed CE/L2 branching) all finite for 3 steps. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Review follow-ups: - validate_geometry_router_config(): geometry_router_on_attention=false, geometry_router_on_mlp=true and geometry_router_mode!='soft' now raise NotImplementedError instead of being silently ignored. Called from both train.py and geometry_model_kwargs() (covers eval.py too); no-op when the router is disabled. - geometry_router_denoiser_only (default false): resolves the 'denoiser attention operator' ambiguity explicitly. false = route every forward of the shared backbone (previous behavior, documented). true = strict denoiser semantics: per-row decoder_step_active tensors force decoder-mode rows to the pure Euclidean gate, and all-decoder forwards (decoder_step_active=True, i.e. the decode path at generation) bypass routing entirely via the original SDPA path. Forwards without decoder_step_active (ODE/SDE v-prediction, self-cond) stay routed. - 8 new tests (23 total): reserved-option rejection, decode-path bit-equality vs the disabled model, mixed-row routing, denoiser forwards staying routed. Disabled model re-verified bit-identical to pre-change code; torch.compile + bf16 smoke for mixed-row and decode paths. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…onfigs - t5_encoder: the manual pairwise-mask stack walk was written against the newer T5Block API (past_key_values/cache_position) and crashed on transformers 4.44.2. Stock path is now preferred (4.44 expands 3D masks correctly; verified bit-identical to the manual walk) with the manual walk as a signature-adaptive fallback for newer transformers. - train_de-en_ELF-B_scratch_{base,geom}.yml: identical 10-epoch from-scratch De-En pilot arms (seed 0, batch 512, muon, grad ckpt); geom arm enables the router on all layers with denoiser_only + learnable per-layer biases + metrics logging. - download_train_data.sh: resilient fetch of wmt14_de-en_train_t5 (2.0 GB). - run_scratch_arm.sh: per-arm DDP launcher (GPU set, master port, log). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

LR schedule is constant, so a 5-epoch cap is identical to the 10-epoch run truncated at epoch 5. watch_and_stop.sh halts each arm right after its epoch-5 checkpoint + in-training eval are flushed (waits for the 'Epoch 6/' log line) so GPUs free up instead of burning compute to epoch 10. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…gates learn ~pure Euclidean) epoch-5 matched (1000-sample, seed42): base BLEU 21.85 vs geom 22.05 (dBLEU +0.21, ROUGE within +-0.23) — within single-seed noise, effectively tied. Learned gates drive gate_e -> ~1.0000 in deep layers: the model trains the router to disable the hyperbolic/sphere branches. v1 operator-level geometry routing gives no benefit here and costs ~3.4x train compute. res.csv rows 17-20 hold base/geom epoch-2 and epoch-5. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…h spiral Forces the hyperbolic/sphere branches to train early so the learned-gate collapse to Euclidean can't be blamed on under-trained branches. - geometry_router_gate_warmup_steps (config, default 0=off): for the first N steps, blend learned gates toward uniform [1/3,1/3,1/3] (linear 1->0). - GeometryRouter.gate_warmup_alpha: requires_grad=False Parameter (NOT a buffer — torch.compile constant-folds buffers; verified). Whole router forward runs under torch._dynamo.disable so the alpha (and trainable bias) stay live; Muon/Adam skip it (requires_grad filter); eval resets it to 0. - train.py drives the per-step schedule via set_gate_warmup_alpha(); metrics record the LEARNED gates (pre-blend), so we can see what the model chooses. - Verified under torch.compile: alpha 1->0 changes output by 0.71 (was masked before by the zero-init final_layer), bias still receives gradient. - New 15-epoch config: 4-GPU, fast router (sample_size=16, sphere_k=0.5,2.0), 5-epoch gate warmup. Tests updated for the extra (non-trainable) param. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Longchentong and others added 7 commits June 12, 2026 14:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPU reproduction scripts, eval outputs, and validation sets#9

Add GPU reproduction scripts, eval outputs, and validation sets#9
Longchentong wants to merge 7 commits into
lillian039:pytorch_elffrom
Longchentong:pytorch_elf

Longchentong commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Longchentong commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant