Skip to content

Add GPU reproduction scripts, eval outputs, and validation sets#9

Open
Longchentong wants to merge 7 commits into
lillian039:pytorch_elffrom
Longchentong:pytorch_elf
Open

Add GPU reproduction scripts, eval outputs, and validation sets#9
Longchentong wants to merge 7 commits into
lillian039:pytorch_elffrom
Longchentong:pytorch_elf

Conversation

@Longchentong

Copy link
Copy Markdown

Conditional-generation reproduction (ELF-B 105M+35M):

  • run_eval.sh: per-seed/per-GPU eval runner (bf16, batch override for HAMi 80GB cap)
  • resilient_download.sh / prefetch_assets.py: fetch HF checkpoints+datasets through flaky proxy
  • outputs/: 3-seed metrics+generations. De-En BLEU 25.74+-0.43 (paper 26.4); XSum ROUGE-1/2/L 36.00/12.39/27.99 (paper 36.0/12.2/27.8)
  • ckpts/{wmt14_de-en,xsum}_validation_t5: T5-embedded validation sets used by eval
  • .gitignore: excludes 798MB checkpoints (see GitHub Release / HuggingFace)

Longchentong and others added 7 commits June 12, 2026 14:12
Conditional-generation reproduction (ELF-B 105M+35M):
- run_eval.sh: per-seed/per-GPU eval runner (bf16, batch override for HAMi 80GB cap)
- resilient_download.sh / prefetch_assets.py: fetch HF checkpoints+datasets through flaky proxy
- outputs/: 3-seed metrics+generations. De-En BLEU 25.74+-0.43 (paper 26.4);
  XSum ROUGE-1/2/L 36.00/12.39/27.99 (paper 36.0/12.2/27.8)
- ckpts/{wmt14_de-en,xsum}_validation_t5: T5-embedded validation sets used by eval
- .gitignore: excludes 798MB checkpoints (see GitHub Release / HuggingFace)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Operator-level routing only — the Flow Matching objective, add_noise,
net_out_to_v_x and train_step loss math are untouched, and model.forward
still returns (output, decoder_logits).

- geometry_router.py: GeometryRouter (parameter-free) computes per-example
  e_H via deterministic four-point delta_rel and e_S via constant-curvature
  cos-Gram eigvalue fit on diameter-normalized pairwise distances of a
  deterministic token subsample (prefix/mode control tokens excluded via
  geometry_mask). Scores run fp32 under no_grad and torch._dynamo.disable.
- layers.py: GeometryRoutedAttention mixes Euclidean / hyperbolic
  (first-order Busemann-PV proxy or Poincare distance) / spherical
  (cosine or negative-angular) attention logits by soft gates; values stay
  in R^d so branch outputs mix directly. Original Attention untouched.
- model.py: per-layer routers via geometry_router_layers spec; t and
  geometry_mask plumbed through blocks incl. gradient checkpointing.
- config.py: geometry_router_* fields, all defaults OFF; example YAML
  train_owt_ELF-B_geom_router.yml.
- Disabled model verified BIT-IDENTICAL to the pre-change code (same
  state_dict keys, max |out diff| = 0). Enabled model keeps the same
  state_dict (router has no params) so pretrained ckpts stay loadable.
- tests/test_geometry_router.py: 15 tests (unittest, no new deps).
- GPU smoke: torch.compile + bf16 autocast + grad ckpt + real train_step
  (Muon, self-cond CFG, mixed CE/L2 branching) all finite for 3 steps.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Review follow-ups:
- validate_geometry_router_config(): geometry_router_on_attention=false,
  geometry_router_on_mlp=true and geometry_router_mode!='soft' now raise
  NotImplementedError instead of being silently ignored. Called from both
  train.py and geometry_model_kwargs() (covers eval.py too); no-op when the
  router is disabled.
- geometry_router_denoiser_only (default false): resolves the 'denoiser
  attention operator' ambiguity explicitly. false = route every forward of
  the shared backbone (previous behavior, documented). true = strict
  denoiser semantics: per-row decoder_step_active tensors force decoder-mode
  rows to the pure Euclidean gate, and all-decoder forwards
  (decoder_step_active=True, i.e. the decode path at generation) bypass
  routing entirely via the original SDPA path. Forwards without
  decoder_step_active (ODE/SDE v-prediction, self-cond) stay routed.
- 8 new tests (23 total): reserved-option rejection, decode-path
  bit-equality vs the disabled model, mixed-row routing, denoiser forwards
  staying routed. Disabled model re-verified bit-identical to pre-change
  code; torch.compile + bf16 smoke for mixed-row and decode paths.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…onfigs

- t5_encoder: the manual pairwise-mask stack walk was written against the
  newer T5Block API (past_key_values/cache_position) and crashed on
  transformers 4.44.2. Stock path is now preferred (4.44 expands 3D masks
  correctly; verified bit-identical to the manual walk) with the manual walk
  as a signature-adaptive fallback for newer transformers.
- train_de-en_ELF-B_scratch_{base,geom}.yml: identical 10-epoch from-scratch
  De-En pilot arms (seed 0, batch 512, muon, grad ckpt); geom arm enables the
  router on all layers with denoiser_only + learnable per-layer biases +
  metrics logging.
- download_train_data.sh: resilient fetch of wmt14_de-en_train_t5 (2.0 GB).
- run_scratch_arm.sh: per-arm DDP launcher (GPU set, master port, log).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
LR schedule is constant, so a 5-epoch cap is identical to the 10-epoch
run truncated at epoch 5. watch_and_stop.sh halts each arm right after its
epoch-5 checkpoint + in-training eval are flushed (waits for the 'Epoch 6/'
log line) so GPUs free up instead of burning compute to epoch 10.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…gates learn ~pure Euclidean)

epoch-5 matched (1000-sample, seed42): base BLEU 21.85 vs geom 22.05 (dBLEU +0.21,
ROUGE within +-0.23) — within single-seed noise, effectively tied. Learned gates
drive gate_e -> ~1.0000 in deep layers: the model trains the router to disable the
hyperbolic/sphere branches. v1 operator-level geometry routing gives no benefit here
and costs ~3.4x train compute. res.csv rows 17-20 hold base/geom epoch-2 and epoch-5.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…h spiral

Forces the hyperbolic/sphere branches to train early so the learned-gate
collapse to Euclidean can't be blamed on under-trained branches.

- geometry_router_gate_warmup_steps (config, default 0=off): for the first N
  steps, blend learned gates toward uniform [1/3,1/3,1/3] (linear 1->0).
- GeometryRouter.gate_warmup_alpha: requires_grad=False Parameter (NOT a
  buffer — torch.compile constant-folds buffers; verified). Whole router
  forward runs under torch._dynamo.disable so the alpha (and trainable bias)
  stay live; Muon/Adam skip it (requires_grad filter); eval resets it to 0.
- train.py drives the per-step schedule via set_gate_warmup_alpha(); metrics
  record the LEARNED gates (pre-blend), so we can see what the model chooses.
- Verified under torch.compile: alpha 1->0 changes output by 0.71 (was masked
  before by the zero-init final_layer), bias still receives gradient.
- New 15-epoch config: 4-GPU, fast router (sample_size=16, sphere_k=0.5,2.0),
  5-epoch gate warmup. Tests updated for the extra (non-trainable) param.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant