Skip to content

[gemma4] Add Frontier single-allocation rollout launcher#12

Draft
Bryan-Nsoh wants to merge 1 commit into
rwr-loe-datafrom
gemma4-frontier-rollouts
Draft

[gemma4] Add Frontier single-allocation rollout launcher#12
Bryan-Nsoh wants to merge 1 commit into
rwr-loe-datafrom
gemma4-frontier-rollouts

Conversation

@Bryan-Nsoh

@Bryan-Nsoh Bryan-Nsoh commented Jun 26, 2026

Copy link
Copy Markdown

What changed

  • Add the gemma4-26b-a4b-it Frontier preset for the staged Gemma 4 26B A4B IT model and the verified vLLM ROCm nightly container.
  • Add scripts/submit_gemma4_frontier_fat.sh for the Gemma rollout path Pete should run.
  • Add scripts/merge_trajectory_lanes.py so one job can merge the node-local outputs into ${RUN_ROOT}/trajectories_merged.
  • Update generate_trajectories.slurm so each lane stages the model to Frontier NVMe, launches vLLM through the OLCF step wrapper when present, and keeps Gemma-specific completions settings explicit.
  • Document the exact Frontier command in the README.

How to run

From the repository checkout on Frontier:

TASKS_PATH=/lustre/orion/syb111/proj-shared/Personal/krusepi/projects/llms/mentor-rl/data/module_corpus_full_brain_mixed/tasks.train.jsonl \
TASK_COUNT=64 \
NODES=4 \
TASK_CONCURRENCY=12 \
scripts/submit_gemma4_frontier_fat.sh

Defaults assume:

  • model: ${SCRATCH}/hf_downloads/google/gemma-4-26B-A4B-it
  • container: ${SCRATCH}/containers/vllm_openai_rocm_nightly-aa2b56ffb0c1d8b29d6468282a49d9b4a9f0dd1c.sif

Override MODEL_PATH or SIF if those are staged somewhere else.

Launch shape

The script submits one Frontier allocation. Each node runs one vLLM server. That server uses tensor parallel size 8 across the node's 8 MI250X GCDs. Nodes are independent rollout lanes over balanced selected tasks. The model loads once per node, from node-local NVMe when available.

Validation

Local checks:

bash -n generate_trajectories.slurm scripts/submit_gemma4_frontier_fat.sh
python3 -m py_compile scripts/merge_trajectory_lanes.py
uv run --python 3.12 --with pytest --with networkx --with requests --with pandas --with scipy \
  python -m pytest tests/test_gemma4_frontier_fat_launcher.py tests/test_generate_trajectories.py -q

Result: 68 passed, 1 skipped.

Frontier debug validation:

  • Job 4905071, run root /lustre/orion/syb114/scratch/bryannsoh/mentor_rl_speed_calibration/runs/debug_gemma4_fat_smoke_20260626_151848: COMPLETED, 3 nodes, 3 lanes, 3 merged trajectories.
  • Server logs from all three lanes show /mnt/bb/.../gemma-4-26B-A4B-it, dtype=bfloat16, tensor_parallel_size=8, reasoning_parser=gemma4, and moe_backend=triton.
  • Calibration run 4904709 produced 9 trajectories across 3 lanes and merged to 9 branch pools, 9 trajectory turns, 9 finding records, 39 raw preference pairs, and 11 filtered preference pairs.

@Bryan-Nsoh Bryan-Nsoh force-pushed the gemma4-frontier-rollouts branch from 9837a37 to dd114c8 Compare June 26, 2026 15:38
@Bryan-Nsoh Bryan-Nsoh changed the title [gemma4] Add Frontier rollout launcher [gemma4] Add Frontier single-allocation rollout launcher Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant