[gemma4] Add Frontier single-allocation rollout launcher by Bryan-Nsoh · Pull Request #12 · Jacobson-CompSysBio/mentor-rl

Bryan-Nsoh · 2026-06-26T12:19:25Z

What changed

Add the gemma4-26b-a4b-it Frontier preset for the staged Gemma 4 26B A4B IT model and the verified vLLM ROCm nightly container.
Add scripts/submit_gemma4_frontier_fat.sh for the Gemma rollout path Pete should run.
Add scripts/merge_trajectory_lanes.py so one job can merge the node-local outputs into ${RUN_ROOT}/trajectories_merged.
Update generate_trajectories.slurm so each lane stages the model to Frontier NVMe, launches vLLM through the OLCF step wrapper when present, and keeps Gemma-specific completions settings explicit.
Document the exact Frontier command in the README.

How to run

From the repository checkout on Frontier:

TASKS_PATH=/lustre/orion/syb111/proj-shared/Personal/krusepi/projects/llms/mentor-rl/data/module_corpus_full_brain_mixed/tasks.train.jsonl \
TASK_COUNT=64 \
NODES=4 \
TASK_CONCURRENCY=12 \
scripts/submit_gemma4_frontier_fat.sh

Defaults assume:

model: ${SCRATCH}/hf_downloads/google/gemma-4-26B-A4B-it
container: ${SCRATCH}/containers/vllm_openai_rocm_nightly-aa2b56ffb0c1d8b29d6468282a49d9b4a9f0dd1c.sif

Override MODEL_PATH or SIF if those are staged somewhere else.

Launch shape

The script submits one Frontier allocation. Each node runs one vLLM server. That server uses tensor parallel size 8 across the node's 8 MI250X GCDs. Nodes are independent rollout lanes over balanced selected tasks. The model loads once per node, from node-local NVMe when available.

Validation

Local checks:

bash -n generate_trajectories.slurm scripts/submit_gemma4_frontier_fat.sh
python3 -m py_compile scripts/merge_trajectory_lanes.py
uv run --python 3.12 --with pytest --with networkx --with requests --with pandas --with scipy \
  python -m pytest tests/test_gemma4_frontier_fat_launcher.py tests/test_generate_trajectories.py -q

Result: 68 passed, 1 skipped.

Frontier debug validation:

Job 4905071, run root /lustre/orion/syb114/scratch/bryannsoh/mentor_rl_speed_calibration/runs/debug_gemma4_fat_smoke_20260626_151848: COMPLETED, 3 nodes, 3 lanes, 3 merged trajectories.
Server logs from all three lanes show /mnt/bb/.../gemma-4-26B-A4B-it, dtype=bfloat16, tensor_parallel_size=8, reasoning_parser=gemma4, and moe_backend=triton.
Calibration run 4904709 produced 9 trajectories across 3 lanes and merged to 9 branch pools, 9 trajectory turns, 9 finding records, 39 raw preference pairs, and 11 filtered preference pairs.

Add Gemma 4 Frontier single-allocation launcher

dd114c8

Bryan-Nsoh force-pushed the gemma4-frontier-rollouts branch from 9837a37 to dd114c8 Compare June 26, 2026 15:38

Bryan-Nsoh changed the title ~~[gemma4] Add Frontier rollout launcher~~ [gemma4] Add Frontier single-allocation rollout launcher Jun 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[gemma4] Add Frontier single-allocation rollout launcher#12

[gemma4] Add Frontier single-allocation rollout launcher#12
Bryan-Nsoh wants to merge 1 commit into
rwr-loe-datafrom
gemma4-frontier-rollouts

Bryan-Nsoh commented Jun 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Bryan-Nsoh commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

How to run

Launch shape

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bryan-Nsoh commented Jun 26, 2026 •

edited

Loading