MENTOR-RL is a research repo for building CORUM-grounded biology tasks, generating tool-using trajectories over a deterministic runtime, and training models with supervised fine-tuning and GRPO-style reinforcement learning.
The current codebase is oriented around Frontier/Slurm workflows and local data paths rather than a packaged pip install.
scripts/: primary entry points for corpus building, trajectory generation, inference, and trainingruntime/: shared deterministic runtime/state/validation/scoring types used by the generatorcpp_runtime/: lower-level runtime components in C++data/: canonical CORUM corpora, generated trajectories, caches, and related artifactstests/: unit and integration testsgenerate_trajectories.slurm,train_sft.slurm,train_grpo.slurm: cluster launchers
scripts/build_corum_corpus.py turns the raw CORUM export into canonical task
JSONL files used by the rest of the pipeline.
python scripts/build_corum_corpus.py --helpTypical outputs live under data/corum_corpus/.
scripts/generate_trajectories.py generates shared-prefix trajectories from the
canonical tasks. It supports deterministic heuristic generation for testing and
model-backed generation through an OpenAI-compatible vLLM endpoint.
python scripts/generate_trajectories.py --helpFor Frontier runs, use the Slurm launcher:
sbatch generate_trajectories.slurmGenerated artifacts are written under data/corum_trajectories/.
Before using model-backed artifacts for DPO, run the verification gate:
python scripts/select_verification_tasks.py \
--seed 42 \
--pilot-size 128 \
--pilot-out data/corum_corpus/tasks.verification_pilot.jsonl \
--smoke-task-ids-out data/corum_corpus/tasks.verification_smoke_ids.txt
SMOKE_TASK_IDS="$(paste -sd, data/corum_corpus/tasks.verification_smoke_ids.txt)" \
SMOKE_TEST_ONLY=1 \
sbatch generate_trajectories.slurm
python scripts/audit_trajectory_run.py \
--run-dir data/corum_trajectories/<run> \
--dpo-pair-gate \
--max-selected-no-tool-rate 0.60 \
--max-positive-selected-no-tool-rate 0.50 \
--min-recovery-expansion-pair-rate 0.35 \
--min-tool-supported-pair-rate 0.30 \
--max-mechanism-label-only-pair-rate 0.20 \
--max-step0-pair-rate 0.70
python scripts/dpo_pair_loader_smoke.py --pairs-path data/corum_trajectories/<run>/preference_pairs.jsonl
python scripts/render_preference_pairs_review.py \
--pairs-path data/corum_trajectories/<run>/preference_pairs.jsonl \
--sample-size 30 \
--out data/corum_trajectories/<run>/preference_pair_review_sample.md
python scripts/audit_recovery_recoverability.py \
--tasks-path data/corum_corpus/tasks.verification_pilot.jsonl \
--top-ks 50,100,250,500,1000 \
--out-jsonl data/corum_corpus/recovery_recoverability.rwr.jsonlThe Slurm launcher writes run_freeze.json into each output directory and pins
the generator to the discovered served model id. Set TASKS_PATH to the pilot
JSONL and run once with TASK_CONCURRENCY=1, then repeat with the intended
production concurrency before scaling. The verification selector stratifies
pilot rows by task type, evidence mode, and difficulty, then uses a deterministic
seeded order so small pilots do not concentrate on the earliest CORUM complexes.
The launcher defaults to verbalized actor sampling, task-aware selection,
quality-balanced pair mining, and tool-coverage retries for recovery/refinement
tasks; override
ACTOR_SAMPLING_STRATEGY, SELECTION_POLICY, PAIR_MINING_STRATEGY, and
TOOL_COVERAGE_RETRY_COUNT only for ablations. Recovery RWR branches default
to RECOVERY_RWR_TOP_K=500 so verifier prompts include a broader non-seed
candidate list. For DPO-only runs, high
branch-pool tie rates are diagnostic as long as balanced preference pairs have
valid margins; the audit's --dpo-pair-gate keeps pair/schema/backend checks
strict while downgrading tie-rate findings to warnings.
scripts/train_sft.py trains a supervised model over a local JSON dataset using
TRL's SFTTrainer.
python scripts/train_sft.py --helpCluster runs typically go through:
sbatch train_sft.slurmscripts/train_grpo.py contains the GRPO training path and patches TRL's
VLLMClient to work with a standard vLLM server.
python scripts/train_grpo.py --helpCluster runs typically go through:
sbatch train_grpo.slurmRun the test suite with:
pytestFor quick script-level validation, the main Python entry points also expose
--help.
- Many defaults assume the current ORNL/Frontier filesystem layout and large local model caches.
- The trajectory pipeline writes run artifacts and progress files into
data/. - Slurm launchers are the most complete examples of the intended production workflow.