diff --git a/.claude/commands/add-model.md b/.claude/commands/add-model.md index cb91fd7..68f9221 100644 --- a/.claude/commands/add-model.md +++ b/.claude/commands/add-model.md @@ -45,12 +45,12 @@ Add a new model to this repository following all repo conventions. 9. **Do not commit** large checkpoints, datasets, `.env` files, or tokens — document how users obtain them instead. 10. **Git workflow** — always branch, never commit directly to `main`: - 1. `git fetch origin && git checkout -b aaji/ origin/main` + 1. `git fetch origin && git checkout -b / origin/main` 2. Create all files, set scripts `chmod +x`. 3. `git add /models//` and commit. - 4. `git push -u origin aaji/` + 4. `git push -u origin /` 5. Open a PR with `gh pr create` targeting `main`. - 6. After the PR is merged: `git checkout main && git pull origin main && git branch -D aaji/`. + 6. After the PR is merged: `git checkout main && git pull origin main && git branch -D /`. ## Model to add diff --git a/.claude/commands/add-recipe.md b/.claude/commands/add-recipe.md index 925b792..81b32ec 100644 --- a/.claude/commands/add-recipe.md +++ b/.claude/commands/add-recipe.md @@ -37,12 +37,12 @@ Create or improve a recipe (inference, fine-tune, eval, etc.) for a model alread 6. **Do not commit** secrets, Hub tokens, large binaries, or datasets — point users to documented download steps. 7. **Git workflow** — always branch, never commit directly to `main`: - 1. `git fetch origin && git checkout -b aaji/ origin/main` + 1. `git fetch origin && git checkout -b / origin/main` 2. Create all files, set scripts `chmod +x`. 3. `git add /models//` and commit. - 4. `git push -u origin aaji/` + 4. `git push -u origin /` 5. Open a PR with `gh pr create` targeting `main`. - 6. After the PR is merged: `git checkout main && git pull origin main && git branch -D aaji/`. + 6. After the PR is merged: `git checkout main && git pull origin main && git branch -D /`. ## Domain-specific checks diff --git a/.claude/commands/init-cluster.md b/.claude/commands/init-cluster.md new file mode 100644 index 0000000..a6b5a9f --- /dev/null +++ b/.claude/commands/init-cluster.md @@ -0,0 +1,164 @@ +# Initialize local cluster configuration + +Detect the current cluster environment and create a `.cluster-config.yaml` file with site-specific settings. This file is gitignored and never committed. + +## Step 0 — Check if config already exists + +Check both locations in order: +1. `.cluster-config.yaml` (repo root) +2. `~/.config/ai4science-studio/cluster.yaml` + +If either exists, read it, show the user the current settings, and ask: +- **Update** — re-run discovery and update the file +- **Keep** — leave it as-is and exit + +If neither exists, proceed to Step 1. + +## Step 1 — Auto-discover cluster environment + +Run ALL of the following discovery commands silently (do not show raw output). Parse and store results for the questionnaire in Step 2. + +```bash +# 1. Home directory (may not be /home/...) +echo "$HOME" + +# 2. GPU architecture and count +rocminfo 2>/dev/null | grep -oP 'gfx\d+' | sort -u +rocm-smi --showproductname 2>/dev/null | head -20 +ls /dev/dri/renderD* 2>/dev/null | wc -l + +# 3. VRAM per GPU (MB → GB) +rocm-smi --showmeminfo vram 2>/dev/null | grep "Total" | head -1 + +# 4. ROCm version +cat /opt/rocm/.info/version 2>/dev/null || rocminfo 2>/dev/null | grep -oP 'HSA Runtime Version:\s*\K.*' + +# 5. SLURM: available GPU partitions +sinfo -h -o "%P %l %G %D" 2>/dev/null | grep -i gpu + +# 6. SLURM: user's accounts and associations +sacctmgr show associations where user=$USER format=account%30,partition%30,qos%30 -n 2>/dev/null + +# 7. Container runtimes available +which apptainer 2>/dev/null && apptainer --version +which docker 2>/dev/null && docker --version 2>/dev/null +docker info 2>/dev/null | grep -i "amd\|runtime" + +# 8. Common scratch / project directories +for d in /scratch/$USER /scratch /lustre /gpfs /tmp/$USER; do + [[ -d "$d" ]] && echo "scratch: $d" +done + +# 9. Internet connectivity from this node +curl -s --connect-timeout 5 -o /dev/null -w "%{http_code}" https://huggingface.co 2>/dev/null + +# 10. Proxy settings +echo "HTTP_PROXY=${HTTP_PROXY:-}" +echo "HTTPS_PROXY=${HTTPS_PROXY:-}" + +# 11. Existing SIF files +find "$HOME" /scratch /projects /opt -maxdepth 4 -name "*.sif" 2>/dev/null | head -20 +``` + +## Step 2 — Present findings and ask user to confirm or choose + +Show the user a summary table of discovered values. For each category, present what was found and ask them to confirm or override. + +### Questions (use AskUserQuestion for multi-choice items) + +**Q1. GPU architecture** (auto-filled if detected) +Show detected arch (e.g. "gfx942 — MI300X"). Ask only if detection failed: +- MI300X (gfx942) +- MI250X (gfx90a) +- MI350X (gfx950) +- MI210 (gfx90a) +- MI100 (gfx908) + +**Q2. GPUs per node** (auto-filled from renderD count, default 8) +Show detected count. Ask to confirm or override. + +**Q3. VRAM per GPU** (auto-filled if detected) +Show detected value. Ask to confirm or override. + +**Q4. SLURM partition** (choose from discovered partitions) +If multiple GPU partitions found, present them as multiple-choice. +If only one, pre-fill and ask to confirm. +If SLURM not available (e.g. Docker-only workstation), set to empty. + +**Q5. SLURM account** (choose from discovered accounts) +If multiple accounts found, present them as multiple-choice. +If only one, pre-fill and ask to confirm. +If SLURM not available, set to empty. + +**Q6. Container runtime** +Based on what's available: +- **Apptainer** (if found — recommended for HPC) +- **Docker** (if found) +- **Both** (if both found — ask which is preferred) + +If Docker is available, also check for AMD Container Toolkit: +- If `docker info` shows AMD runtime → set `gpu_method: amd-toolkit` +- Otherwise → set `gpu_method: device-passthrough` + +**Q7. Scratch / working directory** +Show discovered writable directories. Ask user to pick or provide their preferred scratch path for large files (SIF, overlays, datasets). + +**Q8. Internet access** +Show result of connectivity test. Ask to confirm. +If no internet, ask for proxy settings. + +**Q9. Config file location** +- `.cluster-config.yaml` in this repo (recommended — per-checkout) +- `~/.config/ai4science-studio/cluster.yaml` (user-level, shared across clones) + +## Step 3 — Write the config file + +Write the YAML config to the chosen location using the confirmed values. Use this structure: + +```yaml +# AI4Science Studio — Local Cluster Configuration +# Auto-generated by /init-cluster on +# Edit manually or re-run /init-cluster to update. + +slurm: + partition: "" + account: "" + qos: "" + +paths: + home: "<$HOME value>" + scratch: "" + projects: "" + sif_cache: "" + +containers: + runtime: "" + gpu_method: "" + +gpu: + architecture: "" + count_per_node: + vram_gb: + +network: + internet_access: + proxy: "" + +discovered_sifs: + # SIF files found during init (informational — not used by scripts) + # - /path/to/some.sif +``` + +After writing, confirm to the user: +- File location +- Summary of all settings +- Remind them this file is gitignored and will never be committed +- Tell them to re-run `/init-cluster` if they move to a different cluster + +## Step 4 — Show next steps + +Suggest what the user can do next: +- Run a model: `/run-stormcast`, `/run-orbit2`, etc. +- The run commands will read cluster config automatically for SLURM partition/account defaults + +$ARGUMENTS diff --git a/.claude/commands/run-archesweather.md b/.claude/commands/run-archesweather.md index 0be6962..e7ed611 100644 --- a/.claude/commands/run-archesweather.md +++ b/.claude/commands/run-archesweather.md @@ -2,6 +2,10 @@ Guide the user through running ArchesWeather end-to-end on an AMD cluster. +## Step 0 — Cluster config check + +Check if `.cluster-config.yaml` (repo root) or `~/.config/ai4science-studio/cluster.yaml` exists. If neither exists, run the `/init-cluster` flow first. If a config exists, read it and pre-fill container runtime and SLURM partition/account from saved values. + ## Step 1 — Questionnaire (ask ALL questions before doing anything) Ask the user the following questions. Do not assume any defaults. Wait for answers to all questions before proceeding. @@ -52,7 +56,7 @@ Run these when the user chose **Auto-discover** for any question. Present the re ```bash find "$HOME" /scratch /projects /opt -maxdepth 4 -name "*.sif" 2>/dev/null | head -20 ``` -Use `$HOME` (not `/home`) so the search works when the home directory is under a non-standard prefix (e.g. `/shared/prerelease/home/…`). +Use `$HOME` (not `/home`) so the search works when the home directory is under a non-standard prefix. Filter results for SIF names containing `geoarches` or `rocm`. Verify with `apptainer inspect ` if multiple candidates. **SLURM partition and account (Q6):** diff --git a/.claude/commands/run-aurora.md b/.claude/commands/run-aurora.md index 0716dd5..e86d94f 100644 --- a/.claude/commands/run-aurora.md +++ b/.claude/commands/run-aurora.md @@ -2,6 +2,10 @@ Guide the user through running Aurora (0.1° resolution) end-to-end on an AMD cluster. +## Step 0 — Cluster config check + +Check if `.cluster-config.yaml` (repo root) or `~/.config/ai4science-studio/cluster.yaml` exists. If neither exists, run the `/init-cluster` flow first. If a config exists, read it and pre-fill container runtime and SLURM partition/account from saved values. + ## Step 1 — Questionnaire (ask ALL questions before doing anything) Ask the user the following questions. Do not assume any defaults. Wait for answers to all questions before proceeding. diff --git a/.claude/commands/run-gencast.md b/.claude/commands/run-gencast.md index 6e8ba93..aa9aedc 100644 --- a/.claude/commands/run-gencast.md +++ b/.claude/commands/run-gencast.md @@ -2,6 +2,10 @@ Guide the user through running GenCast end-to-end on an AMD cluster. +## Step 0 — Cluster config check + +Check if `.cluster-config.yaml` (repo root) or `~/.config/ai4science-studio/cluster.yaml` exists. If neither exists, run the `/init-cluster` flow first. If a config exists, read it and pre-fill container runtime and SLURM partition/account from saved values. + ## Step 1 — Questionnaire (ask ALL questions before doing anything) Ask the user the following questions. Do not assume any defaults. Wait for answers to all questions before proceeding. diff --git a/.claude/commands/run-gpmolformer.md b/.claude/commands/run-gpmolformer.md index 1133a66..ac8bcef 100644 --- a/.claude/commands/run-gpmolformer.md +++ b/.claude/commands/run-gpmolformer.md @@ -5,6 +5,10 @@ Guide the user through running GP-MoLFormer on an AMD cluster via SLURM. +## Step 0 — Cluster config check + +Check if `.cluster-config.yaml` (repo root) or `~/.config/ai4science-studio/cluster.yaml` exists. If neither exists, run the `/init-cluster` flow first. If a config exists, read it and pre-fill container runtime and SLURM partition/account from saved values. + ## Step 1 — Questionnaire (ask ALL questions before doing anything) Ask the user the following questions. Do not assume any defaults. Wait for answers to all questions before proceeding. @@ -54,7 +58,7 @@ Run these when the user chose **Auto-discover** for any question. Present the re ```bash find "$HOME" /scratch /projects /opt -maxdepth 4 -name "*.sif" 2>/dev/null | head -20 ``` -Use `$HOME` (not `/home`) so the search works when the home directory is under a non-standard prefix (e.g. `/shared/prerelease/home/…`). +Use `$HOME` (not `/home`) so the search works when the home directory is under a non-standard prefix. Filter results for SIF names containing `rocm` or `pytorch`. Verify with `apptainer inspect ` if multiple candidates. **SLURM partition and account (Q7):** diff --git a/.claude/commands/run-hydragnn.md b/.claude/commands/run-hydragnn.md index f091cd5..298eb75 100644 --- a/.claude/commands/run-hydragnn.md +++ b/.claude/commands/run-hydragnn.md @@ -2,6 +2,10 @@ Guide the user through running HydraGNN predictive GFM inference on AMD GPUs. +## Step 0 — Cluster config check + +Check if `.cluster-config.yaml` (repo root) or `~/.config/ai4science-studio/cluster.yaml` exists. If neither exists, run the `/init-cluster` flow first. If a config exists, read it and pre-fill container runtime and SLURM partition/account from saved values. + ## Step 1 — Questionnaire **Q0. Task** diff --git a/.claude/commands/run-matey.md b/.claude/commands/run-matey.md index aff638d..981b342 100644 --- a/.claude/commands/run-matey.md +++ b/.claude/commands/run-matey.md @@ -2,6 +2,10 @@ Guide the user through training or inference with MATEY on AMD GPUs. +## Step 0 — Cluster config check + +Check if `.cluster-config.yaml` (repo root) or `~/.config/ai4science-studio/cluster.yaml` exists. If neither exists, run the `/init-cluster` flow first. If a config exists, read it and pre-fill container runtime and SLURM partition/account from saved values. + ## Step 1 — Questionnaire **Q0. Task** diff --git a/.claude/commands/run-mattergen.md b/.claude/commands/run-mattergen.md index c5c3184..2357c38 100644 --- a/.claude/commands/run-mattergen.md +++ b/.claude/commands/run-mattergen.md @@ -2,6 +2,10 @@ Guide the user through running MatterGen end-to-end on an AMD cluster via SLURM or Docker. +## Step 0 — Cluster config check + +Check if `.cluster-config.yaml` (repo root) or `~/.config/ai4science-studio/cluster.yaml` exists. If neither exists, run the `/init-cluster` flow first. If a config exists, read it and pre-fill container runtime and SLURM partition/account from saved values. + ## Step 1 — Questionnaire (ask ALL questions before doing anything) Ask the user the following questions. Do not assume any defaults. Wait for answers before proceeding. @@ -44,7 +48,7 @@ Run these when the user chose **Auto-discover** for any question. Present the re ```bash find "$HOME" /scratch /projects /opt -maxdepth 4 -name "*.sif" 2>/dev/null | head -20 ``` -Use `$HOME` (not `/home`) so the search works when the home directory is under a non-standard prefix (e.g. `/shared/prerelease/home/…`). +Use `$HOME` (not `/home`) so the search works when the home directory is under a non-standard prefix. Filter results for SIF names containing `rocm` or `pytorch`. Verify with `apptainer inspect ` if multiple candidates. **SLURM partition and account (Q5):** diff --git a/.claude/commands/run-neuralgcm.md b/.claude/commands/run-neuralgcm.md index 550b563..c6cf0ec 100644 --- a/.claude/commands/run-neuralgcm.md +++ b/.claude/commands/run-neuralgcm.md @@ -2,6 +2,10 @@ Guide the user through running NeuralGCM end-to-end on an AMD cluster via Docker or SLURM. +## Step 0 — Cluster config check + +Check if `.cluster-config.yaml` (repo root) or `~/.config/ai4science-studio/cluster.yaml` exists. If neither exists, run the `/init-cluster` flow first. If a config exists, read it and pre-fill container runtime and SLURM partition/account from saved values. + ## Step 1 — Questionnaire (ask ALL questions before doing anything) Ask the user the following questions. Do not assume any defaults. Wait for answers to all questions before proceeding. diff --git a/.claude/commands/run-orbit2.md b/.claude/commands/run-orbit2.md index 9f5be42..bb5ddf6 100644 --- a/.claude/commands/run-orbit2.md +++ b/.claude/commands/run-orbit2.md @@ -2,6 +2,10 @@ Guide the user through running ORBIT-2 end-to-end on an AMD cluster via SLURM. +## Step 0 — Cluster config check + +Check if `.cluster-config.yaml` (repo root) or `~/.config/ai4science-studio/cluster.yaml` exists. If neither exists, run the `/init-cluster` flow first. If a config exists, read it and pre-fill container runtime and SLURM partition/account from saved values. + ## Step 1 — Questionnaire (ask ALL questions before doing anything) Ask the user the following questions. Do not assume any defaults. Wait for answers to all questions before proceeding. @@ -61,7 +65,7 @@ Run these when the user chose **Auto-discover** for any question. Present the re ```bash find "$HOME" /scratch /projects /opt -maxdepth 4 -type d -name "ORBIT-2" 2>/dev/null | head -20 ``` -Use `$HOME` (not `/home`) so the search works when the home directory is under a non-standard prefix (e.g. `/shared/prerelease/home/…`). +Use `$HOME` (not `/home`) so the search works when the home directory is under a non-standard prefix. Look for a directory containing `src/` and `examples/` subdirectories to confirm it is a valid ORBIT-2 clone. **SIF files (Q2):** diff --git a/.claude/commands/run-panguweather.md b/.claude/commands/run-panguweather.md index ec415ec..5e3e9d9 100644 --- a/.claude/commands/run-panguweather.md +++ b/.claude/commands/run-panguweather.md @@ -2,6 +2,10 @@ Guide the user through running PanguWeather end-to-end on an AMD cluster. +## Step 0 — Cluster config check + +Check if `.cluster-config.yaml` (repo root) or `~/.config/ai4science-studio/cluster.yaml` exists. If neither exists, run the `/init-cluster` flow first. If a config exists, read it and pre-fill container runtime and SLURM partition/account from saved values. + ## Step 1 — Questionnaire (ask ALL questions before doing anything) Ask the user the following questions. Do not assume any defaults. Wait for answers to all questions before proceeding. diff --git a/.claude/commands/run-reinvent4.md b/.claude/commands/run-reinvent4.md index f69fbe4..0698c30 100644 --- a/.claude/commands/run-reinvent4.md +++ b/.claude/commands/run-reinvent4.md @@ -4,6 +4,10 @@ Guide the user through running REINVENT4 transfer learning for molecular design > **Research / engineering use only.** Not for clinical or diagnostic use. +## Step 0 — Cluster config check + +Check if `.cluster-config.yaml` (repo root) or `~/.config/ai4science-studio/cluster.yaml` exists. If neither exists, run the `/init-cluster` flow first. If a config exists, read it and pre-fill container runtime and SLURM partition/account from saved values. + ## Step 1 — Questionnaire **Q0. TOML config** diff --git a/.claude/commands/run-semlaflow.md b/.claude/commands/run-semlaflow.md index a6618ef..e6401c5 100644 --- a/.claude/commands/run-semlaflow.md +++ b/.claude/commands/run-semlaflow.md @@ -4,6 +4,10 @@ Guide the user through generating 3D molecular structures with SemlaFlow on AMD > **Research / engineering use only.** Not for clinical or diagnostic use. +## Step 0 — Cluster config check + +Check if `.cluster-config.yaml` (repo root) or `~/.config/ai4science-studio/cluster.yaml` exists. If neither exists, run the `/init-cluster` flow first. If a config exists, read it and pre-fill container runtime and SLURM partition/account from saved values. + ## Step 1 — Questionnaire **Q0. Checkpoint path** diff --git a/.claude/commands/run-stormcast.md b/.claude/commands/run-stormcast.md index efa075f..f47f1c8 100644 --- a/.claude/commands/run-stormcast.md +++ b/.claude/commands/run-stormcast.md @@ -2,6 +2,12 @@ Guide the user through running StormCast end-to-end on an AMD cluster via SLURM. +## Step 0 — Cluster config check + +Before starting, check if `.cluster-config.yaml` (repo root) or `~/.config/ai4science-studio/cluster.yaml` exists. If neither exists, run the `/init-cluster` flow first to auto-discover the cluster environment. + +If a config exists, read it and pre-fill Q0 (runtime), Q6 (partition/account) from the saved values. Still present them to the user for confirmation but show the saved defaults. + ## Step 1 — Questionnaire (ask ALL questions before doing anything) Ask the user the following questions. Do not assume any defaults. Wait for answers to all questions before proceeding. @@ -50,7 +56,7 @@ Run these when the user chose **Auto-discover** for any question. Present the re ```bash find "$HOME" /scratch /projects /opt -maxdepth 4 -name "*.sif" 2>/dev/null | head -20 ``` -Use `$HOME` (not `/home`) so the search works when the home directory is under a non-standard prefix (e.g. `/shared/prerelease/home/…`). +Use `$HOME` (not `/home`) so the search works when the home directory is under a non-standard prefix. Filter results for SIF names containing `rocm` or `pytorch`. Verify with `apptainer inspect ` if multiple candidates. **Overlay images (Q2):** diff --git a/.claude/commands/run-swinunetr.md b/.claude/commands/run-swinunetr.md index de75e93..622b737 100644 --- a/.claude/commands/run-swinunetr.md +++ b/.claude/commands/run-swinunetr.md @@ -4,6 +4,10 @@ Guide the user through training or inference with SwinUNETR on AMD GPUs. > **Research / engineering use only.** Not for clinical or diagnostic use. +## Step 0 — Cluster config check + +Check if `.cluster-config.yaml` (repo root) or `~/.config/ai4science-studio/cluster.yaml` exists. If neither exists, run the `/init-cluster` flow first. If a config exists, read it and pre-fill container runtime and SLURM partition/account from saved values. + ## Step 1 — Questionnaire **Q0. Task** diff --git a/.claude/commands/run-walrus.md b/.claude/commands/run-walrus.md index b5f5f43..bdec86b 100644 --- a/.claude/commands/run-walrus.md +++ b/.claude/commands/run-walrus.md @@ -2,6 +2,10 @@ Guide the user through running Walrus autoregressive rollout on AMD GPUs. +## Step 0 — Cluster config check + +Check if `.cluster-config.yaml` (repo root) or `~/.config/ai4science-studio/cluster.yaml` exists. If neither exists, run the `/init-cluster` flow first. If a config exists, read it and pre-fill container runtime and SLURM partition/account from saved values. + ## Step 1 — Questionnaire **Q0. Input data** diff --git a/.cluster-config.example.yaml b/.cluster-config.example.yaml new file mode 100644 index 0000000..978c2c6 --- /dev/null +++ b/.cluster-config.example.yaml @@ -0,0 +1,38 @@ +# AI4Science Studio — Local Cluster Configuration +# +# Copy this file to one of these locations (checked in order): +# 1. .cluster-config.yaml (this repo — gitignored, per-checkout) +# 2. ~/.config/ai4science-studio/cluster.yaml (user-level, shared across clones) +# +# The agent (Claude Code / Cursor) reads this file to avoid repeatedly asking +# for cluster-specific details. It is auto-created on first run if missing. +# Update it whenever your cluster environment changes. + +# SLURM settings +slurm: + partition: "" # e.g. "gpu", "mi300x", "batch" + account: "" # e.g. "myproject", "CSC000" + qos: "" # optional QoS if your site requires it + +# Filesystem paths +paths: + home: "" # leave blank to use $HOME; set if non-standard + scratch: "" # e.g. "/scratch/$USER", "/lustre/scratch/$USER" + projects: "" # e.g. "/projects/myteam" + sif_cache: "" # where .sif files are stored (e.g. "$HOME/containers") + +# Container runtime +containers: + runtime: "" # "apptainer" or "docker" + gpu_method: "" # "amd-toolkit" (--runtime=amd) or "device-passthrough" (/dev/kfd) + +# ROCm / GPU details +gpu: + architecture: "" # e.g. "gfx942" (MI300X), "gfx90a" (MI250X), "gfx950" (MI350X) + count_per_node: 8 # GPUs per node + vram_gb: 192 # per-GPU VRAM in GB + +# Network +network: + internet_access: true # can compute nodes reach the internet? + proxy: "" # HTTP proxy if required (e.g. "http://proxy:8080") diff --git a/.cursor/skills/ai4science-earth-science/SKILL.md b/.cursor/skills/ai4science-earth-science/SKILL.md index f4f1afc..f713f60 100755 --- a/.cursor/skills/ai4science-earth-science/SKILL.md +++ b/.cursor/skills/ai4science-earth-science/SKILL.md @@ -50,7 +50,7 @@ The `earth_science/` domain covers **climate**, **weather**, and broader **Earth ) ``` - **Distributed launch:** `srun --mpi=pmix apptainer exec ... --env HOSTNAME="$MASTER_ADDR"` — see studio SKILL.md for full pattern. Confirmed working for 1-GPU and 8-GPU. -- **Data blocker:** ORBIT-2 real training data lives on Frontier Lustre (`/lustre/orion/lrn036/...`) — not publicly accessible. Synthetic data covers the smoke-test path. +- **Data blocker:** ORBIT-2 real training data lives on Frontier Lustre (`/lustre/orion//...`) — not publicly accessible. Synthetic data covers the smoke-test path. ### ArchesWeather (geoarches) diff --git a/.cursor/skills/ai4science-run-models/SKILL.md b/.cursor/skills/ai4science-run-models/SKILL.md index 3553719..1ea6f58 100644 --- a/.cursor/skills/ai4science-run-models/SKILL.md +++ b/.cursor/skills/ai4science-run-models/SKILL.md @@ -7,6 +7,21 @@ description: Step-by-step instructions for how Cursor should run any model in AI Use this skill when a user asks to run, execute, or launch any model in the repository. +## Step 0: Check for cluster config + +Before anything else, check if a cluster config exists: + +```bash +# Check both locations +test -f .cluster-config.yaml && echo "repo-local" || \ +test -f ~/.config/ai4science-studio/cluster.yaml && echo "user-level" || \ +echo "missing" +``` + +If **missing**, tell the user: "No cluster configuration found. Running `/init-cluster` to detect your cluster environment." Then run the init-cluster flow (auto-discover GPU, SLURM, containers, paths) before proceeding. + +If a config **exists**, read it and use its values as defaults for SLURM partition, account, container runtime, GPU arch, and scratch paths throughout the run. Still confirm with the user if any value is empty or looks stale. + ## Step 1: Identify the model 1. Read `models.yaml` at the repo root to find the model by name, slug, or HF id. @@ -35,7 +50,6 @@ When the user chooses auto-discover, run the appropriate commands and present re ```bash # SIF files — use $HOME so we find files even when home is not under /home -# (e.g. /shared/prerelease/home/…) find "$HOME" /scratch /projects /opt -maxdepth 4 -name "*.sif" 2>/dev/null | head -20 # Overlay images diff --git a/.cursor/skills/ai4science-studio/SKILL.md b/.cursor/skills/ai4science-studio/SKILL.md index 0d0ae7b..ece48a5 100755 --- a/.cursor/skills/ai4science-studio/SKILL.md +++ b/.cursor/skills/ai4science-studio/SKILL.md @@ -220,9 +220,20 @@ Always verify at end: `assert 'rocm' in torch.__version__` - ORBIT-2 (pytorch-lightning + xformers + wandb + mpi4py + ...): 7 GB overlay, ~2 GB content - StormCast (earth2studio + cartopy): 4 GB overlay, ~1.7 GB content +## Local cluster config + +Site-specific settings (SLURM partition, account, scratch paths, GPU arch) are stored in an untracked file so they never leak into the public repo. The agent checks two locations (in order): + +1. `.cluster-config.yaml` (repo root — gitignored, per-checkout) +2. `~/.config/ai4science-studio/cluster.yaml` (user-level, shared across clones) + +**On first run or if neither file exists, run `/init-cluster`.** This command auto-discovers the cluster environment (GPU arch, SLURM partitions/accounts, container runtimes, scratch paths, internet access) and asks the user to confirm via multiple-choice questions. The result is written to one of the above locations. + +When cluster-specific info is discovered during a session (new partition, different scratch path, etc.), update the config file so future runs don't re-ask. + ## Auto-discovery: use `$HOME`, never hardcode `/home` -When searching for SIF files, overlays, or repo clones, always use `$HOME` as the search root — not `/home`. Many HPC clusters mount home directories under non-standard prefixes (e.g. `/shared/prerelease/home/amd/user`), so `find /home ...` misses everything. The same applies to overlay and upstream-repo searches. +When searching for SIF files, overlays, or repo clones, always use `$HOME` as the search root — not `/home`. Many HPC clusters mount home directories under non-standard prefixes, so `find /home ...` misses everything. The same applies to overlay and upstream-repo searches. ```bash find "$HOME" /scratch /projects /opt -maxdepth 4 -name "*.sif" 2>/dev/null | head -20 diff --git a/.gitignore b/.gitignore index 84b9df2..79e9c27 100755 --- a/.gitignore +++ b/.gitignore @@ -50,6 +50,11 @@ secrets/ **/examples/outputs/ **/examples/generated.csv +# Local cluster config (site-specific, auto-created by agent) +.cluster-config.yaml +.claude/settings.local.json +nul + # IDE .idea/ .vscode/ diff --git a/CLAUDE.md b/CLAUDE.md index aec81a1..2e47a6a 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -79,6 +79,7 @@ Slash commands for Claude Code live in `.claude/commands/`: | `/check-model` | Audit a model folder for completeness and convention compliance | | `/list-models` | Discover and filter models by domain, task, or license | | `/audit-models` | Readiness audit for all models (scripts, preflight, recipes) | +| `/init-cluster` | Auto-detect cluster environment (GPU, SLURM, containers) and create local config | | `/run-stormcast` | Run StormCast inference on an AMD cluster | | `/run-orbit2` | Run ORBIT-2 inference on an AMD cluster | | `/run-archesweather` | Run ArchesWeather inference or training on an AMD cluster | diff --git a/earth_science/models/ORBIT-2/examples/sbatch_infer_docker.sh b/earth_science/models/ORBIT-2/examples/sbatch_infer_docker.sh index 8e4bdc2..383cb64 100755 --- a/earth_science/models/ORBIT-2/examples/sbatch_infer_docker.sh +++ b/earth_science/models/ORBIT-2/examples/sbatch_infer_docker.sh @@ -36,7 +36,7 @@ ##SBATCH -A YOUR_PROJECT_HERE #SBATCH -J orbit2-docker -#SBATCH --partition=amd-arad +#SBATCH --partition=YOUR_PARTITION_HERE #SBATCH --nodes=1 #SBATCH --gres=gpu:8 #SBATCH --ntasks=1 diff --git a/earth_science/models/StormCast/examples/sbatch_inference_docker.sh b/earth_science/models/StormCast/examples/sbatch_inference_docker.sh index e23afa8..3d81091 100755 --- a/earth_science/models/StormCast/examples/sbatch_inference_docker.sh +++ b/earth_science/models/StormCast/examples/sbatch_inference_docker.sh @@ -25,7 +25,7 @@ # See: ../recipes/inference/README.md #SBATCH --job-name=stormcast-docker -#SBATCH --partition=amd-arad +#SBATCH --partition=YOUR_PARTITION_HERE ##SBATCH --account=YOUR_ACCOUNT_HERE #SBATCH --nodes=1 #SBATCH --gres=gpu:1 diff --git a/healthcare/models/GP-MoLFormer/examples/sbatch_inference_docker.sh b/healthcare/models/GP-MoLFormer/examples/sbatch_inference_docker.sh index 206b440..d1555e5 100755 --- a/healthcare/models/GP-MoLFormer/examples/sbatch_inference_docker.sh +++ b/healthcare/models/GP-MoLFormer/examples/sbatch_inference_docker.sh @@ -31,7 +31,7 @@ # For older hardware (MI100/gfx908): override GPMOL_IMAGE with a rocm6.x image. #SBATCH --job-name=gpmolformer-docker -#SBATCH --partition=amd-arad +#SBATCH --partition=YOUR_PARTITION_HERE ##SBATCH --account=YOUR_ACCOUNT_HERE #SBATCH --nodes=1 #SBATCH --gres=gpu:1 diff --git a/healthcare/models/SwinUNETR/model.yaml b/healthcare/models/SwinUNETR/model.yaml index 856c708..1ea6d13 100644 --- a/healthcare/models/SwinUNETR/model.yaml +++ b/healthcare/models/SwinUNETR/model.yaml @@ -1,6 +1,6 @@ name: SwinUNETR hf_id: N/A -license: See upstream (MONAI / silogen/ai-samples) +license: Apache-2.0 task: 3D medical image segmentation (lung tumor, NSCLC-Radiomics) domain: healthcare upstream_code: https://github.com/silogen/ai-samples diff --git a/models.yaml b/models.yaml index 50a8a3e..e7bd210 100644 --- a/models.yaml +++ b/models.yaml @@ -91,7 +91,7 @@ models: - slug: SwinUNETR domain: healthcare hf_id: N/A - license: See upstream (MONAI / silogen/ai-samples) + license: Apache-2.0 task: 3D medical image segmentation tasks_available: [train, inference] path: healthcare/models/SwinUNETR