Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .claude/commands/add-model.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,12 +45,12 @@ Add a new model to this repository following all repo conventions.
9. **Do not commit** large checkpoints, datasets, `.env` files, or tokens — document how users obtain them instead.

10. **Git workflow** — always branch, never commit directly to `main`:
1. `git fetch origin && git checkout -b aaji/<model-slug> origin/main`
1. `git fetch origin && git checkout -b <your-username>/<model-slug> origin/main`
2. Create all files, set scripts `chmod +x`.
3. `git add <domain>/models/<slug>/` and commit.
4. `git push -u origin aaji/<model-slug>`
4. `git push -u origin <your-username>/<model-slug>`
5. Open a PR with `gh pr create` targeting `main`.
6. After the PR is merged: `git checkout main && git pull origin main && git branch -D aaji/<model-slug>`.
6. After the PR is merged: `git checkout main && git pull origin main && git branch -D <your-username>/<model-slug>`.

## Model to add

Expand Down
6 changes: 3 additions & 3 deletions .claude/commands/add-recipe.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,12 +37,12 @@ Create or improve a recipe (inference, fine-tune, eval, etc.) for a model alread
6. **Do not commit** secrets, Hub tokens, large binaries, or datasets — point users to documented download steps.

7. **Git workflow** — always branch, never commit directly to `main`:
1. `git fetch origin && git checkout -b aaji/<model-slug> origin/main`
1. `git fetch origin && git checkout -b <your-username>/<model-slug> origin/main`
2. Create all files, set scripts `chmod +x`.
3. `git add <domain>/models/<slug>/` and commit.
4. `git push -u origin aaji/<model-slug>`
4. `git push -u origin <your-username>/<model-slug>`
5. Open a PR with `gh pr create` targeting `main`.
6. After the PR is merged: `git checkout main && git pull origin main && git branch -D aaji/<model-slug>`.
6. After the PR is merged: `git checkout main && git pull origin main && git branch -D <your-username>/<model-slug>`.

## Domain-specific checks

Expand Down
164 changes: 164 additions & 0 deletions .claude/commands/init-cluster.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
# Initialize local cluster configuration

Detect the current cluster environment and create a `.cluster-config.yaml` file with site-specific settings. This file is gitignored and never committed.

## Step 0 — Check if config already exists

Check both locations in order:
1. `.cluster-config.yaml` (repo root)
2. `~/.config/ai4science-studio/cluster.yaml`

If either exists, read it, show the user the current settings, and ask:
- **Update** — re-run discovery and update the file
- **Keep** — leave it as-is and exit

If neither exists, proceed to Step 1.

## Step 1 — Auto-discover cluster environment

Run ALL of the following discovery commands silently (do not show raw output). Parse and store results for the questionnaire in Step 2.

```bash
# 1. Home directory (may not be /home/...)
echo "$HOME"

# 2. GPU architecture and count
rocminfo 2>/dev/null | grep -oP 'gfx\d+' | sort -u
rocm-smi --showproductname 2>/dev/null | head -20
ls /dev/dri/renderD* 2>/dev/null | wc -l

# 3. VRAM per GPU (MB → GB)
rocm-smi --showmeminfo vram 2>/dev/null | grep "Total" | head -1

# 4. ROCm version
cat /opt/rocm/.info/version 2>/dev/null || rocminfo 2>/dev/null | grep -oP 'HSA Runtime Version:\s*\K.*'

# 5. SLURM: available GPU partitions
sinfo -h -o "%P %l %G %D" 2>/dev/null | grep -i gpu

# 6. SLURM: user's accounts and associations
sacctmgr show associations where user=$USER format=account%30,partition%30,qos%30 -n 2>/dev/null

# 7. Container runtimes available
which apptainer 2>/dev/null && apptainer --version
which docker 2>/dev/null && docker --version 2>/dev/null
docker info 2>/dev/null | grep -i "amd\|runtime"

# 8. Common scratch / project directories
for d in /scratch/$USER /scratch /lustre /gpfs /tmp/$USER; do
[[ -d "$d" ]] && echo "scratch: $d"
done

# 9. Internet connectivity from this node
curl -s --connect-timeout 5 -o /dev/null -w "%{http_code}" https://huggingface.co 2>/dev/null

# 10. Proxy settings
echo "HTTP_PROXY=${HTTP_PROXY:-<unset>}"
echo "HTTPS_PROXY=${HTTPS_PROXY:-<unset>}"

# 11. Existing SIF files
find "$HOME" /scratch /projects /opt -maxdepth 4 -name "*.sif" 2>/dev/null | head -20
```

## Step 2 — Present findings and ask user to confirm or choose

Show the user a summary table of discovered values. For each category, present what was found and ask them to confirm or override.

### Questions (use AskUserQuestion for multi-choice items)

**Q1. GPU architecture** (auto-filled if detected)
Show detected arch (e.g. "gfx942 — MI300X"). Ask only if detection failed:
- MI300X (gfx942)
- MI250X (gfx90a)
- MI350X (gfx950)
- MI210 (gfx90a)
- MI100 (gfx908)

**Q2. GPUs per node** (auto-filled from renderD count, default 8)
Show detected count. Ask to confirm or override.

**Q3. VRAM per GPU** (auto-filled if detected)
Show detected value. Ask to confirm or override.

**Q4. SLURM partition** (choose from discovered partitions)
If multiple GPU partitions found, present them as multiple-choice.
If only one, pre-fill and ask to confirm.
If SLURM not available (e.g. Docker-only workstation), set to empty.

**Q5. SLURM account** (choose from discovered accounts)
If multiple accounts found, present them as multiple-choice.
If only one, pre-fill and ask to confirm.
If SLURM not available, set to empty.

**Q6. Container runtime**
Based on what's available:
- **Apptainer** (if found — recommended for HPC)
- **Docker** (if found)
- **Both** (if both found — ask which is preferred)

If Docker is available, also check for AMD Container Toolkit:
- If `docker info` shows AMD runtime → set `gpu_method: amd-toolkit`
- Otherwise → set `gpu_method: device-passthrough`

**Q7. Scratch / working directory**
Show discovered writable directories. Ask user to pick or provide their preferred scratch path for large files (SIF, overlays, datasets).

**Q8. Internet access**
Show result of connectivity test. Ask to confirm.
If no internet, ask for proxy settings.

**Q9. Config file location**
- `.cluster-config.yaml` in this repo (recommended — per-checkout)
- `~/.config/ai4science-studio/cluster.yaml` (user-level, shared across clones)

## Step 3 — Write the config file

Write the YAML config to the chosen location using the confirmed values. Use this structure:

```yaml
# AI4Science Studio — Local Cluster Configuration
# Auto-generated by /init-cluster on <date>
# Edit manually or re-run /init-cluster to update.

slurm:
partition: "<confirmed partition>"
account: "<confirmed account>"
qos: ""

paths:
home: "<$HOME value>"
scratch: "<confirmed scratch path>"
projects: ""
sif_cache: "<path where .sif files were found, or scratch/containers>"

containers:
runtime: "<apptainer|docker>"
gpu_method: "<amd-toolkit|device-passthrough|rocm-flag>"

gpu:
architecture: "<gfxNNN>"
count_per_node: <N>
vram_gb: <N>

network:
internet_access: <true|false>
proxy: "<proxy URL or empty>"

discovered_sifs:
# SIF files found during init (informational — not used by scripts)
# - /path/to/some.sif
```

After writing, confirm to the user:
- File location
- Summary of all settings
- Remind them this file is gitignored and will never be committed
- Tell them to re-run `/init-cluster` if they move to a different cluster

## Step 4 — Show next steps

Suggest what the user can do next:
- Run a model: `/run-stormcast`, `/run-orbit2`, etc.
- The run commands will read cluster config automatically for SLURM partition/account defaults

$ARGUMENTS
6 changes: 5 additions & 1 deletion .claude/commands/run-archesweather.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@

Guide the user through running ArchesWeather end-to-end on an AMD cluster.

## Step 0 — Cluster config check

Check if `.cluster-config.yaml` (repo root) or `~/.config/ai4science-studio/cluster.yaml` exists. If neither exists, run the `/init-cluster` flow first. If a config exists, read it and pre-fill container runtime and SLURM partition/account from saved values.

## Step 1 — Questionnaire (ask ALL questions before doing anything)

Ask the user the following questions. Do not assume any defaults. Wait for answers to all questions before proceeding.
Expand Down Expand Up @@ -52,7 +56,7 @@ Run these when the user chose **Auto-discover** for any question. Present the re
```bash
find "$HOME" /scratch /projects /opt -maxdepth 4 -name "*.sif" 2>/dev/null | head -20
```
Use `$HOME` (not `/home`) so the search works when the home directory is under a non-standard prefix (e.g. `/shared/prerelease/home/…`).
Use `$HOME` (not `/home`) so the search works when the home directory is under a non-standard prefix.
Filter results for SIF names containing `geoarches` or `rocm`. Verify with `apptainer inspect <sif>` if multiple candidates.

**SLURM partition and account (Q6):**
Expand Down
4 changes: 4 additions & 0 deletions .claude/commands/run-aurora.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@

Guide the user through running Aurora (0.1° resolution) end-to-end on an AMD cluster.

## Step 0 — Cluster config check

Check if `.cluster-config.yaml` (repo root) or `~/.config/ai4science-studio/cluster.yaml` exists. If neither exists, run the `/init-cluster` flow first. If a config exists, read it and pre-fill container runtime and SLURM partition/account from saved values.

## Step 1 — Questionnaire (ask ALL questions before doing anything)

Ask the user the following questions. Do not assume any defaults. Wait for answers to all questions before proceeding.
Expand Down
4 changes: 4 additions & 0 deletions .claude/commands/run-gencast.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@

Guide the user through running GenCast end-to-end on an AMD cluster.

## Step 0 — Cluster config check

Check if `.cluster-config.yaml` (repo root) or `~/.config/ai4science-studio/cluster.yaml` exists. If neither exists, run the `/init-cluster` flow first. If a config exists, read it and pre-fill container runtime and SLURM partition/account from saved values.

## Step 1 — Questionnaire (ask ALL questions before doing anything)

Ask the user the following questions. Do not assume any defaults. Wait for answers to all questions before proceeding.
Expand Down
6 changes: 5 additions & 1 deletion .claude/commands/run-gpmolformer.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,10 @@

Guide the user through running GP-MoLFormer on an AMD cluster via SLURM.

## Step 0 — Cluster config check

Check if `.cluster-config.yaml` (repo root) or `~/.config/ai4science-studio/cluster.yaml` exists. If neither exists, run the `/init-cluster` flow first. If a config exists, read it and pre-fill container runtime and SLURM partition/account from saved values.

## Step 1 — Questionnaire (ask ALL questions before doing anything)

Ask the user the following questions. Do not assume any defaults. Wait for answers to all questions before proceeding.
Expand Down Expand Up @@ -54,7 +58,7 @@ Run these when the user chose **Auto-discover** for any question. Present the re
```bash
find "$HOME" /scratch /projects /opt -maxdepth 4 -name "*.sif" 2>/dev/null | head -20
```
Use `$HOME` (not `/home`) so the search works when the home directory is under a non-standard prefix (e.g. `/shared/prerelease/home/…`).
Use `$HOME` (not `/home`) so the search works when the home directory is under a non-standard prefix.
Filter results for SIF names containing `rocm` or `pytorch`. Verify with `apptainer inspect <sif>` if multiple candidates.

**SLURM partition and account (Q7):**
Expand Down
4 changes: 4 additions & 0 deletions .claude/commands/run-hydragnn.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@

Guide the user through running HydraGNN predictive GFM inference on AMD GPUs.

## Step 0 — Cluster config check

Check if `.cluster-config.yaml` (repo root) or `~/.config/ai4science-studio/cluster.yaml` exists. If neither exists, run the `/init-cluster` flow first. If a config exists, read it and pre-fill container runtime and SLURM partition/account from saved values.

## Step 1 — Questionnaire

**Q0. Task**
Expand Down
4 changes: 4 additions & 0 deletions .claude/commands/run-matey.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@

Guide the user through training or inference with MATEY on AMD GPUs.

## Step 0 — Cluster config check

Check if `.cluster-config.yaml` (repo root) or `~/.config/ai4science-studio/cluster.yaml` exists. If neither exists, run the `/init-cluster` flow first. If a config exists, read it and pre-fill container runtime and SLURM partition/account from saved values.

## Step 1 — Questionnaire

**Q0. Task**
Expand Down
6 changes: 5 additions & 1 deletion .claude/commands/run-mattergen.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@

Guide the user through running MatterGen end-to-end on an AMD cluster via SLURM or Docker.

## Step 0 — Cluster config check

Check if `.cluster-config.yaml` (repo root) or `~/.config/ai4science-studio/cluster.yaml` exists. If neither exists, run the `/init-cluster` flow first. If a config exists, read it and pre-fill container runtime and SLURM partition/account from saved values.

## Step 1 — Questionnaire (ask ALL questions before doing anything)

Ask the user the following questions. Do not assume any defaults. Wait for answers before proceeding.
Expand Down Expand Up @@ -44,7 +48,7 @@ Run these when the user chose **Auto-discover** for any question. Present the re
```bash
find "$HOME" /scratch /projects /opt -maxdepth 4 -name "*.sif" 2>/dev/null | head -20
```
Use `$HOME` (not `/home`) so the search works when the home directory is under a non-standard prefix (e.g. `/shared/prerelease/home/…`).
Use `$HOME` (not `/home`) so the search works when the home directory is under a non-standard prefix.
Filter results for SIF names containing `rocm` or `pytorch`. Verify with `apptainer inspect <sif>` if multiple candidates.

**SLURM partition and account (Q5):**
Expand Down
4 changes: 4 additions & 0 deletions .claude/commands/run-neuralgcm.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@

Guide the user through running NeuralGCM end-to-end on an AMD cluster via Docker or SLURM.

## Step 0 — Cluster config check

Check if `.cluster-config.yaml` (repo root) or `~/.config/ai4science-studio/cluster.yaml` exists. If neither exists, run the `/init-cluster` flow first. If a config exists, read it and pre-fill container runtime and SLURM partition/account from saved values.

## Step 1 — Questionnaire (ask ALL questions before doing anything)

Ask the user the following questions. Do not assume any defaults. Wait for answers to all questions before proceeding.
Expand Down
6 changes: 5 additions & 1 deletion .claude/commands/run-orbit2.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@

Guide the user through running ORBIT-2 end-to-end on an AMD cluster via SLURM.

## Step 0 — Cluster config check

Check if `.cluster-config.yaml` (repo root) or `~/.config/ai4science-studio/cluster.yaml` exists. If neither exists, run the `/init-cluster` flow first. If a config exists, read it and pre-fill container runtime and SLURM partition/account from saved values.

## Step 1 — Questionnaire (ask ALL questions before doing anything)

Ask the user the following questions. Do not assume any defaults. Wait for answers to all questions before proceeding.
Expand Down Expand Up @@ -61,7 +65,7 @@ Run these when the user chose **Auto-discover** for any question. Present the re
```bash
find "$HOME" /scratch /projects /opt -maxdepth 4 -type d -name "ORBIT-2" 2>/dev/null | head -20
```
Use `$HOME` (not `/home`) so the search works when the home directory is under a non-standard prefix (e.g. `/shared/prerelease/home/…`).
Use `$HOME` (not `/home`) so the search works when the home directory is under a non-standard prefix.
Look for a directory containing `src/` and `examples/` subdirectories to confirm it is a valid ORBIT-2 clone.

**SIF files (Q2):**
Expand Down
4 changes: 4 additions & 0 deletions .claude/commands/run-panguweather.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@

Guide the user through running PanguWeather end-to-end on an AMD cluster.

## Step 0 — Cluster config check

Check if `.cluster-config.yaml` (repo root) or `~/.config/ai4science-studio/cluster.yaml` exists. If neither exists, run the `/init-cluster` flow first. If a config exists, read it and pre-fill container runtime and SLURM partition/account from saved values.

## Step 1 — Questionnaire (ask ALL questions before doing anything)

Ask the user the following questions. Do not assume any defaults. Wait for answers to all questions before proceeding.
Expand Down
4 changes: 4 additions & 0 deletions .claude/commands/run-reinvent4.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ Guide the user through running REINVENT4 transfer learning for molecular design

> **Research / engineering use only.** Not for clinical or diagnostic use.

## Step 0 — Cluster config check

Check if `.cluster-config.yaml` (repo root) or `~/.config/ai4science-studio/cluster.yaml` exists. If neither exists, run the `/init-cluster` flow first. If a config exists, read it and pre-fill container runtime and SLURM partition/account from saved values.

## Step 1 — Questionnaire

**Q0. TOML config**
Expand Down
4 changes: 4 additions & 0 deletions .claude/commands/run-semlaflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ Guide the user through generating 3D molecular structures with SemlaFlow on AMD

> **Research / engineering use only.** Not for clinical or diagnostic use.

## Step 0 — Cluster config check

Check if `.cluster-config.yaml` (repo root) or `~/.config/ai4science-studio/cluster.yaml` exists. If neither exists, run the `/init-cluster` flow first. If a config exists, read it and pre-fill container runtime and SLURM partition/account from saved values.

## Step 1 — Questionnaire

**Q0. Checkpoint path**
Expand Down
8 changes: 7 additions & 1 deletion .claude/commands/run-stormcast.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,12 @@

Guide the user through running StormCast end-to-end on an AMD cluster via SLURM.

## Step 0 — Cluster config check

Before starting, check if `.cluster-config.yaml` (repo root) or `~/.config/ai4science-studio/cluster.yaml` exists. If neither exists, run the `/init-cluster` flow first to auto-discover the cluster environment.

If a config exists, read it and pre-fill Q0 (runtime), Q6 (partition/account) from the saved values. Still present them to the user for confirmation but show the saved defaults.

## Step 1 — Questionnaire (ask ALL questions before doing anything)

Ask the user the following questions. Do not assume any defaults. Wait for answers to all questions before proceeding.
Expand Down Expand Up @@ -50,7 +56,7 @@ Run these when the user chose **Auto-discover** for any question. Present the re
```bash
find "$HOME" /scratch /projects /opt -maxdepth 4 -name "*.sif" 2>/dev/null | head -20
```
Use `$HOME` (not `/home`) so the search works when the home directory is under a non-standard prefix (e.g. `/shared/prerelease/home/…`).
Use `$HOME` (not `/home`) so the search works when the home directory is under a non-standard prefix.
Filter results for SIF names containing `rocm` or `pytorch`. Verify with `apptainer inspect <sif>` if multiple candidates.

**Overlay images (Q2):**
Expand Down
4 changes: 4 additions & 0 deletions .claude/commands/run-swinunetr.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ Guide the user through training or inference with SwinUNETR on AMD GPUs.

> **Research / engineering use only.** Not for clinical or diagnostic use.

## Step 0 — Cluster config check

Check if `.cluster-config.yaml` (repo root) or `~/.config/ai4science-studio/cluster.yaml` exists. If neither exists, run the `/init-cluster` flow first. If a config exists, read it and pre-fill container runtime and SLURM partition/account from saved values.

## Step 1 — Questionnaire

**Q0. Task**
Expand Down
Loading
Loading