Aicomnet dev slurm multi by mkuznet1 · Pull Request #142 · ROCm/madengine

mkuznet1 · 2026-06-03T15:58:16Z

Summary

Closes the multi-node coordination gap in the SLURM slurm_multi path. The
upstream refactor made madengine multi-node observable (per-rank log staging,
login-node collect_results, partial manifest restore) but did not yet
coordinate which node owns the local image, when workers may enter
docker run, whose CSV the aggregator should trust, and which manifest fields a
worker needs to reconstruct the builder's environment.

This PR ports that coordination onto slurm_multi as four staged commits
(preparation → build → running → postprocessing) plus one follow-up fix.
Single-node behavior is unchanged — every new path is gated on
MAD_DOCKER_BUILDS, NNODES > 1, skip_perf_collection, manifest presence, or
the failure flow.

What's new

Multi-node coordination

Shared local-image tar cache (MAD_DOCKER_BUILDS): rank 0 docker saves
the locally-built image into a shared dir (atomic sibling-.tmp + os.replace);
workers docker load it instead of pulling. Local images are not in any
registry, so upstream's registry-only pull path cannot reach them on workers.
Rebuild-from-manifest fallback on a worker where the local image is missing
and no tar is available (_build_local_image_from_manifest, with shlex.quote
escaping for every path/image/build-arg component since Console.sh runs with
shell=True).
Deterministic TCP image-ready barrier (_tcp_image_ready_barrier): rank 0
listens on a port derived from MASTER_PORT+SLURM_JOB_ID; workers send
READY <token> <rank> and wait for GO, so no one enters docker run against
a half-loaded image. No shared FS required (avoids NFS/Weka metadata lag).
Hardening: MAD_BARRIER_TOKEN secret, bind to the routable MASTER_ADDR IP
(skips loopback so an /etc/hosts mapping can't strand the listener on lo),
line-based partial-read-safe recv, worker-rank normalization, strict
rank/nnodes parsing, ACKed-socket cleanup + peer logging on master timeout.
Best-non-empty multi-node CSV picker in collect_results
(_select_best_multiple_results_csv): gathers every per-node candidate and
ranks by count of non-empty performance rows (tie-break on total rows), so
the aggregator uses the worker that actually holds the throughput numbers
instead of a present-but-empty master CSV. Header/row keys are stripped so a
leading-space CSV header still matches.
Manifest runtime-context restore on workers now covers docker_mounts,
docker_build_arg, docker_gpus, gpu_vendor, guest_os — runtime-detected
values keep priority.

Bug fixes on the restore / verdict paths

docker_env_vars restore no longer overwrites os.environ-sourced values
(e.g. MAD_SECRETS_HFTOKEN) with unexpanded ${VAR} literals from an old
manifest. Fixes HF 401/403 on Primus / gated models on rerun.
Empty-perf verdict deferred to the login node (skip_perf_collection):
Primus emits throughput only on the last global rank, which often lives on a
different node than the designated collector (MAD_COLLECT_METRICS=true). The
per-node in-job check used to mark the whole job FAILED (exit 3) on the
collector's empty local CSV; it now defers to the authoritative login-node
aggregation. Real failures are still caught first by the error-pattern scan,
and a genuinely-empty multi-node run still gets a FAILURE row from the login
step.

Robustness exposed by scale-out

amd-smi JSON tolerance: parse via JSONDecoder.raw_decode so trailing
non-JSON (deprecation banners, double-emitted blocks under concurrent SLURM
tasks) no longer aborts run-phase init.
Side-effect-free lightweight CLI: madengine --version/--help skip
MODEL_DIR setup, avoiding global-state pollution of later runs in the same
shell.
Duplicate-mount protection: get_mount_arg de-duplicates against
-v/--volume targets already present in additional_docker_run_options so
Docker doesn't reject "Duplicate mount point".
Container-failure diagnostics: on a docker exec RuntimeError, a bounded,
non-fatal snapshot (process table, listening sockets via ss/netstat/lsof,
tails of /run_logs and the model-dir logs) is captured inside the failed
container and lands in the run log next to the failure; the original error is
re-raised unchanged.

Validation

Per-stage 2-node validation on a real SLURM cluster: shared-tar save/load,
TCP barrier release with the image removed on the worker, and throughput parity
vs the single-image baseline.
End-to-end 2-node Primus Megatron-LM Llama-3.1-8B (2 × 8 GPU, IB): worker
rank correctly reports ~10.1k tok/s/GPU, ~522 TFLOPS/GPU, and the run is now
marked SUCCESS (previously a false FAILURE on the collector's empty CSV).
pytest tests/unit/test_slurm_multi.py — 20 passed. Integration tests cover the
shared-tar paths (primary-saves, existing-tar-is-loaded, worker-waits-for-primary).

Test plan

2-node Primus Megatron-LM 8B with image absent on the worker (exercises
build-on-primary → save → worker load → TCP barrier → SUCCESS verdict).
Rerun on a gated/Primus model to confirm MAD_SECRETS_HFTOKEN is not
clobbered by manifest ${VAR} placeholders (no 401/403).
Single-node SLURM run to confirm unchanged behavior (gated paths inactive).
Confirm collect_results selects the worker CSV and writes the aggregated
perf record.

…estore, secrets precedence, lightweight-CLI isolation Stage A of porting the aicomnet multi-node local-image work onto the slurm_multi pipeline (develop). Preparation-phase changes only: - core/context: parse amd-smi JSON via JSONDecoder.raw_decode so trailing non-JSON text (deprecation banners, double-emitted blocks under concurrent slurm tasks) no longer aborts run-phase init. - orchestration/run_orchestrator: restore host-level runtime context (docker_mounts, docker_build_arg, docker_gpus, gpu_vendor, guest_os) from the manifest on execution, keeping runtime-detected values as priority; docker_env_vars from the manifest no longer overwrite runtime values already populated by Context (e.g. MAD_SECRETS_* read from os.environ), so unexpanded "${VAR}" manifest placeholders cannot clobber the resolved secret. - core/constants: --version/--help lightweight CLI invocations skip MODEL_DIR setup and keep output clean, avoiding global-state pollution of subsequent runs in the same shell.

…tribution Stage B of porting the aicomnet multi-node local-image work onto the slurm_multi pipeline. Adds local-image preparation to the per-node run path (run_models_from_manifest, which slurm_multi invokes on every node in docker target mode): - _ensure_local_image_available: primary node (rank 0) ensures the local image is present (building it from the manifest dockerfile if missing), and when MAD_DOCKER_BUILDS points at a shared dir, docker-saves it into a shared tar; worker nodes load that tar instead of pulling. Local images are not in any registry, so upstreams registry-only pull path cannot reach them on workers. - Atomic tar write (sibling .tmp + os.replace) so peers never load a half-written tar. - _build_local_image_from_manifest / _get_build_args / _build_or_pull_local_image with shlex.quote shell-escaping for all path/image/build-arg components (Console.sh runs with shell=True). - Strict NODE_RANK/RANK and NNODES/WORLD_SIZE parsing (malformed values raise instead of silently defaulting to 0/1 and diverging). - _sync_after_local_image_ready: Stage B uses a shared-filesystem readiness wait (workers without the image poll for the primary tar); Stage C replaces this with a deterministic TCP barrier. Replaces the previous verify-then-pull block, which could only pull registry images and left local multi-node images unreachable on worker nodes.

…e-duplication Stage C of porting the aicomnet multi-node local-image work onto the slurm_multi pipeline. Replaces the Stage B shared-FS readiness poll with a robust TCP rendezvous and adds docker mount de-duplication. - _tcp_image_ready_barrier / _recv_line: NODE_RANK=0 listens on a port range derived from MASTER_PORT+SLURM_JOB_ID; workers send "READY <token> <rank>" and wait for "GO". No shared-FS visibility required (avoids NFS/Weka metadata lag). Hardening: MAD_BARRIER_TOKEN secret (defaults JOB<id>); bind to MASTER_ADDR resolved IP, skipping loopback (127/8, ::1) so an /etc/hosts loopback mapping cannot leave the listener on lo while workers connect over the routable IP; line-based recv (partial-read safe); worker-rank normalization so NODE_RANK="01" matches the int ACK; strict rank / nnodes parsing; on master timeout the ACKed worker sockets are closed (EOF, not half-open); accepted-peer list logged for deadlock diagnosis. - _sync_after_local_image_ready now drives the TCP barrier instead of polling the shared filesystem. - get_mount_arg de-duplicates against -v/--volume targets already present in additional_docker_run_options (via _extract_additional_mount_targets) so docker does not reject "Duplicate mount point". SLURM->container env pass-through and --version/--help MODEL_DIR isolation were already covered (develop slurm_env_vars list; Stage A constants).

… + on-failure container diagnostics Stage D of porting the aicomnet multi-node local-image work onto the slurm_multi pipeline. Covers result aggregation and failure triage across nodes: - slurm.py collect_results: instead of taking the first node_<rank> directory that happens to hold the multiple_results CSV, gather every per-node candidate (plus the job-dir copy) and pass them to _select_best_multiple_results_csv, which ranks by the count of non-empty "performance" rows and breaks ties on total row count. In multi-node runs the master CSV is frequently present but empty while a worker holds the real throughput numbers, so first-match could aggregate empty data; this defers empty-perf handling to the richest file. Header/row keys are stripped so a leading-space CSV header still matches. - container_runner.py run path: wrap the model-script exec in a try/except that, on RuntimeError, parses the failed container id from the error and snapshots bounded diagnostics inside it (process table, listening ports, and tails of /run_logs and the model dir logs) via Console.sh so they land in the run log next to the failure. All diagnostic calls are time-bounded and non-fatal; the original error is re-raised unchanged.

…LURM in-job runs In multi-node SLURM runs the per-node in-job status check marked the job FAILED when the designated collector node (MAD_COLLECT_METRICS=true) had an empty multiple_results CSV, even though Primus emits throughput only on the last global rank, which frequently lives on a different node than the collector. With skip_perf_collection set, the login-node collect_results already selects the richest per-node CSV (_select_best_multiple_results_csv) and writes the authoritative perf/status record, so an empty local perf is not authoritative and must not hard-fail the node (exit 3 -> whole job fails). Add a skip_perf_collection guard to the status determination (and the except fallback). Real failures remain caught first by the error-pattern scan; genuinely-empty multi-node runs still get a FAILURE row from the login-node aggregation.

Copilot

Pull request overview

This PR adds missing multi-node coordination and robustness improvements across the SLURM slurm_multi execution path, ensuring workers can reliably obtain locally-built images, synchronize before docker run, and aggregate performance results correctly across nodes.

Changes:

Restore additional runtime/build context from manifests (mounts/build args/env vars) while preserving runtime-detected values.
Add multi-node support utilities in the container runner (shared image tar cache, TCP barrier, duplicate mount avoidance, and failure diagnostics).
Improve SLURM result aggregation by selecting the “best” per-node multiple_results CSV and harden GPU discovery JSON parsing.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
src/madengine/orchestration/run_orchestrator.py	Adjusts local-image manifest creation and improves manifest context restore behavior.
src/madengine/execution/container_runner.py	Adds multi-node local-image coordination helpers, mount de-dup, diagnostics, and verdict behavior for deferred perf collection.
src/madengine/deployment/slurm.py	Picks the richest per-node `multiple_results` CSV during login-node aggregation.
src/madengine/core/context.py	Makes amd-smi JSON parsing tolerant of trailing non-JSON output.
src/madengine/core/constants.py	Avoids side effects (MODEL_DIR setup / verbose config prints) for lightweight CLI invocations like `--help`/`--version`.

Comments suppressed due to low confidence (1)

src/madengine/core/context.py:765

When ROCm < 6.4.1 and KFD topology is not accessible, kfd_renderDs can remain None (the code explicitly tolerates KFD permission errors). The subsequent len(kfd_renderDs) will then raise TypeError, losing the clearer error and making the fallback path brittle. Add an explicit None guard before using kfd_renderDs in the ROCm<6.4.1 mapping branch.

                if len(kfd_unique_ids) != len(kfd_renderDs):
                    raise RuntimeError(
                        f"Mismatch between unique_ids count ({len(kfd_unique_ids)}) "
                        f"and renderDs count ({len(kfd_renderDs)})"
                    )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

        # Validate that the image exists locally or can be pulled
        try:
-            self.console.sh(f"docker image inspect {shlex.quote(image_name)} > /dev/null 2>&1")
+            self.console.sh(f"docker image inspect {image_name} > /dev/null 2>&1")
            self.rich_console.print(f"[green]✓ Image {image_name} found locally[/green]")
        except (subprocess.CalledProcessError, RuntimeError) as e:
            self.rich_console.print(f"[yellow]⚠️  Image {image_name} not found locally, attempting to pull...[/yellow]")
            try:
-                self.console.sh(f"docker pull {shlex.quote(image_name)}")
+                self.console.sh(f"docker pull {image_name}")
                self.rich_console.print(f"[green]✓ Successfully pulled {image_name}[/green]")


…opilot review) Resolves Copilot review (High): in _create_manifest_from_local_image, image_name was interpolated into "docker image inspect" / "docker pull" commands passed to Console.sh (shell=True) without quoting -- a command-injection risk that also breaks valid image names containing shell-special characters. Escape it once via shlex.quote and use the quoted form in both shell calls (display strings keep the raw name). Matches the shlex.quote escaping already used on the Stage B/C container_runner shell paths.

mkuznet1 added 5 commits June 3, 2026 15:29

Copilot AI review requested due to automatic review settings June 3, 2026 15:58

mkuznet1 requested review from Cemberk, Rohan138, coketaste, gargrahul and leconcio as code owners June 3, 2026 15:58

Copilot started reviewing on behalf of mkuznet1 June 3, 2026 15:58 View session

Copilot AI reviewed Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aicomnet dev slurm multi#142

Aicomnet dev slurm multi#142
mkuznet1 wants to merge 6 commits into
ROCm:developfrom
mkuznet1:aicomnet_dev_slurm_multi

mkuznet1 commented Jun 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mkuznet1 commented Jun 3, 2026

Summary

What's new

Multi-node coordination

Bug fixes on the restore / verdict paths

Robustness exposed by scale-out

Validation

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants