Skip to content

Aicomnet dev slurm multi#142

Open
mkuznet1 wants to merge 6 commits into
ROCm:developfrom
mkuznet1:aicomnet_dev_slurm_multi
Open

Aicomnet dev slurm multi#142
mkuznet1 wants to merge 6 commits into
ROCm:developfrom
mkuznet1:aicomnet_dev_slurm_multi

Conversation

@mkuznet1
Copy link
Copy Markdown

@mkuznet1 mkuznet1 commented Jun 3, 2026

Summary

Closes the multi-node coordination gap in the SLURM slurm_multi path. The
upstream refactor made madengine multi-node observable (per-rank log staging,
login-node collect_results, partial manifest restore) but did not yet
coordinate which node owns the local image, when workers may enter
docker run, whose CSV the aggregator should trust, and which manifest fields a
worker needs to reconstruct the builder's environment.

This PR ports that coordination onto slurm_multi as four staged commits
(preparation → build → running → postprocessing) plus one follow-up fix.
Single-node behavior is unchanged — every new path is gated on
MAD_DOCKER_BUILDS, NNODES > 1, skip_perf_collection, manifest presence, or
the failure flow.

What's new

Multi-node coordination

  • Shared local-image tar cache (MAD_DOCKER_BUILDS): rank 0 docker saves
    the locally-built image into a shared dir (atomic sibling-.tmp + os.replace);
    workers docker load it instead of pulling. Local images are not in any
    registry, so upstream's registry-only pull path cannot reach them on workers.
  • Rebuild-from-manifest fallback on a worker where the local image is missing
    and no tar is available (_build_local_image_from_manifest, with shlex.quote
    escaping for every path/image/build-arg component since Console.sh runs with
    shell=True).
  • Deterministic TCP image-ready barrier (_tcp_image_ready_barrier): rank 0
    listens on a port derived from MASTER_PORT+SLURM_JOB_ID; workers send
    READY <token> <rank> and wait for GO, so no one enters docker run against
    a half-loaded image. No shared FS required (avoids NFS/Weka metadata lag).
    Hardening: MAD_BARRIER_TOKEN secret, bind to the routable MASTER_ADDR IP
    (skips loopback so an /etc/hosts mapping can't strand the listener on lo),
    line-based partial-read-safe recv, worker-rank normalization, strict
    rank/nnodes parsing, ACKed-socket cleanup + peer logging on master timeout.
  • Best-non-empty multi-node CSV picker in collect_results
    (_select_best_multiple_results_csv): gathers every per-node candidate and
    ranks by count of non-empty performance rows (tie-break on total rows), so
    the aggregator uses the worker that actually holds the throughput numbers
    instead of a present-but-empty master CSV. Header/row keys are stripped so a
    leading-space CSV header still matches.
  • Manifest runtime-context restore on workers now covers docker_mounts,
    docker_build_arg, docker_gpus, gpu_vendor, guest_os — runtime-detected
    values keep priority.

Bug fixes on the restore / verdict paths

  • docker_env_vars restore no longer overwrites os.environ-sourced values
    (e.g. MAD_SECRETS_HFTOKEN) with unexpanded ${VAR} literals from an old
    manifest. Fixes HF 401/403 on Primus / gated models on rerun.
  • Empty-perf verdict deferred to the login node (skip_perf_collection):
    Primus emits throughput only on the last global rank, which often lives on a
    different node than the designated collector (MAD_COLLECT_METRICS=true). The
    per-node in-job check used to mark the whole job FAILED (exit 3) on the
    collector's empty local CSV; it now defers to the authoritative login-node
    aggregation. Real failures are still caught first by the error-pattern scan,
    and a genuinely-empty multi-node run still gets a FAILURE row from the login
    step.

Robustness exposed by scale-out

  • amd-smi JSON tolerance: parse via JSONDecoder.raw_decode so trailing
    non-JSON (deprecation banners, double-emitted blocks under concurrent SLURM
    tasks) no longer aborts run-phase init.
  • Side-effect-free lightweight CLI: madengine --version/--help skip
    MODEL_DIR setup, avoiding global-state pollution of later runs in the same
    shell.
  • Duplicate-mount protection: get_mount_arg de-duplicates against
    -v/--volume targets already present in additional_docker_run_options so
    Docker doesn't reject "Duplicate mount point".
  • Container-failure diagnostics: on a docker exec RuntimeError, a bounded,
    non-fatal snapshot (process table, listening sockets via ss/netstat/lsof,
    tails of /run_logs and the model-dir logs) is captured inside the failed
    container and lands in the run log next to the failure; the original error is
    re-raised unchanged.

Validation

  • Per-stage 2-node validation on a real SLURM cluster: shared-tar save/load,
    TCP barrier release with the image removed on the worker, and throughput parity
    vs the single-image baseline.
  • End-to-end 2-node Primus Megatron-LM Llama-3.1-8B (2 × 8 GPU, IB): worker
    rank correctly reports ~10.1k tok/s/GPU, ~522 TFLOPS/GPU, and the run is now
    marked SUCCESS (previously a false FAILURE on the collector's empty CSV).
  • pytest tests/unit/test_slurm_multi.py — 20 passed. Integration tests cover the
    shared-tar paths (primary-saves, existing-tar-is-loaded, worker-waits-for-primary).

Test plan

  • 2-node Primus Megatron-LM 8B with image absent on the worker (exercises
    build-on-primary → save → worker load → TCP barrier → SUCCESS verdict).
  • Rerun on a gated/Primus model to confirm MAD_SECRETS_HFTOKEN is not
    clobbered by manifest ${VAR} placeholders (no 401/403).
  • Single-node SLURM run to confirm unchanged behavior (gated paths inactive).
  • Confirm collect_results selects the worker CSV and writes the aggregated
    perf record.

mkuznet1 added 5 commits June 3, 2026 15:29
…estore, secrets precedence, lightweight-CLI isolation

Stage A of porting the aicomnet multi-node local-image work onto the
slurm_multi pipeline (develop). Preparation-phase changes only:

- core/context: parse amd-smi JSON via JSONDecoder.raw_decode so trailing
  non-JSON text (deprecation banners, double-emitted blocks under
  concurrent slurm tasks) no longer aborts run-phase init.
- orchestration/run_orchestrator: restore host-level runtime context
  (docker_mounts, docker_build_arg, docker_gpus, gpu_vendor, guest_os)
  from the manifest on execution, keeping runtime-detected values as
  priority; docker_env_vars from the manifest no longer overwrite runtime
  values already populated by Context (e.g. MAD_SECRETS_* read from
  os.environ), so unexpanded "${VAR}" manifest placeholders cannot clobber
  the resolved secret.
- core/constants: --version/--help lightweight CLI invocations skip
  MODEL_DIR setup and keep output clean, avoiding global-state pollution
  of subsequent runs in the same shell.
…tribution

Stage B of porting the aicomnet multi-node local-image work onto the
slurm_multi pipeline. Adds local-image preparation to the per-node run path
(run_models_from_manifest, which slurm_multi invokes on every node in docker
target mode):

- _ensure_local_image_available: primary node (rank 0) ensures the local image
  is present (building it from the manifest dockerfile if missing), and when
  MAD_DOCKER_BUILDS points at a shared dir, docker-saves it into a shared tar;
  worker nodes load that tar instead of pulling. Local images are not in any
  registry, so upstreams registry-only pull path cannot reach them on workers.
- Atomic tar write (sibling .tmp + os.replace) so peers never load a
  half-written tar.
- _build_local_image_from_manifest / _get_build_args / _build_or_pull_local_image
  with shlex.quote shell-escaping for all path/image/build-arg components
  (Console.sh runs with shell=True).
- Strict NODE_RANK/RANK and NNODES/WORLD_SIZE parsing (malformed values raise
  instead of silently defaulting to 0/1 and diverging).
- _sync_after_local_image_ready: Stage B uses a shared-filesystem readiness wait
  (workers without the image poll for the primary tar); Stage C replaces this
  with a deterministic TCP barrier.

Replaces the previous verify-then-pull block, which could only pull registry
images and left local multi-node images unreachable on worker nodes.
…e-duplication

Stage C of porting the aicomnet multi-node local-image work onto the slurm_multi
pipeline. Replaces the Stage B shared-FS readiness poll with a robust TCP
rendezvous and adds docker mount de-duplication.

- _tcp_image_ready_barrier / _recv_line: NODE_RANK=0 listens on a port range
  derived from MASTER_PORT+SLURM_JOB_ID; workers send "READY <token> <rank>" and
  wait for "GO". No shared-FS visibility required (avoids NFS/Weka metadata lag).
  Hardening: MAD_BARRIER_TOKEN secret (defaults JOB<id>); bind to MASTER_ADDR
  resolved IP, skipping loopback (127/8, ::1) so an /etc/hosts loopback mapping
  cannot leave the listener on lo while workers connect over the routable IP;
  line-based recv (partial-read safe); worker-rank normalization so NODE_RANK="01"
  matches the int ACK; strict rank / nnodes parsing; on master timeout the ACKed
  worker sockets are closed (EOF, not half-open); accepted-peer list logged for
  deadlock diagnosis.
- _sync_after_local_image_ready now drives the TCP barrier instead of polling the
  shared filesystem.
- get_mount_arg de-duplicates against -v/--volume targets already present in
  additional_docker_run_options (via _extract_additional_mount_targets) so docker
  does not reject "Duplicate mount point".

SLURM->container env pass-through and --version/--help MODEL_DIR isolation were
already covered (develop slurm_env_vars list; Stage A constants).
… + on-failure container diagnostics

Stage D of porting the aicomnet multi-node local-image work onto the slurm_multi
pipeline. Covers result aggregation and failure triage across nodes:

- slurm.py collect_results: instead of taking the first node_<rank> directory
  that happens to hold the multiple_results CSV, gather every per-node candidate
  (plus the job-dir copy) and pass them to _select_best_multiple_results_csv,
  which ranks by the count of non-empty "performance" rows and breaks ties on
  total row count. In multi-node runs the master CSV is frequently present but
  empty while a worker holds the real throughput numbers, so first-match could
  aggregate empty data; this defers empty-perf handling to the richest file.
  Header/row keys are stripped so a leading-space CSV header still matches.
- container_runner.py run path: wrap the model-script exec in a try/except that,
  on RuntimeError, parses the failed container id from the error and snapshots
  bounded diagnostics inside it (process table, listening ports, and tails of
  /run_logs and the model dir logs) via Console.sh so they land in the run log
  next to the failure. All diagnostic calls are time-bounded and non-fatal; the
  original error is re-raised unchanged.
…LURM in-job runs

In multi-node SLURM runs the per-node in-job status check marked the job FAILED
when the designated collector node (MAD_COLLECT_METRICS=true) had an empty
multiple_results CSV, even though Primus emits throughput only on the last global
rank, which frequently lives on a different node than the collector. With
skip_perf_collection set, the login-node collect_results already selects the
richest per-node CSV (_select_best_multiple_results_csv) and writes the
authoritative perf/status record, so an empty local perf is not authoritative and
must not hard-fail the node (exit 3 -> whole job fails). Add a skip_perf_collection
guard to the status determination (and the except fallback). Real failures remain
caught first by the error-pattern scan; genuinely-empty multi-node runs still get a
FAILURE row from the login-node aggregation.
Copilot AI review requested due to automatic review settings June 3, 2026 15:58
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds missing multi-node coordination and robustness improvements across the SLURM slurm_multi execution path, ensuring workers can reliably obtain locally-built images, synchronize before docker run, and aggregate performance results correctly across nodes.

Changes:

  • Restore additional runtime/build context from manifests (mounts/build args/env vars) while preserving runtime-detected values.
  • Add multi-node support utilities in the container runner (shared image tar cache, TCP barrier, duplicate mount avoidance, and failure diagnostics).
  • Improve SLURM result aggregation by selecting the “best” per-node multiple_results CSV and harden GPU discovery JSON parsing.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/madengine/orchestration/run_orchestrator.py Adjusts local-image manifest creation and improves manifest context restore behavior.
src/madengine/execution/container_runner.py Adds multi-node local-image coordination helpers, mount de-dup, diagnostics, and verdict behavior for deferred perf collection.
src/madengine/deployment/slurm.py Picks the richest per-node multiple_results CSV during login-node aggregation.
src/madengine/core/context.py Makes amd-smi JSON parsing tolerant of trailing non-JSON output.
src/madengine/core/constants.py Avoids side effects (MODEL_DIR setup / verbose config prints) for lightweight CLI invocations like --help/--version.
Comments suppressed due to low confidence (1)

src/madengine/core/context.py:765

  • When ROCm < 6.4.1 and KFD topology is not accessible, kfd_renderDs can remain None (the code explicitly tolerates KFD permission errors). The subsequent len(kfd_renderDs) will then raise TypeError, losing the clearer error and making the fallback path brittle. Add an explicit None guard before using kfd_renderDs in the ROCm<6.4.1 mapping branch.
                if len(kfd_unique_ids) != len(kfd_renderDs):
                    raise RuntimeError(
                        f"Mismatch between unique_ids count ({len(kfd_unique_ids)}) "
                        f"and renderDs count ({len(kfd_renderDs)})"
                    )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 392 to 400
# Validate that the image exists locally or can be pulled
try:
self.console.sh(f"docker image inspect {shlex.quote(image_name)} > /dev/null 2>&1")
self.console.sh(f"docker image inspect {image_name} > /dev/null 2>&1")
self.rich_console.print(f"[green]✓ Image {image_name} found locally[/green]")
except (subprocess.CalledProcessError, RuntimeError) as e:
self.rich_console.print(f"[yellow]⚠️ Image {image_name} not found locally, attempting to pull...[/yellow]")
try:
self.console.sh(f"docker pull {shlex.quote(image_name)}")
self.console.sh(f"docker pull {image_name}")
self.rich_console.print(f"[green]✓ Successfully pulled {image_name}[/green]")
…opilot review)

Resolves Copilot review (High): in _create_manifest_from_local_image, image_name
was interpolated into "docker image inspect" / "docker pull" commands passed to
Console.sh (shell=True) without quoting -- a command-injection risk that also
breaks valid image names containing shell-special characters. Escape it once via
shlex.quote and use the quoted form in both shell calls (display strings keep the
raw name). Matches the shlex.quote escaping already used on the Stage B/C
container_runner shell paths.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants