Aicomnet dev slurm multi#142
Open
mkuznet1 wants to merge 6 commits into
Open
Conversation
…estore, secrets precedence, lightweight-CLI isolation
Stage A of porting the aicomnet multi-node local-image work onto the
slurm_multi pipeline (develop). Preparation-phase changes only:
- core/context: parse amd-smi JSON via JSONDecoder.raw_decode so trailing
non-JSON text (deprecation banners, double-emitted blocks under
concurrent slurm tasks) no longer aborts run-phase init.
- orchestration/run_orchestrator: restore host-level runtime context
(docker_mounts, docker_build_arg, docker_gpus, gpu_vendor, guest_os)
from the manifest on execution, keeping runtime-detected values as
priority; docker_env_vars from the manifest no longer overwrite runtime
values already populated by Context (e.g. MAD_SECRETS_* read from
os.environ), so unexpanded "${VAR}" manifest placeholders cannot clobber
the resolved secret.
- core/constants: --version/--help lightweight CLI invocations skip
MODEL_DIR setup and keep output clean, avoiding global-state pollution
of subsequent runs in the same shell.
…tribution Stage B of porting the aicomnet multi-node local-image work onto the slurm_multi pipeline. Adds local-image preparation to the per-node run path (run_models_from_manifest, which slurm_multi invokes on every node in docker target mode): - _ensure_local_image_available: primary node (rank 0) ensures the local image is present (building it from the manifest dockerfile if missing), and when MAD_DOCKER_BUILDS points at a shared dir, docker-saves it into a shared tar; worker nodes load that tar instead of pulling. Local images are not in any registry, so upstreams registry-only pull path cannot reach them on workers. - Atomic tar write (sibling .tmp + os.replace) so peers never load a half-written tar. - _build_local_image_from_manifest / _get_build_args / _build_or_pull_local_image with shlex.quote shell-escaping for all path/image/build-arg components (Console.sh runs with shell=True). - Strict NODE_RANK/RANK and NNODES/WORLD_SIZE parsing (malformed values raise instead of silently defaulting to 0/1 and diverging). - _sync_after_local_image_ready: Stage B uses a shared-filesystem readiness wait (workers without the image poll for the primary tar); Stage C replaces this with a deterministic TCP barrier. Replaces the previous verify-then-pull block, which could only pull registry images and left local multi-node images unreachable on worker nodes.
…e-duplication Stage C of porting the aicomnet multi-node local-image work onto the slurm_multi pipeline. Replaces the Stage B shared-FS readiness poll with a robust TCP rendezvous and adds docker mount de-duplication. - _tcp_image_ready_barrier / _recv_line: NODE_RANK=0 listens on a port range derived from MASTER_PORT+SLURM_JOB_ID; workers send "READY <token> <rank>" and wait for "GO". No shared-FS visibility required (avoids NFS/Weka metadata lag). Hardening: MAD_BARRIER_TOKEN secret (defaults JOB<id>); bind to MASTER_ADDR resolved IP, skipping loopback (127/8, ::1) so an /etc/hosts loopback mapping cannot leave the listener on lo while workers connect over the routable IP; line-based recv (partial-read safe); worker-rank normalization so NODE_RANK="01" matches the int ACK; strict rank / nnodes parsing; on master timeout the ACKed worker sockets are closed (EOF, not half-open); accepted-peer list logged for deadlock diagnosis. - _sync_after_local_image_ready now drives the TCP barrier instead of polling the shared filesystem. - get_mount_arg de-duplicates against -v/--volume targets already present in additional_docker_run_options (via _extract_additional_mount_targets) so docker does not reject "Duplicate mount point". SLURM->container env pass-through and --version/--help MODEL_DIR isolation were already covered (develop slurm_env_vars list; Stage A constants).
… + on-failure container diagnostics Stage D of porting the aicomnet multi-node local-image work onto the slurm_multi pipeline. Covers result aggregation and failure triage across nodes: - slurm.py collect_results: instead of taking the first node_<rank> directory that happens to hold the multiple_results CSV, gather every per-node candidate (plus the job-dir copy) and pass them to _select_best_multiple_results_csv, which ranks by the count of non-empty "performance" rows and breaks ties on total row count. In multi-node runs the master CSV is frequently present but empty while a worker holds the real throughput numbers, so first-match could aggregate empty data; this defers empty-perf handling to the richest file. Header/row keys are stripped so a leading-space CSV header still matches. - container_runner.py run path: wrap the model-script exec in a try/except that, on RuntimeError, parses the failed container id from the error and snapshots bounded diagnostics inside it (process table, listening ports, and tails of /run_logs and the model dir logs) via Console.sh so they land in the run log next to the failure. All diagnostic calls are time-bounded and non-fatal; the original error is re-raised unchanged.
…LURM in-job runs In multi-node SLURM runs the per-node in-job status check marked the job FAILED when the designated collector node (MAD_COLLECT_METRICS=true) had an empty multiple_results CSV, even though Primus emits throughput only on the last global rank, which frequently lives on a different node than the collector. With skip_perf_collection set, the login-node collect_results already selects the richest per-node CSV (_select_best_multiple_results_csv) and writes the authoritative perf/status record, so an empty local perf is not authoritative and must not hard-fail the node (exit 3 -> whole job fails). Add a skip_perf_collection guard to the status determination (and the except fallback). Real failures remain caught first by the error-pattern scan; genuinely-empty multi-node runs still get a FAILURE row from the login-node aggregation.
There was a problem hiding this comment.
Pull request overview
This PR adds missing multi-node coordination and robustness improvements across the SLURM slurm_multi execution path, ensuring workers can reliably obtain locally-built images, synchronize before docker run, and aggregate performance results correctly across nodes.
Changes:
- Restore additional runtime/build context from manifests (mounts/build args/env vars) while preserving runtime-detected values.
- Add multi-node support utilities in the container runner (shared image tar cache, TCP barrier, duplicate mount avoidance, and failure diagnostics).
- Improve SLURM result aggregation by selecting the “best” per-node
multiple_resultsCSV and harden GPU discovery JSON parsing.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/madengine/orchestration/run_orchestrator.py | Adjusts local-image manifest creation and improves manifest context restore behavior. |
| src/madengine/execution/container_runner.py | Adds multi-node local-image coordination helpers, mount de-dup, diagnostics, and verdict behavior for deferred perf collection. |
| src/madengine/deployment/slurm.py | Picks the richest per-node multiple_results CSV during login-node aggregation. |
| src/madengine/core/context.py | Makes amd-smi JSON parsing tolerant of trailing non-JSON output. |
| src/madengine/core/constants.py | Avoids side effects (MODEL_DIR setup / verbose config prints) for lightweight CLI invocations like --help/--version. |
Comments suppressed due to low confidence (1)
src/madengine/core/context.py:765
- When ROCm < 6.4.1 and KFD topology is not accessible,
kfd_renderDscan remainNone(the code explicitly tolerates KFD permission errors). The subsequentlen(kfd_renderDs)will then raiseTypeError, losing the clearer error and making the fallback path brittle. Add an explicitNoneguard before usingkfd_renderDsin the ROCm<6.4.1 mapping branch.
if len(kfd_unique_ids) != len(kfd_renderDs):
raise RuntimeError(
f"Mismatch between unique_ids count ({len(kfd_unique_ids)}) "
f"and renderDs count ({len(kfd_renderDs)})"
)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
392
to
400
| # Validate that the image exists locally or can be pulled | ||
| try: | ||
| self.console.sh(f"docker image inspect {shlex.quote(image_name)} > /dev/null 2>&1") | ||
| self.console.sh(f"docker image inspect {image_name} > /dev/null 2>&1") | ||
| self.rich_console.print(f"[green]✓ Image {image_name} found locally[/green]") | ||
| except (subprocess.CalledProcessError, RuntimeError) as e: | ||
| self.rich_console.print(f"[yellow]⚠️ Image {image_name} not found locally, attempting to pull...[/yellow]") | ||
| try: | ||
| self.console.sh(f"docker pull {shlex.quote(image_name)}") | ||
| self.console.sh(f"docker pull {image_name}") | ||
| self.rich_console.print(f"[green]✓ Successfully pulled {image_name}[/green]") |
…opilot review) Resolves Copilot review (High): in _create_manifest_from_local_image, image_name was interpolated into "docker image inspect" / "docker pull" commands passed to Console.sh (shell=True) without quoting -- a command-injection risk that also breaks valid image names containing shell-special characters. Escape it once via shlex.quote and use the quoted form in both shell calls (display strings keep the raw name). Matches the shlex.quote escaping already used on the Stage B/C container_runner shell paths.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the multi-node coordination gap in the SLURM
slurm_multipath. Theupstream refactor made madengine multi-node observable (per-rank log staging,
login-node
collect_results, partial manifest restore) but did not yetcoordinate which node owns the local image, when workers may enter
docker run, whose CSV the aggregator should trust, and which manifest fields aworker needs to reconstruct the builder's environment.
This PR ports that coordination onto
slurm_multias four staged commits(preparation → build → running → postprocessing) plus one follow-up fix.
Single-node behavior is unchanged — every new path is gated on
MAD_DOCKER_BUILDS,NNODES > 1,skip_perf_collection, manifest presence, orthe failure flow.
What's new
Multi-node coordination
MAD_DOCKER_BUILDS): rank 0docker savesthe locally-built image into a shared dir (atomic sibling-
.tmp+os.replace);workers
docker loadit instead of pulling. Local images are not in anyregistry, so upstream's registry-only pull path cannot reach them on workers.
and no tar is available (
_build_local_image_from_manifest, withshlex.quoteescaping for every path/image/build-arg component since
Console.shruns withshell=True)._tcp_image_ready_barrier): rank 0listens on a port derived from
MASTER_PORT+SLURM_JOB_ID; workers sendREADY <token> <rank>and wait forGO, so no one entersdocker runagainsta half-loaded image. No shared FS required (avoids NFS/Weka metadata lag).
Hardening:
MAD_BARRIER_TOKENsecret, bind to the routableMASTER_ADDRIP(skips loopback so an
/etc/hostsmapping can't strand the listener onlo),line-based partial-read-safe recv, worker-rank normalization, strict
rank/nnodes parsing, ACKed-socket cleanup + peer logging on master timeout.
collect_results(
_select_best_multiple_results_csv): gathers every per-node candidate andranks by count of non-empty
performancerows (tie-break on total rows), sothe aggregator uses the worker that actually holds the throughput numbers
instead of a present-but-empty master CSV. Header/row keys are stripped so a
leading-space CSV header still matches.
docker_mounts,docker_build_arg,docker_gpus,gpu_vendor,guest_os— runtime-detectedvalues keep priority.
Bug fixes on the restore / verdict paths
docker_env_varsrestore no longer overwritesos.environ-sourced values(e.g.
MAD_SECRETS_HFTOKEN) with unexpanded${VAR}literals from an oldmanifest. Fixes HF 401/403 on Primus / gated models on rerun.
skip_perf_collection):Primus emits throughput only on the last global rank, which often lives on a
different node than the designated collector (
MAD_COLLECT_METRICS=true). Theper-node in-job check used to mark the whole job FAILED (exit 3) on the
collector's empty local CSV; it now defers to the authoritative login-node
aggregation. Real failures are still caught first by the error-pattern scan,
and a genuinely-empty multi-node run still gets a FAILURE row from the login
step.
Robustness exposed by scale-out
JSONDecoder.raw_decodeso trailingnon-JSON (deprecation banners, double-emitted blocks under concurrent SLURM
tasks) no longer aborts run-phase init.
madengine --version/--helpskipMODEL_DIRsetup, avoiding global-state pollution of later runs in the sameshell.
get_mount_argde-duplicates against-v/--volumetargets already present inadditional_docker_run_optionssoDocker doesn't reject "Duplicate mount point".
docker execRuntimeError, a bounded,non-fatal snapshot (process table, listening sockets via
ss/netstat/lsof,tails of
/run_logsand the model-dir logs) is captured inside the failedcontainer and lands in the run log next to the failure; the original error is
re-raised unchanged.
Validation
save/load,TCP barrier release with the image removed on the worker, and throughput parity
vs the single-image baseline.
rank correctly reports ~10.1k tok/s/GPU, ~522 TFLOPS/GPU, and the run is now
marked SUCCESS (previously a false FAILURE on the collector's empty CSV).
pytest tests/unit/test_slurm_multi.py— 20 passed. Integration tests cover theshared-tar paths (primary-saves, existing-tar-is-loaded, worker-waits-for-primary).
Test plan
build-on-primary → save → worker load → TCP barrier → SUCCESS verdict).
MAD_SECRETS_HFTOKENis notclobbered by manifest
${VAR}placeholders (no 401/403).collect_resultsselects the worker CSV and writes the aggregatedperf record.