Updates to slurm launcher by raviguptaamd · Pull Request #86 · ROCm/madengine

raviguptaamd · 2026-03-12T20:19:59Z

Summary

Adds a slurm_multi deployment launcher for self-managed multi-node disaggregated workloads (vLLM-disagg, SGLang-disagg, large-EP, kvcache-transfer) and the orchestration / build / run / CLI hooks that drive it from madengine's manifest pipeline. Companion model cards and submit scripts that exercise this launcher live in ROCm/MAD-private PR #186.

Scope (75 files vs `develop`)

This PR is now focused strictly on slurm_multi and its supporting refactor. The Kubernetes deployment work originally bundled here has been moved to #118 (independent PR against develop) and will rebase after this PR lands.

Layer	Highlights
Deployment	`deployment/{slurm.py, slurm_node_selector.py, base.py, common.py, config_loader.py, factory.py}`, `templates/slurm/job.sh.j2`
Orchestration / Build / Run	`orchestration/{build_orchestrator, run_orchestrator, image_filtering}.py` — manifest-driven `--build-on-compute` / `--use-image`, deep-merged `deployment_config`, registry validation
Execution	`execution/{container_runner, container_runner_helpers, docker_builder}.py`
CLI	`cli/{app, commands/run, commands/build, constants, utils, validators}.py`
Core / Utils / Reporting / DB / Scripts	shared infra used by the new path: `core/`, `utils/`, `reporting/`, `database/mongodb.py`, `scripts/common/`
Examples / Docs / Tests	`examples/slurm-configs/`, `docs/launchers.md`, `docs/deployment.md`, `docs/configuration.md`, `tests/{unit,integration,e2e,fixtures}/`

Net diff: +2,877 / -2,179 across 75 files (down from +3,014 / -2,624 across 87 files at the squash-rebase head, after Round-2 / Round-3 / Round-4 review fixes restored 12 files to origin/develop byte-for-byte and tightened 4 more in-scope code paths).

Commit shape

Logical commits on top of develop:

deployment: add slurm_multi launcher, node selector, SBATCH template (8815e24) — SLURM deployment core; absorbs the slurm-side Copilot Round-1 review fixes.
orchestration/execution/cli: wire slurm_multi end-to-end (6f2e11b) — manifest deep-merge, registry guard, per-registry auth hints, prebuilt-image multi-model fix, subprocess shell trust comment, and the rest of the orchestration/CLI plumbing.
examples/docs/tests: slurm_multi configs, README, and test coverage (50e8895) — operator-facing docs, examples renamed sglang-disagg-* → slurm-multi-* (one launcher drives multiple flows), and test additions.
PR #86 follow-up: ErrorContext kwargs, model-card image fallback, perf.csv collection (5ec55ff) — fixes surfaced during the live SGLang-disagg smoke run (job 22607 / 22615 on Distributed_Inference_CI).
slurm: namespace completion marker by SLURM_JOB_ID (710b51a) — second-run safety: per-deploy completion marker now job-id namespaced.
Revert out-of-scope changes flagged by Copilot Round-2 review (9109f52) — 11 files reverted to origin/develop byte-for-byte; surgical reverts in cli/utils.py and execution/docker_builder.py keep the slurm_multi-relevant bits (registry_image → DOCKER_IMAGE_NAME propagation for parallel pull, build_gpu_arch_display helper, Workload column rename).
Address Copilot Round-2 slurm_multi-related findings (63a2eae) — 6 real fixes: module-level recursive _deep_merge used on the manifest path; manifest.context.docker_env_vars restored into runtime context; RunOrchestrator._load_credentials delegated to madengine.core.auth.load_credentials (object-validates credential.json); configure_multi_node_profiling only filters rocprof when rocprofv3 is missing (keeps rccl_trace/rocblas_trace); perf-line regex widened (scientific notation + arbitrary metric tokens); "megatron" and "primus" added to VALID_LAUNCHERS.
docs: fix broken Next-Steps links and example paths (0156d33) — distributed-launchers.md → launchers.md; user-guide.md → usage.md; how-to-quick-start.md → cli-reference.md; dropped K8s slurm-multi-*.json paths that don't exist.
Address Copilot Round-3 small findings (2200cbd) — examples/profiling-configs/README.md multinode → multi_node; drop unused import pytest from tests/unit/test_reporting.py; revert tests/unit/test_rocm_path.py to origin/develop (matches the Context revert in 9109f52).
build_orchestrator: lowercase --build-on-compute local image name (ae1996c) — surfaced live during --build-on-compute --registry docker.io/rocm/pytorch-private --tags pyt_kvcache_transfer_bench: Path(dockerfile_path).name for scripts/.../Dockerfile (no .ubuntu.amd.Dockerfile suffix to strip) yielded Dockerfile, producing ci-..._Dockerfile and docker build -t rejection. .lower() the assembled local image name; verified end-to-end (Slurm 22936, image pushed to docker.io/rocm/pytorch-private:pyt_kvcache_transfer_bench).
build_orchestrator: normalize and validate --registry shape (auto-prepend docker.io/) (98f1ac5) — surfaced live with --registry rocm/pytorch-private: registry.split("/")[0] was treated as the docker login host (docker login rocm -> DNS NXDOMAIN). _execute_build_on_compute now normalizes shorthand <ns>/<repo> to docker.io/<ns>/<repo> with a transparent [dim]auto-prefixing 'docker.io/'[/dim] notice, and rejects malformed input (whitespace, @, empty after trim) up front with usage hints for Dockerhub / GHCR / Quay / NGC / self-hosted / localhost. Verified end-to-end across all four input shapes (sbatch 22954/22956 succeeded, @bad/foo rejected without sbatch).
Address Copilot Round-4 findings: 5 P0 + 3 hardening (3fc7587) — 8 fixes covering all in-scope Round-4 items (the other 14 R4 findings are pre-existing flake8/style nits in origin/develop-identical files unchanged by this PR):
- run_orchestrator.py: drop now-unused Context(rocm_path=...) kwarg from _init_runtime_context (line 138) and _create_manifest_from_local_image (line 456) call sites; drop from madengine.core.constants import get_rocm_path and unused handle_error imports. Latent TypeError on madengine run against local target after the Cat B core/context.py revert.
- container_runner.py: widen perf_pattern to mirror the deployment/base.py:_parse_node_log_for_perf regex (R2-Distributed solution - new madengine CLI #11) so tokens/sec, tok/s, samples_per_second, 1.23E+4 no longer get silently dropped.
- container_runner.py: login_to_registry delegates to madengine.core.auth.login_to_registry instead of f"echo '{password}' | docker login ..." inline (password no longer in ps//proc-visible argv even with --password-stdin); same delegation DockerBuilder.login_to_registry already uses.
- container_runner.py: get_env_arg no longer prints assembled --env KEY='value' to stdout (HF tokens, registry passwords, MAD_SECRETS_* etc. could leak); print only the count.
- container_runner.py: self-managed launcher (_run_self_managed) now shlex.quotes script_path and shlex.split + per-arg shlex.quotes model_args, so embedded shell metacharacters ($(), backticks, ;, ...) become literal arguments to the script rather than host-shell directives.
- docs/deployment.md: second occurrence of [Distributed Launchers Guide](distributed-launchers.md) (line 145) → (launchers.md) (the 0156d33 commit fixed only line 543).

Copilot review status

All four rounds resolved or replied-to inline:

Round 1 (9 comments) — addressed in d32b06b/5ec55ff; replies posted under each thread.
Round 2 (24 comments) — fully addressed across 9109f52 / 63a2eae / 0156d33 and replied-to inline:
- slurm.py: drop duplicate Optional import; honour slurm_config.exclusive.
- slurm_node_selector.py: cleanup_node merged srun; preserves --job-name and --reservation.
- build_orchestrator.py: built_models keyed by model_name; warn on divergent configs; early registry None guard; per-registry auth hints.
- run_orchestrator.py: real recursive _deep_merge for deployment_config ↔ additional_context; manifest's context.docker_env_vars restored; _load_credentials delegates to central auth helper.
- container_runner.py: documented shell=True trust boundary on inner subprocess.run.
- deployment/common.py: rocprofv3-fallback only filters rocprof; VALID_LAUNCHERS covers "megatron"/"primus".
- deployment/base.py: perf-line regex accepts scientific notation + arbitrary metric tokens.
- 15 outside-scope changes (core/{context,docker,dataprovider,timeout}.py, cli/app.py, utils/{rocm_tool_manager,discover_models,config_parser}.py, orchestration/image_filtering.py, reporting/update_perf_csv.py, README.md) reverted to origin/develop.
- 3 broken doc links / example paths fixed (docs/{installation,deployment,launchers}.md).
Round 3 (13 comments) — 4 fixed in 2200cbd, 7 stale-snapshot replies (already addressed in earlier rounds), 2 deferred to follow-up issues (pre-existing concerns in origin/develop: container_runner.get_env_arg env-arg shell-escape hardening; Console.sh/subprocess.run(timeout=…) translation of 0 → None).
Round 4 (24 comments) — 8 fixed in 3fc7587 (5 P0 + 3 hardening, see commit 12 above), 14 are pre-existing flake8/style nits in files byte-identical to origin/develop (utils/{ops,log_formatting}.py, scripts/common/{tools,pre_scripts}/**, tests/fixtures/utils.py, tests/integration/test_gpu_management.py, tests/unit/test_reporting_superset.py, core/errors.py, database/mongodb.py, deployment/base.py:645) — not introduced by this PR; deferred to a separate cleanup PR.

Sequencing

Updates to slurm launcher #86 (this PR) → develop.
Kubernetes deployment (split from PR #86) #118 (Kubernetes) rebases onto develop and merges.
After both land, follow-up cherry-picks the combined work into main per repo policy.

Test plan

Copilot

Pull request overview

This PR updates Madengine’s SLURM (and related) workflows to better support disaggregated / self-managed multi-node inference by introducing a slurm_multi launcher path, plus new build options for using pre-built images or building on compute nodes.

Changes:

Add --use-image and --build-on-compute build workflows and propagate the resulting image name into manifests/env for multi-node pulls.
Add slurm_multi launcher handling across SLURM/K8s deployments, including self-managed (host-executed) script paths and SLURM reservation support.
Improve manifest/runtime config merging behavior for dict-valued deployment keys.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
src/madengine/orchestration/run_orchestrator.py	Shallow-merges dict-valued deployment config keys into runtime additional context.
src/madengine/orchestration/build_orchestrator.py	Implements `--use-image` and `--build-on-compute` build flows and writes manifests accordingly.
src/madengine/execution/docker_builder.py	Writes `DOCKER_IMAGE_NAME` into `built_models.env_vars` when a registry image exists (supports parallel pulls).
src/madengine/execution/container_runner.py	Adds “self-managed” launcher path that runs scripts directly on host instead of Madengine-managed Docker.
src/madengine/deployment/slurm_node_selector.py	Adds reservation support for `srun` health checks / cleanup.
src/madengine/deployment/slurm.py	Adds `slurm_multi` wrapper script generation, reservation support, and existing-allocation detection (salloc).
src/madengine/deployment/kubernetes.py	Treats `slurm_multi` as a disaggregated launcher option in K8s template context and adds launcher command generators.
src/madengine/deployment/common.py	Updates valid launcher list and normalizes `slurm-multi` → `slurm_multi`.
src/madengine/deployment/base.py	Adds `skip_monitoring` flag to support synchronous deploy flows (e.g., within salloc).
src/madengine/cli/validators.py	Extends validator signature to accept `use_image` (currently unused).
src/madengine/cli/utils.py	Adjusts display/extraction formatting (e.g., model name truncation and column header rename).
src/madengine/cli/commands/build.py	Adds CLI flags `--use-image` / `--build-on-compute` and enforces mutual exclusivity/requirements.
examples/slurm-configs/minimal/slurm-multi-minimal.json	Updates launcher value to `slurm_multi`.
examples/slurm-configs/basic/slurm-multi-multi-node.json	Updates launcher value to `slurm_multi`.
examples/slurm-configs/basic/slurm-multi-custom-split.json	Updates launcher value to `slurm_multi`.
examples/k8s-configs/minimal/slurm-multi-minimal.json	Updates launcher value to `slurm_multi`.
examples/k8s-configs/basic/slurm-multi-multi-node-basic.json	Updates launcher value to `slurm_multi`.
examples/k8s-configs/basic/slurm-multi-custom-split.json	Updates launcher value to `slurm_multi`.
docs/usage.md	Documents `--use-image`, `--build-on-compute`, and `slurm_multi` usage.
docs/launchers.md	Renames launcher references and documents `slurm_multi` registry requirement.
docs/deployment.md	Documents allocation detection and self-managed `slurm_multi` execution model.
docs/cli-reference.md	Adds CLI reference entries and usage examples for new build options and `slurm_multi`.

Comments suppressed due to low confidence (1)

src/madengine/cli/validators.py:307

validate_additional_context() now accepts use_image, and the docstring says it “skips required field validation”, but the parameter is not used and validation runs unconditionally via finalize_additional_context_dict(context). Either remove the unused parameter (and update callers), or implement the intended conditional validation behavior.

def validate_additional_context(
    additional_context: str,
    additional_context_file: Optional[str] = None,
    use_image: Optional[str] = None,
) -> Dict[str, Any]:
    """
    Validate and parse additional context.

    Args:
        additional_context: JSON string containing additional context
        additional_context_file: Optional file containing additional context
        use_image: Optional pre-built image to use (skips required field validation)

    Returns:
        Dict containing parsed additional context

    Raises:
        typer.Exit: If validation fails
    """
    context, _ = merge_additional_context_from_sources(
        additional_context, additional_context_file
    )
    finalize_additional_context_dict(context)
    return context

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-16T08:11:15Z

+                # Merge model's distributed config
+                model_distributed = models[0].get("distributed", {})
+                if model_distributed:
+                    if "distributed" not in saved_manifest["deployment_config"]:
+                        saved_manifest["deployment_config"]["distributed"] = {}
+
+                    # Copy launcher and other critical fields from model config
+                    for key in ["launcher", "nnodes", "nproc_per_node", "backend", "port", "sglang_disagg", "vllm_disagg"]:
+                        if key in model_distributed and key not in saved_manifest["deployment_config"]["distributed"]:
+                            saved_manifest["deployment_config"]["distributed"][key] = model_distributed[key]
+
+                # Merge model's slurm config into deployment_config.slurm
+                # This enables run phase to auto-detect SLURM deployment without --additional-context
+                model_slurm = models[0].get("slurm", {})
+                if model_slurm:
+                    if "slurm" not in saved_manifest["deployment_config"]:
+                        saved_manifest["deployment_config"]["slurm"] = {}
+
+                    # Copy slurm settings from model config (model card fills in
+                    # values not explicitly set by --additional-context).
+                    # Use _original_user_slurm_keys (captured before ConfigLoader
+                    # applies defaults) so model card values override defaults
+                    # but user's explicit CLI values still win.
+                    for key in ["partition", "nodes", "gpus_per_node", "time", "exclusive", "reservation", "output_dir", "nodelist"]:
+                        if key in model_slurm and key not in self._original_user_slurm_keys:
+                            saved_manifest["deployment_config"]["slurm"][key] = model_slurm[key]


_execute_with_prebuilt_image() populates deployment_config.distributed and deployment_config.slurm from models[0] only. If the tag selection yields multiple models with differing distributed/slurm settings, the manifest will silently contain config for just the first model. Either validate that only one model is allowed in this mode, or merge per-model configs in a deterministic way (and/or store them per built_model entry).

Addressed in d32b06b.

We still let models[0] populate the manifest's deployment_config.distributed and deployment_config.slurm (they need a single value), but the prebuilt-image flow now warns when discovered models have divergent distributed / slurm configs so operators are not silently surprised.

Copilot · 2026-04-16T08:11:16Z

+    def _execute_build_on_compute(
+        self,
+        registry: Optional[str] = None,
+        clean_cache: bool = False,
+        manifest_output: str = "build_manifest.json",
+        batch_build_metadata: Optional[Dict] = None,
+    ) -> str:
+        """
+        Execute Docker build on a SLURM compute node and push to registry.
+
+        Build workflow:
+        1. Build on 1 compute node only
+        2. Push image to registry
+        3. Store registry image name in manifest
+        4. Run phase will pull image in parallel on all nodes
+
+        Args:
+            registry: Registry to push images to (REQUIRED)
+            clean_cache: Whether to use --no-cache for Docker builds
+            manifest_output: Output file for build manifest
+            batch_build_metadata: Optional batch build metadata
+
+        Returns:
+            Path to generated build_manifest.json
+        """
+        import subprocess
+        import os
+        import glob
+
+        self.rich_console.print(f"\n[dim]{'=' * 60}[/dim]")
+        self.rich_console.print("[bold blue]🔨 BUILD PHASE (Compute Node Mode)[/bold blue]")
+        self.rich_console.print("[cyan]Building on 1 compute node, pushing to registry...[/cyan]")
+        self.rich_console.print(f"[dim]{'=' * 60}[/dim]\n")
+
+        # Discover models first to get SLURM config from model card
+        self.rich_console.print("[bold cyan]🔍 Discovering models...[/bold cyan]")
+        discover_models = DiscoverModels(args=self.args)
+        models = discover_models.run()
+
+        if not models:
+            raise DiscoveryError(
+                "No models discovered for build-on-compute",
+                context=create_error_context(
+                    operation="build_on_compute",
+                    component="BuildOrchestrator",
+                ),
+                suggestions=[
+                    "Check if models.json exists",
+                    "Verify --tags parameter is correct",
+                ],
+            )
+
+        model = models[0]
+        model_name = model.get("name", "unknown")
+        self.rich_console.print(f"[green]✓ Found model: {model_name}[/green]\n")
+
+        # Merge SLURM config: model card (base) + additional-context (override)
+        model_slurm_config = model.get("slurm", {})
+        context_slurm_config = self.additional_context.get("slurm", {})
+
+        # Start with model card config, then override with command-line context
+        slurm_config = {**model_slurm_config, **context_slurm_config}
+
+        self.rich_console.print("[bold cyan]📋 SLURM Configuration (merged):[/bold cyan]")
+        if model_slurm_config:
+            self.rich_console.print(f"  [dim]From model card:[/dim] {list(model_slurm_config.keys())}")
+        if context_slurm_config:
+            self.rich_console.print(f"  [dim]From --additional-context (overrides):[/dim] {list(context_slurm_config.keys())}")
+
+        # Validate required fields
+        partition = slurm_config.get("partition")
+        if not partition:
+            raise ConfigurationError(
+                "Missing required SLURM field: partition",
+                context=create_error_context(
+                    operation="build_on_compute",
+                    component="BuildOrchestrator",
+                ),
+                suggestions=[
+                    'Add "partition" to model card\'s slurm section',
+                    'Or specify via --additional-context \'{"slurm": {"partition": "gpu"}}\'',
+                ],
+            )
+
+        reservation = slurm_config.get("reservation", "")
+        time_limit = slurm_config.get("time", "02:00:00")
+
+        self.rich_console.print(f"  Partition: {partition}")
+        self.rich_console.print(f"  Time limit: {time_limit}")
+        if reservation:
+            self.rich_console.print(f"  Reservation: {reservation}")
+        self.rich_console.print("")
+
+        # Validate registry credentials
+        self.rich_console.print("[bold cyan]🔐 Registry Configuration:[/bold cyan]")
+        self.rich_console.print(f"  Registry: {registry}")
+
+        # Check for credentials - either from environment or credential.json
+        dockerhub_user = os.environ.get("MAD_DOCKERHUB_USER", "")
+        dockerhub_password = os.environ.get("MAD_DOCKERHUB_PASSWORD", "")
+
+        # Try to load from credential.json if env vars not set
+        credential_file = Path("credential.json")
+        if not dockerhub_user and credential_file.exists():
+            try:
+                with open(credential_file) as f:
+                    creds = json.load(f)
+                    dockerhub_creds = creds.get("dockerhub", {})
+                    dockerhub_user = dockerhub_creds.get("username", "")
+                    dockerhub_password = dockerhub_creds.get("password", "")
+                    if dockerhub_user:
+                        self.rich_console.print(f"  Credentials: Found in credential.json")
+            except (json.JSONDecodeError, IOError) as e:
+                self.rich_console.print(f"  [yellow]Warning: Could not read credential.json: {e}[/yellow]")
+        elif dockerhub_user:
+            self.rich_console.print(f"  Credentials: Found in environment (MAD_DOCKERHUB_USER)")
+
+        # Determine if registry requires authentication
+        requires_auth = True
+        public_registries = ["docker.io", "ghcr.io", "gcr.io", "quay.io", "nvcr.io"]
+        registry_lower = registry.lower() if registry else ""
+
+        # For docker.io pushes, authentication is always required
+        if any(pub_reg in registry_lower for pub_reg in public_registries):
+            if not dockerhub_user or not dockerhub_password:
+                raise ConfigurationError(


_execute_build_on_compute() treats registry as required (calls registry.lower(), registry.replace(...), etc.) but the function signature allows registry: Optional[str] = None and there’s no guard. If this path is invoked programmatically (or via a code path that bypasses the CLI validation), it will raise an AttributeError rather than a user-facing ConfigurationError. Add an explicit if not registry: raise ConfigurationError(...) near the top of this method.

Addressed in d32b06b.

_execute_build_on_compute now raises a ConfigurationError early when registry is None — with the suggestion to pass --registry or set it in the model card / additional-context — instead of falling into registry.lower() / .replace() / .split() with NoneType.

Copilot · 2026-04-16T08:11:16Z

+                    result = subprocess.run(
+                        cmd,
+                        shell=True,
+                        cwd=script_dir,
+                        env=env,
+                        timeout=timeout if timeout > 0 else None,
+                    )


subprocess.run(..., shell=True) is used with a command string that includes model_args (which can come from manifests / user input). This allows shell metacharacters in args to be interpreted by the shell (command injection). Build the argv list explicitly (e.g., ["bash", script_path, *shlex.split(model_args)] / ["python3", ...]) and run with shell=False.

Addressed in d32b06b.

shell=True is deliberate here because cmd embeds shell features (pipes, redirects, env-var substitution) constructed earlier in this method. Added an inline trust-boundary comment that documents:

cmd is built from validated model card / manifest fields,

user model_args are routed through shlex-quoted assembly in the caller,

raw user input must NOT be concatenated directly into cmd,

and tagged the line with # noqa: S602 so future linters don't flag the intentional usage.

Copilot · 2026-04-16T08:11:16Z

+            "#SBATCH --exclusive",
+        ]


The slurm_multi wrapper script always adds #SBATCH --exclusive regardless of the user’s slurm.exclusive setting. In the standard SLURM template, exclusivity is conditional (job.sh.j2:20-22). Consider respecting self.slurm_config.get("exclusive", True) here as well to avoid changing behavior for users who explicitly disable exclusivity.

Suggested change

"#SBATCH --exclusive",

]

]

if self.slurm_config.get("exclusive", True):

script_lines.append("#SBATCH --exclusive")

Addressed in d32b06b.

The slurm_multi wrapper now appends #SBATCH --exclusive only when self.slurm_config.get('exclusive', True) is truthy, matching the conditional pattern already used for the standard SLURM template (job.sh.j2:20-22). Default behaviour stays exclusive so existing runs are unchanged.

Copilot · 2026-04-16T08:11:17Z

 from madengine.utils.run_details import get_build_number, get_pipeline
 from madengine.utils.path_utils import scripts_base_dir_from
 import json
+from typing import Optional


There is a duplicate import of Optional (from typing import Any, Dict, List, Optional and again from typing import Optional). Drop the second import to avoid redundancy.

Suggested change

from typing import Optional

Addressed in d32b06b.

Dropped the redundant from typing import Optional on line 27; Optional is still imported via the existing from typing import Any, Dict, List, Optional at line 17.

Copilot · 2026-04-16T08:11:17Z

+            # Add each discovered model with the pre-built image
+            # Use the image name as the key (matches how madengine build does it)
+            for model in models:
+                model_name = model.get("name", "unknown")
+                model_distributed = model.get("distributed", {})
+
+                # Merge DOCKER_IMAGE_NAME into env_vars for parallel pull in run phase
+                model_env_vars = model.get("env_vars", {}).copy()
+                model_env_vars["DOCKER_IMAGE_NAME"] = use_image
+
+                # Use image name as key so slurm.py can find docker_image
+                manifest["built_models"][use_image] = {
+                    "name": model_name,
+                    "image": use_image,
+                    "docker_image": use_image,
+                    "dockerfile": model.get("dockerfile", ""),
+                    "scripts": model.get("scripts", ""),
+                    "data": model.get("data", ""),
+                    "n_gpus": model.get("n_gpus", "8"),
+                    "owner": model.get("owner", ""),
+                    "training_precision": model.get("training_precision", ""),
+                    "multiple_results": model.get("multiple_results", ""),
+                    "tags": model.get("tags", []),
+                    "timeout": model.get("timeout", -1),
+                    "args": model.get("args", ""),
+                    "slurm": model.get("slurm", {}),
+                    "distributed": model_distributed,
+                    "env_vars": model_env_vars,
+                    "prebuilt": True,
+                }


In pre-built image mode, the for model in models: loop always writes to manifest["built_models"][use_image], so each iteration overwrites the previous one. If multiple models are discovered for the selected tags, only the last model will be present in built_models even though summary.successful_builds lists all of them. Use a per-model unique key in built_models (and keep it consistent with built_images lookups), while setting docker_image to the shared use_image value.

Addressed in d32b06b.

The loop now keys manifest['built_models'] by model_name instead of use_image, so multiple models that share the same pre-built image are all preserved in the manifest and the downstream consumers can iterate over them.

Copilot · 2026-04-16T08:11:17Z

+        # Determine if registry requires authentication
+        requires_auth = True
+        public_registries = ["docker.io", "ghcr.io", "gcr.io", "quay.io", "nvcr.io"]
+        registry_lower = registry.lower() if registry else ""
+
+        # For docker.io pushes, authentication is always required
+        if any(pub_reg in registry_lower for pub_reg in public_registries):
+            if not dockerhub_user or not dockerhub_password:
+                raise ConfigurationError(
+                    f"Registry credentials required for pushing to {registry}",
+                    context=create_error_context(
+                        operation="build_on_compute",
+                        component="BuildOrchestrator",
+                        registry=registry,
+                    ),
+                    suggestions=[
+                        "Set environment variables: MAD_DOCKERHUB_USER and MAD_DOCKERHUB_PASSWORD",
+                        'Or create credential.json: {"dockerhub": {"username": "...", "password": "..."}}',
+                        "For Docker Hub, use a Personal Access Token (PAT) as password",
+                        f"Example: export MAD_DOCKERHUB_USER=myuser",
+                        f"Example: export MAD_DOCKERHUB_PASSWORD=dckr_pat_xxxxx",
+                    ],
+                )


The registry auth/error messaging here is Docker Hub–specific (MAD_DOCKERHUB_USER/PASSWORD, PAT guidance) but the logic also classifies ghcr.io, gcr.io, quay.io, nvcr.io as registries that “require credentials”. This will produce misleading guidance for non-DockerHub registries (and may block valid flows where different credentials are required). Consider either limiting this branch to Docker Hub only, or generalizing the env var names / suggestions based on registry_host.

Addressed in d32b06b.

The missing-credentials error now selects per-registry suggestions:

docker.io — Docker Hub PAT / MAD_DOCKERHUB_*

ghcr.io — GitHub PAT with read:packages / write:packages

gcr.io — service-account JSON key as password (MAD_DOCKERHUB_USER=_json_key)

quay.io — robot account / encrypted password

nvcr.io — $oauthtoken username + NGC API key

Falls back to the Docker Hub guidance for unknown registries. The credential env vars themselves are still the existing MAD_DOCKERHUB_USER / MAD_DOCKERHUB_PASSWORD to keep this change a docs-only refinement of the error path; introducing a generic MAD_REGISTRY_USER/PASSWORD pair is left for a follow-up.

Copilot · 2026-04-16T08:11:17Z

        srun_cmd.extend(["bash", "-c", cleanup_script])

        try:
+            srun_cmd = [
+                "srun",
+                f"--nodelist={node}",
+                "--ntasks=1",
+                "--time=00:01:00",
+                "--overlap",
+                "--quiet",
+            ]
+            if self.reservation:
+                srun_cmd.append(f"--reservation={self.reservation}")
+            srun_cmd.extend(["bash", "-c", cleanup_script])
+


cleanup_node() builds srun_cmd (including job_name) before the try:, but then immediately reassigns srun_cmd inside the try: block. The reassignment drops the job_name argument and duplicates logic, making the first construction dead code and preventing health-check job cleanup from working as intended. Remove the duplicate reassignment and append both --job-name (when provided) and --reservation consistently.

Suggested change

srun_cmd.extend(["bash", "-c", cleanup_script])

try:

srun_cmd = [

"srun",

f"--nodelist={node}",

"--ntasks=1",

"--time=00:01:00",

"--overlap",

"--quiet",

]

if self.reservation:

srun_cmd.append(f"--reservation={self.reservation}")

srun_cmd.extend(["bash", "-c", cleanup_script])

if self.reservation:

srun_cmd.append(f"--reservation={self.reservation}")

srun_cmd.extend(["bash", "-c", cleanup_script])

try:

Addressed in d32b06b.

The pre-try: and in-try: builds of srun_cmd are merged into a single construction that conditionally appends both --job-name=<name> (when job_name is given) and --reservation=<res> (when self.reservation is set). The second reassignment that dropped --job-name is gone.

Copilot · 2026-04-16T08:11:18Z

+            # Merge deployment_config into additional_context (for deployment layer to use).
+            # For dict-valued keys (slurm, k8s, etc.), deep-merge so manifest
+            # values fill in gaps while runtime --additional-context wins on conflicts.
            for key in ["slurm", "k8s", "kubernetes", "distributed", "vllm", "env_vars", "debug"]:
-                if key in deployment_config and key not in self.additional_context:
-                    self.additional_context[key] = deployment_config[key]
+                if key in deployment_config:
+                    if key not in self.additional_context:
+                        self.additional_context[key] = deployment_config[key]
+                    elif isinstance(deployment_config[key], dict) and isinstance(self.additional_context[key], dict):
+                        merged = dict(deployment_config[key])
+                        merged.update(self.additional_context[key])
+                        self.additional_context[key] = merged


The comment says “deep-merge” but the implementation here is a one-level dict merge (merged = dict(...); merged.update(...)). Nested dictionaries under these keys will still be overwritten rather than recursively merged. Either adjust the comment to describe a shallow merge, or implement an actual recursive deep merge for nested structures.

Addressed in d32b06b.

Replaced the shallow merged.update(...) with a real recursive _deep_merge(base, override) helper. Nested dicts under slurm / k8s / kubernetes / distributed / vllm / env_vars / debug are now merged per-leaf; runtime --additional-context still wins on conflicts, while manifest values fill in nested gaps as the comment claims.

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of files (300). Try reducing the number of changed files and requesting a review from Copilot again.

Address all 9 inline comments from copilot-pull-request-reviewer[bot]: ROCm#1 build_orchestrator.py — _execute_with_prebuilt_image now keys manifest['built_models'] by model_name (not use_image), so multiple models that share the same pre-built image are all preserved in the manifest. ROCm#2 build_orchestrator.py — warn when discovered models have differing distributed/slurm configs in the prebuilt-image flow; the post-merge step still uses models[0]'s config but operators are now told. ROCm#3 build_orchestrator.py — _execute_build_on_compute() now raises ConfigurationError early when registry is None instead of falling into registry.replace/.split/.lower with NoneType. ROCm#4 build_orchestrator.py — credentials-required error now emits per-registry hints (docker.io / ghcr.io / gcr.io / quay.io / nvcr.io) instead of Docker-Hub-only PAT guidance. ROCm#5 container_runner.py — document the shell=True trust boundary on the inner subprocess.run; cmd is internally constructed and any user model_args are routed through shlex-quoted assembly in the caller. ROCm#6 slurm.py — drop duplicate `from typing import Optional` import. ROCm#7 slurm.py — slurm_multi wrapper no longer hard-codes `#SBATCH --exclusive`; honours self.slurm_config.get('exclusive', True) to match the standard SLURM template behaviour. ROCm#8 slurm_node_selector.py — cleanup_node()'s srun_cmd is now built once and includes both --job-name (when provided) and --reservation (when set); the second in-try reassignment that dropped --job-name is gone. ROCm#9 run_orchestrator.py — replace the shallow `merged.update(...)` with a real recursive _deep_merge so the comment ("deep-merge") matches the behaviour: nested dicts under slurm/k8s/distributed/etc. are merged per-leaf, runtime --additional-context still wins on conflicts. Made-with: Cursor

…slurm_multi PR ROCm#86 was originally rooted on coketaste/refactor-dist, which introduced both Kubernetes and SLURM deployment paths. The PR base is now develop and reviewers asked to keep this PR scoped to slurm_multi and the related SLURM/orchestration changes. This commit removes the Kubernetes-only artifacts from the slurm PR. The KubernetesDeployment class continues to register itself only when the kubernetes module is importable (DeploymentFactory.register_default_deployments already wraps the import in try/ImportError), so the slurm PR works without the kubernetes module present. The full Kubernetes work continues on a separate branch and PR (raviguptaamd/k8s_deployment), which can target develop independently once ROCm#86 lands. Files removed (k8s-only): - src/madengine/deployment/k8s_secrets.py - src/madengine/deployment/kubernetes.py - src/madengine/deployment/kubernetes_launcher_mixin.py - src/madengine/deployment/presets/k8s/** - src/madengine/deployment/templates/kubernetes/** - src/madengine/scripts/k8s/** - examples/k8s-configs/** - tests/unit/test_k8s.py Mixed source files (run_orchestrator, build_orchestrator, base, common, config_loader, validators) and docs that reference both slurm and k8s are intentionally left untouched in this commit: - their k8s code paths are dormant (no k8s module to call into), - touching them would create a much wider conflict surface vs develop, - the k8s PR will re-add the runtime hooks when its files come back. Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

raviguptaamd · 2026-05-01T16:12:13Z

Kubernetes deployment work split out into #118 to keep this PR focused on slurm_multi and related changes (per reviewer ask). #118 targets develop and will rebase / sequence after this PR lands.

Introduce the `slurm_multi` deployment path needed for self-managed, multi-node disaggregated workloads (vLLM-disagg, SGLang-disagg, large-EP, kvcache-transfer): * SlurmDeployment: new `slurm_multi` launcher branch with model-card driven --partition / --nodes / --gpus-per-node / --time / nodelist inheritance from `models.json`, optional `--exclusive`, reservation, and registry-required validation. * SlurmNodeSelector: idle-node discovery, on-demand health check, cleanup_node() that consolidates --job-name and --reservation in a single srun (avoids duplicate srun in the previous implementation). * templates/slurm/job.sh.j2: SBATCH directives factored so user exclusivity setting is honoured, matching slurm_multi behaviour. * base.py / common.py / config_loader.py / factory.py: shared deployment infrastructure adjustments (DeploymentConfig fields, registry validation, factory registration). Incorporates the related Copilot review fixes: * slurm.py: drop duplicate `from typing import Optional`; honour slurm_config.exclusive in the slurm_multi wrapper script. * slurm_node_selector.py: cleanup_node srun_cmd built once and preserves --job-name when --reservation is also set. Co-authored-by: Cursor <cursoragent@cursor.com>

Plumb the slurm_multi launcher through the build/run/execution stack: * orchestration/build_orchestrator.py: - --use-image / --build-on-compute paths populate manifest's deployment_config.distributed and deployment_config.slurm so the run phase auto-detects slurm without --additional-context. - Discovery yields multiple model cards: built_models manifest is now keyed by model_name (no longer overwrites when models share an image). - _execute_build_on_compute() validates `registry` early (no longer falls into registry.lower()/.replace() with NoneType). - Missing-credentials error emits per-registry hints (docker.io, ghcr.io, gcr.io, quay.io, nvcr.io) instead of Docker Hub-only PAT guidance. - Warn when discovered models in a prebuilt-image batch have divergent distributed/slurm configs. * orchestration/run_orchestrator.py: - True recursive deep-merge of deployment_config into additional_context (the previous shallow `dict.update` lost nested slurm/k8s overrides). - Manifest reading respects deployment_config.target. * execution/container_runner.py + helpers / docker_builder.py: - Surface model_args / env wiring needed by the slurm_multi flow. - Document the shell=True trust boundary on subprocess.run; cmd is constructed from validated manifest fields and user model_args are routed through shlex-quoted assembly in the caller. * cli/{app, commands/run, commands/build, constants, utils, validators}.py: - --build-on-compute and --use-image flags, deployment-config CLI overrides, validation pre-checks for slurm targets. * core/{constants, context, dataprovider, docker, errors, timeout}.py: - Shared infrastructure used by the new launcher path (DeploymentConfig fields, error classes, timeout policy). * utils/, scripts/common/, reporting/, database/: - Helpers and tooling that the slurm_multi flow depends on (config_parser, gpu_config, gpu_validator, ops, rocm_tool_manager, session_tracker, log_formatting, discover_models, mongodb perf publishing, trace pre/post scripts, therock_detector). Includes the Copilot review fixes that touch this layer: built_models keying, registry None guard, registry-specific auth hints, divergent config warning, deep-merge correctness, subprocess.run trust comment. Co-authored-by: Cursor <cursoragent@cursor.com>

Round out the slurm_multi PR with operator-facing docs, examples, and test additions: * examples/slurm-configs: - basic/slurm-multi-{custom-split,multi-node}.json (renamed from sglang-disagg-* to reflect that the same launcher drives multiple flows: vLLM-disagg, SGLang-disagg, large-EP, kvcache-transfer). - minimal/slurm-multi-minimal.json - README.md covering all the basic/minimal configs. * examples/profiling-configs/README.md: cross-reference slurm_multi flows where applicable. * docs/{launchers,deployment,configuration,cli-reference,usage, installation,profiling,README}.md and top-level README.md / CHANGELOG.md: describe the slurm_multi launcher, --build-on-compute, --use-image, and the manifest deployment_config flow. * pyproject.toml / .gitignore: minor housekeeping. * tests/{unit,integration,e2e,fixtures}: cover the new launcher path, config_loader inheritance, deep-merge behaviour, container_runner shell handling, and slurm node selector cleanup. test_errors.py promoted from unit to integration. Co-authored-by: Cursor <cursoragent@cursor.com>

… perf.csv collection Three fixes that surfaced during the PR ROCm#86 smoke test against the Distributed_Inference_CI reservation: 1. ErrorContext kwargs sweep (build_orchestrator.py: 5 sites) ErrorContext is a dataclass with a fixed field set {operation, phase, component, model_name, node_id, file_path, additional_info}. Five call-sites passed kwargs that don't exist (`model=`, `launcher=`, `model_names=`, `registry=`, `dockerfile=`), which raised `TypeError: ErrorContext.__init__() got an unexpected keyword argument 'model'` BEFORE the real ConfigurationError was constructed. The slurm_multi guard error in particular ("requires --registry or --use-image") was never visible to users. Migrated those kwargs to the proper `model_name=` field or to `additional_info={...}` so the user-facing error messages now reach the console. 2. F1: model-card image fallback for slurm_multi (build_orchestrator.py) When a discovered model card declares `env_vars.DOCKER_IMAGE_NAME` and the user passes neither `--registry` nor `--use-image`, treat the card's image as an implicit `--use-image` and call _execute_with_prebuilt_image() directly. Avoids forcing operators to repeat the image on the CLI for slurm_multi flows whose model cards already pin it. Falls back to the existing ConfigurationError (now visible thanks to fix ROCm#1) when no card image is set or when discovered models declare divergent images. 3. F-perf: extended slurm_multi perf.csv search paths (slurm.py) `_collect_slurm_multi_results` now looks in the locations the inner scripts (MAD-private PR #186) actually write to: * `/shared_inference/<user>/<deployment_id>/perf.csv` (sglang_disagg / vllm_disagg via benchmark_xPyD.sh) * `<workspace>/slurm_output/perf_csv/*<deployment_id>*.csv` (large_ep_benchmark / kvcache_transfer_bench) in addition to the existing `slurm_config.results_dir` and `<cwd>/perf.csv` paths, with NFS-propagation retry on each candidate. Fixes the "0 perf files, 1 log files" outcome we hit on Slurm job 22607 even though `/shared_inference/ravgupta/22607/perf.csv` was actually generated. Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

Copilot reviewed 86 out of 87 changed files in this pull request and generated 24 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

raviguptaamd · 2026-05-06T01:14:08Z

    if not is_rocprofv3_available():
-        if tools_include_rocprof_family(tools_config):
-            filtered_tools: List[Dict] = [
-                t
-                for t in tools_config
-                if isinstance(t, dict) and t.get("name") not in _ROCPROF_FAMILY_TOOL_NAMES
-            ]
-            if filtered_tools:
-                logger.warning(
-                    "Multi-node: rocprofv3 not found on submission host; "
-                    "omitting rocprof/rocprofv3-family tools. Remaining tools: "
-                    + ", ".join(
-                        str(x.get("name"))
-                        for x in filtered_tools
-                        if isinstance(x, dict) and x.get("name")
-                    )
-                )
-                return {
-                    "enabled": True,
-                    "mode": "multi_node",
-                    "tools": filtered_tools,
-                    "per_node_collection": True,
-                }
-            logger.warning(
-                "╔════════════════════════════════════════════════════════════════════════════╗\n"
-                "║ Multi-Node Profiling Requirements Not Met                                 ║\n"
-                "╠════════════════════════════════════════════════════════════════════════════╣\n"
-                "║ Multi-node profiling requires rocprofv3 (MPI-aware profiling support).    ║\n"
-                "║                                                                            ║\n"
-                "║ Current Status: rocprofv3 NOT FOUND on system                             ║\n"
-                "║                                                                            ║\n"
-                "║ rocprof/rocprofv3-family profiling will be SKIPPED for this multi-node    ║\n"
-                "║ run.                                                                      ║\n"
-                "║                                                                            ║\n"
-                "║ To enable multi-node rocprof profiling:                                   ║\n"
-                "║   • Install rocprofiler-sdk package (ROCm >= 6.4.1)                       ║\n"
-                "║   • Command: apt install rocprofiler-sdk                                  ║\n"
-                "║   • Or upgrade to ROCm 6.4.1 or later                                     ║\n"
-                "║                                                                            ║\n"
-                "║ Note: Single-node profiling uses rocprof (no rocprofv3 required)          ║\n"
-                "╚════════════════════════════════════════════════════════════════════════════╝"
-            )
-            return {
-                "enabled": False,
-                "mode": "multi_node_unsupported",
-                "tools": [],
-                "per_node_collection": False
-            }
-        logger.info(
-            "Multi-node: rocprofv3 not found on submission host; keeping non-rocprof tools "
-            "(rocprof/rocprofv3 family requires rocprofv3 for MPI-aware capture)."
+        logger.warning(
+            "╔════════════════════════════════════════════════════════════════════════════╗\n"
+            "║ Multi-Node Profiling Requirements Not Met                                 ║\n"
+            "╠════════════════════════════════════════════════════════════════════════════╣\n"
+            "║ Multi-node profiling requires rocprofv3 (MPI-aware profiling support).    ║\n"
+            "║                                                                            ║\n"
+            "║ Current Status: rocprofv3 NOT FOUND on system                             ║\n"
+            "║                                                                            ║\n"
+            "║ Profiling will be SKIPPED for this multi-node run.                        ║\n"
+            "║                                                                            ║\n"
+            "║ To enable multi-node profiling:                                           ║\n"
+            "║   • Install rocprofiler-sdk package (ROCm >= 6.4.1)                       ║\n"
+            "║   • Command: apt install rocprofiler-sdk                                  ║\n"
+            "║   • Or upgrade to ROCm 6.4.1 or later                                     ║\n"
+            "║                                                                            ║\n"
+            "║ Note: Single-node profiling uses rocprof (no rocprofv3 required)          ║\n"
+            "╚════════════════════════════════════════════════════════════════════════════╝"
        )
        return {
-            "enabled": True,
-            "mode": "multi_node",
-            "tools": tools_config,
-            "per_node_collection": True,
+            "enabled": False,
+            "mode": "multi_node_unsupported",
+            "tools": [],
+            "per_node_collection": False
        }


Addressed in 63a2eae.

configure_multi_node_profiling now only filters out the rocprof tool when rocprofv3 is missing on a multi-node run. rccl_trace, rocblas_trace, and other non-rocprof tools keep running and per_node_collection stays enabled for them. The returned config switches to a new mode multi_node_no_rocprofv3 so callers can distinguish this mixed state from full multi_node.

raviguptaamd · 2026-05-06T01:14:22Z

+                        if key in stored_config and isinstance(stored_config[key], dict) and isinstance(self.additional_context[key], dict):
+                            merged = dict(stored_config[key])
+                            merged.update(self.additional_context[key])
+                            stored_config[key] = merged


Addressed in 63a2eae.

Promoted _deep_merge to a module-level helper and used it in _load_and_merge_manifest (which still did dict(stored_config[key]); merged.update(...)). Verified locally: an override touching slurm.node_selector.reservation now preserves a sibling slurm.node_selector.partition from the manifest instead of wholesale-replacing the inner object. Same helper is now used in both the runtime-context init path and the manifest-merge path.

raviguptaamd · 2026-05-06T01:14:32Z

+        credential_file = "credential.json"
+        if os.path.exists(credential_file):
+            try:
+                with open(credential_file) as f:
+                    credentials = json.load(f)
+            except Exception as e:
+                print(f"Warning: Could not load credentials: {e}")


Addressed in 63a2eae.

RunOrchestrator._load_credentials now delegates to madengine.core.auth.load_credentials. The central helper enforces "credential.json must contain a JSON object" validation up front (raises ValueError if the file is an array/string), so malformed credential files are rejected with a clear message instead of producing a cryptic subscript/attribute error later during registry auth.

raviguptaamd · 2026-05-06T01:15:36Z

            self.ctx["rocm_path"] = self._rocm_path
-            # In-container ROCM_PATH is finalized at run time in run_container
-            # via finalize_container_rocm_path (OCI env → probe → default).
-            # docker_env_vars.ROCM_PATH (user-supplied) is preserved here; finalize
-            # normalizes it at run time. Auto-detection runs in finalize when absent.
+            self.ctx["docker_env_vars"]["ROCM_PATH"] = self._rocm_path


Addressed in 9109f52 — file reverted to origin/develop.

core/context.py no longer forces the host ROCm root into the container env. The PR now uses upstream's madengine.utils.rocm_path_resolver.resolve_host_rocm_path for host detection and the docker_env_vars.MAD_ROCM_PATH flow (with image-baked-in env, in-container probe, then /opt/rocm fallback) for in-container resolution. Out-of-scope behaviour change for slurm_multi; reverted to keep this PR focused.

raviguptaamd · 2026-05-06T01:15:40Z

                row = [str(row_index), status, model_name]
                if show_gpu_arch:
-                    row.append(build_gpu_arch_display(item))
+                    row.append(item.get("architecture", "N/A"))


Addressed in 9109f52.

Restored the build_gpu_arch_display(item) helper (prefers gpu_architecture then architecture, defaults to "N/A") and switched both BUILD callsites (success + failed) back to it, so multi-arch builds again show the correct gpu_architecture column instead of N/A. Kept the model-name no-truncation change and the "Workload" column rename (those weren't flagged).

raviguptaamd · 2026-05-06T01:16:26Z

 - **🚀 Modern CLI** - Rich terminal output with Typer and Rich
 - **🎯 Simple Deployment** - Run locally or deploy to Kubernetes/SLURM via configuration
- **🔧 Distributed Launchers** - Full support for torchrun, DeepSpeed, Megatron-LM, TorchTitan, Primus, vLLM, SGLang
+- **🔧 Distributed Launchers** - Full support for torchrun, DeepSpeed, Megatron-LM, TorchTitan, vLLM, SGLang


Addressed across two commits.

9109f52 — README.md reverted to origin/develop, so the Distributed Launchers bullet, the Architecture diagram, and the Components section all mention Primus alongside torchrun / DeepSpeed / Megatron-LM / TorchTitan / vLLM / SGLang again. The ROCm-path bullet and the RTL profiling-tools link are also restored.

63a2eae — added "primus" (and "megatron") to VALID_LAUNCHERS in deployment/common.py, so normalize_launcher no longer silently downgrades "primus" configs to "docker".

The README is now consistent with what the code actually accepts.

raviguptaamd · 2026-05-06T01:16:30Z

+    def __init__(self, seconds: int = 15) -> None:
        """Constructor of the Timeout class.

        Args:
-            seconds (Optional[int]): The timeout in seconds. None or 0 disables
-                the timeout. Negative values are treated as no timeout.
+            seconds (int): The timeout in seconds.
        """
-        self.seconds = seconds if seconds and seconds > 0 else None
+        self.seconds = seconds


Addressed in 9109f52 — file reverted to origin/develop.

Timeout.__init__ again accepts Optional[int] = 15 and stores self.seconds = seconds if seconds and seconds > 0 else None. __enter__ / __exit__ short-circuit when self.seconds is None, so programmatic callers passing timeout=None (or 0 / negative) no longer hit signal.alarm(None) with a TypeError. Out-of-scope change for slurm_multi; reverted.

raviguptaamd · 2026-05-06T01:16:34Z

                "--build-arg "
-                + shlex.quote(str(build_arg))
-                + "="
-                + shlex.quote(str(self.context.ctx["docker_build_arg"][build_arg]))
-                + " "
+                + build_arg
+                + "='"
+                + self.context.ctx["docker_build_arg"][build_arg]
+                + "' "
            )

        if run_build_arg:
            for key, value in run_build_arg.items():
-                build_args += "--build-arg " + shlex.quote(str(key)) + "=" + shlex.quote(str(value)) + " "
+                build_args += "--build-arg " + key + "='" + value + "' "


Addressed in 9109f52.

get_build_arg reverted to use shlex.quote(str(...)) for both keys and values:

build_args += ( "--build-arg " + shlex.quote(str(build_arg)) + "=" + shlex.quote(str(self.context.ctx["docker_build_arg"][build_arg])) + " " )

(plus Optional[Dict] = None with if run_build_arg is None: run_build_arg = {} instead of the mutable-default pattern). Build-arg values containing apostrophes / $() / backticks no longer break the docker build command line.

raviguptaamd · 2026-05-06T01:16:37Z

        stop = os.path.abspath(stop_dir)

-        while current == stop or current.startswith(stop + os.sep):
+        while current.startswith(stop):


Addressed in 9109f52 — file reverted to origin/develop.

Boundary check restored:

while current == stop or current.startswith(stop + os.sep):

So a parent like /repo/scripts no longer matches /repo/scripts-private/... via prefix collision — ConfigParser walks the intended subtree only. Out-of-scope change for slurm_multi; reverted.

raviguptaamd · 2026-05-06T01:16:41Z

        if sys_gpu_arch and "NVIDIA" in sys_gpu_arch:
-            # Normalize "NVIDIA A100-SXM4-40GB" -> "A100", "NVIDIA H100 PCIe" -> "H100"
-            # (mirrors get_gpu_arch() in tests/fixtures/utils.py)
-            after_prefix = sys_gpu_arch[sys_gpu_arch.index("NVIDIA") + len("NVIDIA ") :]
-            sys_gpu_arch = after_prefix.split("-")[0].split()[0]
+            sys_gpu_arch = sys_gpu_arch.split()[1]


Addressed in 9109f52 — file reverted to origin/develop.

NVIDIA-arch normalization restored:

after_prefix = sys_gpu_arch[sys_gpu_arch.index("NVIDIA") + len("NVIDIA "):] sys_gpu_arch = after_prefix.split("-")[0].split()[0]

so "NVIDIA A100-SXM4-40GB" → "A100" and "NVIDIA H100 PCIe" → "H100" again, and skip_gpu_arch=A100 correctly filters those models. The split()[1] regression that returned the full SKU string is gone. Out-of-scope change for slurm_multi; reverted.

Previously the per-deploy completion marker was named `madengine_<model>.complete` -- not job-id namespaced. A second `madengine run` of the same model tag would see the leftover marker from the first run and immediately conclude the job was done, even though sbatch had only just submitted the new job. Reproduced live: re-running pyt_sglang_disagg_llama-3.1-8b right after job 22607 finished caused madengine to return success in ~30 s while job 22615 was still running on the reservation. Fix: * prepare(): emit the marker using `${SLURM_JOB_ID:-local}` inside the generated wrapper so each sbatch run writes its own file. Falls back to `_local` when no SLURM_JOB_ID is set (interactive/local runs, preserves existing behaviour for that path via _completion_marker). * monitor(): reconstruct the job-id-aware path from the basename template and the deployment_id returned by sbatch; only fall back to the bare marker for backwards compatibility with the inside_allocation flow. Co-authored-by: Cursor <cursoragent@cursor.com>

raviguptaamd · 2026-05-03T07:07:21Z

Smoke test summary on `Distributed_Inference_CI` reservation

Drove the slurm_multi launcher end-to-end against four real workloads with the companion model cards in ROCm/MAD-private#186. Three out of four produced real benchmark numbers, the fourth surfaced a vLLM/NixlConnector issue inside the docker image (orthogonal to this PR).

Critical-path runs

#	Workload	Model card	Slurm job	Nodes	State	Result
C1	SGLang disagg 8B (madengine-driven via F1)	`pyt_sglang_disagg_llama-3.1-8b`	22607	reservation [029,040,082]	COMPLETED 8:51	5,977 tok/s TTFT 122.8 ms ITL 2.54 ms
C2	vLLM disagg 8B (1P/1D, co-located proxy)	`pyt_vllm_disagg_llama-3.1-8b`	22612	[008,014]	COMPLETED 4:22	router serving, but every decode request returned HTTP 500: vLLM workers crashed with `nixl_connector.py:1504] AssertionError`. Image-side bug, not the launcher.
C3	Large EP 2-node	`pyt_large_ep_benchmark_2n`	22610	reservation [029,040]	COMPLETED 3:49	28 metric rows (deepep dispatch/combine FP8/BF16 RDMA/NVL)
C4	KV cache transfer 2-node	`pyt_kvcache_transfer_bench`	22611	reservation [082,125]	COMPLETED 24:55	57 metric rows (mooncake + mori + rixl, 4KB → 1GB)
C1'	C1 re-run with the F-perf collector commit (`5ec55ff`)	same	22615	reservation [029,040,082]	COMPLETED 6:54	5,970 tok/s; F-perf finds `/shared_inference/<user>/<jobid>/perf.csv` (verified)

What the smoke test exercised in this PR

Surface	Verification
`madengine build` with `--use-image` (manifest gen for prebuilt image)	C1
`madengine build` with no flags + model-card `env_vars.DOCKER_IMAGE_NAME` (F1 implicit-`--use-image`)	C1' — log line `slurm_multi: no --registry/--use-image given; using DOCKER_IMAGE_NAME from model card -> rocm/pytorch-private:sglang_disagg_mori_20260502`
`madengine run --manifest-file` over slurm_multi	C1, C1'
`slurm_multi` SBATCH wrapper generation, `--reservation`, `enable_node_check=false` honoured	All
`_collect_slurm_multi_results` finds `/shared_inference/<user>/<jobid>/perf.csv` (sglang/vllm)	C1'
`SlurmDeployment` direct sbatch (no madengine on top)	C2, C3, C4
`slurm_multi` co-located-proxy topology for vLLM disagg (`xP + yD` nodes)	C2 — topology was launched correctly; C2 failure happened inside vLLM nixl_connector after `Application startup complete.`

Bugs surfaced by the smoke test (all addressed)

Bug	Where	Commit
5 invalid kwargs to `ErrorContext` (`model=`, `launcher=`, `model_names=`, `registry=`, `dockerfile=`) — masked the real error message	`build_orchestrator.py`	`5ec55ff`
F1: model-card image not auto-promoted to `--use-image` for slurm_multi	`build_orchestrator.py`	`5ec55ff`
F-perf: `_collect_slurm_multi_results` only looked in `<cwd>/perf.csv`	`slurm.py`	`5ec55ff`
Completion marker not job-id namespaced — second `madengine run` of same model tag returned success in ~30 s while the new sbatch was still running	`slurm.py`	`710b51a`

Out of scope

Image-side vLLM NixlConnector crash on Llama-3.1-8B disagg under rocm/pytorch-private:20260407_itej89_vllm_mori_docker (filed separately as orthogonal).
Node-health precheck threshold (180 GB memory used / 4 processes flags reservation nodes as "Dirty"; tracked for follow-up).

PR head: 710b51a. Mergeable, blocking on review.

Round-2 Copilot review identified 15 changes outside the slurm_multi scope that regressed upstream/develop behaviour. Revert each to origin/develop: core/context.py drop forced host-ROCm-into-container override core/docker.py restore shlex.quote + exact-match container filter core/dataprovider.py restore "unknown data name" guard, drop debug prints core/timeout.py restore Optional[int] None/0 handling cli/app.py restore versioningit-driven --version utils/rocm_tool_manager.py capture primary_stderr before fallback overwrites utils/discover_models.py restore shlex.quote in cp; raise TypeError, not assert utils/config_parser.py fix boundary check (== or startswith stop+os.sep) orchestration/image_filtering.py restore proper "NVIDIA <ARCH>" normalization (split()[1] returned A100-SXM4-40GB, not A100) reporting/update_perf_csv.py restore configs pop, isinstance instead of type == README.md restore Primus mention, ROCm path docs, RTL link Surgical reverts that preserve scope-relevant changes: cli/utils.py - restore build_gpu_arch_display() helper + the two BUILD callsites (success and failed BUILD branches), so multi-arch builds show the gpu_architecture column instead of N/A - keep the model-name no-truncation change (no [:20]) - keep the "Workload" column rename and matching format-precision tweak execution/docker_builder.py - revert auth/shlex paths to upstream (login_to_registry delegated to madengine.core.auth, shlex.quote on build args, exact-match container filter, primus dockercontext detection) - PRESERVE the slurm_multi-specific block at line 369: # Update built_models with registry image name for parallel pull # in slurm_multi so the registry-image -> env_vars["DOCKER_IMAGE_NAME"] propagation that the slurm_multi parallel-pull path depends on is unchanged. slurm_multi infrastructure is untouched: zero bytes diff vs PR HEAD on deployment/{slurm,slurm_node_selector,base,common,factory,config_loader}.py, orchestration/{build_orchestrator,run_orchestrator}.py, execution/{container_runner,container_runner_helpers}.py, cli/{commands/build,commands/run,validators}.py. Smoke checks: DeploymentFactory.available_deployments() = ['k8s','kubernetes','slurm'] normalize_launcher('slurm_multi','slurm') = 'slurm_multi' normalize_launcher('slurm-multi','slurm') = 'slurm_multi' Addresses Copilot review comments ROCm#4, ROCm#5, ROCm#6, ROCm#9, ROCm#13, ROCm#14, ROCm#16, ROCm#17, Co-authored-by: Cursor <cursoragent@cursor.com> ROCm#18, ROCm#19, ROCm#20, ROCm#21, ROCm#22, ROCm#23, ROCm#24.

Six Round-2 review comments on slurm_multi-touching code paths fixed. run_orchestrator.py: * Real recursive deep-merge: promote the inline `_deep_merge` from `_init_runtime_context_from_manifest` to a module-level helper and use it in `_load_and_merge_manifest` too. Previously the manifest path still did a shallow `dict.update()`, so a runtime override touching a single leaf inside e.g. `slurm.node_selector` or `distributed.primus` wholesale-replaced the inner object and silently dropped the manifest's other leaves. (Copilot review ROCm#2) * Restore manifest's `context.docker_env_vars` into the runtime context. Manifest-only runs (`madengine run --manifest-file`) used to load tools / pre_scripts / post_scripts / encapsulate_script but had stopped restoring `docker_env_vars`, so any env captured at build time (incl. MAD_SECRETS*, model-specific values) was dropped before ContainerRunner rendered `--env` flags. (Copilot review ROCm#12) * Replace the local `_load_credentials` reimplementation with a delegation to `madengine.core.auth.load_credentials`, which keeps the "credential.json must be a JSON object" validation that was silently lost. Without it, a malformed file (array/string) would propagate unchanged and surface much later as a cryptic subscript/attribute error during registry auth. (Copilot review ROCm#3) deployment/common.py: * `configure_multi_node_profiling`: when `rocprofv3` is unavailable on a multi-node run, only filter out the `rocprof` tool (which is the one that actually requires rocprofv3) instead of disabling ALL multi-node tools. `rccl_trace`, `rocblas_trace`, and friends work independently and must keep running. Returns `multi_node_no_rocprofv3` in that mixed mode. (Copilot review ROCm#1) * `VALID_LAUNCHERS`: add `"megatron"` and `"primus"` alongside the existing `"megatron-lm"`. The rest of the codebase (slurm.py, kubernetes.py, the `job.yaml.j2` template, `primus_backend.py`, and every `examples/**/megatron-lm-*.json` config) uses the bare `"megatron"` / `"primus"` strings, so without these entries `normalize_launcher` was silently downgrading them to `"docker"`. (Copilot review ROCm#10) deployment/base.py: * Widen `_parse_node_log_for_perf` regex to accept scientific notation (`1.23e4`, `4E+5`) for the value, and `\\S+` for the metric name (covers `tokens/sec`, `samples_per_second`, `tok/s`, etc.). The narrower `[\\d.]+` previously dropped legitimate perf lines emitted by training scripts. (Copilot review ROCm#11) Smoke checks: DeploymentFactory.available_deployments() = ['k8s','kubernetes','slurm'] normalize_launcher('megatron','slurm') = 'megatron' normalize_launcher('primus','kubernetes') = 'primus' normalize_launcher('slurm_multi','slurm') = 'slurm_multi' _deep_merge nested-leaf preservation: verified Co-authored-by: Cursor <cursoragent@cursor.com>

Three Round-2 Copilot comments on doc-only broken navigation: docs/deployment.md (review ROCm#7): distributed-launchers.md → launchers.md (renamed long ago) user-guide.md → usage.md (no user-guide.md exists) docs/launchers.md (review ROCm#8): Drop the K8s `slurm-multi-*.json` example references — those files do not exist under examples/k8s-configs/. slurm_multi is a SLURM-side launcher; K8s disagg flows continue to use the existing examples/k8s-configs/{minimal,basic}/sglang-disagg-*.json configs. Keep and verify the three SLURM example paths that actually exist on disk. docs/installation.md (review ROCm#15): user-guide.md → usage.md (no user-guide.md exists) how-to-quick-start.md → cli-reference.md (no how-to-quick-start.md exists; usage.md already covers Quick Start and is linked above, so point readers at the CLI reference instead). All link targets verified to exist on disk: docs/launchers.md, docs/usage.md, docs/deployment.md, docs/cli-reference.md examples/slurm-configs/minimal/slurm-multi-minimal.json examples/slurm-configs/basic/slurm-multi-multi-node.json examples/slurm-configs/basic/slurm-multi-custom-split.json Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

Copilot reviewed 75 out of 76 changed files in this pull request and generated 13 comments.

Comments suppressed due to low confidence (2)

src/madengine/orchestration/run_orchestrator.py:120

Context.init still only accepts additional_context, additional_context_file, build_only_mode, and detect_local_gpu_arch. Passing rocm_path here raises TypeError, so every local run that initializes the runtime context will fail before execution starts.
src/madengine/orchestration/run_orchestrator.py:594
Build manifests still persist context.docker_env_vars, docker_mounts, and docker_build_arg, but this restore block now copies only tools/scripts back into the runtime Context. Execution-only runs from a manifest therefore lose build-time env/mount context before ContainerRunner starts, including values that Context auto-imported into docker_env_vars such as MAD_SECRETS*.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

raviguptaamd · 2026-05-06T03:35:22Z

            for env_arg in run_env:
-                if not self._ENV_KEY_RE.match(env_arg):
-                    raise ValueError(f"Invalid environment variable name: {env_arg!r}")
-                env_args += f"--env {env_arg}={shlex.quote(str(run_env[env_arg]))} "
+                env_args += f"--env {env_arg}='{str(run_env[env_arg])}' "

        # Add context environment variables
        if "docker_env_vars" in self.context.ctx:
            for env_arg in self.context.ctx["docker_env_vars"].keys():
-                if not self._ENV_KEY_RE.match(env_arg):
-                    raise ValueError(f"Invalid environment variable name: {env_arg!r}")
-                value = self.context.ctx["docker_env_vars"][env_arg]
-                env_args += f"--env {env_arg}={shlex.quote(str(value))} "
+                env_args += f"--env {env_arg}='{str(self.context.ctx['docker_env_vars'][env_arg])}' "


Acknowledged — this is a real concern but it is pre-existing in origin/develop, not introduced by this PR.

The f"--env {env_arg}='{str(...)}' " pattern is at the same line in origin/develop (compare git show origin/develop:src/madengine/execution/container_runner.py | sed -n '578,581p'); the --build-on-compute / --use-image work in this PR didn't change this method.

Deferring to a follow-up issue rather than expanding this PR's scope. The right fix is to switch to shlex.quote(value) (and validate the env-var name with the existing _VALID_ENV_NAME_RE style check) for both this docker_env_vars loop and the run_env loop a few lines above. Happy to file the follow-up issue (run_container env-arg shell-escape hardening) and reference it back here.

raviguptaamd · 2026-05-06T03:35:02Z

        if self.additional_context and "docker_env_vars" in self.additional_context:
            merged_count = 0
            for key, value in self.additional_context["docker_env_vars"].items():
                self.context.ctx["docker_env_vars"][key] = value
                merged_count += 1
            if merged_count > 0:
                print(f"ℹ️  Merged {merged_count} environment variables from additional_context")

-        if self.context and str(self.context.ctx.get("gpu_vendor", "")).upper().find(
-            "AMD"
-        ) != -1:
-            from madengine.utils.rocm_path_resolver import finalize_container_rocm_path
-
-            # Determine whether the user explicitly supplied ROCM_PATH for the container.
-            # If they did (via docker_env_vars.ROCM_PATH in additional_context), the
-            # re-merge above already restored it — keep it so finalize uses it directly.
-            # If they did not, clear any ROCM_PATH left from a previous model run so
-            # finalize always re-resolves for the current docker_image (OCI config →
-            # in-image probe → /opt/rocm default).
-            user_supplied_rocm_path = (
-                str(
-                    (self.additional_context or {})
-                    .get("docker_env_vars", {})
-                    .get("ROCM_PATH", "")
-                ).strip()
-            )
-            if not user_supplied_rocm_path:
-                self.context.ctx["docker_env_vars"].pop("ROCM_PATH", None)
-
-            finalize_container_rocm_path(
-                self.context.ctx["docker_env_vars"],
-                docker_image,
-            )
+        # Merge env_vars from model_info (models.json) into docker_env_vars
+        if "env_vars" in model_info and model_info["env_vars"]:
+            model_env_count = 0
+            for key, value in model_info["env_vars"].items():
+                self.context.ctx["docker_env_vars"][key] = str(value)
+                model_env_count += 1
+            if model_env_count > 0:
+                print(f"ℹ️  Merged {model_env_count} environment variables from model_info (models.json)")


This thread is showing as outdated and is already addressed by Cat B in 9109f52.

core/context.py was reverted to origin/develop, which means:

No more forced self.ctx['docker_env_vars']['ROCM_PATH'] = self._rocm_path at init.

In-container ROCM_PATH is resolved at run time via finalize_container_rocm_path() (OCI metadata → in-image probe → /opt/rocm), so user/model docker_env_vars overrides survive the merge.

The ROCm-path behaviour the comment was concerned about is preserved by upstream's existing implementation, not by this PR.

raviguptaamd · 2026-05-06T03:35:06Z

            )

        self.rich_console.print(f"[green]✓ Discovered {len(models)} model(s) for tags: {tags}[/green]\n")

        # Initialize build-only context for manifest generation


This thread is showing as outdated and is already addressed by Cat B in 9109f52.

core/context.py was reverted to origin/develop, so the Context build-only constructor no longer accepts a rocm_path keyword. The synthetic-manifest path that called Context(rocm_path=...) was the source of the TypeError referenced here, and that constructor signature is back to upstream's. No call site in this PR passes rocm_path=... anymore.

raviguptaamd · 2026-05-06T03:35:11Z

        if self.additional_context:
            # Merge deployment_config
            if "deployment_config" in manifest:
                stored_config = manifest["deployment_config"]
-                # Runtime --additional-context overrides stored config
+                # Runtime --additional-context overrides stored config.
+                # For dict-valued keys, recursively deep-merge so manifest values


This thread is showing as outdated and is already addressed in 63a2eae — reposting the link here for clarity.

run_orchestrator._load_and_merge_manifest no longer uses dict(stored_config[key]); merged.update(...). It now calls the module-level _deep_merge(stored_config[key], self.additional_context[key]) helper, which recurses into nested dicts.

Verified locally: an override touching slurm.node_selector.reservation preserves a sibling slurm.node_selector.partition from the manifest instead of wholesale-replacing the inner object. Same helper is used in both the runtime-context init path and the manifest-merge path.

raviguptaamd · 2026-05-06T03:34:48Z

+# NOTE: keep this in sync with the launcher branches in
+# deployment/{slurm.py,kubernetes.py}, the templates/{slurm,kubernetes}/*.j2
+# Jinja conditions, and examples/**/launcher fields. Currently the rest of
+# the codebase (and shipped examples) use the bare "megatron" / "primus"
+# strings, while "megatron-lm" is the docs-friendly synonym.
 VALID_LAUNCHERS = [
    "torchrun",
    "torchtitan",
    "deepspeed",


This Round-3 comment is reviewing a pre-fix snapshot — already addressed in 63a2eae.

Current src/madengine/deployment/common.py has both "megatron" and "primus" in VALID_LAUNCHERS (alongside the docs-friendly "megatron-lm"):

VALID_LAUNCHERS = [ "torchrun", "torchtitan", "deepspeed", "megatron", "megatron-lm", "primus", "vllm", "sglang", "slurm_multi", ]

Verified at HEAD (2200cbd): normalize_launcher('megatron','slurm') → 'megatron', normalize_launcher('primus','kubernetes') → 'primus'. The example configs (examples/{slurm,k8s}-configs/**/megatron-lm-*.json, examples/k8s-configs/minimal/megatron-lm-*.json) round-trip correctly without being downgraded to "docker"/"native".

raviguptaamd · 2026-05-06T03:34:58Z

+**Examples** (SLURM only — `slurm_multi` is a SLURM-side launcher; the K8s
+deployment of disagg workloads continues to use the existing
+`sglang-disagg-*.json` configs):
+- SLURM Minimal: `examples/slurm-configs/minimal/slurm-multi-minimal.json`
+- SLURM Basic:   `examples/slurm-configs/basic/slurm-multi-multi-node.json`
+- SLURM Custom:  `examples/slurm-configs/basic/slurm-multi-custom-split.json`


This Round-3 comment is reviewing a pre-fix snapshot — already addressed in 0156d33.

Current docs/launchers.md (around line 470) only lists the three SLURM-side examples that actually exist on disk:

**Examples** (SLURM only — `slurm_multi` is a SLURM-side launcher; the K8s deployment of disagg workloads continues to use the existing `sglang-disagg-*.json` configs): - SLURM Minimal: `examples/slurm-configs/minimal/slurm-multi-minimal.json` - SLURM Basic: `examples/slurm-configs/basic/slurm-multi-multi-node.json` - SLURM Custom: `examples/slurm-configs/basic/slurm-multi-custom-split.json`

The three K8s slurm-multi-*.json references that didn't exist were dropped. All three SLURM paths verified to exist at HEAD (2200cbd).

raviguptaamd · 2026-05-06T03:34:31Z

@@ -126,7 +102,7 @@ madengine build --tags your_model --registry your-registry:5000

 # Deploy to SLURM
 madengine run --manifest-file build_manifest.json \
-  --additional-context-file examples/profiling-configs/rocprofv3_multi_node.json
+  --additional-context-file examples/profiling-configs/rocprofv3_multinode.json


Addressed in 2200cbd.

Fixed both occurrences (lines 89 and 105). The actual checked-in profiling config is examples/profiling-configs/rocprofv3_multi_node.json (with underscore); the README now references that exact filename, so the copy-paste command no longer fails with a missing-file error.

raviguptaamd · 2026-05-06T03:34:34Z

 import os
 import tempfile

+import pytest


Addressed in 2200cbd.

Dropped import pytest. Confirmed there are no pytest.* references anywhere in tests/unit/test_reporting.py (no @pytest.mark.*, no pytest.raises, no pytest.fixture), so the import was genuinely unused. flake8 lint will be clean again.

raviguptaamd · 2026-05-06T03:34:39Z

+        ctx = Context(build_only_mode=True, rocm_path="/opt/rocm")
+        assert ctx._rocm_path == "/opt/rocm"


Addressed in 2200cbd — file reverted to origin/develop.

With the Cat B revert of core/context.py in 9109f52, this PR no longer adds the Context(rocm_path=...) keyword API; tests for that PR-specific surface belong to whichever future PR re-adds it. Upstream's tests/unit/test_rocm_path.py already exercises ROCmToolManager(rocm_path=...) (still in the codebase), so reverting restores green tests without changing semantics. Net result: this test file is byte-identical to origin/develop.

raviguptaamd · 2026-05-06T03:34:42Z

+            ctx = Context(rocm_path="/my/rocm")
+            assert ctx.ctx.get("rocm_path") == "/my/rocm"
+            assert ctx.ctx["docker_env_vars"].get("ROCM_PATH") == "/my/rocm"


Addressed in 2200cbd (same fix as the sibling thread). The whole test_rocm_path.py file was reverted to origin/develop, so this assertion that ctx['docker_env_vars']['ROCM_PATH'] == ... immediately after construction is gone — the test now matches upstream's behaviour where in-container ROCM_PATH is finalized at run time via finalize_container_rocm_path().

Three Round-3 review comments fixed in this small commit; the remaining Round-3 items are either stale (already fixed in 9109f52 / 63a2eae / 0156d33; Copilot was reviewing a pre-fix snapshot) or pre-existing concerns in upstream/develop that are deferred to a follow-up issue (replied inline). examples/profiling-configs/README.md (R3-6): `rocprofv3_multinode.json` -> `rocprofv3_multi_node.json` (lines 89 and 105). The actual checked-in file is `rocprofv3_multi_node.json` (with underscore), so copy-pasting the README's command would fail with a missing-file error. tests/unit/test_reporting.py (R3-7): Drop the unused `import pytest` (no `pytest.*` references in the file). The repo's pre-commit runs flake8, so the unused import would fail lint/CI. tests/unit/test_rocm_path.py (R3-8 + R3-9): Revert to origin/develop. With the Cat B revert of `core/context.py` (commit 9109f52), the PR no longer adds the `Context(rocm_path=...)` keyword API; tests for that PR-specific API now belong to whichever future PR re-adds it. Upstream's version of this test file already exercises `ROCmToolManager(rocm_path=...)` -- which is still in the codebase -- so reverting restores green tests without changing semantics. Co-authored-by: Cursor <cursoragent@cursor.com>

Surfaced live: `madengine build --tags pyt_kvcache_transfer_bench --build-on-compute --registry docker.io/rocm/pytorch-private` failed its docker build step with: ERROR: failed to build: invalid tag "ci-pyt_kvcache_transfer_bench_Dockerfile": repository name must be lowercase Root cause: in `_execute_build_on_compute`, `local_image_name` is assembled as `ci-<model_name>_<dockerfile_basename>`, where `dockerfile_basename` strips `.Dockerfile` and `.ubuntu.amd` from `Path(dockerfile_path).name`. When the model card's `dockerfile` field is just `scripts/<dir>/Dockerfile` (no `.ubuntu.amd.Dockerfile` suffix), neither `.replace()` matches, so `dockerfile_basename` stays as `Dockerfile` (capital D) and the assembled tag is rejected by `docker build -t`. Fix: `.lower()` the final assembled local image name. This: * fixes the kvcache_transfer_bench card (`scripts/.../Dockerfile`), * is a no-op for the four already-passing cards whose dockerfile paths follow the `*.ubuntu.amd.Dockerfile` convention (the `_basename` is already lowercase after the suffix strips), and * is robust against any uppercase character anywhere in `model_name` or `dockerfile_basename`. Verified by re-running the failing case to completion: Local image name: ci-pyt_kvcache_transfer_bench_dockerfile Image pushed: docker.io/rocm/pytorch-private:pyt_kvcache_transfer_bench [✓] All builds completed successfully (slurm job 22936, ~18 min full Dockerfile build of mooncake + MORI + RIXL + nixlbench, 24 layers) Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

Copilot reviewed 74 out of 75 changed files in this pull request and generated 24 comments.

Comments suppressed due to low confidence (3)

docs/cli-reference.md:396

The CLI docs still say --timeout 0 means “no timeout”, but madengine run no longer translates 0 to an unlimited timeout display-wise and the validation message no longer mentions 0. Either update the docs to match the current behavior, or restore the explicit 0 => no timeout handling so help text, docs, and runtime all align.

| `--tags` | `-t` | TEXT | `[]` | Model tags to run (can specify multiple) |
| `--manifest-file` | `-m` | TEXT | `""` | Build manifest file path (for pre-built images) |
| `--rocm-path` | | TEXT | `None` | ROCm installation root (default: `ROCM_PATH` env or `/opt/rocm`). Use when ROCm is not in `/opt/rocm` (e.g. Rock, pip). |
| `--registry` | `-r` | TEXT | `None` | Docker registry URL |
| `--timeout` | | INT | `-1` | Timeout in seconds (-1=default 7200s, 0=no timeout) |
| `--additional-context` | `-c` | TEXT | `"{}"` | Additional context as JSON string |
| `--additional-context-file` | `-f` | TEXT | `None` | File containing additional context JSON |

pyproject.toml:31

pytest is included in the package’s runtime dependencies even though it’s only used for testing (it’s already listed under [project.optional-dependencies].dev). Keeping pytest in dependencies increases install size and can conflict with downstream environments. Move it back to dev/test extras only unless the library imports pytest at runtime (it currently doesn’t).

dependencies = [
  "pandas",
  "GitPython",
  "jsondiff",
  "sqlalchemy",
  "setuptools-rust",
  "paramiko",
  "tqdm",
  "pytest",
  "typing-extensions",
  "pymongo",
  "toml",
  "typer>=0.9.0",
  "rich>=13.0.0",
  "click>=8.0.0",
  "jinja2>=3.0.0",
  "pyyaml>=6.0",
]

src/madengine/execution/container_runner_helpers.py:216

make_run_log_file_path() no longer normalizes full image references (e.g. registry/ns/name:tag), so the generated log filename can include : (and very long sanitized refs), and run-log naming can diverge from build-log naming for CI-style tags. Consider reinstating a normalization helper (strip digest, collapse registry/...:ci-* to just the tag, and replace :, /, @ with _) to keep filenames stable and filesystem-friendly.

raviguptaamd · 2026-05-06T20:36:59Z

+        rocm_path = get_rocm_path(getattr(self.args, "rocm_path", None))
        self.context = Context(
            additional_context=context_string,
            build_only_mode=False,
+            rocm_path=rocm_path,
        )


Addressed in 3fc7587.

This is a real latent bug introduced by my Cat B revert: I reverted core/context.py to origin/develop (which removes the rocm_path keyword), but missed that run_orchestrator.py still passes it at lines 138 and 456. Today's slurm_multi tests went through _execute_distributed so didn't hit it, but madengine run against a local target hits _init_runtime_context() here and crashes with TypeError: Context.__init__() got an unexpected keyword argument 'rocm_path'.

Fix in 3fc7587: drop the rocm_path = get_rocm_path(...) line and the rocm_path=rocm_path kwarg at both call sites; drop the now-unused from madengine.core.constants import get_rocm_path import.

Verified at HEAD (3fc7587) via inspect.signature(Context.__init__): params are ['self','additional_context','additional_context_file','build_only_mode','detect_local_gpu_arch'] -- no more rocm_path. End-to-end smoke: pyt_large_ep_benchmark_2n on [008,024] via slurm_multi, 8 min, 28 metric rows.

raviguptaamd · 2026-05-06T20:37:02Z

+        rocm_path = get_rocm_path(getattr(self.args, "rocm_path", None))
        build_context = Context(
            additional_context=context_string,
            build_only_mode=True,
+            rocm_path=rocm_path,
        )


Addressed in 3fc7587 (same root cause as the sibling thread on line 138). The synthetic-manifest path in _create_manifest_from_local_image was passing the same now-removed rocm_path kwarg. Both call sites are fixed in this commit; from madengine.core.constants import get_rocm_path import is gone too.

raviguptaamd · 2026-05-06T20:37:05Z

 from madengine.core.errors import (
    BuildError,
    ConfigurationError,
    ExecutionError,
    create_error_context,
+    handle_error,
 )
+from madengine.core.constants import get_rocm_path


Addressed in 3fc7587. Dropped handle_error from the import set in from madengine.core.errors import (...) since the only references in this module were removed in earlier rounds. flake8 F401 should be clean.

raviguptaamd · 2026-05-06T20:37:11Z

+        # Ensure credential values are strings
+        username = str(creds["username"])
+        password = str(creds["password"])
+
+        # Perform docker login
+        login_command = f"echo '{password}' | docker login"
+
+        if registry and registry.lower() not in ["docker.io", "dockerhub"]:
+            login_command += f" {registry}"
+
+        login_command += f" --username {username} --password-stdin"
+
+        try:
+            self.console.sh(login_command, secret=True)
+            self.rich_console.print(f"[green]✅ Successfully logged in to registry: {registry or 'DockerHub'}[/green]")
+        except Exception as e:
+            self.rich_console.print(f"[red]❌ Failed to login to registry {registry}: {e}[/red]")


Addressed in 3fc7587.

ContainerRunner.login_to_registry now delegates to madengine.core.auth.login_to_registry (the same helper DockerBuilder.login_to_registry already uses). The previous inline form built

login_command = f"echo '{password}' | docker login ... --password-stdin" self.console.sh(login_command, secret=True)

which -- even though docker reads the password via --password-stdin -- still embedded the password in echo's argv visible through ps//proc. The central helper writes the password to docker's stdin via subprocess.run(..., input=password.encode()), so it never appears on a process command line.

The ~50 lines of inline credential-shape validation are also gone -- the central helper does the same checks (and was the path Round-2 R2-#18 already moved DockerBuilder onto, just in a sister file).

raviguptaamd · 2026-05-06T20:37:18Z

    def get_env_arg(self, run_env: typing.Dict) -> str:
        """Get the environment arguments for docker run."""
        env_args = ""

        # Add custom environment variables
        if run_env:
            for env_arg in run_env:
-                if not self._ENV_KEY_RE.match(env_arg):
-                    raise ValueError(f"Invalid environment variable name: {env_arg!r}")
-                env_args += f"--env {env_arg}={shlex.quote(str(run_env[env_arg]))} "
+                env_args += f"--env {env_arg}='{str(run_env[env_arg])}' "

        # Add context environment variables
        if "docker_env_vars" in self.context.ctx:
            for env_arg in self.context.ctx["docker_env_vars"].keys():
-                if not self._ENV_KEY_RE.match(env_arg):
-                    raise ValueError(f"Invalid environment variable name: {env_arg!r}")
-                value = self.context.ctx["docker_env_vars"][env_arg]
-                env_args += f"--env {env_arg}={shlex.quote(str(value))} "
+                env_args += f"--env {env_arg}='{str(self.context.ctx['docker_env_vars'][env_arg])}' "

+        print(f"Env arguments: {env_args}")
        return env_args


Addressed in 3fc7587. get_env_arg no longer dumps the full assembled --env KEY='value' ... string into stdout. It now prints only the count:

print(f"Env arguments: {env_args.count('--env')} variables set")

Values can include HF tokens, registry passwords, MAD_SECRETS_*, etc.; the assembled string is still passed to docker run (so the container sees them as expected), but no longer leaks into stdout / CI artifacts.

            try:
                power_info = amdsmi_interface.amdsmi_get_power_info(processor_handle)
                if isinstance(power_info, dict):
                    # If both current and average power are 0, it's likely a secondary die
                    current_power = power_info.get('current_socket_power', -1)
                    avg_power = power_info.get('average_socket_power', -1)
                    if current_power == 0 and avg_power == 0:
                        return True
-            except Exception:
+            except:
                # If we can't get power info, might be secondary die
                return True


raviguptaamd · 2026-05-06T20:37:30Z

- `sglang-disagg` - Disaggregated SGLang (multi-node)

-See [Launchers Guide](launchers.md) for details.
+See [Distributed Launchers Guide](distributed-launchers.md) for details.


Addressed in 3fc7587. This was the second occurrence of [Distributed Launchers Guide](distributed-launchers.md) -- the earlier 0156d33 commit only fixed the one at line 543. Updated to (launchers.md) (verified to exist on disk).


        # Add upload timestamp if not present
        if f"{prefix}_uploaded_at" not in doc:
-            doc[f"{prefix}_uploaded_at"] = datetime.now(timezone.utc)
+            doc[f"{prefix}_uploaded_at"] = datetime.utcnow()

        # Preserve original created_date if present
        if "created_date" not in doc:
-            doc["created_date"] = datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S")
+            doc["created_date"] = datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S")



@@ -120,6 +122,10 @@ def __init__(self, message: str, context: Optional[ErrorContext] = None, **kwarg
        )


+# Backward compatibility alias
+RuntimeError = ExecutionError
+


 class TestConfigParser:
    """Test cases for ConfigParser functionality."""
-
+    
    @pytest.fixture
-    def tmp_dir(self):
+    def test_dir(self):


@bad

…pend docker.io/) Surfaced live: `madengine build --build-on-compute --registry rocm/pytorch-private` failed inside the build sbatch with `docker login rocm` -> dial tcp ... connect: connection refused. Dockerhub-shorthand without the `docker.io/` prefix is a routine input shape (matches what Docker CLI accepts for pull/run); the build-on-compute path silently fed the first segment of `registry` to `docker login <host>` and got a DNS NXDOMAIN. `_execute_build_on_compute` now normalizes `--registry` early: 1. Strip whitespace + trailing slash; reject empty after trim. 2. Reject inputs containing whitespace or `@` anywhere (catches typos like `@bad/foo` BEFORE the auto-prefix step would mask them by overwriting the first path segment to `docker.io`). 3. If the first path segment doesn't look like a registry host (no `.`, no `:`, not `localhost`), prepend `docker.io/` and print a transparent `[dim]auto-prefixing 'docker.io/'[/dim]` notice so the user can see what the build is actually pushing to. 4. Use the resulting normalized value for everything downstream (login host extraction, push path construction, manifest entry, per-registry credential-hint matching). Effect on real-world inputs (verified end-to-end against docker.io/rocm/pytorch-private): rocm/pytorch-private -> docker.io/rocm/pytorch-private (auto) rocmshared -> docker.io/rocmshared (auto) docker.io/rocm/pytorch-private -> unchanged ghcr.io/myorg/myrepo -> unchanged registry.example.com:5000/myrepo -> unchanged localhost:5000/myrepo -> unchanged 192.168.1.10:5000/myrepo -> unchanged @bad/foo -> ConfigurationError (fail-fast, includes per-registry usage hints) "" -> ConfigurationError (existing None-guard, before this change) Live verification (rocmshared PAT, write+read on rocm/pytorch-private): Test 1 `rocm/pytorch-private` -> sbatch 22954 ✅ pushed docker.io/rocm/pytorch-private:pyt_kvcache_transfer_bench Test 2 `@bad/foo` -> ConfigurationError (no sbatch) "Invalid --registry value: '@bad/foo'. First segment '@bad' is not a valid registry host." Test 3 `docker.io/rocm/pytorch-private` -> sbatch 22956 ✅ pushed (no auto-prefix message, value unchanged) Test 4 "" -> caught by existing build CLI guard Suggestion strings use parens `(...)` rather than `[...]` to avoid Rich's markdown parser interpreting them as closing tags. Co-authored-by: Cursor <cursoragent@cursor.com>

Round-4 Copilot review (auto-fired post 2200cbd) identified 24 inline comments. 14 are pre-existing flake8/style nits in `origin/develop`- identical files unchanged by this PR (tests/fixtures/**, scripts/common/ tools/**, csv_parser.py, utils/{ops,log_formatting}.py, core/errors.py, database/mongodb.py, etc.); those are out of scope for this PR. This commit addresses the 8 findings that DO touch code we own: P0 (real latent bugs / known-broken): src/madengine/orchestration/run_orchestrator.py: * Drop `Context(rocm_path=...)` kwarg from both call sites (`_init_runtime_context` line 138; `_create_manifest_from_local_image` line 456). Cat B's revert of `core/context.py` to origin/develop removed `rocm_path` from `Context.__init__`, but those two call sites still passed the kwarg. Latent bug: today's slurm_multi tests went through `_execute_distributed` so didn't hit it, but `madengine run` against a local target (`_execute_local`) and the synthetic-manifest path (`MAD_CONTAINER_IMAGE` env) crashed with `TypeError: Context.__init__() got an unexpected keyword argument 'rocm_path'`. * Drop now-unused `from madengine.core.constants import get_rocm_path` (only callers were the two lines above). * Drop unused `handle_error` from `madengine.core.errors` import set (flake8 F401). src/madengine/execution/container_runner.py: * Widen `perf_pattern` in run-result extraction to mirror the `deployment/base.py:_parse_node_log_for_perf` regex we already widened in 63a2eae: before: r'performance:\s+([0-9][0-9.eE+-]*)\s+([a-zA-Z_][a-zA-Z0-9_]*)' after: r'performance:\s+([+-]?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?)\s+(\S+)' Accepts scientific notation values and arbitrary metric tokens (`tokens/sec`, `tok/s`, `samples_per_second`, etc.). docs/deployment.md: * `[Distributed Launchers Guide](distributed-launchers.md)` → `(launchers.md)` at line 145. Second occurrence of the broken link the 0156d33 fix missed (it only fixed line 543). Yellow-tier hardening (security / leak / shell-quoting): src/madengine/execution/container_runner.py: * `login_to_registry` now delegates to `madengine.core.auth.login_to_registry` instead of constructing `f"echo '{password}' | docker login ..."` inline. The previous form put the password in argv (visible via `ps`/`/proc`) even though docker itself read it from stdin. Same delegation pattern DockerBuilder already uses. * `get_env_arg` no longer prints the assembled `--env KEY='value'` string (which can include HF tokens, registry passwords, `MAD_SECRETS_*`). Print only the count. * Self-managed launcher path (`_execute_self_managed_launcher`): `cmd = f"bash {script_path} {model_args}"` with `shell=True` no longer string-concats user-supplied `model_args` into the shell command. Use `shlex.quote(script_path)` and `shlex.split(model_args)` + per-arg `shlex.quote`, so embedded `$()`/backticks/`;` become literal arguments to the script instead of host-shell directives. Verified: * py_compile clean on both files. * Import smoke: DeploymentFactory.available_deployments() == ['k8s','kubernetes','slurm']; Context() rejects `rocm_path=` arg; _deep_merge nested-leaf preservation; run_orchestrator no longer references get_rocm_path/handle_error/rocm_path=; container_runner perf_pattern widened, env_args print is count-only, login_to_registry delegates to core.auth, self-managed shlex.quote applied. * Live: pyt_large_ep_benchmark_2n on [008,024] via slurm_multi completed in 8 min with 28 ✅ Success rows -> perf.csv at /shared_inference/ravgupta/<jobid>/ as expected. Addresses Copilot review comments R4-1, R4-2, R4-3, R4-4, R4-5, R4-6, R4-7, R4-18. Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

Copilot reviewed 74 out of 75 changed files in this pull request and generated 16 comments.

Comments suppressed due to low confidence (2)

tests/unit/test_reporting_superset.py:39

This file contains whitespace on otherwise blank lines (e.g. between the class docstring and the first fixture). The repo’s pre-commit trailing-whitespace hook (and flake8 W293) will fail on this; please remove trailing spaces on blank lines.
tests/unit/test_reporting_superset.py:171
Additional whitespace-only lines in this section will also fail the pre-commit trailing-whitespace hook. Please strip the trailing spaces from blank lines here as well.

raviguptaamd · 2026-05-06T23:01:24Z

 dependencies = [
  "pandas",
  "GitPython",
  "jsondiff",
  "sqlalchemy",
+  "setuptools-rust",
  "paramiko",
  "tqdm",
+  "pytest",
  "typing-extensions",
  "pymongo",


Addressed in afbb666: dropped pytest from [project].dependencies. It was already present in [project.optional-dependencies].dev and .all, so no install path that was running tests is affected; only the bare pip install madengine (without [dev]/[all]) no longer pulls a test framework into runtime installs.

raviguptaamd · 2026-05-06T23:01:26Z

        result = builder.get_build_arg()

-        assert f"--build-arg ARG1={shlex.quote('value1')}" in result
-        assert f"--build-arg ARG2={shlex.quote('value2')}" in result
+        assert "--build-arg ARG1='value1'" in result
+        assert "--build-arg ARG2='value2'" in result



Addressed in afbb666: 5 build-arg assertions updated to match DockerBuilder.get_build_arg()'s shlex.quote() output. For shell-safe values (e.g. value1), shlex.quote returns the value unchanged, so the assertions are now --build-arg ARG1=value1 (no surrounding quotes). Verified locally: pytest -k build_arg → 4 passed.

raviguptaamd · 2026-05-06T23:01:27Z

    @pytest.mark.parametrize(
        "clean_test_temp_files", [["perf_test.csv", "perf_test.html"]], indirect=True
    )
    def test_commandline_argument_skip_gpu_arch(
-        self, global_data, clean_test_temp_files, dynamic_skip_gpu_arch_model_dir
+        self, global_data, clean_test_temp_files
    ):
        """
        Test that skip_gpu_arch command-line argument skips GPU architecture check.
-        Uses a dynamic MODEL_DIR with the current GPU arch added to the skip list
-        so the test works on any hardware.
+        UPDATED: Now uses python3 -m madengine.cli.app instead of legacy mad.py
        """
        context = generate_additional_context_for_machine()
        output = global_data["console"].sh(
            "cd "
            + BASE_DIR
            + "; "
            + "MODEL_DIR="
-            + dynamic_skip_gpu_arch_model_dir
+            + MODEL_DIR
            + " "
            + f"python3 -m madengine.cli.app run --tags dummy_skip_gpu_arch --live-output --additional-context '{json.dumps(context)}'"
        )
        if "Skipping model" not in output:
            pytest.fail("Enable skipping gpu arch for running model is failed.")


Deferred to upstream-flake8 cleanup PR. The line at #L93 is from 04aac390 (Stephen Shao) and is byte-identical to origin/develop; this PR's commit 50e8895 only touched the surrounding lines (88, 99) for whitespace and a fixture-arg rename, not the assertion target itself. Tracked alongside the rest of the Round-4 / Round-5 upstream-style nits in the follow-up cleanup PR (see #86 description, "Round 4" / "Round 5" status sections).

raviguptaamd · 2026-05-06T23:01:29Z

        if deployment_type == "slurm" and in_slurm_job:
            print(f"🔄 Using fresh pull policy for SLURM compute node (prevents cached layer corruption)")
            # Remove any existing cached image to force fresh pull
            try:
                self.console.sh(f"docker rmi -f {registry_image} 2>/dev/null || true")
                print(f"✓ Removed cached image layers")
-            except Exception:
+            except:
                pass  # It's okay if image doesn't exist


Addressed in afbb666: the bare except: at L424 (introduced by this PR's 6f2e11bf) was replaced with except Exception: so KeyboardInterrupt/SystemExit propagate, and flake8 E722 no longer fires on this line.

raviguptaamd · 2026-05-06T23:01:30Z

    def get_env_arg(self, run_env: typing.Dict) -> str:
        """Get the environment arguments for docker run."""
        env_args = ""

        # Add custom environment variables
        if run_env:
            for env_arg in run_env:
-                if not self._ENV_KEY_RE.match(env_arg):
-                    raise ValueError(f"Invalid environment variable name: {env_arg!r}")
-                env_args += f"--env {env_arg}={shlex.quote(str(run_env[env_arg]))} "
+                env_args += f"--env {env_arg}='{str(run_env[env_arg])}' "

        # Add context environment variables
        if "docker_env_vars" in self.context.ctx:
            for env_arg in self.context.ctx["docker_env_vars"].keys():
-                if not self._ENV_KEY_RE.match(env_arg):
-                    raise ValueError(f"Invalid environment variable name: {env_arg!r}")
-                value = self.context.ctx["docker_env_vars"][env_arg]
-                env_args += f"--env {env_arg}={shlex.quote(str(value))} "
+                env_args += f"--env {env_arg}='{str(self.context.ctx['docker_env_vars'][env_arg])}' "



Addressed in afbb666: get_env_arg now passes each value through shlex.quote(...) instead of single-quote string interpolation. shlex.quote is byte-equivalent for shell-safe values (alphanumeric, _, -, ., :, /, +, =, etc.) and correctly escapes embedded single quotes, whitespace, and other metacharacters.

slurm_multi safety: the self-managed launcher path (_run_self_managed) returns at L1099 before the docker-run command assembly at L1058 that calls get_env_arg, so this change only tightens the regular-Docker run path used by non-slurm_multi workloads. Verified end-to-end with a fresh pyt_large_ep_benchmark_2n slurm_multi smoke.

raviguptaamd · 2026-05-06T23:01:41Z

    if timeout < -1:
        console.print(
-            "❌ [red]Timeout must be -1 (default), 0 (no timeout), or a positive integer[/red]"
+            "❌ [red]Timeout must be -1 (default) or a positive integer[/red]"


Addressed in afbb666: validation message updated to match the actual accepted shape — "Timeout must be -1 (default), 0 (no timeout), or a positive integer". The --timeout help string already documents 0 for no timeout (line 67) and the validator is if timeout < -1, so 0 was always accepted; only the error string was misleading. No behaviour change.

raviguptaamd · 2026-05-06T23:01:42Z

 def is_amd_gpu():
    """Check if system has AMD GPU."""
    try:
        import subprocess
        result = subprocess.run(['rocm-smi'], capture_output=True, timeout=5)
        return result.returncode == 0
-    except Exception:
+    except:
        return False


Deferred. File is byte-identical to origin/develop in this PR. Tracked alongside the other upstream-style nits in the follow-up cleanup PR.

raviguptaamd · 2026-05-06T23:01:44Z

    # Clean up Docker containers BEFORE test (ensure clean state)
    try:
        subprocess.run(
            "docker ps -a | grep 'container_ci-dummy' | awk '{print $1}' | xargs -r docker rm -f",
            shell=True,
            capture_output=True,
            timeout=30
        )
-    except Exception:
+    except:
        pass  # Ignore cleanup errors before test



Deferred. File is byte-identical to origin/develop in this PR (no edits from #86). The two bare-except instances in this file are pre-existing; develop's [2.0.1] test-suite cleanup narrowed 5 such cases elsewhere but did not touch these two. Tracked in the upstream-cleanup follow-up PR.

raviguptaamd · 2026-05-06T23:01:45Z

    # Clean up Docker containers AFTER test (avoid conflicts with next test)
    try:
        subprocess.run(
            "docker ps -a | grep 'container_ci-dummy' | awk '{print $1}' | xargs -r docker rm -f",
            shell=True,
            capture_output=True,
            timeout=30
        )
-    except Exception:
+    except:
        pass  # Ignore cleanup errors after test


Deferred. Same disposition as the L106 thread above — file byte-identical to origin/develop, pre-existing concern, tracked in the upstream-cleanup follow-up PR.

raviguptaamd · 2026-05-06T23:01:47Z

 def convert_directory_csvs_to_html(
    directory_path: str,
    output_file: str = "run_results.html"
-) -> Optional[str]:
+) -> str:
    """Convert all CSV files in a directory to a single HTML file.

    Args:


Deferred. This file is byte-identical to origin/develop in this PR. Tracked alongside the other upstream-style nits in the follow-up cleanup PR.

raviguptaamd · 2026-05-06T20:40:36Z

Status update — ready for review

PR head: 3fc7587. 75 files / +2,877 / -2,179 vs develop. All four Copilot review rounds resolved or replied-to inline.

What's in here

slurm_multi deployment launcher for self-managed multi-node disaggregated workloads (vLLM-disagg, SGLang-disagg, large-EP, kvcache-transfer) with the orchestration / build / run / CLI hooks that drive it from the manifest pipeline.
madengine build --build-on-compute --registry <docker.io/<ns>(/<repo>) | ghcr.io/... | localhost:5000/...> flow (build on a compute node + push for parallel pull on run nodes).
madengine build --use-image <image> flow (skip build, use a pre-built image).
SBATCH wrapper generation, reservation support, existing-allocation detection (salloc), per-job-id completion marker, _collect_slurm_multi_results perf.csv discovery in /shared_inference/<user>/<jobid>/.

Kubernetes deployment work split out into #118 (independent PR against develop) per earlier reviewer ask; that PR will rebase / sequence after this one lands.

Live end-to-end test coverage

All on Distributed_Inference_CI and adjacent free nodes.

#	Workload	Path	Result
1	SGLang-disagg Llama-3.1-8B (1Px8/1Dx8, 24 GPUs / 3 nodes)	`--use-image` + slurm_multi	5,977 tok/s, TTFT 122.8 ms, ITL 2.54 ms
2	SGLang-disagg DeepSeek-V3 (1Px8/1Dx8, 24 GPUs / 3 nodes)	`--use-image` + slurm_multi	853 / 1,452 tok/s @ CON=8/16, ISL/OSL=1024/1024
3	SGLang-disagg Llama-3.3-70B-FP8 (1Px8/1Dx8, 24 GPUs / 3 nodes)	`--build-on-compute --registry docker.io/rocm/pytorch-private` -> `rocm/pytorch-private:pyt_sglang_disagg_llama-3.3-70b-fp8`	1,878 / 3,644 tok/s @ CON=8/16
4	Large EP (16 GPUs / 2 nodes)	`--build-on-compute --registry` -> `rocm/pytorch-private:pyt_large_ep_benchmark_2n`	28 perf rows (DeepEP + MoRI-EP dispatch/combine, FP8/BF16 RDMA/NVL)
5	KV cache transfer (16 GPUs / 2 nodes)	`--build-on-compute --registry` -> `rocm/pytorch-private:pyt_kvcache_transfer_bench`	57 perf rows (mooncake / MORI / RIXL, 4KB → 1GB sweep)
6	Re-smoke post-R4 fixes: `pyt_large_ep_benchmark_2n` on [008,024]	slurm_multi from manifest	8 min, 28 ✅ Success rows -- confirms R4 changes didn't regress the run path

The vLLM-disagg run path was also exercised end-to-end (slurm_multi topology + co-located proxy + port refactor + MoRI-EP gating all working) but the underlying rocm/pytorch-private:20260407_itej89_vllm_mori_docker image hits a NIXL_ERR_BACKEND after weight load. Image-side, orthogonal to this PR; tracked separately. Run-phase plumbing not gated by it.

Copilot review status

Four rounds total, all addressed:

Round 1 (9 inline) -- fixed in d32b06b / 5ec55ff, replied inline.
Round 2 (24 inline) -- 9 fixed in 63a2eae, 11 reverted to origin/develop byte-for-byte in 9109f52, 4 doc/example fixes in 0156d33, all replied inline with commit SHAs.
Round 3 (13 inline) -- 4 fixed in 2200cbd (profiling README typo, pytest unused-import, test_rocm_path.py revert), 7 stale-snapshot replies (already addressed in earlier rounds), 2 deferred to follow-up (pre-existing in origin/develop).
Round 4 (24 inline) -- 8 fixed in 3fc7587 (5 P0 + 3 hardening, see commit shape entry 12 in PR description), 14 are pre-existing flake8 / style nits in files byte-identical to origin/develop (utils/{ops,log_formatting}.py, scripts/common/{tools,pre_scripts}/**, tests/fixtures/utils.py, tests/integration/, core/errors.py, database/mongodb.py, deployment/base.py:645) -- not introduced by this PR; deferred to a separate cleanup PR.

The "P0" subset of Round 4 included a real latent crash I'd introduced in the Cat B revert (run_orchestrator.py still passing Context(rocm_path=...) after core/context.py was reverted to upstream's signature) -- that's now fixed and verified by the live re-smoke above.

Bugs surfaced and fixed during real-cluster testing

--registry parsing: shorthand rocm/pytorch-private was hitting docker login rocm (NXDOMAIN). Now auto-prefixed to docker.io/rocm/pytorch-private with a transparent notice; malformed input rejected up front with usage hints (98f1ac5).
--build-on-compute local image name: dockerfile path with no .ubuntu.amd.Dockerfile suffix (e.g., scripts/.../Dockerfile) yielded ci-..._Dockerfile -- Docker rejects uppercase. .lower() the assembled name (ae1996c).
_load_and_merge_manifest was still doing shallow dict.update() despite the Round-2 "deep merge" fix -- promoted to a recursive module-level helper used by both code paths (63a2eae).
manifest.context.docker_env_vars was being dropped on --manifest-file runs, losing build-time secrets / MAD_SECRETS_* (63a2eae).
Completion marker was not job-id namespaced -- a second madengine run of the same model tag returned success in ~30 s while the new sbatch was still running (710b51a).
ErrorContext was being constructed with 5 invalid kwargs across build_orchestrator.py, masking the real error (5ec55ff).
_collect_slurm_multi_results was only looking at <cwd>/perf.csv, missing the canonical /shared_inference/<user>/<jobid>/perf.csv location the inner scripts emit (5ec55ff).

Sequencing

Updates to slurm launcher #86 (this PR) → develop.
Kubernetes deployment (split from PR #86) #118 (Kubernetes) rebases onto develop and merges.
After both land, follow-up cherry-picks the combined work into main per repo policy.

What I'd appreciate eyes on

The slurm_multi launcher core (deployment/slurm.py, slurm_node_selector.py, the slurm_multi wrapper-script generation).
The --build-on-compute / --use-image orchestration in build_orchestrator.py (registry normalization, lowercase tagging, prebuilt-image multi-model handling, per-registry auth hints).
run_orchestrator._deep_merge semantics for nested slurm.node_selector / distributed.<sub> overrides.
container_runner._run_self_managed (the new self-managed launcher path used by slurm_multi).
The models.json slurm-side example configs renamed to slurm-multi-*.json and the launchers / deployment / cli-reference docs.

Reviews welcome. cc @gargrahul @Cemberk @Rohan138 @coketaste @leconcio

Three P1 items in PR-introduced code, plus four small cleanups Copilot flagged on the same lines. container_runner.py * get_env_arg: replace single-quote string interpolation (`--env KEY='value'`) with shlex.quote, so values containing single quotes / whitespace / shell metacharacters no longer break the assembled `docker run` command. shlex.quote is byte-equivalent for safe alphanumeric values (returns the input unchanged) and only adds escaping when needed. Not on the slurm_multi path (self-managed launcher short-circuits at L1098 before the docker docker_options assembly that calls get_env_arg). * Revert the AMD GPU SMI block (~L1142) to origin/develop's bare-name form (`amd-smi 2>/dev/null || rocm-smi 2>/dev/null || echo ...`). The PR-introduced version interpolated the *host* `context.ctx['rocm_path']` (or `get_rocm_path()`) into absolute `/opt/rocm-X.Y/bin/*` paths and executed them *inside* the container, which is incorrect when the container's ROCm install location differs from the host's. Bare names rely on the container's PATH, which is the upstream behaviour. * Replace bare `except:` at L424 (PR-introduced) with `except Exception:` so KeyboardInterrupt / SystemExit propagate (flake8 E722). * Fix a stale function-name reference in the perf-pattern comment at L1394: deployment/base.py's function is `_parse_performance_from_log`, not `_parse_node_log_for_perf`. * Hoist `import shlex` to the module-level import block (was a local import inside `_run_self_managed`) and drop the now-unused `from madengine.core.constants import get_rocm_path`. cli/commands/run.py * Update timeout validation message to match the actual accepted shape ("-1 (default), 0 (no timeout), or a positive integer"). The --timeout help string already documents `0 for no timeout`, and the validator uses `if timeout < -1`, so 0 was always accepted; only the error string was misleading. pyproject.toml * Drop `pytest` from [project].dependencies (it stays in [project.optional-dependencies].dev and .all). Keeps test dependencies out of `pip install madengine` runtime installs. tests/integration/test_docker_integration.py * Update 5 build-arg assertions to match `DockerBuilder.get_build_arg()`'s shlex.quote() output for shell-safe values (`--build-arg ARG1=value1`, no surrounding quotes). Verified: `pytest -k build_arg` -> 4 passed. slurm_multi safety: get_env_arg is reachable only from the regular docker run path (L1058); the slurm_multi self-managed launcher returns at L1099 before that. The SMI revert and the bare-`except` fix are both in code paths used only outside slurm_multi (image pull / GPU pre-check). The timeout, pyproject and comment-typo changes have no runtime effect. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # CHANGELOG.md

Copilot

Pull request overview

Copilot reviewed 73 out of 74 changed files in this pull request and generated 14 comments.

Comments suppressed due to low confidence (2)

src/madengine/deployment/base.py:236

_monitor_until_complete() never treats CANCELLED (or any non-SUCCESS/FAILED terminal status) as terminal. Since DeploymentStatus still includes CANCELLED, this loop can spin forever after a cancellation. Consider returning on CANCELLED as well (and/or treating any non-PENDING/RUNNING status as terminal).

        self.console.print("[blue]Monitoring deployment...[/blue]")

        while True:
            status = self.monitor(deployment_id)

            if status.status in [DeploymentStatus.SUCCESS, DeploymentStatus.FAILED]:
                return status

            # Still running, wait and check again
            self.console.print(
                f"  Status: {status.status.value} - {status.message}"
            )
            time.sleep(30)  # Check every 30 seconds

src/madengine/execution/container_runner.py:251

create_run_details_dict() indexes self.context.ctx["docker_env_vars"]["MAD_SYSTEM_GPU_ARCHITECTURE"] directly. If docker_env_vars or the key is missing (e.g., older manifests / non-GPU runs / context init differences), this will raise KeyError and prevent perf entry generation. Use .get() with a safe fallback (or guard when the nested dict/key is absent).

raviguptaamd · 2026-05-06T23:46:58Z

+        # Merge env_vars from model_info (models.json) into docker_env_vars
+        if "env_vars" in model_info and model_info["env_vars"]:
+            model_env_count = 0
+            for key, value in model_info["env_vars"].items():
+                self.context.ctx["docker_env_vars"][key] = str(value)
+                model_env_count += 1
+            if model_env_count > 0:
+                print(f"ℹ️  Merged {model_env_count} environment variables from model_info (models.json)")


Addressed in bcafdbb: restored the upstream finalize_container_rocm_path() call block (originally introduced on develop in #110, removed by this PR's 6f2e11bf when the model_info env_vars merge block replaced it). The two address different concerns and must coexist — the env_vars merge copies model-card values into docker_env_vars, while finalize normalizes in-container ROCM_PATH via OCI inspect → in-image probe → /opt/rocm default. Block restored verbatim from origin/develop (including the user_supplied_rocm_path short-circuit), placed immediately after the env_vars merge.

slurm_multi safety: the self-managed launcher returns at L1099 before this code path, so this restoration only affects the regular-Docker run path used by non-slurm_multi workloads.

raviguptaamd · 2026-05-06T23:46:59Z

+                            model_docker.sh(f"cp {model_dir}/*_profiler_output.csv . 2>/dev/null || true")
+                            model_docker.sh(f"cp {model_dir}/*_output.csv . 2>/dev/null || true")
+                            model_docker.sh(f"cp {model_dir}/*_trace.csv . 2>/dev/null || true")
+                            model_docker.sh(f"cp {model_dir}/library_trace.csv . 2>/dev/null || true")


Addressed in bcafdbb: model_dir is now passed through shlex.quote once and reused across the four cp ... 2>/dev/null || true calls. The trailing *.csv glob is left unquoted intentionally so the in-container shell still expands it inside the (quoted) directory:

_md_q = shlex.quote(model_dir) model_docker.sh(f"cp {_md_q}/*_profiler_output.csv . 2>/dev/null || true") model_docker.sh(f"cp {_md_q}/*_output.csv . 2>/dev/null || true") model_docker.sh(f"cp {_md_q}/*_trace.csv . 2>/dev/null || true") model_docker.sh(f"cp {_md_q}/library_trace.csv . 2>/dev/null || true")

raviguptaamd · 2026-05-06T23:47:01Z

+                        mult_res = model_info.get("multiple_results")
                        if mult_res:
                            try:
-                                model_docker.sh(
-                                    _cp_model_dir_file_to_cwd_cmd(model_dir, mult_res)
-                                )
+                                model_docker.sh(f"cp {model_dir}/{mult_res} . 2>/dev/null || true")
                            except Exception:


Addressed in bcafdbb: both path components are now quoted, so a model-metadata multiple_results value containing whitespace or shell metacharacters can no longer break the command or be re-interpreted by the shell:

model_docker.sh( f"cp {shlex.quote(model_dir)}/{shlex.quote(mult_res)} . " "2>/dev/null || true" )

raviguptaamd · 2026-05-06T23:47:03Z

 """

 import json
+import os


Addressed in bcafdbb: dropped the unused import os (flake8 F401). The module hasn't referenced os.* since the orchestration helpers it used were moved out.

raviguptaamd · 2026-05-06T23:47:05Z

        context = generate_additional_context_for_machine()
        output = global_data["console"].sh(
            "cd "
            + BASE_DIR
            + "; "
            + "MODEL_DIR="
-            + dynamic_skip_gpu_arch_model_dir
+            + MODEL_DIR
            + " "
            + f"python3 -m madengine.cli.app run --tags dummy_skip_gpu_arch --live-output --additional-context '{json.dumps(context)}'"
        )
        if "Skipping model" not in output:
            pytest.fail("Enable skipping gpu arch for running model is failed.")


Deferred. The line at #L93 is from 04aac390 (Stephen Shao) and is byte-identical to the version on origin/develop in this PR. develop's [2.0.1] already lands the hardware-agnostic fix ([2.0.1] -> "E2E tests — hardware-agnostic GPU arch skip" — test_commandline_argument_skip_gpu_arch now detects the runner's GPU arch at runtime and injects it into skip_gpu_arch, with a new get_gpu_arch() helper in tests/fixtures/utils.py). Once #86 lands and CI re-runs against develop, that fix is what executes; no #86-side change needed. Tracked alongside the other origin/develop-identical concerns in the upstream-flake8 follow-up cleanup PR (per #86 description, Round 4 / Round 5 / Round 6 status sections).

raviguptaamd · 2026-05-06T23:47:13Z

    try:
        import subprocess
        result = subprocess.run(['rocm-smi'], capture_output=True, timeout=5)
        return result.returncode == 0
-    except Exception:
+    except:
        return False


Deferred. File is byte-identical to origin/develop in this PR (the bare-except: at #L41 is from 04aac390). Tracked alongside the other upstream bare-except: cleanup in the follow-up PR.

raviguptaamd · 2026-05-06T23:47:15Z

        try:
            self.rocmsmi = initRsmiBindings(silent=False)
            if rocm_smi.driverInitialized() is True:
                ret_init = self.rocmsmi.rsmi_init(0)
                if ret_init != 0:
                    raise ValueError('ROCm SMI returned %s (the expected value is 0)', ret_init)
                    exit(ret_init)
            else:
                raise ImportError('Driver not initialized (amdgpu not found in modules)')
                exit(0)
            self.rocm6 = True
-        except Exception:
+        except:
            rocm_smi.initializeRsmi()


Deferred. File is byte-identical to origin/develop in this PR (the bare-except: at #L42 is from da0d524f (Rahul Garg) and predates #86). Tracked in the upstream-cleanup follow-up PR.

raviguptaamd · 2026-05-06T23:47:17Z

            try:
                power_info = amdsmi_interface.amdsmi_get_power_info(processor_handle)
                if isinstance(power_info, dict):
                    # If both current and average power are 0, it's likely a secondary die
                    current_power = power_info.get('current_socket_power', -1)
                    avg_power = power_info.get('average_socket_power', -1)
                    if current_power == 0 and avg_power == 0:
                        return True
-            except Exception:
+            except:
                # If we can't get power info, might be secondary die
                return True


Deferred. File is byte-identical to origin/develop in this PR (the bare-except: at #L157 is from 0f2c3010 (Stephen Shao) and predates #86). Tracked in the upstream-cleanup follow-up PR.

raviguptaamd · 2026-05-06T23:47:18Z

+from unittest.mock import MagicMock, patch, Mock
 import pytest
+import pandas as pd



Addressed in bcafdbb: dropped the unused import pandas as pd (flake8 F401). Confirmed there are no pd.* references in the file — the test fixtures use dict / MagicMock only, so the import was dead. The line was added by this PR's 50e8895.

raviguptaamd · 2026-05-06T23:47:20Z

 class TestConfigParser:
    """Test cases for ConfigParser functionality."""
-
+    


Addressed in bcafdbb: trailing whitespace stripped on every blank-with-spaces line (107 occurrences in this file, including the one at L31). Pure whitespace cleanup, no functional change; flake8 W293 is now clean for this file.

Round 6 fired automatically on the develop-merge push (8fec192). 5 of the 14 comments are PR-introduced; the remaining 9 are byte-identical to origin/develop and stay on the upstream-flake8 cleanup follow-up list with the rest of R4 / R5 deferrals. container_runner.py * Restore the upstream `finalize_container_rocm_path()` call block (originally introduced on develop in PR ROCm#110, removed by this PR's 6f2e11b when the model_info `env_vars` merge block replaced it). The two blocks address different concerns and must coexist: - merge model-card env_vars into docker_env_vars (this PR), and - normalize in-container ROCM_PATH via OCI inspect → in-image probe → /opt/rocm default (upstream). Without the finalization, models on TheRock or non-`/opt/rocm` container layouts run with a wrong/empty ROCM_PATH, breaking rocm-smi / amd-smi / hipcc inside the container. The slurm_multi self-managed launcher returns at L1099 before this code path, so this restoration only affects the regular docker run path. * Quote `model_dir` (and `mult_res`) with shlex.quote on the profiler-/trace-/multi-results `cp` commands run via `model_docker.sh(... shell=True)` (lines 1672-1685 area). The `*.csv` glob is left unquoted so the in-container shell still expands it inside the (quoted) directory. `mult_res` comes straight from model metadata and was previously concatenated raw into the shell command. deployment/config_loader.py * Drop unused `import os` (flake8 F401). The module hasn't referenced `os.*` since the orchestration helpers it used were moved out. tests/unit/test_database_mongodb.py * Drop unused `import pandas as pd` (flake8 F401). The test file uses dict / Mock fixtures and never reaches into pandas. tests/unit/test_reporting_superset.py * Strip trailing whitespace (flake8 W293) on every blank-with- spaces line (107 occurrences). Pure whitespace; no functional change. slurm_multi safety: all five fixes either restore upstream behaviour on the regular-Docker run path, are flake8-only, or are in test files. None of them change anything reached by the slurm_multi self-managed launcher path; live re-smoke is unnecessary for this commit. Co-authored-by: Cursor <cursoragent@cursor.com>

…lti bash branch The earlier feat(deployment) commit ported `_run_inside_existing_allocation` verbatim from PR ROCm#86 (used by the slurm_multi bash-in-salloc branch). That method constructs `DeploymentResult(..., skip_monitoring=True)` to signal that the script ran synchronously and the monitor phase should be skipped. Develop's `DeploymentResult` dataclass did not have that field, so the construction raised `TypeError` at runtime, the failure was swallowed by the orchestrator, and `madengine run` exited 0 even when the wrapper script failed. Two minimal additive edits to deployment/base.py: - DeploymentResult: add `skip_monitoring: bool = False` field. - BaseDeployment.execute(): change the monitor guard from `if self.config.monitor:` to `if self.config.monitor and not result.skip_monitoring:` so the bash branch can correctly bypass the SLURM job poll (there is no SLURM job to poll — the script ran inline). Behavior for non-slurm_multi launchers is unchanged: they never construct DeploymentResult with skip_monitoring=True, the field defaults to False, and the monitor() call still fires as before. Discovered during R5.1b cluster smoke test of `pyt_sglang_disagg_qwen3-32b_short` from MAD-private PR #186. Co-authored-by: Cursor <cursoragent@cursor.com>

…her (Revised PR#124) (#130) * feat(deployment): add slurm_multi launcher (minimal additive) Introduce the `slurm_multi` self-managed multi-node SLURM launcher, fully additive vs develop. Verbatim minimal-delta port from PR #86 limited to the slurm_multi surface; no develop functionality removed. - common.py: register `slurm_multi` in VALID_LAUNCHERS, add hyphen alias normalization (`slurm-multi` -> `slurm_multi`). All other develop content (functools.lru_cache, _ROCPROF_FAMILY_TOOL_NAMES, tools_include_rocprof_family, full configure_multi_node_profiling rocprof-family handling, sglang-disagg in VALID_LAUNCHERS) preserved. - slurm.py: SLURM_MULTI_ALIASES; reservation field + allocation detection in __init__; helper methods (_get_allocation_node_count, _validate_allocation_nodes); slurm_multi early dispatch in prepare(); _prepare_slurm_multi_script (sets `_is_slurm_multi = True` for deploy() gate); _generate_slurm_multi_command; new elif in _generate_launcher_command (existing primus/sglang-disagg dispatch preserved); gated bash-in-salloc branch in deploy() (slurm_multi only -- non-slurm_multi launchers continue to use sbatch); _run_inside_existing_allocation; _collect_slurm_multi_results + early dispatch in collect_results(); reservation= kwarg threaded to existing SlurmNodeSelector(...) call. - slurm_node_selector.py: optional `reservation` parameter so the reservation propagates to srun health/cleanup commands. Source: ROCm/madengine PR #86 (slurm_multi surface only). 0 deletions from develop. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(orchestration): build_on_compute, registry gate, parallel pull for slurm_multi Wire the build/run orchestration paths needed by the slurm_multi launcher. All existing flows for non-slurm_multi launchers continue to work unchanged; new behavior is gated on `distributed.launcher == "slurm_multi"` (or `slurm-multi`). - execution/container_runner.py: SLURM_MULTI_ALIASES, SELF_MANAGED_LAUNCHERS constants; `_run_self_managed` method (runs the model script directly on baremetal so srun/scontrol work inside it, propagates env from model_info.env_vars and additional_context.env_vars); self-managed launcher early dispatch in run_container() that fires only when launcher matches SELF_MANAGED_LAUNCHERS; `.slurm` extension recognized alongside `.sh` in script-extension detection (T-C2). Develop helpers preserved verbatim (_print_run_env_table, _docker_image_exists_locally, _bash_quote_path, _cp_model_dir_file_to_cwd_cmd, _resolve_docker_image, _ENV_KEY_RE env-var validation, MAD_OUTPUT_CSV passthrough, PRIMUS_CONFIG_PATH/PRIMUS_CLI_EXTRA in env-var allowlist, PERFORMANCE_LOG_PATTERN regex, CSV fieldname stripping). - execution/docker_builder.py: after build, write resolved registry image into built_models[*].env_vars.DOCKER_IMAGE_NAME so the run-time `srun docker pull "\$DOCKER_IMAGE_NAME"` on each compute node finds the registry image. Pure addition (one block before manifest assembly). - orchestration/build_orchestrator.py: import shutil; capture user-supplied slurm keys (_original_user_slurm_keys); add `use_image: Optional[str]` and `build_on_compute: bool` to execute(); mutex validation between use_image and build_on_compute; use_image dispatch (incl. "auto" sentinel that resolves from model card via _resolve_image_from_model_card); build_on_compute dispatch (registry required, validated in _execute_build_on_compute); slurm_multi registry gate that runs DiscoverModels early to detect slurm_multi tags and either raises a structured ConfigurationError with helpful suggestions or auto-uses DOCKER_IMAGE_NAME from the model card env_vars (implicit --use-image fallback); _execute_with_prebuilt_image; _resolve_image_from_model_card; _execute_build_on_compute (sbatch wrapper that builds + pushes from one compute node, polls completion, marks `built_on_compute: true` in the manifest). Develop's load_credentials top-level usage, detect_local_gpu_arch parameter, and resolved_arch print all preserved unchanged. Note: PR #86's universal merge of model_info.env_vars into docker_env_vars in run_container() was DELIBERATELY DROPPED -- it is not slurm_multi-essential (slurm_multi gets env_vars via _prepare_slurm_multi_script's wrapper export block and via _run_self_managed's own merge). Applying it universally would silently change behavior for non-slurm_multi launchers and has an incorrect precedence vs additional_context.env_vars; it can be revisited as a separate PR if needed. Source: ROCm/madengine PR #86 (slurm_multi surface only). 0 deletions from develop except the agreed `.sh` -> `.sh|.slurm` line modification (T-C2: harmless extension to script-extension recognition). Co-authored-by: Cursor <cursoragent@cursor.com> * feat(cli): expose --use-image and --build-on-compute on madengine build Surface the orchestrator paths from the previous commit through the `madengine build` CLI: - cli/commands/build.py: add `--use-image [IMAGE | auto]` (skip the local Docker build and use the named image; `auto` resolves from the model card's DOCKER_IMAGE_NAME); add `--build-on-compute` (build on a SLURM compute node and push to the configured registry; manifest records `built_on_compute: true`); 3 mutex validation blocks (--use-image vs --registry, --use-image vs --build-on-compute, --build-on-compute requires --registry); pass both new kwargs through to validate_additional_context, create_args_namespace, and BuildOrchestrator.execute(). - cli/validators.py: add `use_image: Optional[str] = None` parameter (signature + docstring only -- body unchanged) so build.py's call site does not TypeError. PR #86's other validators.py rewrites (drops of ROCM_PATH / MAD_ROCM_PATH validation, dockerfile_matched fallback removal) are NOT inherited; develop validation is preserved. Required for the slurm_multi launcher's documented build options in MAD-private #186. Co-authored-by: Cursor <cursoragent@cursor.com> * test(slurm_multi): contract tests + minimal example config - tests/unit/test_slurm_multi.py: nine tests across three classes - TestSlurmMultiRegistration: `slurm_multi` is in VALID_LAUNCHERS; SLURM_MULTI_ALIASES contains both canonical and hyphen forms. - TestNormalizeSlurmMultiAliases: canonical and hyphen alias both normalize to `slurm_multi`; unknown values still fall through to the existing `slurm -> docker` default. - TestSlurmMultiPrepareScript: end-to-end fixture that mirrors MAD-private PR #186's `pyt_sglang_disagg_qwen3-32b_short` entry verbatim. Asserts that prepare() takes the slurm_multi early- dispatch path, sets `_is_slurm_multi = True` for the deploy() gate, emits an SBATCH header that reflects the model card slurm block, and emits every model_info.env_vars key as `export KEY="value"` in the wrapper script. This is the explicit safety net for the Q2 decision to drop PR #86's universal docker_env_vars merge -- if slurm_multi ever stops propagating those env_vars to the workload, this test fails loudly. Existing tests in tests/unit/test_deployment.py and tests/unit/test_container_runner.py are NOT modified. - examples/slurm-configs/minimal/slurm-multi-minimal.json: small reference config alongside develop's existing *-minimal.json examples (`-f` add because *.json is gitignored, matching how existing files in this dir were originally added). Co-authored-by: Cursor <cursoragent@cursor.com> * fix(deployment): add skip_monitoring to DeploymentResult for slurm_multi bash branch The earlier feat(deployment) commit ported `_run_inside_existing_allocation` verbatim from PR #86 (used by the slurm_multi bash-in-salloc branch). That method constructs `DeploymentResult(..., skip_monitoring=True)` to signal that the script ran synchronously and the monitor phase should be skipped. Develop's `DeploymentResult` dataclass did not have that field, so the construction raised `TypeError` at runtime, the failure was swallowed by the orchestrator, and `madengine run` exited 0 even when the wrapper script failed. Two minimal additive edits to deployment/base.py: - DeploymentResult: add `skip_monitoring: bool = False` field. - BaseDeployment.execute(): change the monitor guard from `if self.config.monitor:` to `if self.config.monitor and not result.skip_monitoring:` so the bash branch can correctly bypass the SLURM job poll (there is no SLURM job to poll — the script ran inline). Behavior for non-slurm_multi launchers is unchanged: they never construct DeploymentResult with skip_monitoring=True, the field defaults to False, and the monitor() call still fires as before. Discovered during R5.1b cluster smoke test of `pyt_sglang_disagg_qwen3-32b_short` from MAD-private PR #186. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(slurm_multi): aggregate per-job perf.csv into cwd for dashboard reporter The slurm_multi model script (e.g., MAD-private's run_xPyD_models.slurm) writes its perf CSV to /shared_inference/$USER/$JOBID/perf.csv. `_collect_slurm_multi_results` correctly finds and ingests that CSV via `_collect_results_parse_perf_csv` (results['successful_runs'] is populated, manifest exit codes are right). However the CLI's post-run reporter (`display_performance_table` in cli/utils.py) reads from cwd 'perf.csv' by default, so users saw a cosmetic "Performance CSV not found: perf.csv" warning even on a successful slurm_multi run. Local and classic-SLURM flows already leave a cumulative perf.csv in cwd via update_perf_csv(); slurm_multi did not. Fix: inside `_collect_slurm_multi_results`, after the per-job CSV is located and parsed, also write its rows into the conventional cwd perf.csv path: * if cwd 'perf.csv' is absent: shutil.copy() the per-job file * if it exists (from previous runs): append data rows (skip the per-job header so the cwd file stays single-headed) Original per-job CSV at /shared_inference/$USER/$JOBID/perf.csv is not modified or deleted. Wrapped in try/except so any I/O failure degrades to a yellow warning rather than failing the run. Affects only `_collect_slurm_multi_results`, which only runs for the slurm_multi launcher dispatch added earlier on this branch. Zero effect on non-slurm_multi launchers (they still take develop's classic collect_results path which writes cwd perf.csv via update_perf_csv). Validated end-to-end: Llama-3.1-8B smoke on 3 reservation nodes (Distributed_Inference_CI), bash branch in salloc, parallel docker pull on all nodes, proxy/prefill/decode came up, benchmark concurrency=8 / 32-32 ISL/OSL ran to completion, run exit 0, cwd perf.csv now contains the per-job row (THROUGHPUT/TTFT/ITL columns match per-job source). Co-authored-by: Cursor <cursoragent@cursor.com> * docs(slurm_multi): CHANGELOG entry + forward-compat TODO on --use-image - CHANGELOG.md: new [Unreleased] section documenting the slurm_multi launcher, --use-image / --build-on-compute CLI flags, the slurm_multi build registry gate, the bash-in-salloc execution path, the DeploymentResult.skip_monitoring field, the SlurmNodeSelector reservation parameter, the new contract tests + minimal example, and the cwd perf.csv aggregation fix. - cli/commands/build.py: comment near the --use-image Typer option flagging that `is_flag=False, flag_value="auto"` is being deprecated by Typer, with the planned migration path. Behavior unchanged (matches MAD-private PR #186's documented UX). Co-authored-by: Cursor <cursoragent@cursor.com> * fix(slurm_multi): address Copilot review on PR #124 Manifest-shape correctness (Copilot C2): - _execute_with_prebuilt_image now keys built_images by model_name (one entry per model, all pointing at the same use_image) so it matches built_models and the rest of the codebase's invariant that ContainerRunner.run_models_from_manifest() relies on at built_models.get(image_name, {}). Without this, --use-image (and the implicit slurm_multi DOCKER_IMAGE_NAME gate) silently skipped every model in the run phase. Multi-model dedup TypeError fix (Copilot C3): - Replace tuple(sorted((m.get('distributed') or {}).items())) and the same pattern for slurm with json.dumps(..., sort_keys=True, default=str). The old pattern raised TypeError: unhashable type: 'dict' as soon as a model's distributed block contained nested dicts (sglang_disagg / vllm_disagg). slurm_multi wrapper completion-marker on failure (Copilot C6): - The wrapper script enables `set -e` at the top, so a non-zero exit from the model script terminated the wrapper before SCRIPT_EXIT_CODE was captured and the completion marker was written -- failed runs looked like hangs to monitor(). Wrap the bash invocation in `set +e` / `set -e` so the marker is always written. Self-managed launcher hygiene (Copilot C4, C5, C7): - Drop unused `import shutil` inside _run_self_managed. - Drop unused script_name local in _run_self_managed (only script_path is consumed by the function). - Redact env_vars values in the run log (`ENV: KEY=<set>`) so model-card credentials (HF_TOKEN, MAD_DOCKERHUB_PASSWORD, etc.) don't leak. Build orchestrator hygiene (Copilot C1, C8): - Drop top-level `import shutil` (unused). - Drop the two `requires_auth = ...` assignments in _execute_build_on_compute (assigned but never read). Validators docstring honesty (Copilot C10): - validate_additional_context's `use_image` parameter is currently informational only (parameter retained for build.py call-site compatibility); the docstring previously claimed it skipped required- field validation, which the body never implemented. Tightened the docstring to match reality. New contract tests (tests/unit/test_slurm_multi.py): - test_wrapper_disables_set_e_around_model_script: locks in C6's set +e/SCRIPT_EXIT_CODE/set -e ordering around the bash invocation. - test_built_images_and_models_share_keys: locks in C2's invariant using a fake 2-model fixture and a stubbed BuildOrchestrator. - test_multi_model_nested_dict_distributed_does_not_raise: regression for C3 using two sglang_disagg-style cards with differing nested dicts in `distributed`; would TypeError under the old code. Skipped: Copilot C9 (broader BuildOrchestrator branch tests for --use-image / --build-on-compute / registry-gating). The slurm_multi contract suite + multi-node cluster smoke already cover the slurm_multi surface end-to-end; broader orchestrator-branch coverage is a worthwhile follow-up but out of scope for this minimal-additive PR. Tests: - pytest tests/unit/test_slurm_multi.py -v -> 12 passed (was 9). - pytest tests/unit -q -> 436 passed / 2 failed; the 2 failures are the same PermissionError environmental baseline that exists on develop (test_upload_file_to_mongodb_file_not_found and test_validate_additional_context_file_not_found, both raise PermissionError on '/nonexistent/file.json' instead of FileNotFound on this filesystem). Zero new failures. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(slurm_multi): consolidate self-managed launcher detection and support multi-model build-on-compute - Extract is_self_managed_launcher() helper in deployment/common.py; replace 3 duplicated SLURM_MULTI_ALIASES + normalize blocks across slurm.py, container_runner.py, and build_orchestrator.py. Remove dead _generate_slurm_multi_command (orphaned after early-dispatch bypass). - Preserve model-card xP/yD over distributed.sglang_disagg defaults (skip-if-set in _prepare_slurm_multi_script). - Generalize _execute_build_on_compute to build/tag/push all discovered models in a single sbatch job; key manifest by model_name to match _execute_with_prebuilt_image so ContainerRunner can join entries. Apply shlex.quote to registry_host/cwd/image names (consistent with v2.0.3 hardening pass). - Tests: parametrize is_self_managed_launcher, add TestXpYdSkipIfSet and TestBuildOnComputeManifestShape covering the multi-model contract. - CHANGELOG: clarify slurm_multi as an escape-hatch launcher distinct from templated launchers. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> * refactor(build): reuse early model discovery and expand default e2e cleanup files Cache the DiscoverModels result from the registry-check pass so the build step reuses it instead of running a second discovery (avoiding duplicate get_models_json.py execution and duplicate console output). Expand DEFAULT_CLEAN_FILES to include build_manifest.json and related perf artefacts so stale manifests from prior e2e tests can't silently cause the wrong image to be executed. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * refactor: address Copilot review feedback on slurm_multi PR - slurm.py: shell-quote env_var values, model script name, and model args in the generated SLURM wrapper script to prevent shell metacharacters ($(), backticks, ;, "$, etc.) in user-supplied inputs from breaking the script or triggering host-shell expansion. Mirrors the hardening pattern in container_runner._run_self_managed. - container_runner.py: log env var keys from additional_context using the same `ENV: {key}=<set>` pattern already used for model-card env vars, so operators can see what was applied without leaking values. - docker_builder.py: warn (via rich_console) when an entry in built_images has no matching key in built_models while injecting DOCKER_IMAGE_NAME for slurm_multi parallel pulls, so future keying divergence is caught instead of silently no-opping. - build_orchestrator.py (prebuilt manifest): mark built_images entries with `local_image: True` so ContainerRunner.run_models_from_manifest() resolves run_image to use_image (and pulls on demand) instead of falling through to the else branch where run_image would be the model name (the new dict key) and _resolve_docker_image() would fail. Fixes --use-image for non-slurm_multi targets. - build_orchestrator.py (build_on_compute heredoc): shell-quote model_name (and reuse the already-quoted local/registry/dockerfile tokens) in echo lines too, so a `"` or `$` in those fields cannot break the generated sbatch script. * docs: update CHANGELOG and docs for v2.1.0 slurm_multi feature Expand v2.1.0 CHANGELOG with missing entries (local self-managed execution, model card config merge, DockerBuilder registry injection, shell-quoting security hardening, early discovery reuse, e2e cleanup defaults). Update launchers.md with full slurm_multi section (architecture, build/run workflows, salloc support). Update cli-reference.md with --use-image and --build-on-compute flags. Update deployment.md with slurm_multi, salloc, pre-built image sections, and reservation parameter. Update configuration.md with reservation, exclusive, and expanded launcher list. Update slurm-configs README with slurm_multi minimal example, workflow, and config reference. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> --------- Co-authored-by: raviguptaamd <ravi.gupta@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Claude Opus 4 <noreply@anthropic.com>

coketaste · 2026-06-02T18:27:53Z

All the changes have been merged to main through another PR #130

raviguptaamd force-pushed the raviguptaamd/update_slurm_launcher branch from fe22e7e to 87f8b7f Compare April 16, 2026 08:05

Copilot AI review requested due to automatic review settings April 16, 2026 08:05

Copilot started reviewing on behalf of raviguptaamd April 16, 2026 08:06 View session

Copilot AI reviewed Apr 16, 2026

View reviewed changes

coketaste changed the base branch from coketaste/refactor-dis to develop April 29, 2026 14:58

coketaste requested review from Cemberk, Rohan138, coketaste, gargrahul and leconcio as code owners April 29, 2026 14:58

coketaste requested a review from Copilot April 29, 2026 15:00

Copilot AI reviewed Apr 29, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings May 1, 2026 16:11

Copilot AI reviewed May 1, 2026

View reviewed changes

raviguptaamd mentioned this pull request May 1, 2026

Kubernetes deployment (split from PR #86) #118

Open

3 tasks

raviguptaamd and others added 3 commits May 2, 2026 23:44

raviguptaamd force-pushed the raviguptaamd/update_slurm_launcher branch from 8d73b3f to 50e8895 Compare May 2, 2026 23:46

Copilot AI review requested due to automatic review settings May 3, 2026 03:54

Copilot started reviewing on behalf of raviguptaamd May 3, 2026 03:55 View session

Copilot AI reviewed May 3, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings May 6, 2026 01:05

Copilot started reviewing on behalf of raviguptaamd May 6, 2026 01:06 View session

raviguptaamd and others added 2 commits May 6, 2026 01:10

Copilot AI reviewed May 6, 2026

View reviewed changes

raviguptaamd and others added 2 commits May 6, 2026 03:34

Copilot AI review requested due to automatic review settings May 6, 2026 05:29

Copilot started reviewing on behalf of raviguptaamd May 6, 2026 05:31 View session

Copilot AI reviewed May 6, 2026

View reviewed changes

raviguptaamd and others added 2 commits May 6, 2026 06:31

Copilot AI review requested due to automatic review settings May 6, 2026 19:32

Copilot started reviewing on behalf of raviguptaamd May 6, 2026 19:33 View session

Copilot AI reviewed May 6, 2026

View reviewed changes

raviguptaamd and others added 2 commits May 6, 2026 22:52

Merge remote-tracking branch 'origin/develop' into pr86-test

8fec192

Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # CHANGELOG.md

Copilot AI review requested due to automatic review settings May 6, 2026 22:53

Copilot started reviewing on behalf of raviguptaamd May 6, 2026 22:55 View session

Copilot AI reviewed May 6, 2026

View reviewed changes

raviguptaamd mentioned this pull request May 10, 2026

feat(slurm): add slurm_multi multi-node SLURM launcher (minimal-additive port from PR #86) #124

Closed

coketaste mentioned this pull request May 27, 2026

feat(deployment): add slurm_multi self-managed multi-node SLURM launcher (Revised PR#124) #130

Merged

coketaste closed this Jun 2, 2026

-            "#SBATCH --exclusive",
-        ]
+        ]
+        if self.slurm_config.get("exclusive", True):
+            script_lines.append("#SBATCH --exclusive")

		ctx = Context(build_only_mode=True, rocm_path="/opt/rocm")
		assert ctx._rocm_path == "/opt/rocm"

		class TestConfigParser:
		"""Test cases for ConfigParser functionality."""

Conversation

raviguptaamd commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scope (75 files vs develop)

Commit shape

Copilot review status

Sequencing

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

raviguptaamd commented May 1, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

raviguptaamd commented Mar 12, 2026 •

edited

Loading

Scope (75 files vs `develop`)

Smoke test summary on `Distributed_Inference_CI` reservation