Skip to content

Updates to slurm launcher#86

Closed
raviguptaamd wants to merge 15 commits into
ROCm:developfrom
raviguptaamd:raviguptaamd/update_slurm_launcher
Closed

Updates to slurm launcher#86
raviguptaamd wants to merge 15 commits into
ROCm:developfrom
raviguptaamd:raviguptaamd/update_slurm_launcher

Conversation

@raviguptaamd
Copy link
Copy Markdown
Contributor

@raviguptaamd raviguptaamd commented Mar 12, 2026

Summary

Adds a slurm_multi deployment launcher for self-managed multi-node disaggregated workloads (vLLM-disagg, SGLang-disagg, large-EP, kvcache-transfer) and the orchestration / build / run / CLI hooks that drive it from madengine's manifest pipeline. Companion model cards and submit scripts that exercise this launcher live in ROCm/MAD-private PR #186.

Scope (75 files vs develop)

This PR is now focused strictly on slurm_multi and its supporting refactor. The Kubernetes deployment work originally bundled here has been moved to #118 (independent PR against develop) and will rebase after this PR lands.

Layer Highlights
Deployment deployment/{slurm.py, slurm_node_selector.py, base.py, common.py, config_loader.py, factory.py}, templates/slurm/job.sh.j2
Orchestration / Build / Run orchestration/{build_orchestrator, run_orchestrator, image_filtering}.py — manifest-driven --build-on-compute / --use-image, deep-merged deployment_config, registry validation
Execution execution/{container_runner, container_runner_helpers, docker_builder}.py
CLI cli/{app, commands/run, commands/build, constants, utils, validators}.py
Core / Utils / Reporting / DB / Scripts shared infra used by the new path: core/, utils/, reporting/, database/mongodb.py, scripts/common/
Examples / Docs / Tests examples/slurm-configs/**, docs/launchers.md, docs/deployment.md, docs/configuration.md, tests/{unit,integration,e2e,fixtures}/**

Net diff: +2,877 / -2,179 across 75 files (down from +3,014 / -2,624 across 87 files at the squash-rebase head, after Round-2 / Round-3 / Round-4 review fixes restored 12 files to origin/develop byte-for-byte and tightened 4 more in-scope code paths).

Commit shape

Logical commits on top of develop:

  1. deployment: add slurm_multi launcher, node selector, SBATCH template (8815e24) — SLURM deployment core; absorbs the slurm-side Copilot Round-1 review fixes.
  2. orchestration/execution/cli: wire slurm_multi end-to-end (6f2e11b) — manifest deep-merge, registry guard, per-registry auth hints, prebuilt-image multi-model fix, subprocess shell trust comment, and the rest of the orchestration/CLI plumbing.
  3. examples/docs/tests: slurm_multi configs, README, and test coverage (50e8895) — operator-facing docs, examples renamed sglang-disagg-*slurm-multi-* (one launcher drives multiple flows), and test additions.
  4. PR #86 follow-up: ErrorContext kwargs, model-card image fallback, perf.csv collection (5ec55ff) — fixes surfaced during the live SGLang-disagg smoke run (job 22607 / 22615 on Distributed_Inference_CI).
  5. slurm: namespace completion marker by SLURM_JOB_ID (710b51a) — second-run safety: per-deploy completion marker now job-id namespaced.
  6. Revert out-of-scope changes flagged by Copilot Round-2 review (9109f52) — 11 files reverted to origin/develop byte-for-byte; surgical reverts in cli/utils.py and execution/docker_builder.py keep the slurm_multi-relevant bits (registry_image → DOCKER_IMAGE_NAME propagation for parallel pull, build_gpu_arch_display helper, Workload column rename).
  7. Address Copilot Round-2 slurm_multi-related findings (63a2eae) — 6 real fixes: module-level recursive _deep_merge used on the manifest path; manifest.context.docker_env_vars restored into runtime context; RunOrchestrator._load_credentials delegated to madengine.core.auth.load_credentials (object-validates credential.json); configure_multi_node_profiling only filters rocprof when rocprofv3 is missing (keeps rccl_trace/rocblas_trace); perf-line regex widened (scientific notation + arbitrary metric tokens); "megatron" and "primus" added to VALID_LAUNCHERS.
  8. docs: fix broken Next-Steps links and example paths (0156d33) — distributed-launchers.mdlaunchers.md; user-guide.mdusage.md; how-to-quick-start.mdcli-reference.md; dropped K8s slurm-multi-*.json paths that don't exist.
  9. Address Copilot Round-3 small findings (2200cbd) — examples/profiling-configs/README.md multinodemulti_node; drop unused import pytest from tests/unit/test_reporting.py; revert tests/unit/test_rocm_path.py to origin/develop (matches the Context revert in 9109f52).
  10. build_orchestrator: lowercase --build-on-compute local image name (ae1996c) — surfaced live during --build-on-compute --registry docker.io/rocm/pytorch-private --tags pyt_kvcache_transfer_bench: Path(dockerfile_path).name for scripts/.../Dockerfile (no .ubuntu.amd.Dockerfile suffix to strip) yielded Dockerfile, producing ci-..._Dockerfile and docker build -t rejection. .lower() the assembled local image name; verified end-to-end (Slurm 22936, image pushed to docker.io/rocm/pytorch-private:pyt_kvcache_transfer_bench).
  11. build_orchestrator: normalize and validate --registry shape (auto-prepend docker.io/) (98f1ac5) — surfaced live with --registry rocm/pytorch-private: registry.split("/")[0] was treated as the docker login host (docker login rocm -> DNS NXDOMAIN). _execute_build_on_compute now normalizes shorthand <ns>/<repo> to docker.io/<ns>/<repo> with a transparent [dim]auto-prefixing 'docker.io/'[/dim] notice, and rejects malformed input (whitespace, @, empty after trim) up front with usage hints for Dockerhub / GHCR / Quay / NGC / self-hosted / localhost. Verified end-to-end across all four input shapes (sbatch 22954/22956 succeeded, @bad/foo rejected without sbatch).
  12. Address Copilot Round-4 findings: 5 P0 + 3 hardening (3fc7587) — 8 fixes covering all in-scope Round-4 items (the other 14 R4 findings are pre-existing flake8/style nits in origin/develop-identical files unchanged by this PR):
    • run_orchestrator.py: drop now-unused Context(rocm_path=...) kwarg from _init_runtime_context (line 138) and _create_manifest_from_local_image (line 456) call sites; drop from madengine.core.constants import get_rocm_path and unused handle_error imports. Latent TypeError on madengine run against local target after the Cat B core/context.py revert.
    • container_runner.py: widen perf_pattern to mirror the deployment/base.py:_parse_node_log_for_perf regex (R2-Distributed solution - new madengine CLI #11) so tokens/sec, tok/s, samples_per_second, 1.23E+4 no longer get silently dropped.
    • container_runner.py: login_to_registry delegates to madengine.core.auth.login_to_registry instead of f"echo '{password}' | docker login ..." inline (password no longer in ps//proc-visible argv even with --password-stdin); same delegation DockerBuilder.login_to_registry already uses.
    • container_runner.py: get_env_arg no longer prints assembled --env KEY='value' to stdout (HF tokens, registry passwords, MAD_SECRETS_* etc. could leak); print only the count.
    • container_runner.py: self-managed launcher (_run_self_managed) now shlex.quotes script_path and shlex.split + per-arg shlex.quotes model_args, so embedded shell metacharacters ($(), backticks, ;, ...) become literal arguments to the script rather than host-shell directives.
    • docs/deployment.md: second occurrence of [Distributed Launchers Guide](distributed-launchers.md) (line 145) → (launchers.md) (the 0156d33 commit fixed only line 543).

Copilot review status

All four rounds resolved or replied-to inline:

  • Round 1 (9 comments) — addressed in d32b06b/5ec55ff; replies posted under each thread.
  • Round 2 (24 comments) — fully addressed across 9109f52 / 63a2eae / 0156d33 and replied-to inline:
    • slurm.py: drop duplicate Optional import; honour slurm_config.exclusive.
    • slurm_node_selector.py: cleanup_node merged srun; preserves --job-name and --reservation.
    • build_orchestrator.py: built_models keyed by model_name; warn on divergent configs; early registry None guard; per-registry auth hints.
    • run_orchestrator.py: real recursive _deep_merge for deployment_configadditional_context; manifest's context.docker_env_vars restored; _load_credentials delegates to central auth helper.
    • container_runner.py: documented shell=True trust boundary on inner subprocess.run.
    • deployment/common.py: rocprofv3-fallback only filters rocprof; VALID_LAUNCHERS covers &#34;megatron&#34;/&#34;primus&#34;.
    • deployment/base.py: perf-line regex accepts scientific notation + arbitrary metric tokens.
    • 15 outside-scope changes (core/{context,docker,dataprovider,timeout}.py, cli/app.py, utils/{rocm_tool_manager,discover_models,config_parser}.py, orchestration/image_filtering.py, reporting/update_perf_csv.py, README.md) reverted to origin/develop.
    • 3 broken doc links / example paths fixed (docs/{installation,deployment,launchers}.md).
  • Round 3 (13 comments) — 4 fixed in 2200cbd, 7 stale-snapshot replies (already addressed in earlier rounds), 2 deferred to follow-up issues (pre-existing concerns in origin/develop: container_runner.get_env_arg env-arg shell-escape hardening; Console.sh/subprocess.run(timeout=…) translation of 0None).
  • Round 4 (24 comments) — 8 fixed in 3fc7587 (5 P0 + 3 hardening, see commit 12 above), 14 are pre-existing flake8/style nits in files byte-identical to origin/develop (utils/{ops,log_formatting}.py, scripts/common/{tools,pre_scripts}/**, tests/fixtures/utils.py, tests/integration/test_gpu_management.py, tests/unit/test_reporting_superset.py, core/errors.py, database/mongodb.py, deployment/base.py:645) — not introduced by this PR; deferred to a separate cleanup PR.

Sequencing

  1. Updates to slurm launcher #86 (this PR) → develop.
  2. Kubernetes deployment (split from PR #86) #118 (Kubernetes) rebases onto develop and merges.
  3. After both land, follow-up cherry-picks the combined work into main per repo policy.

Test plan

  • Import-only smoke check: DeploymentFactory.available_deployments() returns slurm (and k8s/kubernetes from develop's unchanged code, which this PR does not touch).
  • normalize_launcher round-trip for slurm_multi/slurm-multi/megatron/primus/vllm/sglang.
  • _deep_merge nested-leaf preservation (verified: an override touching slurm.node_selector.reservation keeps a sibling slurm.node_selector.partition from the manifest).
  • Context() rejects rocm_path= kwarg post-Cat-B-revert (verified via inspect.signature).
  • --registry normalization across all input shapes (rocm/pytorch-private, rocmshared, docker.io/rocm/pytorch-private, ghcr.io/myorg/myrepo, localhost:5000/myrepo, 192.168.1.10:5000/myrepo, registry.example.com:5000/myrepo, @bad/foo rejected, empty rejected).
  • End-to-end SGLang-disagg DSV3 1Px8/1Dx8 (24 GPUs / 3 nodes) on Distributed_Inference_CI (jobs 22607/22615): 5,977 tok/s (Llama-3.1-8B-Instruct). perf.csv written to /shared_inference/<user>/<jobid>/perf.csv and discovered by _collect_slurm_multi_results.
  • End-to-end SGLang-disagg Llama-3.3-70b-fp8 (24 GPUs / 3 nodes) via --build-on-compute --registry docker.io/rocm/pytorch-private: 1,878 tok/s @ CON=8, 3,644 tok/s @ CON=16 (ISL/OSL=1024/1024). Image pushed to docker.io/rocm/pytorch-private:pyt_sglang_disagg_llama-3.3-70b-fp8.
  • End-to-end pyt_large_ep_benchmark_2n (16 GPUs / 2 nodes) via --build-on-compute --registry: 28 perf rows (DeepEP / MoRI EP dispatch + combine across FP8/BF16 RDMA/NVL).
  • End-to-end pyt_kvcache_transfer_bench (16 GPUs / 2 nodes): 57 perf rows (mooncake + MORI + RIXL backends, 4KB → 1GB sweep).
  • Re-smoke after Round-4 fixes: pyt_large_ep_benchmark_2n on [008,024] via slurm_multi, 8 min, 28 ✅ Success rows. Confirms the Round-4 changes didn't regress the run path.
  • CI on the rebased branch.
  • Optional vLLM-MoRI end-to-end (pyt_vllm_disagg_llama3.1-8b / pyt_vllm_disagg_deepseek-v3_short) once the image-side NIXL_ERR_BACKEND (orthogonal to this PR; observed on rocm/pytorch-private:20260407_itej89_vllm_mori_docker) is resolved.

@raviguptaamd raviguptaamd force-pushed the raviguptaamd/update_slurm_launcher branch from fe22e7e to 87f8b7f Compare April 16, 2026 08:05
Copilot AI review requested due to automatic review settings April 16, 2026 08:05
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Madengine’s SLURM (and related) workflows to better support disaggregated / self-managed multi-node inference by introducing a slurm_multi launcher path, plus new build options for using pre-built images or building on compute nodes.

Changes:

  • Add --use-image and --build-on-compute build workflows and propagate the resulting image name into manifests/env for multi-node pulls.
  • Add slurm_multi launcher handling across SLURM/K8s deployments, including self-managed (host-executed) script paths and SLURM reservation support.
  • Improve manifest/runtime config merging behavior for dict-valued deployment keys.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
src/madengine/orchestration/run_orchestrator.py Shallow-merges dict-valued deployment config keys into runtime additional context.
src/madengine/orchestration/build_orchestrator.py Implements --use-image and --build-on-compute build flows and writes manifests accordingly.
src/madengine/execution/docker_builder.py Writes DOCKER_IMAGE_NAME into built_models.env_vars when a registry image exists (supports parallel pulls).
src/madengine/execution/container_runner.py Adds “self-managed” launcher path that runs scripts directly on host instead of Madengine-managed Docker.
src/madengine/deployment/slurm_node_selector.py Adds reservation support for srun health checks / cleanup.
src/madengine/deployment/slurm.py Adds slurm_multi wrapper script generation, reservation support, and existing-allocation detection (salloc).
src/madengine/deployment/kubernetes.py Treats slurm_multi as a disaggregated launcher option in K8s template context and adds launcher command generators.
src/madengine/deployment/common.py Updates valid launcher list and normalizes slurm-multislurm_multi.
src/madengine/deployment/base.py Adds skip_monitoring flag to support synchronous deploy flows (e.g., within salloc).
src/madengine/cli/validators.py Extends validator signature to accept use_image (currently unused).
src/madengine/cli/utils.py Adjusts display/extraction formatting (e.g., model name truncation and column header rename).
src/madengine/cli/commands/build.py Adds CLI flags --use-image / --build-on-compute and enforces mutual exclusivity/requirements.
examples/slurm-configs/minimal/slurm-multi-minimal.json Updates launcher value to slurm_multi.
examples/slurm-configs/basic/slurm-multi-multi-node.json Updates launcher value to slurm_multi.
examples/slurm-configs/basic/slurm-multi-custom-split.json Updates launcher value to slurm_multi.
examples/k8s-configs/minimal/slurm-multi-minimal.json Updates launcher value to slurm_multi.
examples/k8s-configs/basic/slurm-multi-multi-node-basic.json Updates launcher value to slurm_multi.
examples/k8s-configs/basic/slurm-multi-custom-split.json Updates launcher value to slurm_multi.
docs/usage.md Documents --use-image, --build-on-compute, and slurm_multi usage.
docs/launchers.md Renames launcher references and documents slurm_multi registry requirement.
docs/deployment.md Documents allocation detection and self-managed slurm_multi execution model.
docs/cli-reference.md Adds CLI reference entries and usage examples for new build options and slurm_multi.
Comments suppressed due to low confidence (1)

src/madengine/cli/validators.py:307

  • validate_additional_context() now accepts use_image, and the docstring says it “skips required field validation”, but the parameter is not used and validation runs unconditionally via finalize_additional_context_dict(context). Either remove the unused parameter (and update callers), or implement the intended conditional validation behavior.
def validate_additional_context(
    additional_context: str,
    additional_context_file: Optional[str] = None,
    use_image: Optional[str] = None,
) -> Dict[str, Any]:
    """
    Validate and parse additional context.

    Args:
        additional_context: JSON string containing additional context
        additional_context_file: Optional file containing additional context
        use_image: Optional pre-built image to use (skips required field validation)

    Returns:
        Dict containing parsed additional context

    Raises:
        typer.Exit: If validation fails
    """
    context, _ = merge_additional_context_from_sources(
        additional_context, additional_context_file
    )
    finalize_additional_context_dict(context)
    return context

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +651 to +676
# Merge model's distributed config
model_distributed = models[0].get("distributed", {})
if model_distributed:
if "distributed" not in saved_manifest["deployment_config"]:
saved_manifest["deployment_config"]["distributed"] = {}

# Copy launcher and other critical fields from model config
for key in ["launcher", "nnodes", "nproc_per_node", "backend", "port", "sglang_disagg", "vllm_disagg"]:
if key in model_distributed and key not in saved_manifest["deployment_config"]["distributed"]:
saved_manifest["deployment_config"]["distributed"][key] = model_distributed[key]

# Merge model's slurm config into deployment_config.slurm
# This enables run phase to auto-detect SLURM deployment without --additional-context
model_slurm = models[0].get("slurm", {})
if model_slurm:
if "slurm" not in saved_manifest["deployment_config"]:
saved_manifest["deployment_config"]["slurm"] = {}

# Copy slurm settings from model config (model card fills in
# values not explicitly set by --additional-context).
# Use _original_user_slurm_keys (captured before ConfigLoader
# applies defaults) so model card values override defaults
# but user's explicit CLI values still win.
for key in ["partition", "nodes", "gpus_per_node", "time", "exclusive", "reservation", "output_dir", "nodelist"]:
if key in model_slurm and key not in self._original_user_slurm_keys:
saved_manifest["deployment_config"]["slurm"][key] = model_slurm[key]
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_execute_with_prebuilt_image() populates deployment_config.distributed and deployment_config.slurm from models[0] only. If the tag selection yields multiple models with differing distributed/slurm settings, the manifest will silently contain config for just the first model. Either validate that only one model is allowed in this mode, or merge per-model configs in a deterministic way (and/or store them per built_model entry).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in d32b06b.

We still let models[0] populate the manifest's deployment_config.distributed and deployment_config.slurm (they need a single value), but the prebuilt-image flow now warns when discovered models have divergent distributed / slurm configs so operators are not silently surprised.

Comment on lines +779 to +904
def _execute_build_on_compute(
self,
registry: Optional[str] = None,
clean_cache: bool = False,
manifest_output: str = "build_manifest.json",
batch_build_metadata: Optional[Dict] = None,
) -> str:
"""
Execute Docker build on a SLURM compute node and push to registry.

Build workflow:
1. Build on 1 compute node only
2. Push image to registry
3. Store registry image name in manifest
4. Run phase will pull image in parallel on all nodes

Args:
registry: Registry to push images to (REQUIRED)
clean_cache: Whether to use --no-cache for Docker builds
manifest_output: Output file for build manifest
batch_build_metadata: Optional batch build metadata

Returns:
Path to generated build_manifest.json
"""
import subprocess
import os
import glob

self.rich_console.print(f"\n[dim]{'=' * 60}[/dim]")
self.rich_console.print("[bold blue]🔨 BUILD PHASE (Compute Node Mode)[/bold blue]")
self.rich_console.print("[cyan]Building on 1 compute node, pushing to registry...[/cyan]")
self.rich_console.print(f"[dim]{'=' * 60}[/dim]\n")

# Discover models first to get SLURM config from model card
self.rich_console.print("[bold cyan]🔍 Discovering models...[/bold cyan]")
discover_models = DiscoverModels(args=self.args)
models = discover_models.run()

if not models:
raise DiscoveryError(
"No models discovered for build-on-compute",
context=create_error_context(
operation="build_on_compute",
component="BuildOrchestrator",
),
suggestions=[
"Check if models.json exists",
"Verify --tags parameter is correct",
],
)

model = models[0]
model_name = model.get("name", "unknown")
self.rich_console.print(f"[green]✓ Found model: {model_name}[/green]\n")

# Merge SLURM config: model card (base) + additional-context (override)
model_slurm_config = model.get("slurm", {})
context_slurm_config = self.additional_context.get("slurm", {})

# Start with model card config, then override with command-line context
slurm_config = {**model_slurm_config, **context_slurm_config}

self.rich_console.print("[bold cyan]📋 SLURM Configuration (merged):[/bold cyan]")
if model_slurm_config:
self.rich_console.print(f" [dim]From model card:[/dim] {list(model_slurm_config.keys())}")
if context_slurm_config:
self.rich_console.print(f" [dim]From --additional-context (overrides):[/dim] {list(context_slurm_config.keys())}")

# Validate required fields
partition = slurm_config.get("partition")
if not partition:
raise ConfigurationError(
"Missing required SLURM field: partition",
context=create_error_context(
operation="build_on_compute",
component="BuildOrchestrator",
),
suggestions=[
'Add "partition" to model card\'s slurm section',
'Or specify via --additional-context \'{"slurm": {"partition": "gpu"}}\'',
],
)

reservation = slurm_config.get("reservation", "")
time_limit = slurm_config.get("time", "02:00:00")

self.rich_console.print(f" Partition: {partition}")
self.rich_console.print(f" Time limit: {time_limit}")
if reservation:
self.rich_console.print(f" Reservation: {reservation}")
self.rich_console.print("")

# Validate registry credentials
self.rich_console.print("[bold cyan]🔐 Registry Configuration:[/bold cyan]")
self.rich_console.print(f" Registry: {registry}")

# Check for credentials - either from environment or credential.json
dockerhub_user = os.environ.get("MAD_DOCKERHUB_USER", "")
dockerhub_password = os.environ.get("MAD_DOCKERHUB_PASSWORD", "")

# Try to load from credential.json if env vars not set
credential_file = Path("credential.json")
if not dockerhub_user and credential_file.exists():
try:
with open(credential_file) as f:
creds = json.load(f)
dockerhub_creds = creds.get("dockerhub", {})
dockerhub_user = dockerhub_creds.get("username", "")
dockerhub_password = dockerhub_creds.get("password", "")
if dockerhub_user:
self.rich_console.print(f" Credentials: Found in credential.json")
except (json.JSONDecodeError, IOError) as e:
self.rich_console.print(f" [yellow]Warning: Could not read credential.json: {e}[/yellow]")
elif dockerhub_user:
self.rich_console.print(f" Credentials: Found in environment (MAD_DOCKERHUB_USER)")

# Determine if registry requires authentication
requires_auth = True
public_registries = ["docker.io", "ghcr.io", "gcr.io", "quay.io", "nvcr.io"]
registry_lower = registry.lower() if registry else ""

# For docker.io pushes, authentication is always required
if any(pub_reg in registry_lower for pub_reg in public_registries):
if not dockerhub_user or not dockerhub_password:
raise ConfigurationError(
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_execute_build_on_compute() treats registry as required (calls registry.lower(), registry.replace(...), etc.) but the function signature allows registry: Optional[str] = None and there’s no guard. If this path is invoked programmatically (or via a code path that bypasses the CLI validation), it will raise an AttributeError rather than a user-facing ConfigurationError. Add an explicit if not registry: raise ConfigurationError(...) near the top of this method.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in d32b06b.

_execute_build_on_compute now raises a ConfigurationError early when registry is None — with the suggestion to pass --registry or set it in the model card / additional-context — instead of falling into registry.lower() / .replace() / .split() with NoneType.

Comment on lines +787 to +793
result = subprocess.run(
cmd,
shell=True,
cwd=script_dir,
env=env,
timeout=timeout if timeout > 0 else None,
)
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subprocess.run(..., shell=True) is used with a command string that includes model_args (which can come from manifests / user input). This allows shell metacharacters in args to be interpreted by the shell (command injection). Build the argv list explicitly (e.g., ["bash", script_path, *shlex.split(model_args)] / ["python3", ...]) and run with shell=False.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in d32b06b.

shell=True is deliberate here because cmd embeds shell features (pipes, redirects, env-var substitution) constructed earlier in this method. Added an inline trust-boundary comment that documents:

  1. cmd is built from validated model card / manifest fields,
  2. user model_args are routed through shlex-quoted assembly in the caller,
  3. raw user input must NOT be concatenated directly into cmd,

and tagged the line with # noqa: S602 so future linters don't flag the intentional usage.

Comment thread src/madengine/deployment/slurm.py Outdated
Comment on lines +434 to +435
"#SBATCH --exclusive",
]
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The slurm_multi wrapper script always adds #SBATCH --exclusive regardless of the user’s slurm.exclusive setting. In the standard SLURM template, exclusivity is conditional (job.sh.j2:20-22). Consider respecting self.slurm_config.get("exclusive", True) here as well to avoid changing behavior for users who explicitly disable exclusivity.

Suggested change
"#SBATCH --exclusive",
]
]
if self.slurm_config.get("exclusive", True):
script_lines.append("#SBATCH --exclusive")

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in d32b06b.

The slurm_multi wrapper now appends #SBATCH --exclusive only when self.slurm_config.get('exclusive', True) is truthy, matching the conditional pattern already used for the standard SLURM template (job.sh.j2:20-22). Default behaviour stays exclusive so existing runs are unchanged.

Comment thread src/madengine/deployment/slurm.py Outdated
from madengine.utils.run_details import get_build_number, get_pipeline
from madengine.utils.path_utils import scripts_base_dir_from
import json
from typing import Optional
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a duplicate import of Optional (from typing import Any, Dict, List, Optional and again from typing import Optional). Drop the second import to avoid redundancy.

Suggested change
from typing import Optional

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in d32b06b.

Dropped the redundant from typing import Optional on line 27; Optional is still imported via the existing from typing import Any, Dict, List, Optional at line 17.

Comment on lines +603 to +632
# Add each discovered model with the pre-built image
# Use the image name as the key (matches how madengine build does it)
for model in models:
model_name = model.get("name", "unknown")
model_distributed = model.get("distributed", {})

# Merge DOCKER_IMAGE_NAME into env_vars for parallel pull in run phase
model_env_vars = model.get("env_vars", {}).copy()
model_env_vars["DOCKER_IMAGE_NAME"] = use_image

# Use image name as key so slurm.py can find docker_image
manifest["built_models"][use_image] = {
"name": model_name,
"image": use_image,
"docker_image": use_image,
"dockerfile": model.get("dockerfile", ""),
"scripts": model.get("scripts", ""),
"data": model.get("data", ""),
"n_gpus": model.get("n_gpus", "8"),
"owner": model.get("owner", ""),
"training_precision": model.get("training_precision", ""),
"multiple_results": model.get("multiple_results", ""),
"tags": model.get("tags", []),
"timeout": model.get("timeout", -1),
"args": model.get("args", ""),
"slurm": model.get("slurm", {}),
"distributed": model_distributed,
"env_vars": model_env_vars,
"prebuilt": True,
}
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In pre-built image mode, the for model in models: loop always writes to manifest["built_models"][use_image], so each iteration overwrites the previous one. If multiple models are discovered for the selected tags, only the last model will be present in built_models even though summary.successful_builds lists all of them. Use a per-model unique key in built_models (and keep it consistent with built_images lookups), while setting docker_image to the shared use_image value.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in d32b06b.

The loop now keys manifest['built_models'] by model_name instead of use_image, so multiple models that share the same pre-built image are all preserved in the manifest and the downstream consumers can iterate over them.

Comment on lines +896 to +918
# Determine if registry requires authentication
requires_auth = True
public_registries = ["docker.io", "ghcr.io", "gcr.io", "quay.io", "nvcr.io"]
registry_lower = registry.lower() if registry else ""

# For docker.io pushes, authentication is always required
if any(pub_reg in registry_lower for pub_reg in public_registries):
if not dockerhub_user or not dockerhub_password:
raise ConfigurationError(
f"Registry credentials required for pushing to {registry}",
context=create_error_context(
operation="build_on_compute",
component="BuildOrchestrator",
registry=registry,
),
suggestions=[
"Set environment variables: MAD_DOCKERHUB_USER and MAD_DOCKERHUB_PASSWORD",
'Or create credential.json: {"dockerhub": {"username": "...", "password": "..."}}',
"For Docker Hub, use a Personal Access Token (PAT) as password",
f"Example: export MAD_DOCKERHUB_USER=myuser",
f"Example: export MAD_DOCKERHUB_PASSWORD=dckr_pat_xxxxx",
],
)
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The registry auth/error messaging here is Docker Hub–specific (MAD_DOCKERHUB_USER/PASSWORD, PAT guidance) but the logic also classifies ghcr.io, gcr.io, quay.io, nvcr.io as registries that “require credentials”. This will produce misleading guidance for non-DockerHub registries (and may block valid flows where different credentials are required). Consider either limiting this branch to Docker Hub only, or generalizing the env var names / suggestions based on registry_host.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in d32b06b.

The missing-credentials error now selects per-registry suggestions:

  • docker.io — Docker Hub PAT / MAD_DOCKERHUB_*
  • ghcr.io — GitHub PAT with read:packages / write:packages
  • gcr.io — service-account JSON key as password (MAD_DOCKERHUB_USER=_json_key)
  • quay.io — robot account / encrypted password
  • nvcr.io$oauthtoken username + NGC API key

Falls back to the Docker Hub guidance for unknown registries. The credential env vars themselves are still the existing MAD_DOCKERHUB_USER / MAD_DOCKERHUB_PASSWORD to keep this change a docs-only refinement of the error path; introducing a generic MAD_REGISTRY_USER/PASSWORD pair is left for a follow-up.

Comment on lines +331 to +345
srun_cmd.extend(["bash", "-c", cleanup_script])

try:
srun_cmd = [
"srun",
f"--nodelist={node}",
"--ntasks=1",
"--time=00:01:00",
"--overlap",
"--quiet",
]
if self.reservation:
srun_cmd.append(f"--reservation={self.reservation}")
srun_cmd.extend(["bash", "-c", cleanup_script])

Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanup_node() builds srun_cmd (including job_name) before the try:, but then immediately reassigns srun_cmd inside the try: block. The reassignment drops the job_name argument and duplicates logic, making the first construction dead code and preventing health-check job cleanup from working as intended. Remove the duplicate reassignment and append both --job-name (when provided) and --reservation consistently.

Suggested change
srun_cmd.extend(["bash", "-c", cleanup_script])
try:
srun_cmd = [
"srun",
f"--nodelist={node}",
"--ntasks=1",
"--time=00:01:00",
"--overlap",
"--quiet",
]
if self.reservation:
srun_cmd.append(f"--reservation={self.reservation}")
srun_cmd.extend(["bash", "-c", cleanup_script])
if self.reservation:
srun_cmd.append(f"--reservation={self.reservation}")
srun_cmd.extend(["bash", "-c", cleanup_script])
try:

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in d32b06b.

The pre-try: and in-try: builds of srun_cmd are merged into a single construction that conditionally appends both --job-name=<name> (when job_name is given) and --reservation=<res> (when self.reservation is set). The second reassignment that dropped --job-name is gone.

Comment on lines +233 to +243
# Merge deployment_config into additional_context (for deployment layer to use).
# For dict-valued keys (slurm, k8s, etc.), deep-merge so manifest
# values fill in gaps while runtime --additional-context wins on conflicts.
for key in ["slurm", "k8s", "kubernetes", "distributed", "vllm", "env_vars", "debug"]:
if key in deployment_config and key not in self.additional_context:
self.additional_context[key] = deployment_config[key]
if key in deployment_config:
if key not in self.additional_context:
self.additional_context[key] = deployment_config[key]
elif isinstance(deployment_config[key], dict) and isinstance(self.additional_context[key], dict):
merged = dict(deployment_config[key])
merged.update(self.additional_context[key])
self.additional_context[key] = merged
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says “deep-merge” but the implementation here is a one-level dict merge (merged = dict(...); merged.update(...)). Nested dictionaries under these keys will still be overwritten rather than recursively merged. Either adjust the comment to describe a shallow merge, or implement an actual recursive deep merge for nested structures.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in d32b06b.

Replaced the shallow merged.update(...) with a real recursive _deep_merge(base, override) helper. Nested dicts under slurm / k8s / kubernetes / distributed / vllm / env_vars / debug are now merged per-leaf; runtime --additional-context still wins on conflicts, while manifest values fill in nested gaps as the comment claims.

@coketaste coketaste changed the base branch from coketaste/refactor-dis to develop April 29, 2026 14:58
@coketaste coketaste requested a review from Copilot April 29, 2026 15:00
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of files (300). Try reducing the number of changed files and requesting a review from Copilot again.

raviguptaamd added a commit to raviguptaamd/madengine that referenced this pull request May 1, 2026
Address all 9 inline comments from copilot-pull-request-reviewer[bot]:

ROCm#1 build_orchestrator.py — _execute_with_prebuilt_image now keys
   manifest['built_models'] by model_name (not use_image), so multiple
   models that share the same pre-built image are all preserved in the
   manifest.

ROCm#2 build_orchestrator.py — warn when discovered models have differing
   distributed/slurm configs in the prebuilt-image flow; the post-merge
   step still uses models[0]'s config but operators are now told.

ROCm#3 build_orchestrator.py — _execute_build_on_compute() now raises
   ConfigurationError early when registry is None instead of falling
   into registry.replace/.split/.lower with NoneType.

ROCm#4 build_orchestrator.py — credentials-required error now emits
   per-registry hints (docker.io / ghcr.io / gcr.io / quay.io / nvcr.io)
   instead of Docker-Hub-only PAT guidance.

ROCm#5 container_runner.py — document the shell=True trust boundary on the
   inner subprocess.run; cmd is internally constructed and any user
   model_args are routed through shlex-quoted assembly in the caller.

ROCm#6 slurm.py — drop duplicate `from typing import Optional` import.

ROCm#7 slurm.py — slurm_multi wrapper no longer hard-codes
   `#SBATCH --exclusive`; honours self.slurm_config.get('exclusive', True)
   to match the standard SLURM template behaviour.

ROCm#8 slurm_node_selector.py — cleanup_node()'s srun_cmd is now built once
   and includes both --job-name (when provided) and --reservation (when
   set); the second in-try reassignment that dropped --job-name is gone.

ROCm#9 run_orchestrator.py — replace the shallow `merged.update(...)` with
   a real recursive _deep_merge so the comment ("deep-merge") matches the
   behaviour: nested dicts under slurm/k8s/distributed/etc. are merged
   per-leaf, runtime --additional-context still wins on conflicts.

Made-with: Cursor
raviguptaamd added a commit to raviguptaamd/madengine that referenced this pull request May 1, 2026
…slurm_multi

PR ROCm#86 was originally rooted on coketaste/refactor-dist, which introduced
both Kubernetes and SLURM deployment paths. The PR base is now develop and
reviewers asked to keep this PR scoped to slurm_multi and the related
SLURM/orchestration changes.

This commit removes the Kubernetes-only artifacts from the slurm PR. The
KubernetesDeployment class continues to register itself only when the
kubernetes module is importable (DeploymentFactory.register_default_deployments
already wraps the import in try/ImportError), so the slurm PR works without
the kubernetes module present.

The full Kubernetes work continues on a separate branch and PR
(raviguptaamd/k8s_deployment), which can target develop independently
once ROCm#86 lands.

Files removed (k8s-only):
  - src/madengine/deployment/k8s_secrets.py
  - src/madengine/deployment/kubernetes.py
  - src/madengine/deployment/kubernetes_launcher_mixin.py
  - src/madengine/deployment/presets/k8s/**
  - src/madengine/deployment/templates/kubernetes/**
  - src/madengine/scripts/k8s/**
  - examples/k8s-configs/**
  - tests/unit/test_k8s.py

Mixed source files (run_orchestrator, build_orchestrator, base, common,
config_loader, validators) and docs that reference both slurm and k8s
are intentionally left untouched in this commit:
  - their k8s code paths are dormant (no k8s module to call into),
  - touching them would create a much wider conflict surface vs develop,
  - the k8s PR will re-add the runtime hooks when its files come back.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copilot AI review requested due to automatic review settings May 1, 2026 16:11
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

@raviguptaamd
Copy link
Copy Markdown
Contributor Author

Kubernetes deployment work split out into #118 to keep this PR focused on slurm_multi and related changes (per reviewer ask). #118 targets develop and will rebase / sequence after this PR lands.

raviguptaamd and others added 3 commits May 2, 2026 23:44
Introduce the `slurm_multi` deployment path needed for self-managed,
multi-node disaggregated workloads (vLLM-disagg, SGLang-disagg,
large-EP, kvcache-transfer):

* SlurmDeployment: new `slurm_multi` launcher branch with model-card
  driven --partition / --nodes / --gpus-per-node / --time / nodelist
  inheritance from `models.json`, optional `--exclusive`, reservation,
  and registry-required validation.
* SlurmNodeSelector: idle-node discovery, on-demand health check,
  cleanup_node() that consolidates --job-name and --reservation in a
  single srun (avoids duplicate srun in the previous implementation).
* templates/slurm/job.sh.j2: SBATCH directives factored so user
  exclusivity setting is honoured, matching slurm_multi behaviour.
* base.py / common.py / config_loader.py / factory.py: shared
  deployment infrastructure adjustments (DeploymentConfig fields,
  registry validation, factory registration).

Incorporates the related Copilot review fixes:
* slurm.py: drop duplicate `from typing import Optional`; honour
  slurm_config.exclusive in the slurm_multi wrapper script.
* slurm_node_selector.py: cleanup_node srun_cmd built once and
  preserves --job-name when --reservation is also set.

Co-authored-by: Cursor <cursoragent@cursor.com>
Plumb the slurm_multi launcher through the build/run/execution stack:

* orchestration/build_orchestrator.py:
  - --use-image / --build-on-compute paths populate manifest's
    deployment_config.distributed and deployment_config.slurm so the
    run phase auto-detects slurm without --additional-context.
  - Discovery yields multiple model cards: built_models manifest is now
    keyed by model_name (no longer overwrites when models share an image).
  - _execute_build_on_compute() validates `registry` early (no longer
    falls into registry.lower()/.replace() with NoneType).
  - Missing-credentials error emits per-registry hints (docker.io,
    ghcr.io, gcr.io, quay.io, nvcr.io) instead of Docker Hub-only PAT
    guidance.
  - Warn when discovered models in a prebuilt-image batch have
    divergent distributed/slurm configs.

* orchestration/run_orchestrator.py:
  - True recursive deep-merge of deployment_config into
    additional_context (the previous shallow `dict.update` lost nested
    slurm/k8s overrides).
  - Manifest reading respects deployment_config.target.

* execution/container_runner.py + helpers / docker_builder.py:
  - Surface model_args / env wiring needed by the slurm_multi flow.
  - Document the shell=True trust boundary on subprocess.run; cmd is
    constructed from validated manifest fields and user model_args are
    routed through shlex-quoted assembly in the caller.

* cli/{app, commands/run, commands/build, constants, utils, validators}.py:
  - --build-on-compute and --use-image flags, deployment-config CLI
    overrides, validation pre-checks for slurm targets.

* core/{constants, context, dataprovider, docker, errors, timeout}.py:
  - Shared infrastructure used by the new launcher path
    (DeploymentConfig fields, error classes, timeout policy).

* utils/, scripts/common/, reporting/, database/:
  - Helpers and tooling that the slurm_multi flow depends on
    (config_parser, gpu_config, gpu_validator, ops, rocm_tool_manager,
    session_tracker, log_formatting, discover_models, mongodb perf
    publishing, trace pre/post scripts, therock_detector).

Includes the Copilot review fixes that touch this layer: built_models
keying, registry None guard, registry-specific auth hints, divergent
config warning, deep-merge correctness, subprocess.run trust comment.

Co-authored-by: Cursor <cursoragent@cursor.com>
Round out the slurm_multi PR with operator-facing docs, examples, and
test additions:

* examples/slurm-configs:
  - basic/slurm-multi-{custom-split,multi-node}.json (renamed from
    sglang-disagg-* to reflect that the same launcher drives multiple
    flows: vLLM-disagg, SGLang-disagg, large-EP, kvcache-transfer).
  - minimal/slurm-multi-minimal.json
  - README.md covering all the basic/minimal configs.

* examples/profiling-configs/README.md: cross-reference slurm_multi
  flows where applicable.

* docs/{launchers,deployment,configuration,cli-reference,usage,
  installation,profiling,README}.md and top-level README.md / CHANGELOG.md:
  describe the slurm_multi launcher, --build-on-compute, --use-image,
  and the manifest deployment_config flow.

* pyproject.toml / .gitignore: minor housekeeping.

* tests/{unit,integration,e2e,fixtures}: cover the new launcher path,
  config_loader inheritance, deep-merge behaviour, container_runner
  shell handling, and slurm node selector cleanup. test_errors.py
  promoted from unit to integration.

Co-authored-by: Cursor <cursoragent@cursor.com>
@raviguptaamd raviguptaamd force-pushed the raviguptaamd/update_slurm_launcher branch from 8d73b3f to 50e8895 Compare May 2, 2026 23:46
… perf.csv collection

Three fixes that surfaced during the PR ROCm#86 smoke test against the
Distributed_Inference_CI reservation:

1. ErrorContext kwargs sweep (build_orchestrator.py: 5 sites)
   ErrorContext is a dataclass with a fixed field set
   {operation, phase, component, model_name, node_id, file_path,
    additional_info}. Five call-sites passed kwargs that don't exist
   (`model=`, `launcher=`, `model_names=`, `registry=`, `dockerfile=`),
   which raised `TypeError: ErrorContext.__init__() got an unexpected
   keyword argument 'model'` BEFORE the real ConfigurationError was
   constructed. The slurm_multi guard error in particular ("requires
   --registry or --use-image") was never visible to users.

   Migrated those kwargs to the proper `model_name=` field or to
   `additional_info={...}` so the user-facing error messages now
   reach the console.

2. F1: model-card image fallback for slurm_multi (build_orchestrator.py)
   When a discovered model card declares
   `env_vars.DOCKER_IMAGE_NAME` and the user passes neither
   `--registry` nor `--use-image`, treat the card's image as an
   implicit `--use-image` and call _execute_with_prebuilt_image()
   directly. Avoids forcing operators to repeat the image on the CLI
   for slurm_multi flows whose model cards already pin it.

   Falls back to the existing ConfigurationError (now visible thanks
   to fix ROCm#1) when no card image is set or when discovered models
   declare divergent images.

3. F-perf: extended slurm_multi perf.csv search paths (slurm.py)
   `_collect_slurm_multi_results` now looks in the locations the
   inner scripts (MAD-private PR #186) actually write to:
     * `/shared_inference/<user>/<deployment_id>/perf.csv`
       (sglang_disagg / vllm_disagg via benchmark_xPyD.sh)
     * `<workspace>/slurm_output/perf_csv/*<deployment_id>*.csv`
       (large_ep_benchmark / kvcache_transfer_bench)
   in addition to the existing `slurm_config.results_dir` and
   `<cwd>/perf.csv` paths, with NFS-propagation retry on each
   candidate. Fixes the "0 perf files, 1 log files" outcome we hit
   on Slurm job 22607 even though `/shared_inference/ravgupta/22607/perf.csv`
   was actually generated.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copilot AI review requested due to automatic review settings May 3, 2026 03:54
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 86 out of 87 changed files in this pull request and generated 24 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 120 to 144
if not is_rocprofv3_available():
if tools_include_rocprof_family(tools_config):
filtered_tools: List[Dict] = [
t
for t in tools_config
if isinstance(t, dict) and t.get("name") not in _ROCPROF_FAMILY_TOOL_NAMES
]
if filtered_tools:
logger.warning(
"Multi-node: rocprofv3 not found on submission host; "
"omitting rocprof/rocprofv3-family tools. Remaining tools: "
+ ", ".join(
str(x.get("name"))
for x in filtered_tools
if isinstance(x, dict) and x.get("name")
)
)
return {
"enabled": True,
"mode": "multi_node",
"tools": filtered_tools,
"per_node_collection": True,
}
logger.warning(
"╔════════════════════════════════════════════════════════════════════════════╗\n"
"║ Multi-Node Profiling Requirements Not Met ║\n"
"╠════════════════════════════════════════════════════════════════════════════╣\n"
"║ Multi-node profiling requires rocprofv3 (MPI-aware profiling support). ║\n"
"║ ║\n"
"║ Current Status: rocprofv3 NOT FOUND on system ║\n"
"║ ║\n"
"║ rocprof/rocprofv3-family profiling will be SKIPPED for this multi-node ║\n"
"║ run. ║\n"
"║ ║\n"
"║ To enable multi-node rocprof profiling: ║\n"
"║ • Install rocprofiler-sdk package (ROCm >= 6.4.1) ║\n"
"║ • Command: apt install rocprofiler-sdk ║\n"
"║ • Or upgrade to ROCm 6.4.1 or later ║\n"
"║ ║\n"
"║ Note: Single-node profiling uses rocprof (no rocprofv3 required) ║\n"
"╚════════════════════════════════════════════════════════════════════════════╝"
)
return {
"enabled": False,
"mode": "multi_node_unsupported",
"tools": [],
"per_node_collection": False
}
logger.info(
"Multi-node: rocprofv3 not found on submission host; keeping non-rocprof tools "
"(rocprof/rocprofv3 family requires rocprofv3 for MPI-aware capture)."
logger.warning(
"╔════════════════════════════════════════════════════════════════════════════╗\n"
"║ Multi-Node Profiling Requirements Not Met ║\n"
"╠════════════════════════════════════════════════════════════════════════════╣\n"
"║ Multi-node profiling requires rocprofv3 (MPI-aware profiling support). ║\n"
"║ ║\n"
"║ Current Status: rocprofv3 NOT FOUND on system ║\n"
"║ ║\n"
"║ Profiling will be SKIPPED for this multi-node run. ║\n"
"║ ║\n"
"║ To enable multi-node profiling: ║\n"
"║ • Install rocprofiler-sdk package (ROCm >= 6.4.1) ║\n"
"║ • Command: apt install rocprofiler-sdk ║\n"
"║ • Or upgrade to ROCm 6.4.1 or later ║\n"
"║ ║\n"
"║ Note: Single-node profiling uses rocprof (no rocprofv3 required) ║\n"
"╚════════════════════════════════════════════════════════════════════════════╝"
)
return {
"enabled": True,
"mode": "multi_node",
"tools": tools_config,
"per_node_collection": True,
"enabled": False,
"mode": "multi_node_unsupported",
"tools": [],
"per_node_collection": False
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 63a2eae.

configure_multi_node_profiling now only filters out the rocprof tool when rocprofv3 is missing on a multi-node run. rccl_trace, rocblas_trace, and other non-rocprof tools keep running and per_node_collection stays enabled for them. The returned config switches to a new mode multi_node_no_rocprofv3 so callers can distinguish this mixed state from full multi_node.

Comment on lines +531 to +534
if key in stored_config and isinstance(stored_config[key], dict) and isinstance(self.additional_context[key], dict):
merged = dict(stored_config[key])
merged.update(self.additional_context[key])
stored_config[key] = merged
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 63a2eae.

Promoted _deep_merge to a module-level helper and used it in _load_and_merge_manifest (which still did dict(stored_config[key]); merged.update(...)). Verified locally: an override touching slurm.node_selector.reservation now preserves a sibling slurm.node_selector.partition from the manifest instead of wholesale-replacing the inner object. Same helper is now used in both the runtime-context init path and the manifest-merge path.

Comment on lines +1018 to +1024
credential_file = "credential.json"
if os.path.exists(credential_file):
try:
with open(credential_file) as f:
credentials = json.load(f)
except Exception as e:
print(f"Warning: Could not load credentials: {e}")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 63a2eae.

RunOrchestrator._load_credentials now delegates to madengine.core.auth.load_credentials. The central helper enforces "credential.json must contain a JSON object" validation up front (raises ValueError if the file is an array/string), so malformed credential files are rejected with a clear message instead of producing a cryptic subscript/attribute error later during registry auth.

Comment thread src/madengine/core/context.py Outdated
Comment on lines +257 to +258
self.ctx["rocm_path"] = self._rocm_path
# In-container ROCM_PATH is finalized at run time in run_container
# via finalize_container_rocm_path (OCI env → probe → default).
# docker_env_vars.ROCM_PATH (user-supplied) is preserved here; finalize
# normalizes it at run time. Auto-detection runs in finalize when absent.
self.ctx["docker_env_vars"]["ROCM_PATH"] = self._rocm_path
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 9109f52 — file reverted to origin/develop.

core/context.py no longer forces the host ROCm root into the container env. The PR now uses upstream's madengine.utils.rocm_path_resolver.resolve_host_rocm_path for host detection and the docker_env_vars.MAD_ROCM_PATH flow (with image-baked-in env, in-container probe, then /opt/rocm fallback) for in-container resolution. Out-of-scope behaviour change for slurm_multi; reverted to keep this PR focused.

Comment thread src/madengine/cli/utils.py Outdated
row = [str(row_index), status, model_name]
if show_gpu_arch:
row.append(build_gpu_arch_display(item))
row.append(item.get("architecture", "N/A"))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 9109f52.

Restored the build_gpu_arch_display(item) helper (prefers gpu_architecture then architecture, defaults to "N/A") and switched both BUILD callsites (success + failed) back to it, so multi-arch builds again show the correct gpu_architecture column instead of N/A. Kept the model-name no-truncation change and the "Workload" column rename (those weren't flagged).

Comment thread README.md Outdated
- **🚀 Modern CLI** - Rich terminal output with Typer and Rich
- **🎯 Simple Deployment** - Run locally or deploy to Kubernetes/SLURM via configuration
- **🔧 Distributed Launchers** - Full support for torchrun, DeepSpeed, Megatron-LM, TorchTitan, Primus, vLLM, SGLang
- **🔧 Distributed Launchers** - Full support for torchrun, DeepSpeed, Megatron-LM, TorchTitan, vLLM, SGLang
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed across two commits.

  • 9109f52README.md reverted to origin/develop, so the Distributed Launchers bullet, the Architecture diagram, and the Components section all mention Primus alongside torchrun / DeepSpeed / Megatron-LM / TorchTitan / vLLM / SGLang again. The ROCm-path bullet and the RTL profiling-tools link are also restored.
  • 63a2eae — added "primus" (and "megatron") to VALID_LAUNCHERS in deployment/common.py, so normalize_launcher no longer silently downgrades "primus" configs to "docker".

The README is now consistent with what the code actually accepts.

Comment thread src/madengine/core/timeout.py Outdated
Comment on lines +20 to +26
def __init__(self, seconds: int = 15) -> None:
"""Constructor of the Timeout class.

Args:
seconds (Optional[int]): The timeout in seconds. None or 0 disables
the timeout. Negative values are treated as no timeout.
seconds (int): The timeout in seconds.
"""
self.seconds = seconds if seconds and seconds > 0 else None
self.seconds = seconds
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 9109f52 — file reverted to origin/develop.

Timeout.__init__ again accepts Optional[int] = 15 and stores self.seconds = seconds if seconds and seconds > 0 else None. __enter__ / __exit__ short-circuit when self.seconds is None, so programmatic callers passing timeout=None (or 0 / negative) no longer hit signal.alarm(None) with a TypeError. Out-of-scope change for slurm_multi; reverted.

Comment on lines +77 to +86
"--build-arg "
+ shlex.quote(str(build_arg))
+ "="
+ shlex.quote(str(self.context.ctx["docker_build_arg"][build_arg]))
+ " "
+ build_arg
+ "='"
+ self.context.ctx["docker_build_arg"][build_arg]
+ "' "
)

if run_build_arg:
for key, value in run_build_arg.items():
build_args += "--build-arg " + shlex.quote(str(key)) + "=" + shlex.quote(str(value)) + " "
build_args += "--build-arg " + key + "='" + value + "' "
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 9109f52.

get_build_arg reverted to use shlex.quote(str(...)) for both keys and values:

build_args += (
    "--build-arg "
    + shlex.quote(str(build_arg))
    + "="
    + shlex.quote(str(self.context.ctx["docker_build_arg"][build_arg]))
    + " "
)

(plus Optional[Dict] = None with if run_build_arg is None: run_build_arg = {} instead of the mutable-default pattern). Build-arg values containing apostrophes / $() / backticks no longer break the docker build command line.

Comment thread src/madengine/utils/config_parser.py Outdated
stop = os.path.abspath(stop_dir)

while current == stop or current.startswith(stop + os.sep):
while current.startswith(stop):
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 9109f52 — file reverted to origin/develop.

Boundary check restored:

while current == stop or current.startswith(stop + os.sep):

So a parent like /repo/scripts no longer matches /repo/scripts-private/... via prefix collision — ConfigParser walks the intended subtree only. Out-of-scope change for slurm_multi; reverted.

Comment on lines +96 to +97
if sys_gpu_arch and "NVIDIA" in sys_gpu_arch:
# Normalize "NVIDIA A100-SXM4-40GB" -> "A100", "NVIDIA H100 PCIe" -> "H100"
# (mirrors get_gpu_arch() in tests/fixtures/utils.py)
after_prefix = sys_gpu_arch[sys_gpu_arch.index("NVIDIA") + len("NVIDIA ") :]
sys_gpu_arch = after_prefix.split("-")[0].split()[0]
sys_gpu_arch = sys_gpu_arch.split()[1]
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 9109f52 — file reverted to origin/develop.

NVIDIA-arch normalization restored:

after_prefix = sys_gpu_arch[sys_gpu_arch.index("NVIDIA") + len("NVIDIA "):]
sys_gpu_arch = after_prefix.split("-")[0].split()[0]

so "NVIDIA A100-SXM4-40GB""A100" and "NVIDIA H100 PCIe""H100" again, and skip_gpu_arch=A100 correctly filters those models. The split()[1] regression that returned the full SKU string is gone. Out-of-scope change for slurm_multi; reverted.

Previously the per-deploy completion marker was named
`madengine_<model>.complete` -- not job-id namespaced. A second
`madengine run` of the same model tag would see the leftover marker
from the first run and immediately conclude the job was done, even
though sbatch had only just submitted the new job.

Reproduced live: re-running pyt_sglang_disagg_llama-3.1-8b right after
job 22607 finished caused madengine to return success in ~30 s while
job 22615 was still running on the reservation.

Fix:

* prepare(): emit the marker using `${SLURM_JOB_ID:-local}` inside the
  generated wrapper so each sbatch run writes its own file. Falls back
  to `_local` when no SLURM_JOB_ID is set (interactive/local runs,
  preserves existing behaviour for that path via _completion_marker).

* monitor(): reconstruct the job-id-aware path from the basename
  template and the deployment_id returned by sbatch; only fall back
  to the bare marker for backwards compatibility with the
  inside_allocation flow.

Co-authored-by: Cursor <cursoragent@cursor.com>
@raviguptaamd
Copy link
Copy Markdown
Contributor Author

Smoke test summary on Distributed_Inference_CI reservation

Drove the slurm_multi launcher end-to-end against four real workloads with the companion model cards in ROCm/MAD-private#186. Three out of four produced real benchmark numbers, the fourth surfaced a vLLM/NixlConnector issue inside the docker image (orthogonal to this PR).

Critical-path runs

# Workload Model card Slurm job Nodes State Result
C1 SGLang disagg 8B (madengine-driven via F1) pyt_sglang_disagg_llama-3.1-8b 22607 reservation [029,040,082] COMPLETED 8:51 5,977 tok/s TTFT 122.8 ms ITL 2.54 ms
C2 vLLM disagg 8B (1P/1D, co-located proxy) pyt_vllm_disagg_llama-3.1-8b 22612 [008,014] COMPLETED 4:22 router serving, but every decode request returned HTTP 500: vLLM workers crashed with nixl_connector.py:1504] AssertionError. Image-side bug, not the launcher.
C3 Large EP 2-node pyt_large_ep_benchmark_2n 22610 reservation [029,040] COMPLETED 3:49 28 metric rows (deepep dispatch/combine FP8/BF16 RDMA/NVL)
C4 KV cache transfer 2-node pyt_kvcache_transfer_bench 22611 reservation [082,125] COMPLETED 24:55 57 metric rows (mooncake + mori + rixl, 4KB → 1GB)
C1' C1 re-run with the F-perf collector commit (5ec55ff) same 22615 reservation [029,040,082] COMPLETED 6:54 5,970 tok/s; F-perf finds /shared_inference/<user>/<jobid>/perf.csv (verified)

What the smoke test exercised in this PR

Surface Verification
madengine build with --use-image (manifest gen for prebuilt image) C1
madengine build with no flags + model-card env_vars.DOCKER_IMAGE_NAME (F1 implicit---use-image) C1' — log line slurm_multi: no --registry/--use-image given; using DOCKER_IMAGE_NAME from model card -> rocm/pytorch-private:sglang_disagg_mori_20260502
madengine run --manifest-file over slurm_multi C1, C1'
slurm_multi SBATCH wrapper generation, --reservation, enable_node_check=false honoured All
_collect_slurm_multi_results finds /shared_inference/<user>/<jobid>/perf.csv (sglang/vllm) C1'
SlurmDeployment direct sbatch (no madengine on top) C2, C3, C4
slurm_multi co-located-proxy topology for vLLM disagg (xP + yD nodes) C2 — topology was launched correctly; C2 failure happened inside vLLM nixl_connector after Application startup complete.

Bugs surfaced by the smoke test (all addressed)

Bug Where Commit
5 invalid kwargs to ErrorContext (model=, launcher=, model_names=, registry=, dockerfile=) — masked the real error message build_orchestrator.py 5ec55ff
F1: model-card image not auto-promoted to --use-image for slurm_multi build_orchestrator.py 5ec55ff
F-perf: _collect_slurm_multi_results only looked in <cwd>/perf.csv slurm.py 5ec55ff
Completion marker not job-id namespaced — second madengine run of same model tag returned success in ~30 s while the new sbatch was still running slurm.py 710b51a

Out of scope

  • Image-side vLLM NixlConnector crash on Llama-3.1-8B disagg under rocm/pytorch-private:20260407_itej89_vllm_mori_docker (filed separately as orthogonal).
  • Node-health precheck threshold (180 GB memory used / 4 processes flags reservation nodes as "Dirty"; tracked for follow-up).

PR head: 710b51a. Mergeable, blocking on review.

Round-2 Copilot review identified 15 changes outside the slurm_multi
scope that regressed upstream/develop behaviour. Revert each to
origin/develop:

  core/context.py            drop forced host-ROCm-into-container override
  core/docker.py             restore shlex.quote + exact-match container filter
  core/dataprovider.py       restore "unknown data name" guard, drop debug prints
  core/timeout.py            restore Optional[int] None/0 handling
  cli/app.py                 restore versioningit-driven --version
  utils/rocm_tool_manager.py capture primary_stderr before fallback overwrites
  utils/discover_models.py   restore shlex.quote in cp; raise TypeError, not assert
  utils/config_parser.py     fix boundary check (== or startswith stop+os.sep)
  orchestration/image_filtering.py
                             restore proper "NVIDIA <ARCH>" normalization
                             (split()[1] returned A100-SXM4-40GB, not A100)
  reporting/update_perf_csv.py
                             restore configs pop, isinstance instead of type ==
  README.md                  restore Primus mention, ROCm path docs, RTL link

Surgical reverts that preserve scope-relevant changes:

  cli/utils.py
    - restore build_gpu_arch_display() helper + the two BUILD callsites
      (success and failed BUILD branches), so multi-arch builds show the
      gpu_architecture column instead of N/A
    - keep the model-name no-truncation change (no [:20])
    - keep the "Workload" column rename and matching format-precision tweak

  execution/docker_builder.py
    - revert auth/shlex paths to upstream (login_to_registry delegated to
      madengine.core.auth, shlex.quote on build args, exact-match container
      filter, primus dockercontext detection)
    - PRESERVE the slurm_multi-specific block at line 369:
        # Update built_models with registry image name for parallel pull
        # in slurm_multi
      so the registry-image -> env_vars["DOCKER_IMAGE_NAME"] propagation
      that the slurm_multi parallel-pull path depends on is unchanged.

slurm_multi infrastructure is untouched: zero bytes diff vs PR HEAD on
deployment/{slurm,slurm_node_selector,base,common,factory,config_loader}.py,
orchestration/{build_orchestrator,run_orchestrator}.py,
execution/{container_runner,container_runner_helpers}.py,
cli/{commands/build,commands/run,validators}.py.

Smoke checks:
  DeploymentFactory.available_deployments() = ['k8s','kubernetes','slurm']
  normalize_launcher('slurm_multi','slurm')  = 'slurm_multi'
  normalize_launcher('slurm-multi','slurm')  = 'slurm_multi'

Addresses Copilot review comments ROCm#4, ROCm#5, ROCm#6, ROCm#9, ROCm#13, ROCm#14, ROCm#16, ROCm#17,

Co-authored-by: Cursor <cursoragent@cursor.com>
ROCm#18, ROCm#19, ROCm#20, ROCm#21, ROCm#22, ROCm#23, ROCm#24.
Copilot AI review requested due to automatic review settings May 6, 2026 01:05
raviguptaamd and others added 2 commits May 6, 2026 01:10
Six Round-2 review comments on slurm_multi-touching code paths fixed.

run_orchestrator.py:
  * Real recursive deep-merge: promote the inline `_deep_merge` from
    `_init_runtime_context_from_manifest` to a module-level helper and
    use it in `_load_and_merge_manifest` too. Previously the manifest
    path still did a shallow `dict.update()`, so a runtime override
    touching a single leaf inside e.g. `slurm.node_selector` or
    `distributed.primus` wholesale-replaced the inner object and
    silently dropped the manifest's other leaves.
    (Copilot review ROCm#2)

  * Restore manifest's `context.docker_env_vars` into the runtime
    context. Manifest-only runs (`madengine run --manifest-file`)
    used to load tools / pre_scripts / post_scripts / encapsulate_script
    but had stopped restoring `docker_env_vars`, so any env captured
    at build time (incl. MAD_SECRETS*, model-specific values) was
    dropped before ContainerRunner rendered `--env` flags.
    (Copilot review ROCm#12)

  * Replace the local `_load_credentials` reimplementation with a
    delegation to `madengine.core.auth.load_credentials`, which keeps
    the "credential.json must be a JSON object" validation that was
    silently lost. Without it, a malformed file (array/string) would
    propagate unchanged and surface much later as a cryptic
    subscript/attribute error during registry auth.
    (Copilot review ROCm#3)

deployment/common.py:
  * `configure_multi_node_profiling`: when `rocprofv3` is unavailable
    on a multi-node run, only filter out the `rocprof` tool (which is
    the one that actually requires rocprofv3) instead of disabling
    ALL multi-node tools. `rccl_trace`, `rocblas_trace`, and friends
    work independently and must keep running. Returns
    `multi_node_no_rocprofv3` in that mixed mode.
    (Copilot review ROCm#1)

  * `VALID_LAUNCHERS`: add `"megatron"` and `"primus"` alongside the
    existing `"megatron-lm"`. The rest of the codebase (slurm.py,
    kubernetes.py, the `job.yaml.j2` template, `primus_backend.py`,
    and every `examples/**/megatron-lm-*.json` config) uses the bare
    `"megatron"` / `"primus"` strings, so without these entries
    `normalize_launcher` was silently downgrading them to `"docker"`.
    (Copilot review ROCm#10)

deployment/base.py:
  * Widen `_parse_node_log_for_perf` regex to accept scientific
    notation (`1.23e4`, `4E+5`) for the value, and `\\S+` for the
    metric name (covers `tokens/sec`, `samples_per_second`, `tok/s`,
    etc.). The narrower `[\\d.]+` previously dropped legitimate
    perf lines emitted by training scripts.
    (Copilot review ROCm#11)

Smoke checks:
  DeploymentFactory.available_deployments() = ['k8s','kubernetes','slurm']
  normalize_launcher('megatron','slurm')        = 'megatron'
  normalize_launcher('primus','kubernetes')     = 'primus'
  normalize_launcher('slurm_multi','slurm')     = 'slurm_multi'
  _deep_merge nested-leaf preservation: verified

Co-authored-by: Cursor <cursoragent@cursor.com>
Three Round-2 Copilot comments on doc-only broken navigation:

  docs/deployment.md  (review ROCm#7):
    distributed-launchers.md → launchers.md (renamed long ago)
    user-guide.md            → usage.md     (no user-guide.md exists)

  docs/launchers.md   (review ROCm#8):
    Drop the K8s `slurm-multi-*.json` example references — those
    files do not exist under examples/k8s-configs/. slurm_multi is
    a SLURM-side launcher; K8s disagg flows continue to use the
    existing examples/k8s-configs/{minimal,basic}/sglang-disagg-*.json
    configs. Keep and verify the three SLURM example paths that
    actually exist on disk.

  docs/installation.md (review ROCm#15):
    user-guide.md          → usage.md       (no user-guide.md exists)
    how-to-quick-start.md  → cli-reference.md (no how-to-quick-start.md
                              exists; usage.md already covers Quick Start
                              and is linked above, so point readers at
                              the CLI reference instead).

All link targets verified to exist on disk:
  docs/launchers.md, docs/usage.md, docs/deployment.md, docs/cli-reference.md
  examples/slurm-configs/minimal/slurm-multi-minimal.json
  examples/slurm-configs/basic/slurm-multi-multi-node.json
  examples/slurm-configs/basic/slurm-multi-custom-split.json

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 75 out of 76 changed files in this pull request and generated 13 comments.

Comments suppressed due to low confidence (2)

src/madengine/orchestration/run_orchestrator.py:120

  • Context.init still only accepts additional_context, additional_context_file, build_only_mode, and detect_local_gpu_arch. Passing rocm_path here raises TypeError, so every local run that initializes the runtime context will fail before execution starts.
    src/madengine/orchestration/run_orchestrator.py:594
  • Build manifests still persist context.docker_env_vars, docker_mounts, and docker_build_arg, but this restore block now copies only tools/scripts back into the runtime Context. Execution-only runs from a manifest therefore lose build-time env/mount context before ContainerRunner starts, including values that Context auto-imported into docker_env_vars such as MAD_SECRETS*.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +574 to +580
for env_arg in run_env:
if not self._ENV_KEY_RE.match(env_arg):
raise ValueError(f"Invalid environment variable name: {env_arg!r}")
env_args += f"--env {env_arg}={shlex.quote(str(run_env[env_arg]))} "
env_args += f"--env {env_arg}='{str(run_env[env_arg])}' "

# Add context environment variables
if "docker_env_vars" in self.context.ctx:
for env_arg in self.context.ctx["docker_env_vars"].keys():
if not self._ENV_KEY_RE.match(env_arg):
raise ValueError(f"Invalid environment variable name: {env_arg!r}")
value = self.context.ctx["docker_env_vars"][env_arg]
env_args += f"--env {env_arg}={shlex.quote(str(value))} "
env_args += f"--env {env_arg}='{str(self.context.ctx['docker_env_vars'][env_arg])}' "
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged — this is a real concern but it is pre-existing in origin/develop, not introduced by this PR.

The f"--env {env_arg}='{str(...)}' " pattern is at the same line in origin/develop (compare git show origin/develop:src/madengine/execution/container_runner.py | sed -n '578,581p'); the --build-on-compute / --use-image work in this PR didn't change this method.

Deferring to a follow-up issue rather than expanding this PR's scope. The right fix is to switch to shlex.quote(value) (and validate the env-var name with the existing _VALID_ENV_NAME_RE style check) for both this docker_env_vars loop and the run_env loop a few lines above. Happy to file the follow-up issue (run_container env-arg shell-escape hardening) and reference it back here.

Comment on lines 1027 to +1042
if self.additional_context and "docker_env_vars" in self.additional_context:
merged_count = 0
for key, value in self.additional_context["docker_env_vars"].items():
self.context.ctx["docker_env_vars"][key] = value
merged_count += 1
if merged_count > 0:
print(f"ℹ️ Merged {merged_count} environment variables from additional_context")

if self.context and str(self.context.ctx.get("gpu_vendor", "")).upper().find(
"AMD"
) != -1:
from madengine.utils.rocm_path_resolver import finalize_container_rocm_path

# Determine whether the user explicitly supplied ROCM_PATH for the container.
# If they did (via docker_env_vars.ROCM_PATH in additional_context), the
# re-merge above already restored it — keep it so finalize uses it directly.
# If they did not, clear any ROCM_PATH left from a previous model run so
# finalize always re-resolves for the current docker_image (OCI config →
# in-image probe → /opt/rocm default).
user_supplied_rocm_path = (
str(
(self.additional_context or {})
.get("docker_env_vars", {})
.get("ROCM_PATH", "")
).strip()
)
if not user_supplied_rocm_path:
self.context.ctx["docker_env_vars"].pop("ROCM_PATH", None)

finalize_container_rocm_path(
self.context.ctx["docker_env_vars"],
docker_image,
)
# Merge env_vars from model_info (models.json) into docker_env_vars
if "env_vars" in model_info and model_info["env_vars"]:
model_env_count = 0
for key, value in model_info["env_vars"].items():
self.context.ctx["docker_env_vars"][key] = str(value)
model_env_count += 1
if model_env_count > 0:
print(f"ℹ️ Merged {model_env_count} environment variables from model_info (models.json)")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This thread is showing as outdated and is already addressed by Cat B in 9109f52.

core/context.py was reverted to origin/develop, which means:

  • No more forced self.ctx['docker_env_vars']['ROCM_PATH'] = self._rocm_path at init.
  • In-container ROCM_PATH is resolved at run time via finalize_container_rocm_path() (OCI metadata → in-image probe → /opt/rocm), so user/model docker_env_vars overrides survive the merge.

The ROCm-path behaviour the comment was concerned about is preserved by upstream's existing implementation, not by this PR.

Comment on lines 444 to 448
)

self.rich_console.print(f"[green]✓ Discovered {len(models)} model(s) for tags: {tags}[/green]\n")

# Initialize build-only context for manifest generation
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This thread is showing as outdated and is already addressed by Cat B in 9109f52.

core/context.py was reverted to origin/develop, so the Context build-only constructor no longer accepts a rocm_path keyword. The synthetic-manifest path that called Context(rocm_path=...) was the source of the TypeError referenced here, and that constructor signature is back to upstream's. No call site in this PR passes rocm_path=... anymore.

Comment on lines 529 to +534
if self.additional_context:
# Merge deployment_config
if "deployment_config" in manifest:
stored_config = manifest["deployment_config"]
# Runtime --additional-context overrides stored config
# Runtime --additional-context overrides stored config.
# For dict-valued keys, recursively deep-merge so manifest values
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This thread is showing as outdated and is already addressed in 63a2eae — reposting the link here for clarity.

run_orchestrator._load_and_merge_manifest no longer uses dict(stored_config[key]); merged.update(...). It now calls the module-level _deep_merge(stored_config[key], self.additional_context[key]) helper, which recurses into nested dicts.

Verified locally: an override touching slurm.node_selector.reservation preserves a sibling slurm.node_selector.partition from the manifest instead of wholesale-replacing the inner object. Same helper is used in both the runtime-context init path and the manifest-merge path.

Comment on lines +15 to 23
# NOTE: keep this in sync with the launcher branches in
# deployment/{slurm.py,kubernetes.py}, the templates/{slurm,kubernetes}/*.j2
# Jinja conditions, and examples/**/launcher fields. Currently the rest of
# the codebase (and shipped examples) use the bare "megatron" / "primus"
# strings, while "megatron-lm" is the docs-friendly synonym.
VALID_LAUNCHERS = [
"torchrun",
"torchtitan",
"deepspeed",
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Round-3 comment is reviewing a pre-fix snapshot — already addressed in 63a2eae.

Current src/madengine/deployment/common.py has both "megatron" and "primus" in VALID_LAUNCHERS (alongside the docs-friendly "megatron-lm"):

VALID_LAUNCHERS = [
    "torchrun", "torchtitan", "deepspeed",
    "megatron", "megatron-lm", "primus",
    "vllm", "sglang", "slurm_multi",
]

Verified at HEAD (2200cbd): normalize_launcher('megatron','slurm')'megatron', normalize_launcher('primus','kubernetes')'primus'. The example configs (examples/{slurm,k8s}-configs/**/megatron-lm-*.json, examples/k8s-configs/minimal/megatron-lm-*.json) round-trip correctly without being downgraded to "docker"/"native".

Comment thread docs/launchers.md
Comment on lines +471 to +476
**Examples** (SLURM only — `slurm_multi` is a SLURM-side launcher; the K8s
deployment of disagg workloads continues to use the existing
`sglang-disagg-*.json` configs):
- SLURM Minimal: `examples/slurm-configs/minimal/slurm-multi-minimal.json`
- SLURM Basic: `examples/slurm-configs/basic/slurm-multi-multi-node.json`
- SLURM Custom: `examples/slurm-configs/basic/slurm-multi-custom-split.json`
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Round-3 comment is reviewing a pre-fix snapshot — already addressed in 0156d33.

Current docs/launchers.md (around line 470) only lists the three SLURM-side examples that actually exist on disk:

**Examples** (SLURM only — `slurm_multi` is a SLURM-side launcher; the K8s
deployment of disagg workloads continues to use the existing
`sglang-disagg-*.json` configs):
- SLURM Minimal: `examples/slurm-configs/minimal/slurm-multi-minimal.json`
- SLURM Basic:   `examples/slurm-configs/basic/slurm-multi-multi-node.json`
- SLURM Custom:  `examples/slurm-configs/basic/slurm-multi-custom-split.json`

The three K8s slurm-multi-*.json references that didn't exist were dropped. All three SLURM paths verified to exist at HEAD (2200cbd).

Comment thread examples/profiling-configs/README.md Outdated
Comment on lines +89 to +105
@@ -126,7 +102,7 @@ madengine build --tags your_model --registry your-registry:5000

# Deploy to SLURM
madengine run --manifest-file build_manifest.json \
--additional-context-file examples/profiling-configs/rocprofv3_multi_node.json
--additional-context-file examples/profiling-configs/rocprofv3_multinode.json
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 2200cbd.

Fixed both occurrences (lines 89 and 105). The actual checked-in profiling config is examples/profiling-configs/rocprofv3_multi_node.json (with underscore); the README now references that exact filename, so the copy-paste command no longer fails with a missing-file error.

Comment thread tests/unit/test_reporting.py Outdated
import os
import tempfile

import pytest
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 2200cbd.

Dropped import pytest. Confirmed there are no pytest.* references anywhere in tests/unit/test_reporting.py (no @pytest.mark.*, no pytest.raises, no pytest.fixture), so the import was genuinely unused. flake8 lint will be clean again.

Comment thread tests/unit/test_rocm_path.py Outdated
Comment on lines +50 to +51
ctx = Context(build_only_mode=True, rocm_path="/opt/rocm")
assert ctx._rocm_path == "/opt/rocm"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 2200cbd — file reverted to origin/develop.

With the Cat B revert of core/context.py in 9109f52, this PR no longer adds the Context(rocm_path=...) keyword API; tests for that PR-specific surface belong to whichever future PR re-adds it. Upstream's tests/unit/test_rocm_path.py already exercises ROCmToolManager(rocm_path=...) (still in the codebase), so reverting restores green tests without changing semantics. Net result: this test file is byte-identical to origin/develop.

Comment thread tests/unit/test_rocm_path.py Outdated
Comment on lines +65 to +67
ctx = Context(rocm_path="/my/rocm")
assert ctx.ctx.get("rocm_path") == "/my/rocm"
assert ctx.ctx["docker_env_vars"].get("ROCM_PATH") == "/my/rocm"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 2200cbd (same fix as the sibling thread). The whole test_rocm_path.py file was reverted to origin/develop, so this assertion that ctx['docker_env_vars']['ROCM_PATH'] == ... immediately after construction is gone — the test now matches upstream's behaviour where in-container ROCM_PATH is finalized at run time via finalize_container_rocm_path().

raviguptaamd and others added 2 commits May 6, 2026 03:34
Three Round-3 review comments fixed in this small commit; the
remaining Round-3 items are either stale (already fixed in 9109f52 /
63a2eae / 0156d33; Copilot was reviewing a pre-fix snapshot) or
pre-existing concerns in upstream/develop that are deferred to a
follow-up issue (replied inline).

  examples/profiling-configs/README.md (R3-6):
    `rocprofv3_multinode.json` -> `rocprofv3_multi_node.json` (lines 89
    and 105). The actual checked-in file is `rocprofv3_multi_node.json`
    (with underscore), so copy-pasting the README's command would fail
    with a missing-file error.

  tests/unit/test_reporting.py (R3-7):
    Drop the unused `import pytest` (no `pytest.*` references in the
    file). The repo's pre-commit runs flake8, so the unused import
    would fail lint/CI.

  tests/unit/test_rocm_path.py (R3-8 + R3-9):
    Revert to origin/develop. With the Cat B revert of
    `core/context.py` (commit 9109f52), the PR no longer adds the
    `Context(rocm_path=...)` keyword API; tests for that PR-specific
    API now belong to whichever future PR re-adds it. Upstream's
    version of this test file already exercises
    `ROCmToolManager(rocm_path=...)` -- which is still in the codebase
    -- so reverting restores green tests without changing semantics.

Co-authored-by: Cursor <cursoragent@cursor.com>
Surfaced live: `madengine build --tags pyt_kvcache_transfer_bench
--build-on-compute --registry docker.io/rocm/pytorch-private` failed
its docker build step with:

    ERROR: failed to build: invalid tag
    "ci-pyt_kvcache_transfer_bench_Dockerfile":
    repository name must be lowercase

Root cause: in `_execute_build_on_compute`, `local_image_name` is
assembled as `ci-<model_name>_<dockerfile_basename>`, where
`dockerfile_basename` strips `.Dockerfile` and `.ubuntu.amd` from
`Path(dockerfile_path).name`. When the model card's `dockerfile`
field is just `scripts/<dir>/Dockerfile` (no `.ubuntu.amd.Dockerfile`
suffix), neither `.replace()` matches, so `dockerfile_basename`
stays as `Dockerfile` (capital D) and the assembled tag is rejected
by `docker build -t`.

Fix: `.lower()` the final assembled local image name. This:
  * fixes the kvcache_transfer_bench card (`scripts/.../Dockerfile`),
  * is a no-op for the four already-passing cards whose dockerfile
    paths follow the `*.ubuntu.amd.Dockerfile` convention (the
    `_basename` is already lowercase after the suffix strips), and
  * is robust against any uppercase character anywhere in
    `model_name` or `dockerfile_basename`.

Verified by re-running the failing case to completion:

    Local image name: ci-pyt_kvcache_transfer_bench_dockerfile
    Image pushed: docker.io/rocm/pytorch-private:pyt_kvcache_transfer_bench
    [✓] All builds completed successfully

(slurm job 22936, ~18 min full Dockerfile build of mooncake + MORI +
RIXL + nixlbench, 24 layers)

Co-authored-by: Cursor <cursoragent@cursor.com>
Copilot AI review requested due to automatic review settings May 6, 2026 05:29
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 74 out of 75 changed files in this pull request and generated 24 comments.

Comments suppressed due to low confidence (3)

docs/cli-reference.md:396

  • The CLI docs still say --timeout 0 means “no timeout”, but madengine run no longer translates 0 to an unlimited timeout display-wise and the validation message no longer mentions 0. Either update the docs to match the current behavior, or restore the explicit 0 => no timeout handling so help text, docs, and runtime all align.
| `--tags` | `-t` | TEXT | `[]` | Model tags to run (can specify multiple) |
| `--manifest-file` | `-m` | TEXT | `""` | Build manifest file path (for pre-built images) |
| `--rocm-path` | | TEXT | `None` | ROCm installation root (default: `ROCM_PATH` env or `/opt/rocm`). Use when ROCm is not in `/opt/rocm` (e.g. Rock, pip). |
| `--registry` | `-r` | TEXT | `None` | Docker registry URL |
| `--timeout` | | INT | `-1` | Timeout in seconds (-1=default 7200s, 0=no timeout) |
| `--additional-context` | `-c` | TEXT | `"{}"` | Additional context as JSON string |
| `--additional-context-file` | `-f` | TEXT | `None` | File containing additional context JSON |

pyproject.toml:31

  • pytest is included in the package’s runtime dependencies even though it’s only used for testing (it’s already listed under [project.optional-dependencies].dev). Keeping pytest in dependencies increases install size and can conflict with downstream environments. Move it back to dev/test extras only unless the library imports pytest at runtime (it currently doesn’t).
dependencies = [
  "pandas",
  "GitPython",
  "jsondiff",
  "sqlalchemy",
  "setuptools-rust",
  "paramiko",
  "tqdm",
  "pytest",
  "typing-extensions",
  "pymongo",
  "toml",
  "typer>=0.9.0",
  "rich>=13.0.0",
  "click>=8.0.0",
  "jinja2>=3.0.0",
  "pyyaml>=6.0",
]

src/madengine/execution/container_runner_helpers.py:216

  • make_run_log_file_path() no longer normalizes full image references (e.g. registry/ns/name:tag), so the generated log filename can include : (and very long sanitized refs), and run-log naming can diverge from build-log naming for CI-style tags. Consider reinstating a normalization helper (strip digest, collapse registry/...:ci-* to just the tag, and replace :, /, @ with _) to keep filenames stable and filesystem-friendly.

Comment on lines 133 to 138
rocm_path = get_rocm_path(getattr(self.args, "rocm_path", None))
self.context = Context(
additional_context=context_string,
build_only_mode=False,
rocm_path=rocm_path,
)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 3fc7587.

This is a real latent bug introduced by my Cat B revert: I reverted core/context.py to origin/develop (which removes the rocm_path keyword), but missed that run_orchestrator.py still passes it at lines 138 and 456. Today's slurm_multi tests went through _execute_distributed so didn't hit it, but madengine run against a local target hits _init_runtime_context() here and crashes with TypeError: Context.__init__() got an unexpected keyword argument 'rocm_path'.

Fix in 3fc7587: drop the rocm_path = get_rocm_path(...) line and the rocm_path=rocm_path kwarg at both call sites; drop the now-unused from madengine.core.constants import get_rocm_path import.

Verified at HEAD (3fc7587) via inspect.signature(Context.__init__): params are ['self','additional_context','additional_context_file','build_only_mode','detect_local_gpu_arch'] -- no more rocm_path. End-to-end smoke: pyt_large_ep_benchmark_2n on [008,024] via slurm_multi, 8 min, 28 metric rows.

Comment on lines 451 to 456
rocm_path = get_rocm_path(getattr(self.args, "rocm_path", None))
build_context = Context(
additional_context=context_string,
build_only_mode=True,
rocm_path=rocm_path,
)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 3fc7587 (same root cause as the sibling thread on line 138). The synthetic-manifest path in _create_manifest_from_local_image was passing the same now-removed rocm_path kwarg. Both call sites are fixed in this commit; from madengine.core.constants import get_rocm_path import is gone too.

Comment on lines +26 to +33
from madengine.core.errors import (
BuildError,
ConfigurationError,
ExecutionError,
create_error_context,
handle_error,
)
from madengine.core.constants import get_rocm_path
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 3fc7587. Dropped handle_error from the import set in from madengine.core.errors import (...) since the only references in this module were removed in earlier rounds. flake8 F401 should be clean.

Comment on lines +415 to +431
# Ensure credential values are strings
username = str(creds["username"])
password = str(creds["password"])

# Perform docker login
login_command = f"echo '{password}' | docker login"

if registry and registry.lower() not in ["docker.io", "dockerhub"]:
login_command += f" {registry}"

login_command += f" --username {username} --password-stdin"

try:
self.console.sh(login_command, secret=True)
self.rich_console.print(f"[green]✅ Successfully logged in to registry: {registry or 'DockerHub'}[/green]")
except Exception as e:
self.rich_console.print(f"[red]❌ Failed to login to registry {registry}: {e}[/red]")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 3fc7587.

ContainerRunner.login_to_registry now delegates to madengine.core.auth.login_to_registry (the same helper DockerBuilder.login_to_registry already uses). The previous inline form built

login_command = f"echo '{password}' | docker login ... --password-stdin"
self.console.sh(login_command, secret=True)

which -- even though docker reads the password via --password-stdin -- still embedded the password in echo's argv visible through ps//proc. The central helper writes the password to docker's stdin via subprocess.run(..., input=password.encode()), so it never appears on a process command line.

The ~50 lines of inline credential-shape validation are also gone -- the central helper does the same checks (and was the path Round-2 R2-#18 already moved DockerBuilder onto, just in a sister file).

Comment on lines 568 to 583
def get_env_arg(self, run_env: typing.Dict) -> str:
"""Get the environment arguments for docker run."""
env_args = ""

# Add custom environment variables
if run_env:
for env_arg in run_env:
if not self._ENV_KEY_RE.match(env_arg):
raise ValueError(f"Invalid environment variable name: {env_arg!r}")
env_args += f"--env {env_arg}={shlex.quote(str(run_env[env_arg]))} "
env_args += f"--env {env_arg}='{str(run_env[env_arg])}' "

# Add context environment variables
if "docker_env_vars" in self.context.ctx:
for env_arg in self.context.ctx["docker_env_vars"].keys():
if not self._ENV_KEY_RE.match(env_arg):
raise ValueError(f"Invalid environment variable name: {env_arg!r}")
value = self.context.ctx["docker_env_vars"][env_arg]
env_args += f"--env {env_arg}={shlex.quote(str(value))} "
env_args += f"--env {env_arg}='{str(self.context.ctx['docker_env_vars'][env_arg])}' "

print(f"Env arguments: {env_args}")
return env_args
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 3fc7587. get_env_arg no longer dumps the full assembled --env KEY='value' ... string into stdout. It now prints only the count:

print(f"Env arguments: {env_args.count('--env')} variables set")

Values can include HF tokens, registry passwords, MAD_SECRETS_*, etc.; the assembled string is still passed to docker run (so the container sees them as expected), but no longer leaks into stdout / CI artifacts.

Comment on lines 147 to 157
try:
power_info = amdsmi_interface.amdsmi_get_power_info(processor_handle)
if isinstance(power_info, dict):
# If both current and average power are 0, it's likely a secondary die
current_power = power_info.get('current_socket_power', -1)
avg_power = power_info.get('average_socket_power', -1)
if current_power == 0 and avg_power == 0:
return True
except Exception:
except:
# If we can't get power info, might be secondary die
return True
Comment thread docs/deployment.md Outdated
- `sglang-disagg` - Disaggregated SGLang (multi-node)

See [Launchers Guide](launchers.md) for details.
See [Distributed Launchers Guide](distributed-launchers.md) for details.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 3fc7587. This was the second occurrence of [Distributed Launchers Guide](distributed-launchers.md) -- the earlier 0156d33 commit only fixed the one at line 543. Updated to (launchers.md) (verified to exist on disk).

Comment on lines 303 to 311

# Add upload timestamp if not present
if f"{prefix}_uploaded_at" not in doc:
doc[f"{prefix}_uploaded_at"] = datetime.now(timezone.utc)
doc[f"{prefix}_uploaded_at"] = datetime.utcnow()

# Preserve original created_date if present
if "created_date" not in doc:
doc["created_date"] = datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S")
doc["created_date"] = datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S")

Comment on lines 86 to +127
@@ -120,6 +122,10 @@ def __init__(self, message: str, context: Optional[ErrorContext] = None, **kwarg
)


# Backward compatibility alias
RuntimeError = ExecutionError

Comment on lines 29 to +33
class TestConfigParser:
"""Test cases for ConfigParser functionality."""

@pytest.fixture
def tmp_dir(self):
def test_dir(self):
raviguptaamd and others added 2 commits May 6, 2026 06:31
…pend docker.io/)

Surfaced live: `madengine build --build-on-compute --registry rocm/pytorch-private`
failed inside the build sbatch with `docker login rocm` -> dial tcp ...
connect: connection refused. Dockerhub-shorthand without the `docker.io/`
prefix is a routine input shape (matches what Docker CLI accepts for
pull/run); the build-on-compute path silently fed the first segment of
`registry` to `docker login <host>` and got a DNS NXDOMAIN.

`_execute_build_on_compute` now normalizes `--registry` early:

  1. Strip whitespace + trailing slash; reject empty after trim.
  2. Reject inputs containing whitespace or `@` anywhere (catches typos
     like `@bad/foo` BEFORE the auto-prefix step would mask them by
     overwriting the first path segment to `docker.io`).
  3. If the first path segment doesn't look like a registry host
     (no `.`, no `:`, not `localhost`), prepend `docker.io/` and print
     a transparent `[dim]auto-prefixing 'docker.io/'[/dim]` notice so
     the user can see what the build is actually pushing to.
  4. Use the resulting normalized value for everything downstream
     (login host extraction, push path construction, manifest entry,
     per-registry credential-hint matching).

Effect on real-world inputs (verified end-to-end against
docker.io/rocm/pytorch-private):

  rocm/pytorch-private             -> docker.io/rocm/pytorch-private  (auto)
  rocmshared                       -> docker.io/rocmshared            (auto)
  docker.io/rocm/pytorch-private   -> unchanged
  ghcr.io/myorg/myrepo             -> unchanged
  registry.example.com:5000/myrepo -> unchanged
  localhost:5000/myrepo            -> unchanged
  192.168.1.10:5000/myrepo         -> unchanged
  @bad/foo                         -> ConfigurationError (fail-fast,
                                       includes per-registry usage hints)
  ""                               -> ConfigurationError (existing
                                       None-guard, before this change)

Live verification (rocmshared PAT, write+read on rocm/pytorch-private):

  Test 1 `rocm/pytorch-private`             -> sbatch 22954 ✅ pushed
                                                docker.io/rocm/pytorch-private:pyt_kvcache_transfer_bench
  Test 2 `@bad/foo`                         -> ConfigurationError (no sbatch)
                                                "Invalid --registry value: '@bad/foo'.
                                                 First segment '@bad' is not a valid
                                                 registry host."
  Test 3 `docker.io/rocm/pytorch-private`   -> sbatch 22956 ✅ pushed (no auto-prefix
                                                message, value unchanged)
  Test 4 ""                                 -> caught by existing build CLI guard

Suggestion strings use parens `(...)` rather than `[...]` to avoid
Rich's markdown parser interpreting them as closing tags.

Co-authored-by: Cursor <cursoragent@cursor.com>
Round-4 Copilot review (auto-fired post 2200cbd) identified 24 inline
comments. 14 are pre-existing flake8/style nits in `origin/develop`-
identical files unchanged by this PR (tests/fixtures/**, scripts/common/
tools/**, csv_parser.py, utils/{ops,log_formatting}.py, core/errors.py,
database/mongodb.py, etc.); those are out of scope for this PR.

This commit addresses the 8 findings that DO touch code we own:

P0 (real latent bugs / known-broken):

  src/madengine/orchestration/run_orchestrator.py:
    * Drop `Context(rocm_path=...)` kwarg from both call sites
      (`_init_runtime_context` line 138; `_create_manifest_from_local_image`
      line 456). Cat B's revert of `core/context.py` to origin/develop
      removed `rocm_path` from `Context.__init__`, but those two call
      sites still passed the kwarg. Latent bug: today's slurm_multi
      tests went through `_execute_distributed` so didn't hit it,
      but `madengine run` against a local target (`_execute_local`)
      and the synthetic-manifest path (`MAD_CONTAINER_IMAGE` env)
      crashed with `TypeError: Context.__init__() got an unexpected
      keyword argument 'rocm_path'`.
    * Drop now-unused `from madengine.core.constants import get_rocm_path`
      (only callers were the two lines above).
    * Drop unused `handle_error` from `madengine.core.errors` import set
      (flake8 F401).

  src/madengine/execution/container_runner.py:
    * Widen `perf_pattern` in run-result extraction to mirror the
      `deployment/base.py:_parse_node_log_for_perf` regex we already
      widened in 63a2eae:
        before: r'performance:\s+([0-9][0-9.eE+-]*)\s+([a-zA-Z_][a-zA-Z0-9_]*)'
        after:  r'performance:\s+([+-]?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?)\s+(\S+)'
      Accepts scientific notation values and arbitrary metric tokens
      (`tokens/sec`, `tok/s`, `samples_per_second`, etc.).

  docs/deployment.md:
    * `[Distributed Launchers Guide](distributed-launchers.md)` →
      `(launchers.md)` at line 145. Second occurrence of the broken
      link the 0156d33 fix missed (it only fixed line 543).

Yellow-tier hardening (security / leak / shell-quoting):

  src/madengine/execution/container_runner.py:
    * `login_to_registry` now delegates to
      `madengine.core.auth.login_to_registry` instead of constructing
      `f"echo '{password}' | docker login ..."` inline. The previous
      form put the password in argv (visible via `ps`/`/proc`) even
      though docker itself read it from stdin. Same delegation pattern
      DockerBuilder already uses.
    * `get_env_arg` no longer prints the assembled `--env KEY='value'`
      string (which can include HF tokens, registry passwords,
      `MAD_SECRETS_*`). Print only the count.
    * Self-managed launcher path (`_execute_self_managed_launcher`):
      `cmd = f"bash {script_path} {model_args}"` with `shell=True`
      no longer string-concats user-supplied `model_args` into the
      shell command. Use `shlex.quote(script_path)` and
      `shlex.split(model_args)` + per-arg `shlex.quote`, so embedded
      `$()`/backticks/`;` become literal arguments to the script
      instead of host-shell directives.

Verified:
  * py_compile clean on both files.
  * Import smoke: DeploymentFactory.available_deployments() ==
    ['k8s','kubernetes','slurm']; Context() rejects `rocm_path=` arg;
    _deep_merge nested-leaf preservation; run_orchestrator no longer
    references get_rocm_path/handle_error/rocm_path=; container_runner
    perf_pattern widened, env_args print is count-only,
    login_to_registry delegates to core.auth, self-managed shlex.quote
    applied.
  * Live: pyt_large_ep_benchmark_2n on [008,024] via slurm_multi
    completed in 8 min with 28 ✅ Success rows -> perf.csv at
    /shared_inference/ravgupta/<jobid>/ as expected.

Addresses Copilot review comments R4-1, R4-2, R4-3, R4-4, R4-5,
R4-6, R4-7, R4-18.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copilot AI review requested due to automatic review settings May 6, 2026 19:32
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 74 out of 75 changed files in this pull request and generated 16 comments.

Comments suppressed due to low confidence (2)

tests/unit/test_reporting_superset.py:39

  • This file contains whitespace on otherwise blank lines (e.g. between the class docstring and the first fixture). The repo’s pre-commit trailing-whitespace hook (and flake8 W293) will fail on this; please remove trailing spaces on blank lines.
    tests/unit/test_reporting_superset.py:171
  • Additional whitespace-only lines in this section will also fail the pre-commit trailing-whitespace hook. Please strip the trailing spaces from blank lines here as well.

Comment thread pyproject.toml
Comment on lines 14 to 24
dependencies = [
"pandas",
"GitPython",
"jsondiff",
"sqlalchemy",
"setuptools-rust",
"paramiko",
"tqdm",
"pytest",
"typing-extensions",
"pymongo",
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in afbb666: dropped pytest from [project].dependencies. It was already present in [project.optional-dependencies].dev and .all, so no install path that was running tests is affected; only the bare pip install madengine (without [dev]/[all]) no longer pulls a test framework into runtime installs.

Comment on lines 152 to 156
result = builder.get_build_arg()

assert f"--build-arg ARG1={shlex.quote('value1')}" in result
assert f"--build-arg ARG2={shlex.quote('value2')}" in result
assert "--build-arg ARG1='value1'" in result
assert "--build-arg ARG2='value2'" in result

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in afbb666: 5 build-arg assertions updated to match DockerBuilder.get_build_arg()'s shlex.quote() output. For shell-safe values (e.g. value1), shlex.quote returns the value unchanged, so the assertions are now --build-arg ARG1=value1 (no surrounding quotes). Verified locally: pytest -k build_arg4 passed.

Comment on lines 72 to 93
@pytest.mark.parametrize(
"clean_test_temp_files", [["perf_test.csv", "perf_test.html"]], indirect=True
)
def test_commandline_argument_skip_gpu_arch(
self, global_data, clean_test_temp_files, dynamic_skip_gpu_arch_model_dir
self, global_data, clean_test_temp_files
):
"""
Test that skip_gpu_arch command-line argument skips GPU architecture check.
Uses a dynamic MODEL_DIR with the current GPU arch added to the skip list
so the test works on any hardware.
UPDATED: Now uses python3 -m madengine.cli.app instead of legacy mad.py
"""
context = generate_additional_context_for_machine()
output = global_data["console"].sh(
"cd "
+ BASE_DIR
+ "; "
+ "MODEL_DIR="
+ dynamic_skip_gpu_arch_model_dir
+ MODEL_DIR
+ " "
+ f"python3 -m madengine.cli.app run --tags dummy_skip_gpu_arch --live-output --additional-context '{json.dumps(context)}'"
)
if "Skipping model" not in output:
pytest.fail("Enable skipping gpu arch for running model is failed.")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferred to upstream-flake8 cleanup PR. The line at #L93 is from 04aac390 (Stephen Shao) and is byte-identical to origin/develop; this PR's commit 50e8895 only touched the surrounding lines (88, 99) for whitespace and a fixture-arg rename, not the assertion target itself. Tracked alongside the rest of the Round-4 / Round-5 upstream-style nits in the follow-up cleanup PR (see #86 description, "Round 4" / "Round 5" status sections).

Comment on lines 418 to 425
if deployment_type == "slurm" and in_slurm_job:
print(f"🔄 Using fresh pull policy for SLURM compute node (prevents cached layer corruption)")
# Remove any existing cached image to force fresh pull
try:
self.console.sh(f"docker rmi -f {registry_image} 2>/dev/null || true")
print(f"✓ Removed cached image layers")
except Exception:
except:
pass # It's okay if image doesn't exist
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in afbb666: the bare except: at L424 (introduced by this PR's 6f2e11bf) was replaced with except Exception: so KeyboardInterrupt/SystemExit propagate, and flake8 E722 no longer fires on this line.

Comment on lines 521 to 534
def get_env_arg(self, run_env: typing.Dict) -> str:
"""Get the environment arguments for docker run."""
env_args = ""

# Add custom environment variables
if run_env:
for env_arg in run_env:
if not self._ENV_KEY_RE.match(env_arg):
raise ValueError(f"Invalid environment variable name: {env_arg!r}")
env_args += f"--env {env_arg}={shlex.quote(str(run_env[env_arg]))} "
env_args += f"--env {env_arg}='{str(run_env[env_arg])}' "

# Add context environment variables
if "docker_env_vars" in self.context.ctx:
for env_arg in self.context.ctx["docker_env_vars"].keys():
if not self._ENV_KEY_RE.match(env_arg):
raise ValueError(f"Invalid environment variable name: {env_arg!r}")
value = self.context.ctx["docker_env_vars"][env_arg]
env_args += f"--env {env_arg}={shlex.quote(str(value))} "
env_args += f"--env {env_arg}='{str(self.context.ctx['docker_env_vars'][env_arg])}' "

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in afbb666: get_env_arg now passes each value through shlex.quote(...) instead of single-quote string interpolation. shlex.quote is byte-equivalent for shell-safe values (alphanumeric, _, -, ., :, /, +, =, etc.) and correctly escapes embedded single quotes, whitespace, and other metacharacters.

slurm_multi safety: the self-managed launcher path (_run_self_managed) returns at L1099 before the docker-run command assembly at L1058 that calls get_env_arg, so this change only tightens the regular-Docker run path used by non-slurm_multi workloads. Verified end-to-end with a fresh pyt_large_ep_benchmark_2n slurm_multi smoke.

Comment thread src/madengine/cli/commands/run.py Outdated
if timeout < -1:
console.print(
"❌ [red]Timeout must be -1 (default), 0 (no timeout), or a positive integer[/red]"
"❌ [red]Timeout must be -1 (default) or a positive integer[/red]"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in afbb666: validation message updated to match the actual accepted shape — "Timeout must be -1 (default), 0 (no timeout), or a positive integer". The --timeout help string already documents 0 for no timeout (line 67) and the validator is if timeout < -1, so 0 was always accepted; only the error string was misleading. No behaviour change.

Comment on lines 34 to 41
def is_amd_gpu():
"""Check if system has AMD GPU."""
try:
import subprocess
result = subprocess.run(['rocm-smi'], capture_output=True, timeout=5)
return result.returncode == 0
except Exception:
except:
return False
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferred. File is byte-identical to origin/develop in this PR. Tracked alongside the other upstream-style nits in the follow-up cleanup PR.

Comment thread tests/fixtures/utils.py
Comment on lines 96 to 106
# Clean up Docker containers BEFORE test (ensure clean state)
try:
subprocess.run(
"docker ps -a | grep 'container_ci-dummy' | awk '{print $1}' | xargs -r docker rm -f",
shell=True,
capture_output=True,
timeout=30
)
except Exception:
except:
pass # Ignore cleanup errors before test

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferred. File is byte-identical to origin/develop in this PR (no edits from #86). The two bare-except instances in this file are pre-existing; develop's [2.0.1] test-suite cleanup narrowed 5 such cases elsewhere but did not touch these two. Tracked in the upstream-cleanup follow-up PR.

Comment thread tests/fixtures/utils.py
Comment on lines 118 to 127
# Clean up Docker containers AFTER test (avoid conflicts with next test)
try:
subprocess.run(
"docker ps -a | grep 'container_ci-dummy' | awk '{print $1}' | xargs -r docker rm -f",
shell=True,
capture_output=True,
timeout=30
)
except Exception:
except:
pass # Ignore cleanup errors after test
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferred. Same disposition as the L106 thread above — file byte-identical to origin/develop, pre-existing concern, tracked in the upstream-cleanup follow-up PR.

Comment on lines 60 to 66
def convert_directory_csvs_to_html(
directory_path: str,
output_file: str = "run_results.html"
) -> Optional[str]:
) -> str:
"""Convert all CSV files in a directory to a single HTML file.

Args:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferred. This file is byte-identical to origin/develop in this PR. Tracked alongside the other upstream-style nits in the follow-up cleanup PR.

@raviguptaamd
Copy link
Copy Markdown
Contributor Author

Status update — ready for review

PR head: 3fc7587. 75 files / +2,877 / -2,179 vs develop. All four Copilot review rounds resolved or replied-to inline.

What's in here

  • slurm_multi deployment launcher for self-managed multi-node disaggregated workloads (vLLM-disagg, SGLang-disagg, large-EP, kvcache-transfer) with the orchestration / build / run / CLI hooks that drive it from the manifest pipeline.
  • madengine build --build-on-compute --registry <docker.io/<ns>(/<repo>) | ghcr.io/... | localhost:5000/...> flow (build on a compute node + push for parallel pull on run nodes).
  • madengine build --use-image <image> flow (skip build, use a pre-built image).
  • SBATCH wrapper generation, reservation support, existing-allocation detection (salloc), per-job-id completion marker, _collect_slurm_multi_results perf.csv discovery in /shared_inference/<user>/<jobid>/.

Kubernetes deployment work split out into #118 (independent PR against develop) per earlier reviewer ask; that PR will rebase / sequence after this one lands.

Live end-to-end test coverage

All on Distributed_Inference_CI and adjacent free nodes.

# Workload Path Result
1 SGLang-disagg Llama-3.1-8B (1Px8/1Dx8, 24 GPUs / 3 nodes) --use-image + slurm_multi 5,977 tok/s, TTFT 122.8 ms, ITL 2.54 ms
2 SGLang-disagg DeepSeek-V3 (1Px8/1Dx8, 24 GPUs / 3 nodes) --use-image + slurm_multi 853 / 1,452 tok/s @ CON=8/16, ISL/OSL=1024/1024
3 SGLang-disagg Llama-3.3-70B-FP8 (1Px8/1Dx8, 24 GPUs / 3 nodes) --build-on-compute --registry docker.io/rocm/pytorch-private -> rocm/pytorch-private:pyt_sglang_disagg_llama-3.3-70b-fp8 1,878 / 3,644 tok/s @ CON=8/16
4 Large EP (16 GPUs / 2 nodes) --build-on-compute --registry -> rocm/pytorch-private:pyt_large_ep_benchmark_2n 28 perf rows (DeepEP + MoRI-EP dispatch/combine, FP8/BF16 RDMA/NVL)
5 KV cache transfer (16 GPUs / 2 nodes) --build-on-compute --registry -> rocm/pytorch-private:pyt_kvcache_transfer_bench 57 perf rows (mooncake / MORI / RIXL, 4KB → 1GB sweep)
6 Re-smoke post-R4 fixes: pyt_large_ep_benchmark_2n on [008,024] slurm_multi from manifest 8 min, 28 ✅ Success rows -- confirms R4 changes didn't regress the run path

The vLLM-disagg run path was also exercised end-to-end (slurm_multi topology + co-located proxy + port refactor + MoRI-EP gating all working) but the underlying rocm/pytorch-private:20260407_itej89_vllm_mori_docker image hits a NIXL_ERR_BACKEND after weight load. Image-side, orthogonal to this PR; tracked separately. Run-phase plumbing not gated by it.

Copilot review status

Four rounds total, all addressed:

  • Round 1 (9 inline) -- fixed in d32b06b / 5ec55ff, replied inline.
  • Round 2 (24 inline) -- 9 fixed in 63a2eae, 11 reverted to origin/develop byte-for-byte in 9109f52, 4 doc/example fixes in 0156d33, all replied inline with commit SHAs.
  • Round 3 (13 inline) -- 4 fixed in 2200cbd (profiling README typo, pytest unused-import, test_rocm_path.py revert), 7 stale-snapshot replies (already addressed in earlier rounds), 2 deferred to follow-up (pre-existing in origin/develop).
  • Round 4 (24 inline) -- 8 fixed in 3fc7587 (5 P0 + 3 hardening, see commit shape entry 12 in PR description), 14 are pre-existing flake8 / style nits in files byte-identical to origin/develop (utils/{ops,log_formatting}.py, scripts/common/{tools,pre_scripts}/**, tests/fixtures/utils.py, tests/integration/, core/errors.py, database/mongodb.py, deployment/base.py:645) -- not introduced by this PR; deferred to a separate cleanup PR.

The "P0" subset of Round 4 included a real latent crash I'd introduced in the Cat B revert (run_orchestrator.py still passing Context(rocm_path=...) after core/context.py was reverted to upstream's signature) -- that's now fixed and verified by the live re-smoke above.

Bugs surfaced and fixed during real-cluster testing

  • --registry parsing: shorthand rocm/pytorch-private was hitting docker login rocm (NXDOMAIN). Now auto-prefixed to docker.io/rocm/pytorch-private with a transparent notice; malformed input rejected up front with usage hints (98f1ac5).
  • --build-on-compute local image name: dockerfile path with no .ubuntu.amd.Dockerfile suffix (e.g., scripts/.../Dockerfile) yielded ci-..._Dockerfile -- Docker rejects uppercase. .lower() the assembled name (ae1996c).
  • _load_and_merge_manifest was still doing shallow dict.update() despite the Round-2 "deep merge" fix -- promoted to a recursive module-level helper used by both code paths (63a2eae).
  • manifest.context.docker_env_vars was being dropped on --manifest-file runs, losing build-time secrets / MAD_SECRETS_* (63a2eae).
  • Completion marker was not job-id namespaced -- a second madengine run of the same model tag returned success in ~30 s while the new sbatch was still running (710b51a).
  • ErrorContext was being constructed with 5 invalid kwargs across build_orchestrator.py, masking the real error (5ec55ff).
  • _collect_slurm_multi_results was only looking at <cwd>/perf.csv, missing the canonical /shared_inference/<user>/<jobid>/perf.csv location the inner scripts emit (5ec55ff).

Sequencing

  1. Updates to slurm launcher #86 (this PR) → develop.
  2. Kubernetes deployment (split from PR #86) #118 (Kubernetes) rebases onto develop and merges.
  3. After both land, follow-up cherry-picks the combined work into main per repo policy.

What I'd appreciate eyes on

  • The slurm_multi launcher core (deployment/slurm.py, slurm_node_selector.py, the slurm_multi wrapper-script generation).
  • The --build-on-compute / --use-image orchestration in build_orchestrator.py (registry normalization, lowercase tagging, prebuilt-image multi-model handling, per-registry auth hints).
  • run_orchestrator._deep_merge semantics for nested slurm.node_selector / distributed.<sub> overrides.
  • container_runner._run_self_managed (the new self-managed launcher path used by slurm_multi).
  • The models.json slurm-side example configs renamed to slurm-multi-*.json and the launchers / deployment / cli-reference docs.

Reviews welcome. cc @gargrahul @Cemberk @Rohan138 @coketaste @leconcio

raviguptaamd and others added 2 commits May 6, 2026 22:52
Three P1 items in PR-introduced code, plus four small cleanups Copilot
flagged on the same lines.

container_runner.py
  * get_env_arg: replace single-quote string interpolation
    (`--env KEY='value'`) with shlex.quote, so values containing single
    quotes / whitespace / shell metacharacters no longer break the
    assembled `docker run` command. shlex.quote is byte-equivalent
    for safe alphanumeric values (returns the input unchanged) and
    only adds escaping when needed. Not on the slurm_multi path
    (self-managed launcher short-circuits at L1098 before the docker
    docker_options assembly that calls get_env_arg).
  * Revert the AMD GPU SMI block (~L1142) to origin/develop's bare-name
    form (`amd-smi 2>/dev/null || rocm-smi 2>/dev/null || echo ...`).
    The PR-introduced version interpolated the *host* `context.ctx['rocm_path']`
    (or `get_rocm_path()`) into absolute `/opt/rocm-X.Y/bin/*` paths and
    executed them *inside* the container, which is incorrect when the
    container's ROCm install location differs from the host's. Bare
    names rely on the container's PATH, which is the upstream behaviour.
  * Replace bare `except:` at L424 (PR-introduced) with `except Exception:`
    so KeyboardInterrupt / SystemExit propagate (flake8 E722).
  * Fix a stale function-name reference in the perf-pattern comment at
    L1394: deployment/base.py's function is `_parse_performance_from_log`,
    not `_parse_node_log_for_perf`.
  * Hoist `import shlex` to the module-level import block (was a local
    import inside `_run_self_managed`) and drop the now-unused
    `from madengine.core.constants import get_rocm_path`.

cli/commands/run.py
  * Update timeout validation message to match the actual accepted
    shape ("-1 (default), 0 (no timeout), or a positive integer").
    The --timeout help string already documents `0 for no timeout`,
    and the validator uses `if timeout < -1`, so 0 was always
    accepted; only the error string was misleading.

pyproject.toml
  * Drop `pytest` from [project].dependencies (it stays in
    [project.optional-dependencies].dev and .all). Keeps test
    dependencies out of `pip install madengine` runtime installs.

tests/integration/test_docker_integration.py
  * Update 5 build-arg assertions to match `DockerBuilder.get_build_arg()`'s
    shlex.quote() output for shell-safe values
    (`--build-arg ARG1=value1`, no surrounding quotes).
    Verified: `pytest -k build_arg` -> 4 passed.

slurm_multi safety: get_env_arg is reachable only from the regular
docker run path (L1058); the slurm_multi self-managed launcher
returns at L1099 before that. The SMI revert and the bare-`except`
fix are both in code paths used only outside slurm_multi (image
pull / GPU pre-check). The timeout, pyproject and comment-typo
changes have no runtime effect.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>

# Conflicts:
#	CHANGELOG.md
Copilot AI review requested due to automatic review settings May 6, 2026 22:53
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 73 out of 74 changed files in this pull request and generated 14 comments.

Comments suppressed due to low confidence (2)

src/madengine/deployment/base.py:236

  • _monitor_until_complete() never treats CANCELLED (or any non-SUCCESS/FAILED terminal status) as terminal. Since DeploymentStatus still includes CANCELLED, this loop can spin forever after a cancellation. Consider returning on CANCELLED as well (and/or treating any non-PENDING/RUNNING status as terminal).
        self.console.print("[blue]Monitoring deployment...[/blue]")

        while True:
            status = self.monitor(deployment_id)

            if status.status in [DeploymentStatus.SUCCESS, DeploymentStatus.FAILED]:
                return status

            # Still running, wait and check again
            self.console.print(
                f"  Status: {status.status.value} - {status.message}"
            )
            time.sleep(30)  # Check every 30 seconds

src/madengine/execution/container_runner.py:251

  • create_run_details_dict() indexes self.context.ctx["docker_env_vars"]["MAD_SYSTEM_GPU_ARCHITECTURE"] directly. If docker_env_vars or the key is missing (e.g., older manifests / non-GPU runs / context init differences), this will raise KeyError and prevent perf entry generation. Use .get() with a safe fallback (or guard when the nested dict/key is absent).

Comment on lines +1012 to +1019
# Merge env_vars from model_info (models.json) into docker_env_vars
if "env_vars" in model_info and model_info["env_vars"]:
model_env_count = 0
for key, value in model_info["env_vars"].items():
self.context.ctx["docker_env_vars"][key] = str(value)
model_env_count += 1
if model_env_count > 0:
print(f"ℹ️ Merged {model_env_count} environment variables from model_info (models.json)")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in bcafdbb: restored the upstream finalize_container_rocm_path() call block (originally introduced on develop in #110, removed by this PR's 6f2e11bf when the model_info env_vars merge block replaced it). The two address different concerns and must coexist — the env_vars merge copies model-card values into docker_env_vars, while finalize normalizes in-container ROCM_PATH via OCI inspect → in-image probe → /opt/rocm default. Block restored verbatim from origin/develop (including the user_supplied_rocm_path short-circuit), placed immediately after the env_vars merge.

slurm_multi safety: the self-managed launcher returns at L1099 before this code path, so this restoration only affects the regular-Docker run path used by non-slurm_multi workloads.

Comment on lines +1672 to +1675
model_docker.sh(f"cp {model_dir}/*_profiler_output.csv . 2>/dev/null || true")
model_docker.sh(f"cp {model_dir}/*_output.csv . 2>/dev/null || true")
model_docker.sh(f"cp {model_dir}/*_trace.csv . 2>/dev/null || true")
model_docker.sh(f"cp {model_dir}/library_trace.csv . 2>/dev/null || true")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in bcafdbb: model_dir is now passed through shlex.quote once and reused across the four cp ... 2>/dev/null || true calls. The trailing *.csv glob is left unquoted intentionally so the in-container shell still expands it inside the (quoted) directory:

_md_q = shlex.quote(model_dir)
model_docker.sh(f"cp {_md_q}/*_profiler_output.csv . 2>/dev/null || true")
model_docker.sh(f"cp {_md_q}/*_output.csv . 2>/dev/null || true")
model_docker.sh(f"cp {_md_q}/*_trace.csv . 2>/dev/null || true")
model_docker.sh(f"cp {_md_q}/library_trace.csv . 2>/dev/null || true")

Comment on lines +1682 to 1686
mult_res = model_info.get("multiple_results")
if mult_res:
try:
model_docker.sh(
_cp_model_dir_file_to_cwd_cmd(model_dir, mult_res)
)
model_docker.sh(f"cp {model_dir}/{mult_res} . 2>/dev/null || true")
except Exception:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in bcafdbb: both path components are now quoted, so a model-metadata multiple_results value containing whitespace or shell metacharacters can no longer break the command or be re-interpreted by the shell:

model_docker.sh(
    f"cp {shlex.quote(model_dir)}/{shlex.quote(mult_res)} . "
    "2>/dev/null || true"
)

"""

import json
import os
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in bcafdbb: dropped the unused import os (flake8 F401). The module hasn't referenced os.* since the orchestration helpers it used were moved out.

Comment on lines 82 to 93
context = generate_additional_context_for_machine()
output = global_data["console"].sh(
"cd "
+ BASE_DIR
+ "; "
+ "MODEL_DIR="
+ dynamic_skip_gpu_arch_model_dir
+ MODEL_DIR
+ " "
+ f"python3 -m madengine.cli.app run --tags dummy_skip_gpu_arch --live-output --additional-context '{json.dumps(context)}'"
)
if "Skipping model" not in output:
pytest.fail("Enable skipping gpu arch for running model is failed.")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferred. The line at #L93 is from 04aac390 (Stephen Shao) and is byte-identical to the version on origin/develop in this PR. develop's [2.0.1] already lands the hardware-agnostic fix ([2.0.1] -> "E2E tests — hardware-agnostic GPU arch skip"test_commandline_argument_skip_gpu_arch now detects the runner's GPU arch at runtime and injects it into skip_gpu_arch, with a new get_gpu_arch() helper in tests/fixtures/utils.py). Once #86 lands and CI re-runs against develop, that fix is what executes; no #86-side change needed. Tracked alongside the other origin/develop-identical concerns in the upstream-flake8 follow-up cleanup PR (per #86 description, Round 4 / Round 5 / Round 6 status sections).

Comment on lines 36 to 41
try:
import subprocess
result = subprocess.run(['rocm-smi'], capture_output=True, timeout=5)
return result.returncode == 0
except Exception:
except:
return False
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferred. File is byte-identical to origin/develop in this PR (the bare-except: at #L41 is from 04aac390). Tracked alongside the other upstream bare-except: cleanup in the follow-up PR.

Comment on lines 30 to 42
try:
self.rocmsmi = initRsmiBindings(silent=False)
if rocm_smi.driverInitialized() is True:
ret_init = self.rocmsmi.rsmi_init(0)
if ret_init != 0:
raise ValueError('ROCm SMI returned %s (the expected value is 0)', ret_init)
exit(ret_init)
else:
raise ImportError('Driver not initialized (amdgpu not found in modules)')
exit(0)
self.rocm6 = True
except Exception:
except:
rocm_smi.initializeRsmi()
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferred. File is byte-identical to origin/develop in this PR (the bare-except: at #L42 is from da0d524f (Rahul Garg) and predates #86). Tracked in the upstream-cleanup follow-up PR.

Comment on lines 147 to 157
try:
power_info = amdsmi_interface.amdsmi_get_power_info(processor_handle)
if isinstance(power_info, dict):
# If both current and average power are 0, it's likely a secondary die
current_power = power_info.get('current_socket_power', -1)
avg_power = power_info.get('average_socket_power', -1)
if current_power == 0 and avg_power == 0:
return True
except Exception:
except:
# If we can't get power info, might be secondary die
return True
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferred. File is byte-identical to origin/develop in this PR (the bare-except: at #L157 is from 0f2c3010 (Stephen Shao) and predates #86). Tracked in the upstream-cleanup follow-up PR.

Comment on lines +14 to 17
from unittest.mock import MagicMock, patch, Mock
import pytest
import pandas as pd

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in bcafdbb: dropped the unused import pandas as pd (flake8 F401). Confirmed there are no pd.* references in the file — the test fixtures use dict / MagicMock only, so the import was dead. The line was added by this PR's 50e8895.

Comment thread tests/unit/test_reporting_superset.py Outdated
class TestConfigParser:
"""Test cases for ConfigParser functionality."""

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in bcafdbb: trailing whitespace stripped on every blank-with-spaces line (107 occurrences in this file, including the one at L31). Pure whitespace cleanup, no functional change; flake8 W293 is now clean for this file.

Round 6 fired automatically on the develop-merge push (8fec192). 5 of
the 14 comments are PR-introduced; the remaining 9 are byte-identical
to origin/develop and stay on the upstream-flake8 cleanup follow-up
list with the rest of R4 / R5 deferrals.

container_runner.py
  * Restore the upstream `finalize_container_rocm_path()` call
    block (originally introduced on develop in PR ROCm#110, removed by
    this PR's 6f2e11b when the model_info `env_vars` merge block
    replaced it). The two blocks address different concerns and
    must coexist:
      - merge model-card env_vars into docker_env_vars (this PR), and
      - normalize in-container ROCM_PATH via OCI inspect → in-image
        probe → /opt/rocm default (upstream).
    Without the finalization, models on TheRock or non-`/opt/rocm`
    container layouts run with a wrong/empty ROCM_PATH, breaking
    rocm-smi / amd-smi / hipcc inside the container.
    The slurm_multi self-managed launcher returns at L1099 before
    this code path, so this restoration only affects the regular
    docker run path.

  * Quote `model_dir` (and `mult_res`) with shlex.quote on the
    profiler-/trace-/multi-results `cp` commands run via
    `model_docker.sh(... shell=True)` (lines 1672-1685 area). The
    `*.csv` glob is left unquoted so the in-container shell still
    expands it inside the (quoted) directory. `mult_res` comes
    straight from model metadata and was previously concatenated
    raw into the shell command.

deployment/config_loader.py
  * Drop unused `import os` (flake8 F401). The module hasn't
    referenced `os.*` since the orchestration helpers it used were
    moved out.

tests/unit/test_database_mongodb.py
  * Drop unused `import pandas as pd` (flake8 F401). The test
    file uses dict / Mock fixtures and never reaches into pandas.

tests/unit/test_reporting_superset.py
  * Strip trailing whitespace (flake8 W293) on every blank-with-
    spaces line (107 occurrences). Pure whitespace; no functional
    change.

slurm_multi safety: all five fixes either restore upstream
behaviour on the regular-Docker run path, are flake8-only, or are
in test files. None of them change anything reached by the
slurm_multi self-managed launcher path; live re-smoke is
unnecessary for this commit.

Co-authored-by: Cursor <cursoragent@cursor.com>
raviguptaamd added a commit to raviguptaamd/madengine that referenced this pull request May 10, 2026
…lti bash branch

The earlier feat(deployment) commit ported `_run_inside_existing_allocation`
verbatim from PR ROCm#86 (used by the slurm_multi bash-in-salloc branch).
That method constructs `DeploymentResult(..., skip_monitoring=True)` to
signal that the script ran synchronously and the monitor phase should
be skipped. Develop's `DeploymentResult` dataclass did not have that
field, so the construction raised `TypeError` at runtime, the failure
was swallowed by the orchestrator, and `madengine run` exited 0 even
when the wrapper script failed.

Two minimal additive edits to deployment/base.py:

- DeploymentResult: add `skip_monitoring: bool = False` field.
- BaseDeployment.execute(): change the monitor guard from
  `if self.config.monitor:` to
  `if self.config.monitor and not result.skip_monitoring:`
  so the bash branch can correctly bypass the SLURM job poll
  (there is no SLURM job to poll — the script ran inline).

Behavior for non-slurm_multi launchers is unchanged: they never
construct DeploymentResult with skip_monitoring=True, the field
defaults to False, and the monitor() call still fires as before.

Discovered during R5.1b cluster smoke test of
`pyt_sglang_disagg_qwen3-32b_short` from MAD-private PR #186.

Co-authored-by: Cursor <cursoragent@cursor.com>
coketaste added a commit that referenced this pull request May 29, 2026
…her (Revised PR#124) (#130)

* feat(deployment): add slurm_multi launcher (minimal additive)

Introduce the `slurm_multi` self-managed multi-node SLURM launcher,
fully additive vs develop. Verbatim minimal-delta port from PR #86
limited to the slurm_multi surface; no develop functionality removed.

- common.py: register `slurm_multi` in VALID_LAUNCHERS, add hyphen
  alias normalization (`slurm-multi` -> `slurm_multi`). All other
  develop content (functools.lru_cache, _ROCPROF_FAMILY_TOOL_NAMES,
  tools_include_rocprof_family, full configure_multi_node_profiling
  rocprof-family handling, sglang-disagg in VALID_LAUNCHERS) preserved.
- slurm.py: SLURM_MULTI_ALIASES; reservation field + allocation
  detection in __init__; helper methods (_get_allocation_node_count,
  _validate_allocation_nodes); slurm_multi early dispatch in prepare();
  _prepare_slurm_multi_script (sets `_is_slurm_multi = True` for
  deploy() gate); _generate_slurm_multi_command; new elif in
  _generate_launcher_command (existing primus/sglang-disagg dispatch
  preserved); gated bash-in-salloc branch in deploy() (slurm_multi
  only -- non-slurm_multi launchers continue to use sbatch);
  _run_inside_existing_allocation; _collect_slurm_multi_results +
  early dispatch in collect_results(); reservation= kwarg threaded
  to existing SlurmNodeSelector(...) call.
- slurm_node_selector.py: optional `reservation` parameter so the
  reservation propagates to srun health/cleanup commands.

Source: ROCm/madengine PR #86 (slurm_multi surface only). 0 deletions
from develop.

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(orchestration): build_on_compute, registry gate, parallel pull for slurm_multi

Wire the build/run orchestration paths needed by the slurm_multi
launcher. All existing flows for non-slurm_multi launchers continue
to work unchanged; new behavior is gated on
`distributed.launcher == "slurm_multi"` (or `slurm-multi`).

- execution/container_runner.py: SLURM_MULTI_ALIASES,
  SELF_MANAGED_LAUNCHERS constants; `_run_self_managed` method (runs
  the model script directly on baremetal so srun/scontrol work
  inside it, propagates env from model_info.env_vars and
  additional_context.env_vars); self-managed launcher early dispatch
  in run_container() that fires only when launcher matches
  SELF_MANAGED_LAUNCHERS; `.slurm` extension recognized alongside
  `.sh` in script-extension detection (T-C2). Develop helpers
  preserved verbatim (_print_run_env_table, _docker_image_exists_locally,
  _bash_quote_path, _cp_model_dir_file_to_cwd_cmd, _resolve_docker_image,
  _ENV_KEY_RE env-var validation, MAD_OUTPUT_CSV passthrough,
  PRIMUS_CONFIG_PATH/PRIMUS_CLI_EXTRA in env-var allowlist,
  PERFORMANCE_LOG_PATTERN regex, CSV fieldname stripping).
- execution/docker_builder.py: after build, write resolved registry
  image into built_models[*].env_vars.DOCKER_IMAGE_NAME so the
  run-time `srun docker pull "\$DOCKER_IMAGE_NAME"` on each compute
  node finds the registry image. Pure addition (one block before
  manifest assembly).
- orchestration/build_orchestrator.py: import shutil; capture
  user-supplied slurm keys (_original_user_slurm_keys); add
  `use_image: Optional[str]` and `build_on_compute: bool` to
  execute(); mutex validation between use_image and build_on_compute;
  use_image dispatch (incl. "auto" sentinel that resolves from model
  card via _resolve_image_from_model_card); build_on_compute
  dispatch (registry required, validated in _execute_build_on_compute);
  slurm_multi registry gate that runs DiscoverModels early to detect
  slurm_multi tags and either raises a structured ConfigurationError
  with helpful suggestions or auto-uses DOCKER_IMAGE_NAME from the
  model card env_vars (implicit --use-image fallback);
  _execute_with_prebuilt_image; _resolve_image_from_model_card;
  _execute_build_on_compute (sbatch wrapper that builds + pushes from
  one compute node, polls completion, marks `built_on_compute: true`
  in the manifest). Develop's load_credentials top-level usage,
  detect_local_gpu_arch parameter, and resolved_arch print all
  preserved unchanged.

Note: PR #86's universal merge of model_info.env_vars into
docker_env_vars in run_container() was DELIBERATELY DROPPED -- it
is not slurm_multi-essential (slurm_multi gets env_vars via
_prepare_slurm_multi_script's wrapper export block and via
_run_self_managed's own merge). Applying it universally would
silently change behavior for non-slurm_multi launchers and has an
incorrect precedence vs additional_context.env_vars; it can be
revisited as a separate PR if needed.

Source: ROCm/madengine PR #86 (slurm_multi surface only). 0 deletions
from develop except the agreed `.sh` -> `.sh|.slurm` line modification
(T-C2: harmless extension to script-extension recognition).

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(cli): expose --use-image and --build-on-compute on madengine build

Surface the orchestrator paths from the previous commit through the
`madengine build` CLI:

- cli/commands/build.py: add `--use-image [IMAGE | auto]` (skip the
  local Docker build and use the named image; `auto` resolves from
  the model card's DOCKER_IMAGE_NAME); add `--build-on-compute`
  (build on a SLURM compute node and push to the configured
  registry; manifest records `built_on_compute: true`); 3 mutex
  validation blocks (--use-image vs --registry, --use-image vs
  --build-on-compute, --build-on-compute requires --registry); pass
  both new kwargs through to validate_additional_context,
  create_args_namespace, and BuildOrchestrator.execute().
- cli/validators.py: add `use_image: Optional[str] = None`
  parameter (signature + docstring only -- body unchanged) so
  build.py's call site does not TypeError. PR #86's other
  validators.py rewrites (drops of ROCM_PATH / MAD_ROCM_PATH
  validation, dockerfile_matched fallback removal) are NOT
  inherited; develop validation is preserved.

Required for the slurm_multi launcher's documented build options
in MAD-private #186.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(slurm_multi): contract tests + minimal example config

- tests/unit/test_slurm_multi.py: nine tests across three classes
  - TestSlurmMultiRegistration: `slurm_multi` is in VALID_LAUNCHERS;
    SLURM_MULTI_ALIASES contains both canonical and hyphen forms.
  - TestNormalizeSlurmMultiAliases: canonical and hyphen alias both
    normalize to `slurm_multi`; unknown values still fall through to
    the existing `slurm -> docker` default.
  - TestSlurmMultiPrepareScript: end-to-end fixture that mirrors
    MAD-private PR #186's `pyt_sglang_disagg_qwen3-32b_short` entry
    verbatim. Asserts that prepare() takes the slurm_multi early-
    dispatch path, sets `_is_slurm_multi = True` for the deploy()
    gate, emits an SBATCH header that reflects the model card slurm
    block, and emits every model_info.env_vars key as
    `export KEY="value"` in the wrapper script. This is the explicit
    safety net for the Q2 decision to drop PR #86's universal
    docker_env_vars merge -- if slurm_multi ever stops propagating
    those env_vars to the workload, this test fails loudly.
  Existing tests in tests/unit/test_deployment.py and
  tests/unit/test_container_runner.py are NOT modified.
- examples/slurm-configs/minimal/slurm-multi-minimal.json:
  small reference config alongside develop's existing
  *-minimal.json examples (`-f` add because *.json is gitignored,
  matching how existing files in this dir were originally added).

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(deployment): add skip_monitoring to DeploymentResult for slurm_multi bash branch

The earlier feat(deployment) commit ported `_run_inside_existing_allocation`
verbatim from PR #86 (used by the slurm_multi bash-in-salloc branch).
That method constructs `DeploymentResult(..., skip_monitoring=True)` to
signal that the script ran synchronously and the monitor phase should
be skipped. Develop's `DeploymentResult` dataclass did not have that
field, so the construction raised `TypeError` at runtime, the failure
was swallowed by the orchestrator, and `madengine run` exited 0 even
when the wrapper script failed.

Two minimal additive edits to deployment/base.py:

- DeploymentResult: add `skip_monitoring: bool = False` field.
- BaseDeployment.execute(): change the monitor guard from
  `if self.config.monitor:` to
  `if self.config.monitor and not result.skip_monitoring:`
  so the bash branch can correctly bypass the SLURM job poll
  (there is no SLURM job to poll — the script ran inline).

Behavior for non-slurm_multi launchers is unchanged: they never
construct DeploymentResult with skip_monitoring=True, the field
defaults to False, and the monitor() call still fires as before.

Discovered during R5.1b cluster smoke test of
`pyt_sglang_disagg_qwen3-32b_short` from MAD-private PR #186.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(slurm_multi): aggregate per-job perf.csv into cwd for dashboard reporter

The slurm_multi model script (e.g., MAD-private's run_xPyD_models.slurm)
writes its perf CSV to /shared_inference/$USER/$JOBID/perf.csv.
`_collect_slurm_multi_results` correctly finds and ingests that CSV via
`_collect_results_parse_perf_csv` (results['successful_runs'] is
populated, manifest exit codes are right). However the CLI's
post-run reporter (`display_performance_table` in cli/utils.py) reads
from cwd 'perf.csv' by default, so users saw a cosmetic
"Performance CSV not found: perf.csv" warning even on a successful
slurm_multi run. Local and classic-SLURM flows already leave a
cumulative perf.csv in cwd via update_perf_csv(); slurm_multi did not.

Fix: inside `_collect_slurm_multi_results`, after the per-job CSV is
located and parsed, also write its rows into the conventional cwd
perf.csv path:
  * if cwd 'perf.csv' is absent: shutil.copy() the per-job file
  * if it exists (from previous runs): append data rows (skip the
    per-job header so the cwd file stays single-headed)

Original per-job CSV at /shared_inference/$USER/$JOBID/perf.csv is
not modified or deleted. Wrapped in try/except so any I/O failure
degrades to a yellow warning rather than failing the run.

Affects only `_collect_slurm_multi_results`, which only runs for the
slurm_multi launcher dispatch added earlier on this branch. Zero
effect on non-slurm_multi launchers (they still take develop's
classic collect_results path which writes cwd perf.csv via
update_perf_csv).

Validated end-to-end: Llama-3.1-8B smoke on 3 reservation nodes
(Distributed_Inference_CI), bash branch in salloc, parallel docker
pull on all nodes, proxy/prefill/decode came up, benchmark
concurrency=8 / 32-32 ISL/OSL ran to completion, run exit 0, cwd
perf.csv now contains the per-job row (THROUGHPUT/TTFT/ITL columns
match per-job source).

Co-authored-by: Cursor <cursoragent@cursor.com>

* docs(slurm_multi): CHANGELOG entry + forward-compat TODO on --use-image

- CHANGELOG.md: new [Unreleased] section documenting the slurm_multi
  launcher, --use-image / --build-on-compute CLI flags, the slurm_multi
  build registry gate, the bash-in-salloc execution path, the
  DeploymentResult.skip_monitoring field, the SlurmNodeSelector
  reservation parameter, the new contract tests + minimal example,
  and the cwd perf.csv aggregation fix.
- cli/commands/build.py: comment near the --use-image Typer option
  flagging that `is_flag=False, flag_value="auto"` is being deprecated
  by Typer, with the planned migration path. Behavior unchanged
  (matches MAD-private PR #186's documented UX).

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(slurm_multi): address Copilot review on PR #124

Manifest-shape correctness (Copilot C2):
- _execute_with_prebuilt_image now keys built_images by model_name (one
  entry per model, all pointing at the same use_image) so it matches
  built_models and the rest of the codebase's invariant that
  ContainerRunner.run_models_from_manifest() relies on at
  built_models.get(image_name, {}). Without this, --use-image (and the
  implicit slurm_multi DOCKER_IMAGE_NAME gate) silently skipped every
  model in the run phase.

Multi-model dedup TypeError fix (Copilot C3):
- Replace tuple(sorted((m.get('distributed') or {}).items())) and the
  same pattern for slurm with json.dumps(..., sort_keys=True, default=str).
  The old pattern raised TypeError: unhashable type: 'dict' as soon as a
  model's distributed block contained nested dicts (sglang_disagg /
  vllm_disagg).

slurm_multi wrapper completion-marker on failure (Copilot C6):
- The wrapper script enables `set -e` at the top, so a non-zero exit
  from the model script terminated the wrapper before SCRIPT_EXIT_CODE
  was captured and the completion marker was written -- failed runs
  looked like hangs to monitor(). Wrap the bash invocation in
  `set +e` / `set -e` so the marker is always written.

Self-managed launcher hygiene (Copilot C4, C5, C7):
- Drop unused `import shutil` inside _run_self_managed.
- Drop unused script_name local in _run_self_managed (only script_path
  is consumed by the function).
- Redact env_vars values in the run log (`ENV: KEY=<set>`) so model-card
  credentials (HF_TOKEN, MAD_DOCKERHUB_PASSWORD, etc.) don't leak.

Build orchestrator hygiene (Copilot C1, C8):
- Drop top-level `import shutil` (unused).
- Drop the two `requires_auth = ...` assignments in
  _execute_build_on_compute (assigned but never read).

Validators docstring honesty (Copilot C10):
- validate_additional_context's `use_image` parameter is currently
  informational only (parameter retained for build.py call-site
  compatibility); the docstring previously claimed it skipped required-
  field validation, which the body never implemented. Tightened the
  docstring to match reality.

New contract tests (tests/unit/test_slurm_multi.py):
- test_wrapper_disables_set_e_around_model_script: locks in C6's
  set +e/SCRIPT_EXIT_CODE/set -e ordering around the bash invocation.
- test_built_images_and_models_share_keys: locks in C2's invariant
  using a fake 2-model fixture and a stubbed BuildOrchestrator.
- test_multi_model_nested_dict_distributed_does_not_raise: regression
  for C3 using two sglang_disagg-style cards with differing nested
  dicts in `distributed`; would TypeError under the old code.

Skipped: Copilot C9 (broader BuildOrchestrator branch tests for
--use-image / --build-on-compute / registry-gating). The slurm_multi
contract suite + multi-node cluster smoke already cover the slurm_multi
surface end-to-end; broader orchestrator-branch coverage is a worthwhile
follow-up but out of scope for this minimal-additive PR.

Tests:
- pytest tests/unit/test_slurm_multi.py -v  -> 12 passed (was 9).
- pytest tests/unit -q  -> 436 passed / 2 failed; the 2 failures are
  the same PermissionError environmental baseline that exists on
  develop (test_upload_file_to_mongodb_file_not_found and
  test_validate_additional_context_file_not_found, both raise
  PermissionError on '/nonexistent/file.json' instead of FileNotFound
  on this filesystem). Zero new failures.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(slurm_multi): consolidate self-managed launcher detection and support multi-model build-on-compute

- Extract is_self_managed_launcher() helper in deployment/common.py; replace
  3 duplicated SLURM_MULTI_ALIASES + normalize blocks across slurm.py,
  container_runner.py, and build_orchestrator.py. Remove dead
  _generate_slurm_multi_command (orphaned after early-dispatch bypass).
- Preserve model-card xP/yD over distributed.sglang_disagg defaults
  (skip-if-set in _prepare_slurm_multi_script).
- Generalize _execute_build_on_compute to build/tag/push all discovered
  models in a single sbatch job; key manifest by model_name to match
  _execute_with_prebuilt_image so ContainerRunner can join entries.
  Apply shlex.quote to registry_host/cwd/image names (consistent with
  v2.0.3 hardening pass).
- Tests: parametrize is_self_managed_launcher, add TestXpYdSkipIfSet and
  TestBuildOnComputeManifestShape covering the multi-model contract.
- CHANGELOG: clarify slurm_multi as an escape-hatch launcher distinct
  from templated launchers.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

* refactor(build): reuse early model discovery and expand default e2e cleanup files

Cache the DiscoverModels result from the registry-check pass so the
build step reuses it instead of running a second discovery (avoiding
duplicate get_models_json.py execution and duplicate console output).

Expand DEFAULT_CLEAN_FILES to include build_manifest.json and related
perf artefacts so stale manifests from prior e2e tests can't silently
cause the wrong image to be executed.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* refactor: address Copilot review feedback on slurm_multi PR

- slurm.py: shell-quote env_var values, model script name, and model args
  in the generated SLURM wrapper script to prevent shell metacharacters
  ($(), backticks, ;, "$, etc.) in user-supplied inputs from breaking
  the script or triggering host-shell expansion. Mirrors the hardening
  pattern in container_runner._run_self_managed.
- container_runner.py: log env var keys from additional_context using
  the same `ENV: {key}=<set>` pattern already used for model-card env
  vars, so operators can see what was applied without leaking values.
- docker_builder.py: warn (via rich_console) when an entry in
  built_images has no matching key in built_models while injecting
  DOCKER_IMAGE_NAME for slurm_multi parallel pulls, so future keying
  divergence is caught instead of silently no-opping.
- build_orchestrator.py (prebuilt manifest): mark built_images entries
  with `local_image: True` so ContainerRunner.run_models_from_manifest()
  resolves run_image to use_image (and pulls on demand) instead of
  falling through to the else branch where run_image would be the model
  name (the new dict key) and _resolve_docker_image() would fail. Fixes
  --use-image for non-slurm_multi targets.
- build_orchestrator.py (build_on_compute heredoc): shell-quote
  model_name (and reuse the already-quoted local/registry/dockerfile
  tokens) in echo lines too, so a `"` or `$` in those fields cannot
  break the generated sbatch script.

* docs: update CHANGELOG and docs for v2.1.0 slurm_multi feature

Expand v2.1.0 CHANGELOG with missing entries (local self-managed
execution, model card config merge, DockerBuilder registry injection,
shell-quoting security hardening, early discovery reuse, e2e cleanup
defaults).

Update launchers.md with full slurm_multi section (architecture,
build/run workflows, salloc support). Update cli-reference.md with
--use-image and --build-on-compute flags. Update deployment.md with
slurm_multi, salloc, pre-built image sections, and reservation
parameter. Update configuration.md with reservation, exclusive, and
expanded launcher list. Update slurm-configs README with slurm_multi
minimal example, workflow, and config reference.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: raviguptaamd <ravi.gupta@amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Claude Opus 4 <noreply@anthropic.com>
@coketaste
Copy link
Copy Markdown
Collaborator

All the changes have been merged to main through another PR #130

@coketaste coketaste closed this Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants