Skip to content

feat: NVLink-only topology template + microbenchmark calibration (H20 8-GPU)#277

Open
tianhao909 wants to merge 1 commit into
aliyun:masterfrom
tianhao909:pr/plan07-nvlink-topology-template
Open

feat: NVLink-only topology template + microbenchmark calibration (H20 8-GPU)#277
tianhao909 wants to merge 1 commit into
aliyun:masterfrom
tianhao909:pr/plan07-nvlink-topology-template

Conversation

@tianhao909
Copy link
Copy Markdown
Collaborator

PR-β: NVLink-only 8-GPU topology template + config preset

Target repository: aliyun/SimAI


Summary

Adds two gen_Topo_Template.py CLI flags (--no-asw, --no-psw) and a new SimAI_nvlink_only.conf preset, enabling reproducible single-node 8-GPU NVLink-only microbenchmarks on H20/H100 systems without changing any RDMA-path defaults. The default SimAI.conf is left untouched so existing RDMA users are not affected.

Key Changes

  1. astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py+53 lines
    • New CLI flags --no-asw (skip aggregation-switch tier) and --no-psw (skip pod-switch tier)
    • When both are passed, emits a single-tier NVLink-only topology with the header 9 8 1 0 8 H20 (meaning nodes=9 gpu_per_server=8 nv_switch_num=1 (reserved)=0 edges=8 gpu_type=H20) and 8 GPU↔NVSwitch edges (star topology: every GPU connects to the single NVSwitch; there are no GPU↔GPU direct edges)
  2. astra-sim-alibabacloud/inputs/config/SimAI_nvlink_only.confNEW, ~66 lines
    • Byte-identical to SimAI.conf except ENABLE_QCN 0 (QCN is RDMA-specific; NVLink-only runs should not enable it)
    • Default SimAI.conf is NOT modified (deliberate choice — see Reviewer FAQ)
  3. .gitignore+4 lines housekeeping entries for common transient artefacts (*.log.tmp, generated topo dumps)

Total diff: ~123 lines across 3 files, 1 commit.

Testing

Fingerprint (inline):

gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Linux kernel 5.10.134-16.3.al8.x86_64 x86_64 GNU/Linux
Python 3.13.11

Topology generation smoke (runs in < 1 second, no build needed):

# After checking out this PR's branch:
python3 astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py \
    --no-asw --no-psw -g 8 -gt H20 -o /tmp/topo_smoke.txt

head -1 /tmp/topo_smoke.txt                      # → 9 8 1 0 8 H20
awk 'NR>2 && NF>=5 {c++} END {print c}' /tmp/topo_smoke.txt  # → 8  (GPU↔NVSwitch edges, star topology)

End-to-end microbenchmark reproduction on H20 (single node, 8× GPU, NVLink only):

  • Case set: AllReduce / AllGather / ReduceScatter / AllToAll at 16 MiB and 64 MiB (2 of 10 cases use an AllReduce∘AllToAll split-proxy — see Known Limitations c4)
  • In-distribution PASS rate: 10/10
  • Per-case error vs ground truth: avg_err = 12.22%, worst_err = 26.60%
  • Holdout (12 cases that did NOT participate in calibration): 10/12 PASS, avg 15.94%, worst 44.20% — see c2 below
  • SimAI binary under test: sha256 7751d90879921a8854c25da21516759679038e79ca89d2402f1a252f778884c3

The precision numbers above depend on operator-wise AS_SEND_LAT calibration applied as environment variables at run time, not as source-tree values. This PR introduces no such constants into the codebase — see c1 and c3.

Known Limitations

c1 — HIL verdict = WARN-PASS. The human-in-the-loop acceptance review classified this set as PASS but with caveats documented here; it is not an unconditional acceptance.

c2 — Holdout worst_err = 44.20% (Caveat tier). The 12-case holdout set (sizes/operators not used in calibration) reveals that size-axis generalization is bounded; the in-distribution 12.22% figure cannot be extrapolated without the same calibration policy.

c3 — AS_SEND_LAT=80 for AllToAll is out-of-scan. The calibration scan covered [0, 30]; the AS_SEND_LAT=80 value used for AllToAll is a manual extrapolation chosen to match ground truth. It is external (env var), not in source.

c4 — 2/10 cases use a split-proxy. AllReduceAllToAll as a native fused operator crashes (exit 139) on aliyun/SimAI:master (even after the companion PR-α's UB fix, which has a different root cause). The reported results for those 2 cases use an AllReduce ∘ AllToAll split. This PR does not claim to fix the fused-op crash.

c5 — Path evidence is not packet-level. Verification is at the topology header + NVLink edge count + NCCL/MockNccl-layer log level, not at ns-3 packet trace level.

Scope Disclaimer

English: This result does NOT serve as a proof of SimAI's generalization accuracy on H20/NVLink across all scenarios. It only reproduces a specific microbenchmark set under a specific calibration. Generalization to other models, sizes, topologies, or fused ops is not established.

中文:本结果不构成对 SimAI 在 H20/NVLink 全场景泛化准确性的证明。它仅复现特定 microbenchmark 集合在特定 calibration 下的精度。对其他模型、规模、拓扑、融合算子的泛化能力均未建立。

Notes

  • Base: aliyun/SimAI:master (currently f5efb5a)
  • Head: tianhao909/SimAI:pr/plan07-nvlink-topology-template
  • 1 commit: feat: NVLink-only 8-GPU topology template + config preset
  • Independent of PR-α and PR-γ — can be reviewed / merged in any order
  • No rdma-hw.cc, no build-system change, no submodule bump in this PR

Checklist

  • Conventional commit message (feat: …)
  • Default SimAI.conf NOT touched
  • All five caveats (c1–c5) documented verbatim in Known Limitations
  • Scope Disclaimer inline, both English and Chinese, as a > blockquote
  • Testing section uses only inline values (no local filesystem paths)
  • CLI flag additions are backward compatible (defaults reproduce original behavior)

Reviewer FAQ

Question Answer
Why not just change SimAI.conf default ENABLE_QCN=10? The current default is required for multi-node RoCE runs; flipping it would silently regress RDMA users. A dedicated preset keeps the two use cases separate with zero collateral.
Is --no-asw --no-psw redundant? Why not a single --nvlink-only? The two flags are kept orthogonal to leave room for future "only drop ASW, keep PSW" scenarios. Happy to add --nvlink-only as an alias if preferred.
Does the 12.22% number generalise? No — see Scope Disclaimer and c2. The holdout worst_err is 44.20%, indicating limited size-axis extrapolation.
Are the header magic numbers 9 8 1 0 8 H20 documented? Yes: nodes=9 (8 GPU + 1 NVSwitch), gpu_per_server=8, nv_switch_num=1, (reserved)=0, edges=8, gpu_type=H20. A follow-up can add a schema block to the generator's --help output if reviewers prefer.
Why no unit test? Happy to add one in a follow-up commit (tests/topo/test_nvlink_only.py asserting the header string plus link count == 8, i.e. the star-topology edge count). Held back from this PR to keep the diff minimal and topic-focused.
Is AllReduceAllToAll fused-op fixed anywhere in this series? No. c4 flags it explicitly; PR-α's UB fix targets a different code path.

Copilot AI review requested due to automatic review settings May 15, 2026 02:02
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an NVLink-only topology generation path and an accompanying SimAI config preset to support reproducible single-node (8-GPU) microbenchmarks without changing existing RDMA defaults.

Changes:

  • Added --no-asw / --no-psw flags to generate an NVSwitch-only (NVLink-only) topology file from gen_Topo_Template.py.
  • Added SimAI_nvlink_only.conf preset (matches SimAI.conf except ENABLE_QCN 0) for NVLink-only runs.
  • Extended .gitignore with additional transient-artifact patterns.

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 2 comments.

File Description
astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py Adds NVLink-only topology generator + CLI flags and output path support.
astra-sim-alibabacloud/inputs/config/SimAI_nvlink_only.conf New preset config for NVLink-only runs (QCN disabled).
.gitignore Adds ignore patterns for additional generated/transient files.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +9 to +15
def NVLink_Only(parameters):
if parameters['gpu'] % parameters['gpu_per_server'] != 0:
raise ValueError("NVLink-only topology requires gpu to be divisible by gpu_per_server")
servers = int(parameters['gpu'] / parameters['gpu_per_server'])
nv_switch_num = servers * parameters['nv_switch_per_server']
nodes = parameters['gpu'] + nv_switch_num
links = parameters['gpu'] * parameters['nv_switch_per_server']
def NVLink_Only(parameters):
if parameters['gpu'] % parameters['gpu_per_server'] != 0:
raise ValueError("NVLink-only topology requires gpu to be divisible by gpu_per_server")
servers = int(parameters['gpu'] / parameters['gpu_per_server'])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants