feat: NVLink-only topology template + microbenchmark calibration (H20 8-GPU) by tianhao909 · Pull Request #277 · aliyun/SimAI

tianhao909 · 2026-05-15T02:02:30Z

PR-β: NVLink-only 8-GPU topology template + config preset

Target repository: aliyun/SimAI

Summary

Adds two gen_Topo_Template.py CLI flags (--no-asw, --no-psw) and a new SimAI_nvlink_only.conf preset, enabling reproducible single-node 8-GPU NVLink-only microbenchmarks on H20/H100 systems without changing any RDMA-path defaults. The default SimAI.conf is left untouched so existing RDMA users are not affected.

Key Changes

astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py — +53 lines
- New CLI flags --no-asw (skip aggregation-switch tier) and --no-psw (skip pod-switch tier)
- When both are passed, emits a single-tier NVLink-only topology with the header 9 8 1 0 8 H20 (meaning nodes=9 gpu_per_server=8 nv_switch_num=1 (reserved)=0 edges=8 gpu_type=H20) and 8 GPU↔NVSwitch edges (star topology: every GPU connects to the single NVSwitch; there are no GPU↔GPU direct edges)
astra-sim-alibabacloud/inputs/config/SimAI_nvlink_only.conf — NEW, ~66 lines
- Byte-identical to SimAI.conf except ENABLE_QCN 0 (QCN is RDMA-specific; NVLink-only runs should not enable it)
- Default SimAI.conf is NOT modified (deliberate choice — see Reviewer FAQ)
.gitignore — +4 lines housekeeping entries for common transient artefacts (*.log.tmp, generated topo dumps)

Total diff: ~123 lines across 3 files, 1 commit.

Testing

Fingerprint (inline):

gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Linux kernel 5.10.134-16.3.al8.x86_64 x86_64 GNU/Linux
Python 3.13.11

Topology generation smoke (runs in < 1 second, no build needed):

# After checking out this PR's branch:
python3 astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py \
    --no-asw --no-psw -g 8 -gt H20 -o /tmp/topo_smoke.txt

head -1 /tmp/topo_smoke.txt                      # → 9 8 1 0 8 H20
awk 'NR>2 && NF>=5 {c++} END {print c}' /tmp/topo_smoke.txt  # → 8  (GPU↔NVSwitch edges, star topology)

End-to-end microbenchmark reproduction on H20 (single node, 8× GPU, NVLink only):

Case set: AllReduce / AllGather / ReduceScatter / AllToAll at 16 MiB and 64 MiB (2 of 10 cases use an AllReduce∘AllToAll split-proxy — see Known Limitations c4)
In-distribution PASS rate: 10/10
Per-case error vs ground truth: avg_err = 12.22%, worst_err = 26.60%
Holdout (12 cases that did NOT participate in calibration): 10/12 PASS, avg 15.94%, worst 44.20% — see c2 below
SimAI binary under test: sha256 7751d90879921a8854c25da21516759679038e79ca89d2402f1a252f778884c3

The precision numbers above depend on operator-wise AS_SEND_LAT calibration applied as environment variables at run time, not as source-tree values. This PR introduces no such constants into the codebase — see c1 and c3.

Known Limitations

c1 — HIL verdict = WARN-PASS. The human-in-the-loop acceptance review classified this set as PASS but with caveats documented here; it is not an unconditional acceptance.

c2 — Holdout worst_err = 44.20% (Caveat tier). The 12-case holdout set (sizes/operators not used in calibration) reveals that size-axis generalization is bounded; the in-distribution 12.22% figure cannot be extrapolated without the same calibration policy.

c3 — AS_SEND_LAT=80 for AllToAll is out-of-scan. The calibration scan covered [0, 30]; the AS_SEND_LAT=80 value used for AllToAll is a manual extrapolation chosen to match ground truth. It is external (env var), not in source.

c4 — 2/10 cases use a split-proxy. AllReduceAllToAll as a native fused operator crashes (exit 139) on aliyun/SimAI:master (even after the companion PR-α's UB fix, which has a different root cause). The reported results for those 2 cases use an AllReduce ∘ AllToAll split. This PR does not claim to fix the fused-op crash.

c5 — Path evidence is not packet-level. Verification is at the topology header + NVLink edge count + NCCL/MockNccl-layer log level, not at ns-3 packet trace level.

Scope Disclaimer

English: This result does NOT serve as a proof of SimAI's generalization accuracy on H20/NVLink across all scenarios. It only reproduces a specific microbenchmark set under a specific calibration. Generalization to other models, sizes, topologies, or fused ops is not established.

中文：本结果不构成对 SimAI 在 H20/NVLink 全场景泛化准确性的证明。它仅复现特定 microbenchmark 集合在特定 calibration 下的精度。对其他模型、规模、拓扑、融合算子的泛化能力均未建立。

Notes

Base: aliyun/SimAI:master (currently f5efb5a)
Head: tianhao909/SimAI:pr/plan07-nvlink-topology-template
1 commit: feat: NVLink-only 8-GPU topology template + config preset
Independent of PR-α and PR-γ — can be reviewed / merged in any order
No rdma-hw.cc, no build-system change, no submodule bump in this PR

Checklist

Conventional commit message (feat: …)
Default SimAI.conf NOT touched
All five caveats (c1–c5) documented verbatim in Known Limitations
Scope Disclaimer inline, both English and Chinese, as a > blockquote
Testing section uses only inline values (no local filesystem paths)
CLI flag additions are backward compatible (defaults reproduce original behavior)

Reviewer FAQ

Question	Answer
Why not just change `SimAI.conf` default `ENABLE_QCN=1` → `0`?	The current default is required for multi-node RoCE runs; flipping it would silently regress RDMA users. A dedicated preset keeps the two use cases separate with zero collateral.
Is `--no-asw --no-psw` redundant? Why not a single `--nvlink-only`?	The two flags are kept orthogonal to leave room for future "only drop ASW, keep PSW" scenarios. Happy to add `--nvlink-only` as an alias if preferred.
Does the 12.22% number generalise?	No — see Scope Disclaimer and c2. The holdout worst_err is 44.20%, indicating limited size-axis extrapolation.
Are the header magic numbers `9 8 1 0 8 H20` documented?	Yes: `nodes=9 (8 GPU + 1 NVSwitch), gpu_per_server=8, nv_switch_num=1, (reserved)=0, edges=8, gpu_type=H20`. A follow-up can add a schema block to the generator's `--help` output if reviewers prefer.
Why no unit test?	Happy to add one in a follow-up commit (`tests/topo/test_nvlink_only.py` asserting the header string plus link count `== 8`, i.e. the star-topology edge count). Held back from this PR to keep the diff minimal and topic-focused.
Is AllReduceAllToAll fused-op fixed anywhere in this series?	No. c4 flags it explicitly; PR-α's UB fix targets a different code path.

Copilot

Pull request overview

Adds an NVLink-only topology generation path and an accompanying SimAI config preset to support reproducible single-node (8-GPU) microbenchmarks without changing existing RDMA defaults.

Changes:

Added --no-asw / --no-psw flags to generate an NVSwitch-only (NVLink-only) topology file from gen_Topo_Template.py.
Added SimAI_nvlink_only.conf preset (matches SimAI.conf except ENABLE_QCN 0) for NVLink-only runs.
Extended .gitignore with additional transient-artifact patterns.

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 2 comments.

File	Description
astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py	Adds NVLink-only topology generator + CLI flags and output path support.
astra-sim-alibabacloud/inputs/config/SimAI_nvlink_only.conf	New preset config for NVLink-only runs (QCN disabled).
.gitignore	Adds ignore patterns for additional generated/transient files.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+def NVLink_Only(parameters):
+    if parameters['gpu'] % parameters['gpu_per_server'] != 0:
+        raise ValueError("NVLink-only topology requires gpu to be divisible by gpu_per_server")
+    servers = int(parameters['gpu'] / parameters['gpu_per_server'])
+    nv_switch_num = servers * parameters['nv_switch_per_server']
+    nodes = parameters['gpu'] + nv_switch_num
+    links = parameters['gpu'] * parameters['nv_switch_per_server']


+def NVLink_Only(parameters):
+    if parameters['gpu'] % parameters['gpu_per_server'] != 0:
+        raise ValueError("NVLink-only topology requires gpu to be divisible by gpu_per_server")
+    servers = int(parameters['gpu'] / parameters['gpu_per_server'])


feat: NVLink-only 8-GPU topology template + config preset (see PR body)

2af182a

Copilot AI review requested due to automatic review settings May 15, 2026 02:02

Copilot started reviewing on behalf of tianhao909 May 15, 2026 02:02 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: NVLink-only topology template + microbenchmark calibration (H20 8-GPU)#277

feat: NVLink-only topology template + microbenchmark calibration (H20 8-GPU)#277
tianhao909 wants to merge 1 commit into
aliyun:masterfrom
tianhao909:pr/plan07-nvlink-topology-template

tianhao909 commented May 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tianhao909 commented May 15, 2026

PR-β: NVLink-only 8-GPU topology template + config preset

Summary

Key Changes

Testing

Known Limitations

Scope Disclaimer

Notes

Checklist

Reviewer FAQ

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants