validate: macvlan/ipoib RDMA fallback + sriov webhook gate + GB300 Station preset#106
Conversation
Greptile SummaryThis PR ships four targeted fixes following real-world macvlan-rdma-shared bring-up on a GB300 DGX Station cluster. The changes are well-scoped and the caveats are documented inline.
Confidence Score: 5/5The PR is safe to merge. All four changes are well-scoped, field-tested, and the prior reviewer concerns about GNU-only ls -v and the zero-test false-pass have both been addressed. The RDMA device discovery fallbacks are correctly implemented with POSIX-compatible sort, appropriate guards, and documented caveats. The zero-test warning fix closes the silent false-pass gap. The profile changes are complete — confirmed that all three remaining profiles with sriovNetworkOperator.enabled: false require no changes. The new GB300-DGX-Station preset is structurally consistent with the existing GB300-NVL preset. No files require special attention. rdma.go carries the most logic but the shell-generation code is clearly documented and the fallback ordering is correct. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[DiscoverRDMADevices\nfor rail, iface] --> B[Primary probe:\nls /sys/class/net/iface/device/infiniband/]
B -->|non-empty| E[dev = result]
B -->|empty| C[Fallback A: iface exists?\ncount entries in /sys/class/infiniband/]
C -->|count == 1| D[dev = single mlx5 entry]
C -->|count != 1 AND railIndex >= 0| F[Fallback B: sort -t_ -k2,2n\nsed -n idx+1 p]
C -->|count != 1 AND railIndex < 0| G[dev = empty]
F --> H{dev non-empty?}
D --> H
E --> H
G --> I[rail skipped\nno echo]
H -->|yes| J[echo rail=dev\noutput parsed by Go]
H -->|no| I
J --> K[RunMatrix: Plan testPods]
K -->|plan.Skip != nil| L[Matrix skipped - soft exit 0]
K -->|0 tests + Skip nil| M[Warning: 0 tests despite N pods]
K -->|tests > 0| N[Run rping + ib_write_bw stages]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[DiscoverRDMADevices\nfor rail, iface] --> B[Primary probe:\nls /sys/class/net/iface/device/infiniband/]
B -->|non-empty| E[dev = result]
B -->|empty| C[Fallback A: iface exists?\ncount entries in /sys/class/infiniband/]
C -->|count == 1| D[dev = single mlx5 entry]
C -->|count != 1 AND railIndex >= 0| F[Fallback B: sort -t_ -k2,2n\nsed -n idx+1 p]
C -->|count != 1 AND railIndex < 0| G[dev = empty]
F --> H{dev non-empty?}
D --> H
E --> H
G --> I[rail skipped\nno echo]
H -->|yes| J[echo rail=dev\noutput parsed by Go]
H -->|no| I
J --> K[RunMatrix: Plan testPods]
K -->|plan.Skip != nil| L[Matrix skipped - soft exit 0]
K -->|0 tests + Skip nil| M[Warning: 0 tests despite N pods]
K -->|tests > 0| N[Run rping + ib_write_bw stages]
Reviews (2): Last reviewed commit: "validate: macvlan/ipoib RDMA fallback, s..." | Re-trigger Greptile |
…ion preset
* connectivity: pod-side fallback for RDMA device discovery on non-PCI-direct
attachments. The existing probe reads
/sys/class/net/<iface>/device/infiniband/, which works for SR-IOV VFs and
host-device but returns empty for macvlan / IPoIB slaves (their netdev has
no PCI-direct symlink inside the pod netns). Two fallbacks added:
1. Single-mlx5 case: when exactly one entry exists in
/sys/class/infiniband/, use it. Covers single-NIC nodes.
2. Multi-rail case: extract the rail index from the rail key
(regex `(?:^|-)rail-(\d+)(?:-|$)`) and pick the Nth entry from
/sys/class/infiniband/ sorted naturally. Works because every l8k
profile enumerates rails contiguously from 0 in PCI order matching
the kernel's mlx5 numbering. Limitation: assumes 1 rail = 1 mlx5
on this node; the orchestrator-side master→mlx5 mapping (read from
rendered MacvlanNetwork CR) is the proper fix for multi-NIC
multi-rail rdmaShared and is deferred.
* connectivity: explicit warning when Plan() returns zero tests despite
≥2 schedulable pods. Previously the validate output printed
"Plan: 0 same-rail rping + 0 ..." and exited success — every rail had
unresolved RDMA devices and the planner silently dropped every pair.
The new warning names macvlan/IPoIB as the common cause and points to
the in-pod sysfs check the user can run to confirm.
* profiles/{sriov-ethernet-rdma,sriov-ib-rdma,spectrum-x,spectrum-x-ra2.1}:
set `sriov-network-operator.sriovOperatorConfig.enableInjector: false`
and `enableOperatorWebhook: false`. Both helpers depend on TLS secrets
the subchart's post-install Job creates, which doesn't always re-run on
cross-profile upgrades (helm hooks don't fire on "re-enable subchart")
and leaves the operator stuck reconciling against a `<nil>` secretName.
l8k renders pod resource requests explicitly (the injector's value-add
is unused) and pre-validates NodePolicies via the crstate registry
(the webhook's value-add is redundant), so turning both off costs
nothing functional and removes the cross-profile fragility.
* presets: add GB300-DGX-Station-NVIDIA-GB300 preset, derived from a real
Galaxy GB300 -Prime topology-collector report. Single-PF (master only,
matching l8k discover --collapse-nic-rails=true), connectedGPU GPU0
with gpuProximity NODE (no PIX exists; NODE-fallback applies because
the NIC's numaNode=0 overlaps GPU0's NUMA Affinity 0-8). Header
comment documents the platform's Gen5 x8 PCIe ceiling so anyone
matching against this preset later understands the ~206 Gb/s
ib_write_bw expectation.
Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>
9afd84f to
e369087
Compare
|
Addressing both Greptile P2s. Force-pushed
Build + connectivity tests green. |
Summary
Follow-up to #105. Four small fixes surfaced during real-world bring-up of macvlan-rdma-shared + cross-profile flips on a GB300 DGX Station cluster.
/sys/class/net/<iface>/device/infiniband/, which works for SR-IOV VFs (real PCI mlx5 in the pod netns) but returns empty for macvlan slaves (kernel netdevs layered on the master). The planner silently dropped every pair, validate showedPlan: 0 same-rail …and exited success. Two new fallbacks:rail-Npattern) and indexes naturally-sorted/sys/class/infiniband/. Works because every l8k profile numbers rails contiguously from 0 in PCI order matching the kernel's mlx5 enumeration. Limitation: 1 rail = 1 mlx5 on this node; multi-NIC multi-rail rdmaShared still needs the orchestrator-side master→mlx5 mapping (deferred — would read the rendered MacvlanNetwork CR'smasterfield on the host side).Plan()returns 0 tests but ≥2 pods are schedulable (i.e. every rail had an unresolvable RDMA device on at least one endpoint), surface an explicit warning. The previous behaviour printedMatrix complete: 0/0 passedand exited success — false-pass on a real coverage gap.enableInjectorandenableOperatorWebhookon every profile that enables the SR-IOV subchart. Both helpers depend on TLS secrets the subchart's post-install Job creates, which doesn't always re-run on cross-profile upgrades (--overwrite-existingfrom macvlan-rdma-shared → sriov-ethernet-rdma), leaving the operator stuck reconciling against a<nil>secretName for thenetwork-resources-injectorDS. Neither helper is functionally needed: l8k renders resource requests explicitly (no injector consumer) and pre-validates NodePolicies via the crstate registry (no webhook consumer). Cost of disabling: nothing.GB300-DGX-Station-NVIDIA-GB300. Built from a real Galaxy GB300 -Prime topology-collector report. Single PF (collapsed master),connectedGPU: GPU0,gpuProximity: NODE(no PIX exists between NIC and GPU on the GB300 SoC; NODE-fallback applies since the NIC's numaNode=0 overlaps GPU0's NUMA Affinity 0-8). Header comment documents the platform's PCIe Gen5 x8 ceiling so the observed ~206 Gb/s ib_write_bw expectation is up-front in the preset definition.Test plan
go build ./...cleango test ./pkg/networkoperatorplugin/connectivity/... ./pkg/presets/...green (the twoTestGetPresetsDir_*failures are the known pre-existing env failures frommake installhaving populated/usr/local/share/l8k/presets/— unrelated to this PR)l8k validateon a 2-node macvlan-rdma-shared cluster with a single dual-port CX-8 per node now emitsPlan: 4 same-rail rping + 2 cross-rail rping; 4 same-rail ib_write_bw + 2 cross-rail ib_write_bwand runs the tests. (Subsequent ib_write_bw failures in one direction are a separate RoCE-over-macvlan / PFC issue, not a planner problem.)l8k deploy --overwrite-existingflipping macvlan-rdma-shared → sriov-ethernet-rdma now landsSriovOperatorConfig/defaultwithout thenetwork-resources-injectorDS rejection loop.l8k discoveragainst a GB300 DGX Station and confirm the new preset matches the livecluster-config.yamlPFs (1 east-west PF at 0001:03:00.0 with PSIDnvd0000000110).