Skip to content

validate: macvlan/ipoib RDMA fallback + sriov webhook gate + GB300 Station preset#106

Merged
almaslennikov merged 1 commit into
mainfrom
validate-macvlan-rdma-and-sriov-webhooks
Jun 30, 2026
Merged

validate: macvlan/ipoib RDMA fallback + sriov webhook gate + GB300 Station preset#106
almaslennikov merged 1 commit into
mainfrom
validate-macvlan-rdma-and-sriov-webhooks

Conversation

@almaslennikov

Copy link
Copy Markdown
Collaborator

Summary

Follow-up to #105. Four small fixes surfaced during real-world bring-up of macvlan-rdma-shared + cross-profile flips on a GB300 DGX Station cluster.

  • connectivity: macvlan/IPoIB RDMA device fallback. The in-pod probe used to read /sys/class/net/<iface>/device/infiniband/, which works for SR-IOV VFs (real PCI mlx5 in the pod netns) but returns empty for macvlan slaves (kernel netdevs layered on the master). The planner silently dropped every pair, validate showed Plan: 0 same-rail … and exited success. Two new fallbacks:
    1. Single-mlx5 case (single-NIC node, single rail) — picks the only entry exposed by the rdmaSharedDevicePlugin.
    2. Multi-rail case — extracts the rail index from the rail key (rail-N pattern) and indexes naturally-sorted /sys/class/infiniband/. Works because every l8k profile numbers rails contiguously from 0 in PCI order matching the kernel's mlx5 enumeration. Limitation: 1 rail = 1 mlx5 on this node; multi-NIC multi-rail rdmaShared still needs the orchestrator-side master→mlx5 mapping (deferred — would read the rendered MacvlanNetwork CR's master field on the host side).
  • connectivity: zero-tests-despite-pods warning. When Plan() returns 0 tests but ≥2 pods are schedulable (i.e. every rail had an unresolvable RDMA device on at least one endpoint), surface an explicit warning. The previous behaviour printed Matrix complete: 0/0 passed and exited success — false-pass on a real coverage gap.
  • profiles: disable enableInjector and enableOperatorWebhook on every profile that enables the SR-IOV subchart. Both helpers depend on TLS secrets the subchart's post-install Job creates, which doesn't always re-run on cross-profile upgrades (--overwrite-existing from macvlan-rdma-shared → sriov-ethernet-rdma), leaving the operator stuck reconciling against a <nil> secretName for the network-resources-injector DS. Neither helper is functionally needed: l8k renders resource requests explicitly (no injector consumer) and pre-validates NodePolicies via the crstate registry (no webhook consumer). Cost of disabling: nothing.
  • presets: add GB300-DGX-Station-NVIDIA-GB300. Built from a real Galaxy GB300 -Prime topology-collector report. Single PF (collapsed master), connectedGPU: GPU0, gpuProximity: NODE (no PIX exists between NIC and GPU on the GB300 SoC; NODE-fallback applies since the NIC's numaNode=0 overlaps GPU0's NUMA Affinity 0-8). Header comment documents the platform's PCIe Gen5 x8 ceiling so the observed ~206 Gb/s ib_write_bw expectation is up-front in the preset definition.

Test plan

  • go build ./... clean
  • go test ./pkg/networkoperatorplugin/connectivity/... ./pkg/presets/... green (the two TestGetPresetsDir_* failures are the known pre-existing env failures from make install having populated /usr/local/share/l8k/presets/ — unrelated to this PR)
  • Field-tested: l8k validate on a 2-node macvlan-rdma-shared cluster with a single dual-port CX-8 per node now emits Plan: 4 same-rail rping + 2 cross-rail rping; 4 same-rail ib_write_bw + 2 cross-rail ib_write_bw and runs the tests. (Subsequent ib_write_bw failures in one direction are a separate RoCE-over-macvlan / PFC issue, not a planner problem.)
  • Field-tested: l8k deploy --overwrite-existing flipping macvlan-rdma-shared → sriov-ethernet-rdma now lands SriovOperatorConfig/default without the network-resources-injector DS rejection loop.
  • Optional: run l8k discover against a GB300 DGX Station and confirm the new preset matches the live cluster-config.yaml PFs (1 east-west PF at 0001:03:00.0 with PSID nvd0000000110).

@greptile-apps

greptile-apps Bot commented Jun 29, 2026

Copy link
Copy Markdown

Greptile Summary

This PR ships four targeted fixes following real-world macvlan-rdma-shared bring-up on a GB300 DGX Station cluster. The changes are well-scoped and the caveats are documented inline.

  • RDMA device fallback in DiscoverRDMADevices: adds two shell-side fallbacks for macvlan/IPoIB interfaces that lack the device/infiniband/ PCI symlink, using sort -t_ -k2,2n (POSIX) rather than ls -v (GNU-only) to avoid silent misranking at rail ≥ 10. A defensive zero-test warning is added to RunMatrix so a coverage gap is no longer silently swallowed as success.
  • Profile hardening: enableInjector: false and enableOperatorWebhook: false are propagated to all four SR-IOV-enabled profiles (sriov-ethernet-rdma, sriov-ib-rdma, spectrum-x, spectrum-x-ra2.1) to prevent a stuck Reconcile loop caused by a missing TLS secret on cross-profile upgrades.
  • New preset: GB300-DGX-Station-NVIDIA-GB300 topology file is added, consistent in structure and deviceID format with the existing GB300-NVL preset.

Confidence Score: 5/5

The PR is safe to merge. All four changes are well-scoped, field-tested, and the prior reviewer concerns about GNU-only ls -v and the zero-test false-pass have both been addressed.

The RDMA device discovery fallbacks are correctly implemented with POSIX-compatible sort, appropriate guards, and documented caveats. The zero-test warning fix closes the silent false-pass gap. The profile changes are complete — confirmed that all three remaining profiles with sriovNetworkOperator.enabled: false require no changes. The new GB300-DGX-Station preset is structurally consistent with the existing GB300-NVL preset.

No files require special attention. rdma.go carries the most logic but the shell-generation code is clearly documented and the fallback ordering is correct.

Important Files Changed

Filename Overview
pkg/networkoperatorplugin/connectivity/rdma.go Adds two shell-side fallbacks for non-PCI-direct RDMA device discovery. POSIX sort replaces GNU-only ls -v. Fallback B lacks the interface-existence guard present in Fallback A, but this doesn't cause false positives since rping/ib_write_bw use IP connectivity which would fail independently.
pkg/networkoperatorplugin/connectivity/connectivity.go Adds a zero-test warning that fires when Plan() returns 0 tests despite ≥2 schedulable pods, fixing the silent false-pass that previously printed "Matrix complete: 0/0 passed". Comment correctly covers both root causes (no RDMA device resolution, no shared rail keys).
pkg/presets/data/GB300-DGX-Station-NVIDIA-GB300/topology.yaml New preset for the GB300 DGX Station. Structure, deviceID format ("1023" for ConnectX-8), and field set are consistent with the existing GB300-NVL preset. Single collapsed PF is correct for --collapse-nic-rails default. Classification rationale is well-documented in comments.
profiles/sriov-ethernet-rdma/00-values.yaml Adds enableInjector: false and enableOperatorWebhook: false with a detailed rationale comment. The three other profiles with sriovNetworkOperator.enabled: false do not need this change.
profiles/sriov-ib-rdma/00-values.yaml Adds enableInjector: false and enableOperatorWebhook: false with a cross-reference to the canonical rationale in sriov-ethernet-rdma.
profiles/spectrum-x/00-values.yaml Adds enableInjector: false and enableOperatorWebhook: false, consistent with the other SR-IOV profiles.
profiles/spectrum-x-ra2.1/00-values.yaml Adds enableInjector: false and enableOperatorWebhook: false, consistent with the other SR-IOV profiles.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[DiscoverRDMADevices\nfor rail, iface] --> B[Primary probe:\nls /sys/class/net/iface/device/infiniband/]
    B -->|non-empty| E[dev = result]
    B -->|empty| C[Fallback A: iface exists?\ncount entries in /sys/class/infiniband/]
    C -->|count == 1| D[dev = single mlx5 entry]
    C -->|count != 1 AND railIndex >= 0| F[Fallback B: sort -t_ -k2,2n\nsed -n idx+1 p]
    C -->|count != 1 AND railIndex < 0| G[dev = empty]
    F --> H{dev non-empty?}
    D --> H
    E --> H
    G --> I[rail skipped\nno echo]
    H -->|yes| J[echo rail=dev\noutput parsed by Go]
    H -->|no| I
    J --> K[RunMatrix: Plan testPods]
    K -->|plan.Skip != nil| L[Matrix skipped - soft exit 0]
    K -->|0 tests + Skip nil| M[Warning: 0 tests despite N pods]
    K -->|tests > 0| N[Run rping + ib_write_bw stages]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[DiscoverRDMADevices\nfor rail, iface] --> B[Primary probe:\nls /sys/class/net/iface/device/infiniband/]
    B -->|non-empty| E[dev = result]
    B -->|empty| C[Fallback A: iface exists?\ncount entries in /sys/class/infiniband/]
    C -->|count == 1| D[dev = single mlx5 entry]
    C -->|count != 1 AND railIndex >= 0| F[Fallback B: sort -t_ -k2,2n\nsed -n idx+1 p]
    C -->|count != 1 AND railIndex < 0| G[dev = empty]
    F --> H{dev non-empty?}
    D --> H
    E --> H
    G --> I[rail skipped\nno echo]
    H -->|yes| J[echo rail=dev\noutput parsed by Go]
    H -->|no| I
    J --> K[RunMatrix: Plan testPods]
    K -->|plan.Skip != nil| L[Matrix skipped - soft exit 0]
    K -->|0 tests + Skip nil| M[Warning: 0 tests despite N pods]
    K -->|tests > 0| N[Run rping + ib_write_bw stages]
Loading

Reviews (2): Last reviewed commit: "validate: macvlan/ipoib RDMA fallback, s..." | Re-trigger Greptile

Comment thread pkg/networkoperatorplugin/connectivity/rdma.go
Comment thread pkg/networkoperatorplugin/connectivity/connectivity.go
…ion preset

* connectivity: pod-side fallback for RDMA device discovery on non-PCI-direct
  attachments. The existing probe reads
  /sys/class/net/<iface>/device/infiniband/, which works for SR-IOV VFs and
  host-device but returns empty for macvlan / IPoIB slaves (their netdev has
  no PCI-direct symlink inside the pod netns). Two fallbacks added:
    1. Single-mlx5 case: when exactly one entry exists in
       /sys/class/infiniband/, use it. Covers single-NIC nodes.
    2. Multi-rail case: extract the rail index from the rail key
       (regex `(?:^|-)rail-(\d+)(?:-|$)`) and pick the Nth entry from
       /sys/class/infiniband/ sorted naturally. Works because every l8k
       profile enumerates rails contiguously from 0 in PCI order matching
       the kernel's mlx5 numbering. Limitation: assumes 1 rail = 1 mlx5
       on this node; the orchestrator-side master→mlx5 mapping (read from
       rendered MacvlanNetwork CR) is the proper fix for multi-NIC
       multi-rail rdmaShared and is deferred.

* connectivity: explicit warning when Plan() returns zero tests despite
  ≥2 schedulable pods. Previously the validate output printed
  "Plan: 0 same-rail rping + 0 ..." and exited success — every rail had
  unresolved RDMA devices and the planner silently dropped every pair.
  The new warning names macvlan/IPoIB as the common cause and points to
  the in-pod sysfs check the user can run to confirm.

* profiles/{sriov-ethernet-rdma,sriov-ib-rdma,spectrum-x,spectrum-x-ra2.1}:
  set `sriov-network-operator.sriovOperatorConfig.enableInjector: false`
  and `enableOperatorWebhook: false`. Both helpers depend on TLS secrets
  the subchart's post-install Job creates, which doesn't always re-run on
  cross-profile upgrades (helm hooks don't fire on "re-enable subchart")
  and leaves the operator stuck reconciling against a `<nil>` secretName.
  l8k renders pod resource requests explicitly (the injector's value-add
  is unused) and pre-validates NodePolicies via the crstate registry
  (the webhook's value-add is redundant), so turning both off costs
  nothing functional and removes the cross-profile fragility.

* presets: add GB300-DGX-Station-NVIDIA-GB300 preset, derived from a real
  Galaxy GB300 -Prime topology-collector report. Single-PF (master only,
  matching l8k discover --collapse-nic-rails=true), connectedGPU GPU0
  with gpuProximity NODE (no PIX exists; NODE-fallback applies because
  the NIC's numaNode=0 overlaps GPU0's NUMA Affinity 0-8). Header
  comment documents the platform's Gen5 x8 PCIe ceiling so anyone
  matching against this preset later understands the ~206 Gb/s
  ib_write_bw expectation.

Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>
@almaslennikov almaslennikov force-pushed the validate-macvlan-rdma-and-sriov-webhooks branch from 9afd84f to e369087 Compare June 29, 2026 10:29
@almaslennikov

Copy link
Copy Markdown
Collaborator Author

Addressing both Greptile P2s. Force-pushed e369087 (was 9afd84f).

rdma.go:133ls -1v busybox portability: replaced ls -1v with ls -1 | sort -t_ -k2,2n. POSIX sort with field separator _ and numeric sort on field 2 — works on both GNU and busybox, and correctly orders mlx5_0..mlx5_17 even when the count crosses 10. Comment now documents the busybox concern so a future image swap doesn't silently regress.

connectivity.go:227 — zero-test warning text: widened to name both causes ("either every rail had at least one endpoint with no resolvable RDMA device, or no rail key was shared across pods"). The second case can't occur with any current l8k-rendered example DS, but the comment + warning text now mention it explicitly so the diagnostic doesn't misattribute on a future multi-group render.

Build + connectivity tests green.

@almaslennikov almaslennikov merged commit b9ea239 into main Jun 30, 2026
3 checks passed
@almaslennikov almaslennikov deleted the validate-macvlan-rdma-and-sriov-webhooks branch June 30, 2026 10:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant