Skip to content

fix: NCO-coexistence, ns-override, FW-storage gate, GB300 preset#105

Merged
almaslennikov merged 1 commit into
mainfrom
preinstalled-discovery
Jun 26, 2026
Merged

fix: NCO-coexistence, ns-override, FW-storage gate, GB300 preset#105
almaslennikov merged 1 commit into
mainfrom
preinstalled-discovery

Conversation

@almaslennikov

Copy link
Copy Markdown
Collaborator

Summary

Group of fixes that surfaced while bringing l8k up on a GB300-NVL setup with a pre-existing NIC Configuration Operator install, plus the corresponding GB300 preset.

  • discover: cluster-wide NicDevice list. The daemon's controller queries CRs cluster-wide by node name, so when an existing NICOP install owns the CRs in another namespace, the daemon writes status into those CRs in their original namespace — restricting our List to the bootstrap namespace caused discovery to time out. Drops the client.InNamespace filter from DiscoverClusterConfig + waitNicDevicesDiscovered.
  • discover: classify via NicDevice.Status.DPU / .SuperNIC. The operator already reads INTERNAL_CPU_MODEL via mlxconfig (EMBEDDED_CPU(1) ⇒ DPU) — use it. Replaces the chip-ID heuristic, which mis-flagged BF3 DPU SKUs whose part numbers weren't in ns-product-ids as east-west SuperNICs. ns-product-ids stays as a positive-only secondary signal.
  • deploy / validate: --network-operator-namespace honored standalone. The flag was only wired on generate/root; the standalone subcommands went through loadUserConfig(options.Options{}) and ignored the override. Threaded through loadUserConfig (mirroring ApplyOptionsToConfig's same branch) + registered on both subcommands.
  • profiles: gate nicFirmwareStorage behind a new NicConfigurationOperator.UpdateFW. Spectrum-X RA2.2 was rendering the PVC block unconditionally; the others didn't have it at all. Added the gated block to the five non-Spectrum-X NCPs and widened the parent nicConfigurationOperator gate to (DeployNicInterfaceNameTemplate OR UpdateFW). Default UpdateFW=false → no PVC/StorageClass dependency for the common case.
  • profiles: suppress empty NicNodePolicy on SR-IOV profiles when OFED is disabled. sriov-ethernet-rdma / sriov-ib-rdma have only ofedDriver in the NNP body (the SR-IOV device plugin policy is a separate CRD); disabling OFED left a no-op NNP carrying just metadata + nodeSelector. Outer gate now (versionGE 26.4 AND DOCADriver.Enable); the renderer's empty-output guard drops the file.
  • discover: always-on unloads, drop OFED-module probe. UnloadStorageModules and UnloadThirdPartyRDMAModules now default to true at construction; the in-process probe (the holder-graph BFS) is removed because it was OOM-prone (SIGKILL/137 on large nodes) and the auto-enable side effect is redundant with default-on. SkipPreflightChecks defaulted to explicit false.
  • presets: GB300-NVL-NVIDIA-GB300. Built from a real GB300 topology-collector report. NODE-fallback classification (no PIX between any GPU and any east-west NIC on this chassis). 4 east-west CX-8 PFs (one per NIC, master function — matches l8k discover --collapse-nic-rails default) + 2 north-south BF3 ports.
  • presets symlink. presetspkg/presets/data at repo root so the dev install path picks the in-repo data up without manual copies.

Test plan

  • go build ./... clean
  • go test ./pkg/networkoperatorplugin/... ./pkg/cmd/... ./pkg/config/... green
  • go test ./pkg/presets/... — only the known pre-existing environmental failures (TestGetPresetsDir_NotFound / TestGetPresetsDir_SkipsFiles, both fail when /usr/local/share/l8k/presets/ exists from a prior make install — unrelated)
  • Manual: l8k discover against a cluster with a pre-existing NICOP install in another namespace returns the existing NicDevice CRs instead of timing out
  • Manual: l8k deploy --network-operator-namespace foo installs Helm into foo
  • Manual: l8k generate with nicConfigurationOperator.updateFW: false (default) → no nicFirmwareStorage in the rendered NCP, no PVC creation at deploy time
  • Manual: l8k generate for sriov-ethernet-rdma with docaDriver.enable: false → no NNP file in the rendered manifests
  • Manual: l8k discover on a GB300-NVL produces a cluster-config.yaml matching the new preset; preset-validation surfaces zero deviations

@greptile-apps

greptile-apps Bot commented Jun 26, 2026

Copy link
Copy Markdown

Greptile Summary

This PR bundles several targeted fixes for GB300-NVL bring-up with a co-existing NIC Configuration Operator (NICOP) install, plus the corresponding GB300 preset. The changes are well-scoped, well-commented, and the PR description provides explicit migration notes for the known limitation around pre-existing docaDriver: config blocks.

  • Discover fixes: NicDevice list is now cluster-wide (fixes coexistence timeout), DPU/SuperNIC classification uses Status.DPU/Status.SuperNIC from the operator instead of the BF3 chip-ID heuristic, and the OOM-prone kernel module probe is removed with UnloadStorageModules/UnloadThirdPartyRDMAModules defaulting to true.
  • CLI fix: --network-operator-namespace is now registered on and correctly threaded into both l8k deploy and l8k validate via loadUserConfig, mirroring the existing ApplyOptionsToConfig path.
  • Profile fixes: nicFirmwareStorage is gated behind a new UpdateFW boolean (default false), and the sriov-* NNP files now suppress themselves when DOCADriver.Enable is false to avoid no-op CRs.
  • New preset: GB300-NVL-NVIDIA-GB300/topology.yaml with 4 east-west CX-8 PFs and 2 north-south BF3 ports, plus a repo-root presets symlink for dev-install ergonomics.

Confidence Score: 5/5

Safe to merge; all changes address concrete, real-world failures observed on GB300-NVL hardware with a co-existing NICOP install, and the logic is consistent across the affected code paths.

Each fix is narrowly scoped and directly traceable to an observed failure mode. The cluster-wide NicDevice listing correctly handles the coexistence case. The namespace override threading is applied only after the config is confirmed non-nil, avoiding any panic path. The UpdateFW gate and NNP suppression are additive conditions that default to the safe/no-op state. No regressions are introduced on the existing code paths.

pkg/networkoperatorplugin/networkoperator.go — the acknowledged zero-value migration issue for pre-existing docaDriver blocks is worth tracking in a follow-up PR (pointer-bool schema change).

Important Files Changed

Filename Overview
pkg/networkoperatorplugin/discovery/discover.go Removes namespace filter from NicDevice listing (cluster-wide fix), replaces chip-ID heuristic with Status.DPU/Status.SuperNIC fields, and removes the OOM-prone kernel module probe. All changes are sound.
pkg/networkoperatorplugin/networkoperator.go Adds default-on unload flags and explicit SkipPreflightChecks=false to all three DOCADriverConfig initialization sites. The acknowledged limitation — pre-existing docaDriver: blocks land with Go zero-value false for the unload fields — is documented in both the code and the PR description but deferred to a future schema change.
pkg/cmd/userconfig.go Correctly threads NetworkOperatorNamespace override into loadUserConfig, applied after ApplyNetworkOperatorRelease with a nil guard; cfg is always non-nil at that point due to early-return guards above it.
pkg/cmd/deploy.go Registers --network-operator-namespace flag and passes it to loadUserConfig; cfg.NetworkOperator (with overridden namespace) is then assigned to deployOpts. Correct fix.
pkg/cmd/validate.go Registers --network-operator-namespace flag and passes it to loadUserConfig; mirrors deploy.go fix cleanly.
profiles/sriov-ethernet-rdma/11-nicnodepolicy.yaml Adds .DOCADriver.Enable to the outer gate; suppresses the no-op NNP when OFED is disabled. Inner {{- if .DOCADriver.Enable }} is now redundant but harmless.
profiles/sriov-ib-rdma/11-nicnodepolicy.yaml Same fix as sriov-ethernet-rdma — DOCADriver.Enable gate added to suppress no-op NNP. Clean.
profiles/spectrum-x/10-nicclusterpolicy.yaml Adds inner UpdateFW gate around nicFirmwareStorage block, which was previously unconditional. The outer nicConfigurationOperator section remains unconditional by design for spectrum-x. Correct fix.
pkg/presets/data/GB300-NVL-NVIDIA-GB300/topology.yaml New GB300-NVL preset with 4 east-west CX-8 PFs and 2 north-south BF3 DPU ports. NODE-fallback classification is explained in comments. North-south entries correctly omit rail field.
pkg/networkoperatorplugin/discovery/library.go Mirrors the default-on unload flag initialization from networkoperator.go into applyRelease; consistent with the other initialization sites.

Reviews (2): Last reviewed commit: "fix: NCO-coexistence, ns-override, FW-st..." | Re-trigger Greptile

Comment thread pkg/networkoperatorplugin/networkoperator.go
* discover: list NicDevice CRs cluster-wide so an existing NIC
  Configuration Operator install in another namespace doesn't make
  discovery time out. The daemon's controller queries CRs cluster-wide
  by node name, so when a pre-existing install owns the CRs, the
  daemon writes status into them in their original namespace rather
  than minting fresh ones under the bootstrap namespace; the in-process
  bootstrap-namespace List filter has to follow.

* discover: classify traffic via NicDevice.Status.DPU / Status.SuperNIC
  (operator-authoritative, INTERNAL_CPU_MODEL via mlxconfig) instead
  of the chip-ID heuristic, which mis-flagged BF3 DPU SKUs whose part
  numbers aren't in ns-product-ids as east-west SuperNICs. The
  ns-product-ids table stays as a positive-only secondary signal.

* deploy / validate: honor --network-operator-namespace in the
  standalone subcommands. Previously the flag was only registered on
  generate/root, so a fresh `l8k deploy --network-operator-namespace
  foo` had no effect — the standalone path went through loadUserConfig
  with empty options, never reaching ApplyOptionsToConfig.

* profiles: gate `nicFirmwareStorage` behind a new
  NicConfigurationOperator.UpdateFW field. Adds the gated block to the
  five non-Spectrum-X NCPs and widens the parent nicConfigurationOperator
  gate to (DeployNicInterfaceNameTemplate OR UpdateFW), so the storage
  PVC stays reachable when the user disables the name template but
  enables firmware update. Default UpdateFW=false → no PVC dependency
  for the common case.

* profiles: suppress empty NicNodePolicy when DOCADriver.Enable is
  false on the sriov-ethernet-rdma / sriov-ib-rdma profiles. Their
  only NNP body section is ofedDriver, so disabling OFED leaves an
  NNP carrying only metadata + nodeSelector — a no-op CR. Tightened
  outer gate to (versionGE 26.4 AND DOCADriver.Enable); the existing
  empty-render guard in templates.go drops the file.

* discover: default UnloadStorageModules + UnloadThirdPartyRDMAModules
  to true; drop the OFED-module probe block. The probe was OOM-prone
  (SIGKILL/137 on large nodes) and the auto-enable side effect is now
  redundant with default-on. SkipPreflightChecks defaulted to explicit
  false (matching the existing intent: surface hardware-incompat early).

* presets: add GB300-NVL-NVIDIA-GB300 preset, derived from a real
  GB300 topology-collector report. NODE-fallback classification (no
  PIX between any GPU and east-west NIC), 4 east-west CX-8 PFs + 2
  north-south BF3 ports.

* presets: add a repo-root `presets` symlink to pkg/presets/data so
  the dev install path (`l8k preset list` lookup chain) picks the
  in-repo data up without a manual copy.

Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>
@almaslennikov almaslennikov force-pushed the preinstalled-discovery branch from 9465b8e to 12a99c0 Compare June 26, 2026 10:30
@almaslennikov

Copy link
Copy Markdown
Collaborator Author

Re: Greptile P2 on networkoperator.go (existing docaDriver: block silently keeps zero-value unload fields) — confirmed, this is a real but narrow regression for users whose previous probe was OOM-killed.

Picked Greptile's softer remedy (document the limitation + migration path) rather than the schema change (*bool) which is a separate concern. Amended the commit to add a code comment at the construction site explaining:

  • defaults apply only when cluster-config.yaml has no docaDriver: block at all,
  • Go bool can't distinguish "absent" from "explicit false" without switching the field type to a pointer (deferred to a separate PR),
  • the shipped default-config.yaml already carries both unload fields at true, so a fresh l8k discover (without --user-config) regenerates with the new defaults,
  • existing users in the affected state can either re-discover from scratch or add unloadStorageModules: true / unloadThirdPartyRDMAModules: true to their cluster-config.yaml by hand.

Force-pushed 12a99c0 to preinstalled-discovery (was 9465b8e).

@almaslennikov almaslennikov merged commit a5ccceb into main Jun 26, 2026
3 checks passed
@almaslennikov almaslennikov deleted the preinstalled-discovery branch June 30, 2026 10:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant