Skip to content

Configure maintenance operation concurrency#108

Merged
almaslennikov merged 1 commit into
mainfrom
dev/maintenance-requestor-mode
Jun 30, 2026
Merged

Configure maintenance operation concurrency#108
almaslennikov merged 1 commit into
mainfrom
dev/maintenance-requestor-mode

Conversation

@almaslennikov

Copy link
Copy Markdown
Collaborator

Summary

  • add one top-level maintenance config section with validated defaults of four concurrent nodes
  • render MaintenanceOperatorConfig globally, legacy SR-IOV pool limits before 26.1, and legacy OFED upgrade concurrency
  • enable OFED and SR-IOV Maintenance Operator requestor mode for Network Operator 26.1 and newer
  • document accepted values, zero behavior, release gates, and Helm upgrade requirements

Validation

  • GOCACHE=/private/tmp/l8k-go-cache go test ./... -count=1 -skip "TestGetPresetsDir_(NotFound|SkipsFiles)"
  • GOCACHE=/private/tmp/l8k-go-cache make build
  • rendered representative 25.10, 26.1, and 26.4 profiles, including percentage and explicit-zero overrides
  • helm template passed with exact Network Operator v25.10.0 and v26.1.1 charts and the local 26.4 chart

The two skipped preset tests depend on /usr/local/share/l8k/presets being absent; they fail identically on unmodified main in this development environment.

@greptile-apps

greptile-apps Bot commented Jun 30, 2026

Copy link
Copy Markdown

Greptile Summary

This PR introduces a top-level maintenance config section that centralises node-disruption concurrency across DOCA/OFED upgrades and SR-IOV configuration, replacing the previous hardcoded value of 1. It adds a new IntOrPercent type that mirrors the Kubernetes intstr.IntOrString field, normalises and validates the config on every load path, and renders the values into MaintenanceOperatorConfig Helm values, NicClusterPolicy/NicNodePolicy manifests, and a new release-gated SriovNetworkPoolConfig.

  • Adds pkg/config/maintenance.go with MaintenanceConfig, IntOrPercent serialisation, NormalizeMaintenance, and field-level validation; backed by thorough unit tests.
  • Adds per-profile 00-values.yaml template changes that gate Maintenance Operator requestor mode (useRequestor, useDrainControllerRequestor, externalDrainer) on selectedRelease >= 26.1, and two new 35-sriovnetworkpoolconfig.yaml templates that only render for pre-26.1 releases.
  • Updates 10-nicclusterpolicy.yaml and 11-nicnodepolicy.yaml across all five OFED-enabled profiles to use the configurable maxParallelUpgrades instead of the previous hardcoded 1.

Confidence Score: 5/5

Safe to merge. All changed template paths are covered by integration tests against real Helm charts and across the four relevant release versions.

The normalization and validation logic is straightforward and idempotent, the new IntOrPercent type has a thorough YAML round-trip test, and the release-gating logic for requestor mode is exercised across every profile and release combination in the matrix test. No incorrect data flow or broken contract was found in the changed paths.

No files require special attention.

Important Files Changed

Filename Overview
pkg/config/maintenance.go New file: IntOrPercent type, MaintenanceConfig struct, NormalizeMaintenance, and validateIntOrPercent. Logic is correct; all validation paths are reachable from tests. validateMaintenance dereferences pointer fields without nil-guards, which is safe because it is only ever called from NormalizeMaintenance after defaults are fully populated.
pkg/config/maintenance_test.go New test file covering defaults, explicit-zero preservation, YAML round-trip, template rendering, and all validation error paths. Coverage is thorough.
pkg/networkoperatorplugin/maintenance_render_test.go New integration test file exercising requestor-mode profile matrix across all 7 profiles and 4 releases, plus explicit-zero preservation, maxParallelUpgrades rendering, and legacy SriovNetworkPoolConfig release gating.
pkg/config/config.go Added Maintenance field to LaunchKitConfig and NormalizeMaintenance calls in DefaultLaunchKitConfig, LoadFullConfig, and ValidateClusterConfig. The ValidateClusterConfig call is defensive normalization for programmatically-constructed configs; it is idempotent and harmless.
profiles/sriov-ethernet-rdma/35-sriovnetworkpoolconfig.yaml New template: renders SriovNetworkPoolConfig only for pre-26.1 releases via versionLT gate. When the condition is false the template renders empty, which the profile generation system correctly ignores.
profiles/host-device-rdma/00-values.yaml Adds release-gated requestor mode block and maintenance-operator-chart operatorConfig section. maxParallelOperations and maxUnavailable are correctly quoted strings (Helm truthiness); maxNodeMaintenanceTimeSeconds is also quoted (see prior thread).
profiles/host-device-rdma/11-nicnodepolicy.yaml Replaces hardcoded maxParallelUpgrades: 1 with the configurable template value. Rendered unquoted (correct for a K8s manifest integer field).
pkg/networkoperatorplugin/templates.go Adds a single NormalizeMaintenance call at the top of GenerateProfileDeploymentFiles, ensuring the Maintenance pointer is always populated before any template accesses it.
docs/maintenance.rst New documentation page accurately describing all four fields, value restrictions, zero-value semantics, release-specific behavior, and upgrade instructions. Content aligns with the implementation.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[LoadFullConfig / DefaultLaunchKitConfig] --> B[NormalizeMaintenance\nfill defaults + validate]
    C[ValidateClusterConfig] --> B
    D[GenerateProfileDeploymentFiles] --> B
    B --> E{selectedRelease >= 26.1?}
    E -- yes --> F[maintenanceOperator.enabled: true\noperator.maintenanceOperator.useRequestor\nuseDrainControllerRequestor\nsriov-network-operator.operator.externalDrainer]
    E -- no --> G[legacy path]
    F --> H[maintenance-operator-chart.operatorConfig\nmaxParallelOperations / maxUnavailable\nmaxNodeMaintenanceTimeSeconds]
    G --> H
    G --> I[35-sriovnetworkpoolconfig.yaml\nSriovNetworkPoolConfig.spec.maxUnavailable]
    G --> J[10-nicclusterpolicy / 11-nicnodepolicy\nmaxParallelUpgrades OFED]
    F --> J2[10-nicclusterpolicy / 11-nicnodepolicy\nmaxParallelUpgrades rendered but no effect]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[LoadFullConfig / DefaultLaunchKitConfig] --> B[NormalizeMaintenance\nfill defaults + validate]
    C[ValidateClusterConfig] --> B
    D[GenerateProfileDeploymentFiles] --> B
    B --> E{selectedRelease >= 26.1?}
    E -- yes --> F[maintenanceOperator.enabled: true\noperator.maintenanceOperator.useRequestor\nuseDrainControllerRequestor\nsriov-network-operator.operator.externalDrainer]
    E -- no --> G[legacy path]
    F --> H[maintenance-operator-chart.operatorConfig\nmaxParallelOperations / maxUnavailable\nmaxNodeMaintenanceTimeSeconds]
    G --> H
    G --> I[35-sriovnetworkpoolconfig.yaml\nSriovNetworkPoolConfig.spec.maxUnavailable]
    G --> J[10-nicclusterpolicy / 11-nicnodepolicy\nmaxParallelUpgrades OFED]
    F --> J2[10-nicclusterpolicy / 11-nicnodepolicy\nmaxParallelUpgrades rendered but no effect]
Loading

Reviews (2): Last reviewed commit: "Configure maintenance operation concurre..." | Re-trigger Greptile

deploy: true
maxParallelOperations: "{{ .Maintenance.MaxParallelOperations.String }}"
maxUnavailable: "{{ .Maintenance.MaxUnavailable.String }}"
maxNodeMaintenanceTimeSeconds: "{{ .Maintenance.MaxNodeMaintenanceTimeSeconds }}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 maxNodeMaintenanceTimeSeconds is always a plain integer, yet it is wrapped in double-quotes like the IntOrPercent fields above it, making it a YAML string in the rendered values. The other two fields legitimately need string quoting because they can be percentage values (e.g. "25%"), but this field cannot. If the Maintenance Operator chart has a JSON schema that validates this key as an integer, Helm will reject the string value. Dropping the quotes produces the integer scalar the chart expects, and the template auto-dereferences the *int32 pointer correctly. The same change applies to all seven profile 00-values.yaml files (ipoib-rdma-shared, macvlan-rdma-shared, sriov-ethernet-rdma, sriov-ib-rdma, spectrum-x, spectrum-x-ra2.1).

Suggested change
maxNodeMaintenanceTimeSeconds: "{{ .Maintenance.MaxNodeMaintenanceTimeSeconds }}"
maxNodeMaintenanceTimeSeconds: {{ .Maintenance.MaxNodeMaintenanceTimeSeconds }}

Comment thread pkg/config/config.go
Expose Maintenance Operator, SR-IOV drain, and legacy OFED concurrency through one launch-kit config section. Enable requestor mode for OFED and SR-IOV flows starting with Network Operator 26.1.

Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>
@almaslennikov almaslennikov force-pushed the dev/maintenance-requestor-mode branch from 1bd02c8 to f485f03 Compare June 30, 2026 11:16
@almaslennikov

Copy link
Copy Markdown
Collaborator Author

Addressed both review findings in amended commit f485f03:

  1. Quoted maxNodeMaintenanceTimeSeconds: kept the scalar quoted intentionally and documented that choice in all seven values templates. The upstream subchart wraps the field in a truthiness guard, so numeric 0 would be omitted; a quoted "0" preserves the supported immediate-cleanup override. The exact 25.10.0 and 26.1.1 charts accept this representation and rendered the CR correctly in smoke tests.
  2. Repeated normalization in validation: kept the call because ValidateClusterConfig is a public path for programmatically constructed configs that bypass LoadFullConfig, including non-nil values that still need validation. Added a comment making that invariant explicit.

Verification remains green: go test ./pkg/config ./pkg/networkoperatorplugin -count=1 and make build.

@almaslennikov almaslennikov merged commit 9e06c30 into main Jun 30, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant