Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
Describe the bug
When transitioning a node's configuration from vm-passthrough to vm-vgpu using the NVIDIA GPU Operator, the expected pods (nvidia-sandbox-device-plugin-daemonset and nvidia-sandbox-validator) fail to deploy automatically. Because these pods do not start, the vGPU device is prevented from attaching to the nodes, and allocatable resources do not show correctly.
To Reproduce
Prerequisites:
- Cluster is using NVIDIA GPU operator
Steps to reproduce :
- While the node is in vm-passthrough mode, (or respective vGPU config).
- Switch the node to vm-vgpu mode.
- label the node with nvidia.com/vgpu.config= <vgpu config e.g, A2-2Q>
- Wait until the nvidia-vgpu-manager-daemonset is in a ready state.
- Observe: The nvidia-sandbox-device-plugin and nvidia-sandbox-validator pods are missing/fail to deploy.
Steps to workaround/force deployment:
- Method A (Toggling deploy labels):
- Apply the following labels to the nodes to scale down the pods:
nvidia.com/gpu.deploy.sandbox-validator: "false"
nvidia.com/gpu.deploy.vgpu-device-manager: "false"
- Relabel both of those labels back to "true".
- Wait until the nvidia-sandbox-validator and nvidia-vgpu-device-manager DaemonSets are back in a ready state.
- The allocatable resources on the nodes now show correctly.
- Method B (Toggling vGPU config label):
- Unlabel nvidia.com/vgpu.config from the node.
- Re-apply the nvidia.com/vgpu.config=A2-2Q label.
- The pods deploy, and the vGPU attaches correctly.
Expected behavior
The transition from vm-passthrough to vm-vgpu should automatically trigger the deployment of the required sandbox validator and device plugin pods. The vGPU device should attach, and allocatable resources should display correctly without requiring manual intervention or label toggling.
Environment (please provide the following information):
- GPU Operator Version: v25.10.0
- Kernel Version: [e.g. 6.8.0-generic]
- Container Runtime Version: cri-o://1.35.1-4.rhaos4.22.gitf998e4a
- Kubernetes Distro and Version: Openshift cnv-4.21
Information to attach (optional if deemed irrelevant)
% oc get pods -n nvidia-gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-operator-dd6856b8d-z649r 1/1 Running 0 12h
nvidia-vgpu-device-manager-29jnh 1/1 Running 0 14h
nvidia-vgpu-device-manager-h8vnr 1/1 Running 0 24m
nvidia-vgpu-device-manager-n2prn 1/1 Running 0 24m
nvidia-vgpu-manager-daemonset-9.6.20251112-0-7fnqg 2/2 Running 0 14h
nvidia-vgpu-manager-daemonset-9.6.20251112-0-dxmpm 2/2 Running 0 25m
nvidia-vgpu-manager-daemonset-9.6.20251112-0-t9hv6 2/2 Running 1 (24m ago) 25m
- [] vgpu-device-manager pod logs
% oc logs nvidia-vgpu-device-manager-n2prn -n nvidia-gpu-operator
time="2026-01-20T17:03:46Z" level=info msg="Applying vGPU device configuration..."
.
.
time="2026-01-20T17:03:46Z" level=debug msg=" Updating vGPU config: map[A2-2Q:8]"
time="2026-01-20T17:03:46Z" level=fatal msg="error setting VGPU config: no parent devices found for GPU at index '0'"
time="2026-01-20T17:03:46Z" level=error msg="Failed to apply vGPU config: unable to apply config 'default': exit status 1"
time="2026-01-20T17:03:46Z" level=info msg="Setting node label: nvidia.com/vgpu.config.state=failed"
time="2026-01-20T17:03:46Z" level=info msg="Waiting for change to 'nvidia.com/vgpu.config' label"
nvidia.com/gpu.deploy.sandbox-device-plugin: paused-for-vgpu-change
nvidia.com/gpu.deploy.sandbox-validator: paused-for-vgpu-change
nvidia.com/gpu.deploy.vfio-manager: "true"
nvidia.com/gpu.present: "true"
nvidia.com/gpu.workload.config: vm-passthrough
nvidia.com/vgpu.config: A2-2Q
nvidia.com/vgpu.config.state: failed
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
Describe the bug
When transitioning a node's configuration from vm-passthrough to vm-vgpu using the NVIDIA GPU Operator, the expected pods (nvidia-sandbox-device-plugin-daemonset and nvidia-sandbox-validator) fail to deploy automatically. Because these pods do not start, the vGPU device is prevented from attaching to the nodes, and allocatable resources do not show correctly.
To Reproduce
Prerequisites:
Steps to reproduce :
Steps to workaround/force deployment:
nvidia.com/gpu.deploy.sandbox-validator: "false"
nvidia.com/gpu.deploy.vgpu-device-manager: "false"
Expected behavior
The transition from vm-passthrough to vm-vgpu should automatically trigger the deployment of the required sandbox validator and device plugin pods. The vGPU device should attach, and allocatable resources should display correctly without requiring manual intervention or label toggling.
Environment (please provide the following information):
Information to attach (optional if deemed irrelevant)
kubectl get pods -n OPERATOR_NAMESPACE