[Bug]: nvidia-sandbox-validator and device plugin pods fail to deploy when transitioning from vm-passthrough to vm-vgpu

_**Important Note:  NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case [here](https://enterprise-support.nvidia.com/s/create-case)**._

**Describe the bug**
When transitioning a node's configuration from vm-passthrough to vm-vgpu using the NVIDIA GPU Operator, the expected pods (nvidia-sandbox-device-plugin-daemonset and nvidia-sandbox-validator) fail to deploy automatically. Because these pods do not start, the vGPU device is prevented from attaching to the nodes, and allocatable resources do not show correctly. 

**To Reproduce**
Prerequisites:

- Cluster is using NVIDIA GPU operator

Steps to reproduce :

- While the node is in vm-passthrough mode,  (or respective vGPU config).
- Switch the node to vm-vgpu mode.
- label the node with nvidia.com/vgpu.config= <vgpu config e.g, A2-2Q>
- Wait until the nvidia-vgpu-manager-daemonset is in a ready state.
- Observe: The nvidia-sandbox-device-plugin and nvidia-sandbox-validator pods are missing/fail to deploy.

Steps to workaround/force deployment:

1. Method A (Toggling deploy labels):
- Apply the following labels to the nodes to scale down the pods:
   nvidia.com/gpu.deploy.sandbox-validator: "false"
   nvidia.com/gpu.deploy.vgpu-device-manager: "false"
- Relabel both of those labels back to "true".
- Wait until the nvidia-sandbox-validator and nvidia-vgpu-device-manager DaemonSets are back in a ready state.
- The allocatable resources on the nodes now show correctly.

2. Method B (Toggling vGPU config label):
- Unlabel nvidia.com/vgpu.config from the node.
- Re-apply the nvidia.com/vgpu.config=A2-2Q label.
- The pods deploy, and the vGPU attaches correctly.

**Expected behavior**
The transition from vm-passthrough to vm-vgpu should automatically trigger the deployment of the required sandbox validator and device plugin pods. The vGPU device should attach, and allocatable resources should display correctly without requiring manual intervention or label toggling.

**Environment (please provide the following information):**
 - GPU Operator Version: v25.10.0
 - Kernel Version: [e.g. 6.8.0-generic]
 - Container Runtime Version: cri-o://1.35.1-4.rhaos4.22.gitf998e4a
 - Kubernetes Distro and Version: Openshift cnv-4.21



**Information to [attach](https://help.github.com/articles/file-attachments-on-issues-and-pull-requests/)** (optional if deemed irrelevant)

 - [x] kubernetes pods status: `kubectl get pods -n OPERATOR_NAMESPACE`
```
% oc get pods -n nvidia-gpu-operator                                                                   
NAME                                                 READY   STATUS    RESTARTS      AGE
gpu-operator-dd6856b8d-z649r                         1/1     Running   0             12h
nvidia-vgpu-device-manager-29jnh                     1/1     Running   0             14h
nvidia-vgpu-device-manager-h8vnr                     1/1     Running   0             24m
nvidia-vgpu-device-manager-n2prn                     1/1     Running   0             24m
nvidia-vgpu-manager-daemonset-9.6.20251112-0-7fnqg   2/2     Running   0             14h
nvidia-vgpu-manager-daemonset-9.6.20251112-0-dxmpm   2/2     Running   0             25m
nvidia-vgpu-manager-daemonset-9.6.20251112-0-t9hv6   2/2     Running   1 (24m ago)   25m
```

- [] vgpu-device-manager pod logs

```
% oc logs nvidia-vgpu-device-manager-n2prn  -n nvidia-gpu-operator

time="2026-01-20T17:03:46Z" level=info msg="Applying vGPU device configuration..."
.
.
time="2026-01-20T17:03:46Z" level=debug msg="    Updating vGPU config: map[A2-2Q:8]"
time="2026-01-20T17:03:46Z" level=fatal msg="error setting VGPU config: no parent devices found for GPU at index '0'"
time="2026-01-20T17:03:46Z" level=error msg="Failed to apply vGPU config: unable to apply config 'default': exit status 1"
time="2026-01-20T17:03:46Z" level=info msg="Setting node label: nvidia.com/vgpu.config.state=failed"
time="2026-01-20T17:03:46Z" level=info msg="Waiting for change to 'nvidia.com/vgpu.config' label"
```

- Labels on node
```
   nvidia.com/gpu.deploy.sandbox-device-plugin: paused-for-vgpu-change
    nvidia.com/gpu.deploy.sandbox-validator: paused-for-vgpu-change
    nvidia.com/gpu.deploy.vfio-manager: "true"
    nvidia.com/gpu.present: "true"
    nvidia.com/gpu.workload.config: vm-passthrough
    nvidia.com/vgpu.config: A2-2Q
    nvidia.com/vgpu.config.state: failed
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: nvidia-sandbox-validator and device plugin pods fail to deploy when transitioning from vm-passthrough to vm-vgpu #2272

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: nvidia-sandbox-validator and device plugin pods fail to deploy when transitioning from vm-passthrough to vm-vgpu #2272

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions