Kubernetes x NVIDIA DRA Workshop

A hands-on workshop for Kubernetes engineers to explore Dynamic Resource Allocation (DRA) with NVIDIA GPUs on Kind clusters.

Covers the full stack: exclusive GPU scheduling, MPS sharing, MIG hardware isolation, MIG x MPS hybrid architecture, admin access, observability, and resilience testing.

繁體中文版 README

Prerequisites

Tool	Purpose	Verify
NVIDIA Driver 550+	GPU driver	`nvidia-smi`
Docker	Container runtime	`docker ps`
Kind v0.24+	Local K8s cluster	`kind version`
Helm 3	DRA driver install	`helm version`
nvidia-ctk	CDI config	`nvidia-ctk cdi list`

./scripts/phase1/run-module0-check-env.sh   # Automated check

Workshop Modules

Phase 1: DRA Basics & MPS Sharing

Module	Topic	Script	Docs
M0	Prerequisites Check	`run-module0-check-env.sh`	Link
M1	Kind Cluster Setup	`run-module1-setup-kind.sh`	Link
M2	DRA Driver Install	`run-module2-install-driver.sh`	Link
M3	First GPU Pod	`run-module3-verify-workload.sh`	Link
M4	MPS Basics (DRA-managed)	`run-module4-mps-basics.sh`	Link
M5	MPS Resource Limits	`run-module5-mps-advanced.sh`	Link
M6	vLLM on MPS	`run-module6-vllm-verify.sh`	Link

Phase 2: Production Readiness

Module	Topic	Script	Docs
M7	Consumable Capacity (Alpha)	`run-module7-consumable-capacity.sh`	Link
M8	Admin Access (Beta)	`run-module8-admin-access.sh`	Link
M8	Observability (DCGM via adminAccess)	`run-module8-observability.sh`	Link
M9	Resilience (Chaos)	`run-module9-resilience.sh`	Link

Phase 3: MIG Hardware Isolation (A100/H100 only)

Module	Topic	Script	Docs
M10.1	MIG Profile Selection	`module10-1/run.sh`	Link
M10.2	CEL Capacity Matching	`module10-2/run.sh`	Link
M10.3	OOM Isolation Proof	`module10-3/run.sh`	Link
M10.4	MIG x MPS Hybrid	`module10-4/run.sh`	Link
M10.5	Silicon-to-Pod Traceability	`module10-5/run.sh`	Link
M10.6	Dynamic MIG Reconfiguration	`module10-6/run.sh`	Link

Quick Start

# Setup (run once)
./scripts/phase1/run-module0-check-env.sh
./scripts/phase1/run-module1-setup-kind.sh
./scripts/phase1/run-module2-install-driver.sh

# Run any module independently (M3-M9)
./scripts/phase1/run-module3-verify-workload.sh
./scripts/phase2/run-module8-admin-access.sh    # order doesn't matter

# Run everything
./run_all.sh

# MIG mode (A100/H100 only)
sudo ./scripts/common/mig-reconfig.sh mig       # enable MIG
./scripts/phase3/module10-1/run.sh               # run MIG modules
sudo ./scripts/common/mig-reconfig.sh gpu        # restore full GPU mode

Module Independence

After initial setup (M0 → M1 → M2), every module (M3-M9) is self-contained and can be run in any order. Each module sources a shared ensure-ready.sh helper that:

Verifies driver pods are running
Enables the MPSSupport feature gate if needed
Cleans stale DeviceClasses, pods, claims, and MPS daemons from prior modules
Waits for ResourceSlice availability

Module Dependency Graph

graph TD
    M0[M0: Check Env] --> M1[M1: Kind Cluster]
    M1 --> M2[M2: DRA Driver]
    M2 --> M3[M3: First Pod]
    M2 --> M4[M4: MPS Basics]
    M2 --> M5[M5: MPS Limits]
    M2 --> M6[M6: vLLM]
    M2 --> M7[M7: Capacity]
    M2 --> M8[M8: Admin + DCGM]
    M2 --> M9[M9: Resilience]
    M2 --> MIG{MIG enabled?}
    MIG --> M10.1[M10.1: MIG Selection]
    MIG --> M10.2[M10.2: CEL Matching]
    MIG --> M10.3[M10.3: OOM Isolation]
    MIG --> M10.4[M10.4: MIG x MPS]
    MIG --> M10.5[M10.5: Traceability]
    MIG --> M10.6[M10.6: Reconfig]

    classDef infra fill:#f9f,stroke:#333,stroke-width:2px
    classDef workload fill:#bbf,stroke:#333,stroke-width:2px
    classDef mig fill:#fdb,stroke:#333,stroke-width:2px
    class M0,M1,M2 infra
    class M3,M4,M5,M6,M7,M8,M9 workload
    class M10.1,M10.2,M10.3,M10.4,M10.5,M10.6 mig

Test Environment

	Config
GPU	NVIDIA A100-PCIE-40GB (MIG), RTX 5090 (MPS/vLLM)
OS	Ubuntu 22.04 LTS, Kernel 5.15+
Driver	NVIDIA Driver 550+
K8s	Kind v0.24+ (K8s v1.32–1.34) with DRA feature gates
DRA Driver	`nvcr.io/nvidia/k8s-dra-driver-gpu:v25.8.1`

Key Technical Highlights

DRA-managed MPS: GPU sharing without hostIPC — the driver creates per-claim MPS daemons automatically
Kind compatibility: COPY_DRIVER_LIBS_FROM_ROOT mechanism solves NVML library discovery on Kind clusters without NVIDIA container runtime
Hardware isolation: MIG partitioning provides independent memory controllers and SMs per slice
MIG x MPS hybrid: Hardware isolation between slices + software sharing within each slice
Resilience by design: CDI decoupling means driver/controller restarts don't kill running workloads
Admin Access: adminAccess: true bypasses GPU exclusivity for debugging fully-allocated nodes

Troubleshooting

If any module fails unexpectedly, run reset-env first:

./scripts/common/reset-env.sh       # clean all resources + refresh DRA plugin (keeps cluster)

This resolves most issues (stale claims, expired plugin sockets, leftover DeviceClasses). See the full Troubleshooting Guide for details.

Cleanup

./scripts/common/run-teardown.sh    # destroy Kind cluster

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
docs		docs
manifests		manifests
nvidia-dra-driver-gpu		nvidia-dra-driver-gpu
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README-zh_TW.md		README-zh_TW.md
README.md		README.md
k8s-dra-features.md		k8s-dra-features.md
overview-slide.pdf		overview-slide.pdf
run_all.sh		run_all.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kubernetes x NVIDIA DRA Workshop

Prerequisites

Workshop Modules

Phase 1: DRA Basics & MPS Sharing

Phase 2: Production Readiness

Phase 3: MIG Hardware Isolation (A100/H100 only)

Quick Start

Module Independence

Module Dependency Graph

Test Environment

Key Technical Highlights

Troubleshooting

Cleanup

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Kubernetes x NVIDIA DRA Workshop

Prerequisites

Workshop Modules

Phase 1: DRA Basics & MPS Sharing

Phase 2: Production Readiness

Phase 3: MIG Hardware Isolation (A100/H100 only)

Quick Start

Module Independence

Module Dependency Graph

Test Environment

Key Technical Highlights

Troubleshooting

Cleanup

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages