deployment with updated rust crate by fuddin-bit · Pull Request #9 · wseaton/gpu-pruner

fuddin-bit · 2026-06-15T17:30:23Z

Summary

Deploy gpu-pruner to user namespaces with CoreWeave Prometheus integration and configurable idle threshold (< 1%)
Add Slack notifications and LoadBalancer interaction endpoint for alert acknowledgments
Expand Grafana dashboard and Helm charts with DCGM panels aligned to query.promql.j2 idle detection logic
Remove embedded web dashboard in favor of Grafana; add deployment docs and session notes

Key changes

gpu-pruner

Slack webhook notifications and interaction server (--slack-interaction-port, --slack-channel)
Idle detection uses DCGM_FI_PROF_GR_ENGINE_ACTIVE with DCGM_FI_DEV_GPU_UTIL fallback and < 0.01 threshold
User-namespace deploy script and LoadBalancer service for Slack callbacks
Removed built-in HTML dashboard
Grafana / monitoring
gpu-dashboard.json: Engine idle/active (30m) and Idle GPU Workloads table match pruner PromQL
Helm values, dashboard ConfigMap, and deployment guides

Test plan

kubectl get pods -n fuddin-dev — gpu-pruner pod healthy
Logs show Query succeeded against Prometheus: kubectl logs -n fuddin-dev deployment/gpu-pruner --context coreweave-waldorf --tail=50
Slack #test-pruner receives idle GPU alerts
curl to LoadBalancer /slack/interactions returns OK for URL verification
Grafana panels show idle/active counts consistent with pruner query results
RBAC ClusterRoleBindings applied for scale-down

- Add Axum-based web server with real-time dashboard UI - Add REST API endpoint at /api/status for programmatic access - Add Kubernetes Service and OpenShift Route manifests - Update deployment to expose dashboard on port 8080 - Add comprehensive documentation (DASHBOARD.md, DEPLOYMENT_GUIDE.md) Dashboard features: - Real-time monitoring with 10-second auto-refresh - Display total pods checked, idle workloads, and wasted resources - Detailed table of idle workloads (namespace, name, type) - Modern responsive UI with gradient design and badges - CORS-enabled API for external integrations Kubernetes resources: - Service: gpu-pruner-dashboard (ClusterIP on port 8080) - Route: External HTTPS access via OpenShift Route - Updated deployment with dashboard port configuration Implements requirements from intern project: - Current running workload display - Idle GPU workloads list - Resource consumption metrics - Web UI for easy monitoring Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Implements native Prometheus metrics exposition to complement existing OTEL support and enable standard Prometheus-based monitoring of gpu-pruner operations. ## Changes ### Application Metrics - Add prometheus crate (v0.13) and lazy_static dependencies - Create new metrics module (src/metrics.rs) with: - Counters: query_successes, query_failures, query_candidates, query_shutdown_events, scale_successes, scale_failures - Gauges: idle_gpus, pods_checked_total - Initialize metrics registry at startup - Instrument main.rs at 6 locations (query/scale operations) - Update dashboard state sync to also update Prometheus gauges ### HTTP Endpoints - Add GET /metrics route to dashboard router (Axum) - Returns Prometheus text format via prometheus::TextEncoder - Reuses existing dashboard server on port 8080 ### Kubernetes Resources - Create ServiceMonitor for Prometheus Operator scraping - Update service.yaml port name from "dashboard" to "http" - Add servicemonitor.yaml to kustomization.yaml - Add prometheus-scrape-config.yaml with example configs for DCGM exporter and kube-state-metrics ## Verification Tested locally: - cargo build --release: Success - /metrics endpoint: Returns valid Prometheus text format - /api/status endpoint: Still functional - Metrics increment correctly (query_failures_total = 1 on bad URL) ## Metrics Exposed - gpu_pruner_query_successes_total (counter) - gpu_pruner_query_failures_total (counter) - gpu_pruner_query_candidates_total (counter) - gpu_pruner_query_shutdown_events_total (counter) - gpu_pruner_scale_successes_total (counter) - gpu_pruner_scale_failures_total (counter) - gpu_pruner_idle_gpus (gauge) - gpu_pruner_pods_checked_total (gauge) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Provides step-by-step instructions for deploying the new metrics functionality including image updates, service/servicemonitor creation, and verification steps. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Extend gpu-dashboard.json with temperature, power, VRAM %, memory-copy, XID, and optional profiling metrics; sync Helm ConfigMap and document PromQL in DASHBOARD.md and GRAFANA_DEPLOYMENT.md. Co-authored-by: Cursor <cursoragent@cursor.com>

Include values for OpenShift and vanilla Kubernetes, dashboard import script, CoreWeave ingress guide, and README Helm quick start. Co-authored-by: Cursor <cursoragent@cursor.com>

Visualize DCGM_FI_PROF_GR_ENGINE_ACTIVE per node and document PromQL in DASHBOARD.md and GRAFANA_DEPLOYMENT.md. Co-authored-by: Cursor <cursoragent@cursor.com>

… states. Adjusted legend formats and added refIds for clarity in the Grafana configuration. Ensured consistency across dashboard panels for better monitoring of GPU workloads.

Deployment

- Updated ClusterRoleBindings to reference fuddin-dev namespace - Added namespace.yaml to kustomization resources - All manifests now use fuddin-dev instead of gpu-pruner-system Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Code uses let chains which require edition 2024. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

fuddin-bit and others added 30 commits June 1, 2026 14:16

docs: Add Prometheus deployment guide

c13096e

Provides step-by-step instructions for deploying the new metrics functionality including image updates, service/servicemonitor creation, and verification steps. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

added PROMETHEUS UID and added UID

bbbc6a7

Add Grafana Helm deployment and ingress access docs.

b175ece

Include values for OpenShift and vanilla Kubernetes, dashboard import script, CoreWeave ingress guide, and README Helm quick start. Co-authored-by: Cursor <cursoragent@cursor.com>

Add graphics engine active panel to GPU Health dashboard row.

e04a235

Visualize DCGM_FI_PROF_GR_ENGINE_ACTIVE per node and document PromQL in DASHBOARD.md and GRAFANA_DEPLOYMENT.md. Co-authored-by: Cursor <cursoragent@cursor.com>

Update GPU dashboard to include 30-minute metrics for idle and active…

bacde09

… states. Adjusted legend formats and added refIds for clarity in the Grafana configuration. Ensured consistency across dashboard panels for better monitoring of GPU workloads.

added slack notifies and removed UI

d91b987

0.1 threshold

de61d6d

add slack acknowledgement and in-cluster

3629dfd

Merge pull request #1 from fuddin-bit/deployment

347ddc9

Deployment

feat: Slack acknowledgment interactions and LoadBalancer deploy

dbc8a16

added my namepsace with new prom url and args

19bbd9b

removed generated docs and created deployment guide

321a67b

removed docs to review to right needed docs later

4ed993e

removed bash scripts

80408b1

removed generated docs

17aead2

Update deployment to use :main image tag and fix metrics endpoint

74c0637

Merge deployment branch with metrics endpoint and latest fixes

6424730

fix: Use structured logging for Slack webhook errors

fae9223

fix: Update rust-version to 1.75 (valid version)

ca5d8bf

added fmt

997607b

Change Rust edition from 2024 to 2021

40f6264

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Revert to Rust edition 2024

3b23e86

Code uses let chains which require edition 2024. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

remoed if collapse

6dfd286

Merge branch 'main' into main

0d9bdaf

forgot to add rootobject error

c104ab8

formatting

f78a264

fuddin-bit added 10 commits June 16, 2026 09:54

fixed user parsing slack payload

551be3d

5 min grace period and fixed stateful test timeout

e6e9ac5

cargo fmt and commented out e2e due to rbac issues

e9e50cf

fmt and clipping issues fixed

1ce3545

removed ack test

774e0a1

new image

f0a506c

updated readme

c22e79f

excluded llm-d-nightly and bench-guide

132fbcf

updated test case

d1cc1cc

updated image to main

cf83acd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deployment with updated rust crate#9

deployment with updated rust crate#9
fuddin-bit wants to merge 40 commits into
wseaton:mainfrom
fuddin-bit:main

fuddin-bit commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fuddin-bit commented Jun 15, 2026

Summary

Key changes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant