Skip to content

deployment with updated rust crate#9

Open
fuddin-bit wants to merge 40 commits into
wseaton:mainfrom
fuddin-bit:main
Open

deployment with updated rust crate#9
fuddin-bit wants to merge 40 commits into
wseaton:mainfrom
fuddin-bit:main

Conversation

@fuddin-bit

Copy link
Copy Markdown

Summary

  • Deploy gpu-pruner to user namespaces with CoreWeave Prometheus integration and configurable idle threshold (< 1%)
  • Add Slack notifications and LoadBalancer interaction endpoint for alert acknowledgments
  • Expand Grafana dashboard and Helm charts with DCGM panels aligned to query.promql.j2 idle detection logic
  • Remove embedded web dashboard in favor of Grafana; add deployment docs and session notes

Key changes

gpu-pruner

  • Slack webhook notifications and interaction server (--slack-interaction-port, --slack-channel)
  • Idle detection uses DCGM_FI_PROF_GR_ENGINE_ACTIVE with DCGM_FI_DEV_GPU_UTIL fallback and < 0.01 threshold
  • User-namespace deploy script and LoadBalancer service for Slack callbacks
  • Removed built-in HTML dashboard
    Grafana / monitoring
  • gpu-dashboard.json: Engine idle/active (30m) and Idle GPU Workloads table match pruner PromQL
  • Helm values, dashboard ConfigMap, and deployment guides

Test plan

  • kubectl get pods -n fuddin-dev — gpu-pruner pod healthy
  • Logs show Query succeeded against Prometheus: kubectl logs -n fuddin-dev deployment/gpu-pruner --context coreweave-waldorf --tail=50
  • Slack #test-pruner receives idle GPU alerts
  • curl to LoadBalancer /slack/interactions returns OK for URL verification
  • Grafana panels show idle/active counts consistent with pruner query results
  • RBAC ClusterRoleBindings applied for scale-down

fuddin-bit and others added 30 commits June 1, 2026 14:16
- Add Axum-based web server with real-time dashboard UI
- Add REST API endpoint at /api/status for programmatic access
- Add Kubernetes Service and OpenShift Route manifests
- Update deployment to expose dashboard on port 8080
- Add comprehensive documentation (DASHBOARD.md, DEPLOYMENT_GUIDE.md)

Dashboard features:
- Real-time monitoring with 10-second auto-refresh
- Display total pods checked, idle workloads, and wasted resources
- Detailed table of idle workloads (namespace, name, type)
- Modern responsive UI with gradient design and badges
- CORS-enabled API for external integrations

Kubernetes resources:
- Service: gpu-pruner-dashboard (ClusterIP on port 8080)
- Route: External HTTPS access via OpenShift Route
- Updated deployment with dashboard port configuration

Implements requirements from intern project:
- Current running workload display
- Idle GPU workloads list
- Resource consumption metrics
- Web UI for easy monitoring

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Implements native Prometheus metrics exposition to complement existing
OTEL support and enable standard Prometheus-based monitoring of
gpu-pruner operations.

## Changes

### Application Metrics
- Add prometheus crate (v0.13) and lazy_static dependencies
- Create new metrics module (src/metrics.rs) with:
  - Counters: query_successes, query_failures, query_candidates,
    query_shutdown_events, scale_successes, scale_failures
  - Gauges: idle_gpus, pods_checked_total
- Initialize metrics registry at startup
- Instrument main.rs at 6 locations (query/scale operations)
- Update dashboard state sync to also update Prometheus gauges

### HTTP Endpoints
- Add GET /metrics route to dashboard router (Axum)
- Returns Prometheus text format via prometheus::TextEncoder
- Reuses existing dashboard server on port 8080

### Kubernetes Resources
- Create ServiceMonitor for Prometheus Operator scraping
- Update service.yaml port name from "dashboard" to "http"
- Add servicemonitor.yaml to kustomization.yaml
- Add prometheus-scrape-config.yaml with example configs for
  DCGM exporter and kube-state-metrics

## Verification

Tested locally:
- cargo build --release: Success
- /metrics endpoint: Returns valid Prometheus text format
- /api/status endpoint: Still functional
- Metrics increment correctly (query_failures_total = 1 on bad URL)

## Metrics Exposed

- gpu_pruner_query_successes_total (counter)
- gpu_pruner_query_failures_total (counter)
- gpu_pruner_query_candidates_total (counter)
- gpu_pruner_query_shutdown_events_total (counter)
- gpu_pruner_scale_successes_total (counter)
- gpu_pruner_scale_failures_total (counter)
- gpu_pruner_idle_gpus (gauge)
- gpu_pruner_pods_checked_total (gauge)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Provides step-by-step instructions for deploying the new metrics
functionality including image updates, service/servicemonitor creation,
and verification steps.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Extend gpu-dashboard.json with temperature, power, VRAM %, memory-copy,
XID, and optional profiling metrics; sync Helm ConfigMap and document
PromQL in DASHBOARD.md and GRAFANA_DEPLOYMENT.md.

Co-authored-by: Cursor <cursoragent@cursor.com>
Include values for OpenShift and vanilla Kubernetes, dashboard import
script, CoreWeave ingress guide, and README Helm quick start.

Co-authored-by: Cursor <cursoragent@cursor.com>
Visualize DCGM_FI_PROF_GR_ENGINE_ACTIVE per node and document PromQL
in DASHBOARD.md and GRAFANA_DEPLOYMENT.md.

Co-authored-by: Cursor <cursoragent@cursor.com>
… states. Adjusted legend formats and added refIds for clarity in the Grafana configuration. Ensured consistency across dashboard panels for better monitoring of GPU workloads.
- Updated ClusterRoleBindings to reference fuddin-dev namespace
- Added namespace.yaml to kustomization resources
- All manifests now use fuddin-dev instead of gpu-pruner-system

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Code uses let chains which require edition 2024.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant