deployment with updated rust crate#9
Open
fuddin-bit wants to merge 40 commits into
Open
Conversation
- Add Axum-based web server with real-time dashboard UI - Add REST API endpoint at /api/status for programmatic access - Add Kubernetes Service and OpenShift Route manifests - Update deployment to expose dashboard on port 8080 - Add comprehensive documentation (DASHBOARD.md, DEPLOYMENT_GUIDE.md) Dashboard features: - Real-time monitoring with 10-second auto-refresh - Display total pods checked, idle workloads, and wasted resources - Detailed table of idle workloads (namespace, name, type) - Modern responsive UI with gradient design and badges - CORS-enabled API for external integrations Kubernetes resources: - Service: gpu-pruner-dashboard (ClusterIP on port 8080) - Route: External HTTPS access via OpenShift Route - Updated deployment with dashboard port configuration Implements requirements from intern project: - Current running workload display - Idle GPU workloads list - Resource consumption metrics - Web UI for easy monitoring Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Implements native Prometheus metrics exposition to complement existing
OTEL support and enable standard Prometheus-based monitoring of
gpu-pruner operations.
## Changes
### Application Metrics
- Add prometheus crate (v0.13) and lazy_static dependencies
- Create new metrics module (src/metrics.rs) with:
- Counters: query_successes, query_failures, query_candidates,
query_shutdown_events, scale_successes, scale_failures
- Gauges: idle_gpus, pods_checked_total
- Initialize metrics registry at startup
- Instrument main.rs at 6 locations (query/scale operations)
- Update dashboard state sync to also update Prometheus gauges
### HTTP Endpoints
- Add GET /metrics route to dashboard router (Axum)
- Returns Prometheus text format via prometheus::TextEncoder
- Reuses existing dashboard server on port 8080
### Kubernetes Resources
- Create ServiceMonitor for Prometheus Operator scraping
- Update service.yaml port name from "dashboard" to "http"
- Add servicemonitor.yaml to kustomization.yaml
- Add prometheus-scrape-config.yaml with example configs for
DCGM exporter and kube-state-metrics
## Verification
Tested locally:
- cargo build --release: Success
- /metrics endpoint: Returns valid Prometheus text format
- /api/status endpoint: Still functional
- Metrics increment correctly (query_failures_total = 1 on bad URL)
## Metrics Exposed
- gpu_pruner_query_successes_total (counter)
- gpu_pruner_query_failures_total (counter)
- gpu_pruner_query_candidates_total (counter)
- gpu_pruner_query_shutdown_events_total (counter)
- gpu_pruner_scale_successes_total (counter)
- gpu_pruner_scale_failures_total (counter)
- gpu_pruner_idle_gpus (gauge)
- gpu_pruner_pods_checked_total (gauge)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Provides step-by-step instructions for deploying the new metrics functionality including image updates, service/servicemonitor creation, and verification steps. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Extend gpu-dashboard.json with temperature, power, VRAM %, memory-copy, XID, and optional profiling metrics; sync Helm ConfigMap and document PromQL in DASHBOARD.md and GRAFANA_DEPLOYMENT.md. Co-authored-by: Cursor <cursoragent@cursor.com>
Include values for OpenShift and vanilla Kubernetes, dashboard import script, CoreWeave ingress guide, and README Helm quick start. Co-authored-by: Cursor <cursoragent@cursor.com>
Visualize DCGM_FI_PROF_GR_ENGINE_ACTIVE per node and document PromQL in DASHBOARD.md and GRAFANA_DEPLOYMENT.md. Co-authored-by: Cursor <cursoragent@cursor.com>
… states. Adjusted legend formats and added refIds for clarity in the Grafana configuration. Ensured consistency across dashboard panels for better monitoring of GPU workloads.
Deployment
- Updated ClusterRoleBindings to reference fuddin-dev namespace - Added namespace.yaml to kustomization resources - All manifests now use fuddin-dev instead of gpu-pruner-system Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Code uses let chains which require edition 2024. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
query.promql.j2idle detection logicKey changes
gpu-pruner
--slack-interaction-port,--slack-channel)DCGM_FI_PROF_GR_ENGINE_ACTIVEwithDCGM_FI_DEV_GPU_UTILfallback and< 0.01thresholdGrafana / monitoring
gpu-dashboard.json: Engine idle/active (30m) and Idle GPU Workloads table match pruner PromQLTest plan
kubectl get pods -n fuddin-dev— gpu-pruner pod healthyQuery succeededagainst Prometheus: kubectl logs -n fuddin-dev deployment/gpu-pruner --context coreweave-waldorf --tail=50#test-prunerreceives idle GPU alertscurlto LoadBalancer/slack/interactionsreturns OK for URL verification