Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
345 changes: 345 additions & 0 deletions COREWEAVE_INGRESS_GUIDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,345 @@
# CoreWeave Kubernetes (CKS) Ingress Guide

This guide explains how to expose services for external access in CoreWeave Kubernetes Service (CKS).

## Overview

CoreWeave uses a **LoadBalancer + DNS annotation** pattern rather than traditional Ingress controllers. The cluster has **Istio** installed but standard users don't have permissions to create Gateway API resources or VirtualServices.

## Available Methods

### Method 1: LoadBalancer Service with DNS (Recommended for CKS)

CoreWeave provides an **External Hostname Controller** that automatically creates DNS records for LoadBalancer services.

#### How It Works

1. Create a LoadBalancer service
2. Add the `service.beta.kubernetes.io/external-hostname` annotation
3. CoreWeave assigns a public IP and creates a DNS record in `.coreweave.app` domain
4. DNS status is reflected in `.status.conditions` field of the Service

#### Example: Expose Grafana with LoadBalancer

**IMPORTANT**: You must add the `service.beta.kubernetes.io/coreweave-load-balancer-type: public` annotation to get a **public IP**. Without this annotation, CoreWeave assigns an internal VIP only.

```yaml
apiVersion: v1
kind: Service
metadata:
name: gpu-grafana
namespace: fuddin-dev
annotations:
service.beta.kubernetes.io/coreweave-load-balancer-type: "public" # REQUIRED for public IP
service.beta.kubernetes.io/external-hostname: "gpu-grafana"
# This creates: gpu-grafana-<hash>.coreweave.app
spec:
type: LoadBalancer
selector:
app.kubernetes.io/name: grafana
app.kubernetes.io/instance: gpu-grafana
ports:
- name: http
port: 80
targetPort: 3000
protocol: TCP
```

Apply and check the assigned hostname:

```bash
kubectl apply -f grafana-loadbalancer.yaml

# Wait for external IP assignment
kubectl get svc gpu-grafana -n fuddin-dev -w

# Check the assigned DNS name in status
kubectl get svc gpu-grafana -n fuddin-dev -o jsonpath='{.status.conditions[?(@.type=="ExternalRecords")].message}'
```

The service will be accessible at: `http://gpu-grafana-<hash>.coreweave.app`

#### Wildcard DNS

For wildcard DNS records (e.g., for multiple subdomains):

```yaml
metadata:
annotations:
service.beta.kubernetes.io/external-hostname: "*"
# Creates: *.abc123-mycluster.coreweave.app
```

### Method 2: Port-Forward (Development/Testing)

For temporary access without exposing services publicly:

```bash
# Forward local port 3000 to Grafana service
kubectl port-forward -n fuddin-dev svc/gpu-grafana 3000:80

# Access at http://localhost:3000
```

**Pros**:
- No cluster configuration needed
- Works immediately
- No public exposure

**Cons**:
- Only accessible from your machine
- Connection breaks when command terminates
- Not suitable for production

### Method 3: Istio VirtualService (Requires Permissions)

CoreWeave has **Istio** installed, but standard users don't have permissions to create VirtualServices or Gateways. This method requires cluster admin assistance.

If you have permissions, you would create:

```yaml
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
name: grafana-vs
namespace: fuddin-dev
spec:
hosts:
- "grafana.example.com"
gateways:
- istio-system/public-gateway # Shared cluster gateway
http:
- match:
- uri:
prefix: /
route:
- destination:
host: gpu-grafana.fuddin-dev.svc.cluster.local
port:
number: 80
```

**Note**: This requires a shared Gateway to exist and permissions to create VirtualServices.

## Comparison of Methods

| Method | Access | Setup Complexity | Cost | Use Case |
|--------|--------|------------------|------|----------|
| **LoadBalancer + DNS** | Public internet | Low | Charges for public IP | Production, public dashboards |
| **Port-Forward** | Local only | Very low | Free | Development, debugging |
| **Istio VirtualService** | Shared gateway | Medium | Shared cost | Multi-service routing, advanced traffic control |

## Recommended Approach for Grafana

### Option A: LoadBalancer (Public Access)

Best for production Grafana instance that multiple team members need to access.

```bash
# Update Grafana service to LoadBalancer
kubectl patch svc gpu-grafana -n fuddin-dev -p '{"spec":{"type":"LoadBalancer"}}'

# Add REQUIRED annotation for public IP
kubectl annotate svc gpu-grafana -n fuddin-dev \
service.beta.kubernetes.io/coreweave-load-balancer-type="public"

# Add DNS annotation
kubectl annotate svc gpu-grafana -n fuddin-dev \
service.beta.kubernetes.io/external-hostname="gpu-grafana"

# Wait for external IP
kubectl get svc gpu-grafana -n fuddin-dev -w

# Get the public IP
kubectl get svc gpu-grafana -n fuddin-dev -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
```

### Option B: Port-Forward (Personal Access)

Best for personal dashboards or development:

```bash
# Add to your shell profile for automatic port-forward
alias grafana-forward='kubectl port-forward -n fuddin-dev svc/gpu-grafana 3000:80'

# Run whenever you need access
grafana-forward
```

## Current Grafana Setup

Your Grafana is currently deployed with:

- **Service Type**: ClusterIP (internal only)
- **Namespace**: `fuddin-dev`
- **Port**: 80 (service) → 3000 (pod)
- **Access Method**: Port-forward only

### Convert to LoadBalancer

```bash
# Method 1: kubectl patch
kubectl patch svc gpu-grafana -n fuddin-dev -p '{"spec":{"type":"LoadBalancer"}}'
kubectl annotate svc gpu-grafana -n fuddin-dev \
service.beta.kubernetes.io/external-hostname="gpu-grafana-fuddin"

# Method 2: Helm upgrade
helm upgrade gpu-grafana grafana/grafana \
--reuse-values \
--set service.type=LoadBalancer \
--set service.annotations."service\.beta\.kubernetes\.io/external-hostname"="gpu-grafana-fuddin" \
-n fuddin-dev
```

## Cluster Architecture

CoreWeave Kubernetes (CKS) uses:

- **Istio** for service mesh (installed at cluster level)
- **Gateway API** (available but restricted permissions)
- **External Hostname Controller** for automatic DNS provisioning
- **LoadBalancer** services get public IPs automatically

### Installed Components

```bash
# Istio control plane
kubectl get svc -n istio-system istiod
# NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S)
# istiod ClusterIP 10.16.0.170 <none> 15010/TCP,15012/TCP,443/TCP,15014/TCP

# Gateway API CRDs available
kubectl api-resources | grep gateway
# httproutes
# gateways.gateway.networking.k8s.io
# virtualservices (Istio)
```

### Permissions

Standard users in CKS can:
- ✅ Create/modify Services in their namespace
- ✅ Use LoadBalancer service type
- ✅ Add DNS annotations
- ❌ Create Gateway resources
- ❌ Create HTTPRoute resources
- ❌ Create VirtualService resources (Istio)
- ❌ List cluster-wide resources

## Troubleshooting

### LoadBalancer stuck in "Pending"

```bash
kubectl describe svc gpu-grafana -n fuddin-dev

# Check events for errors
kubectl get events -n fuddin-dev --sort-by='.lastTimestamp' | grep gpu-grafana
```

Common causes:
- Quota limits on public IPs
- Invalid annotation format
- Namespace resource limits

### DNS not resolving

```bash
# Check service status
kubectl get svc gpu-grafana -n fuddin-dev -o yaml

# Look for ExternalRecords condition
kubectl get svc gpu-grafana -n fuddin-dev -o jsonpath='{.status.conditions[?(@.type=="ExternalRecords")]}'
```

The DNS record creation may take 1-2 minutes after the external IP is assigned.

### Port-forward connection refused

```bash
# Check if pod is running
kubectl get pods -n fuddin-dev -l app.kubernetes.io/name=grafana

# Check pod logs
kubectl logs -n fuddin-dev -l app.kubernetes.io/name=grafana --tail=50

# Test service internally
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
curl http://gpu-grafana.fuddin-dev.svc.cluster.local
```

## Cost Considerations

- **Public IPs**: CoreWeave charges for LoadBalancer public IPs
- **Bandwidth**: Egress traffic may have costs
- **Port-Forward**: No additional cost (uses cluster credentials)

For cost-effective access:
1. Use port-forward for personal/development access
2. Use LoadBalancer only for production services that need public access
3. Share one LoadBalancer across multiple services using path-based routing (requires Istio VirtualService with permissions)

## Security Best Practices

### For LoadBalancer Services

1. **Enable authentication** in Grafana (already configured with admin password)
2. **Use HTTPS**: Add TLS certificate
3. **Restrict source IPs**: Use `loadBalancerSourceRanges`
4. **Monitor access logs**: Enable Grafana audit logging
5. **Use NetworkPolicies**: Restrict pod-to-pod communication

```yaml
spec:
type: LoadBalancer
loadBalancerSourceRanges:
- "1.2.3.4/32" # Your office IP
- "5.6.7.8/24" # Your VPN range
```

### For Port-Forward

- ✅ Automatically secured by Kubernetes RBAC
- ✅ Requires valid cluster credentials
- ✅ No public exposure
- ⚠️ Ensure your local machine is secured

## Next Steps

1. **Decide on access method**:
- Public access → Use LoadBalancer with DNS
- Personal access → Use port-forward

2. **If using LoadBalancer**:
```bash
kubectl patch svc gpu-grafana -n fuddin-dev -p '{"spec":{"type":"LoadBalancer"}}'
kubectl annotate svc gpu-grafana -n fuddin-dev \
service.beta.kubernetes.io/external-hostname="gpu-grafana-fuddin"
```

3. **Monitor the service**:
```bash
kubectl get svc gpu-grafana -n fuddin-dev -w
```

4. **Access Grafana**:
- LoadBalancer: Wait for DNS record, then access via `http://<assigned-dns>.coreweave.app`
- Port-forward: `kubectl port-forward -n fuddin-dev svc/gpu-grafana 3000:80`

## References

- [Create a Public DNS Name | CoreWeave](https://docs.coreweave.com/docs/products/networking/how-to/expose-service-dns)
- [Introduction to CoreWeave Kubernetes Service | CoreWeave](https://docs.coreweave.com/docs/products/cks)
- [Kubernetes Ingress Documentation](https://kubernetes.io/docs/concepts/services-networking/ingress/)
- [Exposing Applications for External Access | Kube by Example](https://kubebyexample.com/learning-paths/application-development-kubernetes/lesson-3-networking-kubernetes/exposing-0)

## Summary

**CoreWeave uses LoadBalancer services with DNS annotations, not traditional Ingress controllers.**

For your Grafana deployment:
- **Quick access**: `kubectl port-forward -n fuddin-dev svc/gpu-grafana 3000:80`
- **Public access**: Convert service to LoadBalancer with DNS annotation
- **Advanced routing**: Request VirtualService permissions from cluster admin

The simplest production-ready approach is to use LoadBalancer with the `service.beta.kubernetes.io/external-hostname` annotation.
1 change: 1 addition & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

22 changes: 22 additions & 0 deletions DASHBOARD.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@ A Grafana dashboard is included in `gpu-dashboard.json` for more detailed GPU mo
- **Idle GPU Workloads**: GPUs with zero compute activity for 30+ minutes
- **Idle GPU Time by Deployment**: Deployments producing the most allocated GPU idle time (see [Prometheus Queries](#prometheus-queries) below)
- **GPU Allocation Leaderboard**: Total GPU requests per namespace
- **GPU Health & DCGM**: Temperature, power, VRAM %, memory-copy util, XID errors, and optional DCGM profiling metrics

### Importing the Grafana Dashboard

Expand Down Expand Up @@ -139,6 +140,27 @@ The overview row uses **two independent partitions** of the same total. Each pai

Equivalently: **Engine active** = Total − Engine idle, and **VRAM free** = Total − VRAM allocated, when the same DCGM time series are counted.

### GPU Health & DCGM

Panels in the **GPU Health & DCGM** row use additional dcgm-exporter counters. Profiling panels show no data unless your exporter exposes `DCGM_FI_PROF_*` metrics (same requirement as `DCGM_FI_PROF_GR_ENGINE_ACTIVE`).

| Panel | PromQL |
|-------|--------|
| Peak GPU temperature | `max(DCGM_FI_DEV_GPU_TEMP)` |
| Peak power (W) | `max(DCGM_FI_DEV_POWER_USAGE)` |
| XID errors (total) | `sum(DCGM_FI_DEV_XID_ERRORS)` |
| GPU temperature by node | `avg by (Hostname) (DCGM_FI_DEV_GPU_TEMP)` |
| Power draw by node | `sum by (Hostname) (DCGM_FI_DEV_POWER_USAGE)` |
| VRAM utilization % | `100 * avg by (Hostname, gpu) (DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL)` |
| Memory copy utilization | `avg by (Hostname) (DCGM_FI_DEV_MEM_COPY_UTIL)` |
| Graphics/compute engine active by node | `avg by (Hostname) (DCGM_FI_PROF_GR_ENGINE_ACTIVE)` |
| XID errors (1h increase) | `sum by (Hostname, gpu) (increase(DCGM_FI_DEV_XID_ERRORS[1h]))` |
| SM active by node | `avg by (Hostname) (DCGM_FI_PROF_SM_ACTIVE)` |
| Tensor pipe active by node | `avg by (Hostname) (DCGM_FI_PROF_PIPE_TENSOR_ACTIVE)` |
| DRAM active by node | `avg by (Hostname) (DCGM_FI_PROF_DRAM_ACTIVE)` |

Note: gpu-pruner idle detection uses [`query.promql.j2`](gpu-pruner/src/query.promql.j2) at runtime; Grafana idle panels use related but simpler PromQL for visualization.

### Idle GPU Time by Deployment Query

This query identifies which Kubernetes Deployments are producing the most allocated GPU idle time while GPU utilization is at 0%.
Expand Down
Loading
Loading