Read-only multi-cluster SRE agent in your terminal. Ask plain-language
questions about Kubernetes / JVM / Python / GPU workloads across every
cluster you have credentials for, and get answers stitched together
from kubectl, Prometheus, Loki, jcmd, py-spy, nvidia-smi, perf, eBPF,
and friends β without typing any of them.
cloudy never mutates infrastructure. Every call is GET / LIST /
WATCH, enforced at four layers.
ββββββββββ βββββββ βββ ββββββββββ βββ βββ
βββββββββββ ββββββββββββ βββββββββββββββ ββββ
βββ βββ βββ ββββββ ββββββ βββ βββββββ
βββ βββ βββ ββββββ ββββββ βββ βββββ
ββββββββββββββββββββββββββββββββββββββββββ βββ
βββββββββββββββ βββββββ βββββββ βββββββ βββ
β /setup discover clusters & backends
? /help keyboard shortcuts
β or just ask a question
You type:
Why did checkout-service p99 spike around 2am yesterday?
cloudy plans the investigation, runs the relevant read-only probes (metrics, logs, traces, profiles), and explains what it found. The agent picks tools from a typed registry β Kubernetes, Prometheus, Loki / ES, Tempo / Jaeger, pprof, async-profiler, py-spy, NVIDIA SMI, perf, eBPF β based on the question, not on a fixed script.
One-liner (macOS, Linux β amd64 + arm64):
curl -fsSL https://raw.githubusercontent.com/rlaope/cloudy/master/install.sh | shDrops the latest GitHub release into ~/.local/bin/cloudy, sets the
executable bit, and prints a PATH-setup hint if needed. Once the
installer finishes, the binary is reachable as plain cloudy from
any directory (no ./ prefix β it lives on $PATH, not in your
working directory). Re-run the same one-liner anytime to upgrade,
or use cloudy update from inside the TUI β the installer always
pulls whatever GitHub marks as latest.
Override the install location with CLOUDY_INSTALL_DIR:
curl -fsSL https://raw.githubusercontent.com/rlaope/cloudy/master/install.sh \
| CLOUDY_INSTALL_DIR=/usr/local/bin shBuild from source (Windows, contributors, anything off the release matrix):
git clone https://github.com/rlaope/cloudy.git
cd cloudy
make build # produces ./cloudy in the repo root
./cloudy --version # quick smoke test from the build dir
sudo mv cloudy /usr/local/bin/ # or move it onto your PATH any other way
cloudy --version # now reachable as a bare commandEither install path leaves the binary reachable as plain cloudy
from any directory once it is on your PATH.
cloudyThe TUI opens. Two commands get you to the first question:
/setupβ scans your kubeconfig contexts, auto-discovers Prometheus / Loki / Elasticsearch / Tempo / Jaeger / Postgres / MySQL / Redis / pprof / V8 inspector endpoints, lets you pick which to enable inline, then writes~/.cloudy/config.yamlplus aprofile.yamlsnapshot of the scan. No restart./loginβ picks an LLM provider (Anthropic / OpenAI / Google / Moonshot) with arrow keys and saves the API key to~/.cloudy/secrets(mode0600). The chosen model is active immediately;/model <id>swaps mid-session.
Then ask:
> Why does the payments-api pod keep getting OOMKilled?
Headless / CI usage:
cloudy ask "Why is the checkout service slow right now?" # one-shot
cloudy setup --auto # non-interactive setup
cloudy setup --auto --dry-run # scan contexts and synthesize config without writing files
cloudy profile use payments-sre # activate a permission profile
cloudy profile cluster # show RBAC for current contextThree independent enforcement layers plus boot-time and runtime hardening. Defense in depth, not a single chokepoint.
- HTTP
RoundTripperrejects every method other thanGET/HEAD/OPTIONSbefore the request reaches the network. The K8s client honours this too βrest.Config.WrapTransportis set to the same wrapper, so apiserver calls share the HTTP whitelist end-to-end. - Bundled
ClusterRole(manifests/rbac/) only grantsget/list/watch(plus the two narrow bastion verbs below) at the RBAC layer β the cluster itself refuses anything else even if a guard in cloudy were bypassed. - Bastion reachability verbs (
services/proxy: get,pods/portforward: create) are the minimum required to reach HTTP and TCP backends through the apiserver and do not widen the mutation surface.
On top of those layers cloudy adds two hardening guards:
- The
tools.Registrymutator-name assertion panics at boot if any registered tool name looks like a write (create_*,delete_*,patch_*, ...). Mutating tools (exec,delete,patch, write-mode port-forward) are never registered, so the LLM never sees them in its tool catalogue and cannot ask for them. - A risk-rated approval gate sits in front of tools that are
read-only but expensive enough to perturb the system they're
observing β STW JVM pauses, attached eBPF probes, long profiling
windows. The TUI surfaces a
y/Nbanner; headless entry points refuse them with a clear message. See docs/SAFETY.md.
| Domain | What it talks to |
|---|---|
| Kubernetes | apiserver (get / list / watch only) |
| Docker | daemon API (container list / inspect / stats / logs) |
| Metrics | Prometheus, Thanos, VictoriaMetrics |
| Logs | Loki, Elasticsearch, OpenSearch |
| Traces | Tempo, Jaeger |
| Change | k8s rollouts / images / scale, Docker containers, Argo CD sync |
| Correlation | cross-signal changeβsymptom evidence timeline |
| JVM | jcmd, async-profiler (heap / cpu / alloc) |
| Python | py-spy (sampling / dump-stacks) |
| Ruby | rbspy (sampling) β registered as perf.rbspy_dump |
| GPU | NVIDIA SMI, DCGM |
| Kernel | perf, eBPF (read-only probes only) |
| Databases | Postgres / MySQL / Redis (read-only query subset) |
HTTP backends are reached via the K8s apiserver's services/proxy,
TCP backends via in-process SPDY port-forward. A single
kubectl-reachable cluster is enough β no VPN, no per-service
ingress.
Every probe the agent can call is a typed tool with a JSON schema.
Tools self-register at boot, and several groups are conditional on the
active profile, configured backends, host binaries, and reachable local
services. Type /tools in the TUI or run cloudy tools --json to see
the code-derived registry for your environment, including skipped groups
and reasons.
The built-in skill reference guard tracks the tool names that skills may
invoke. The runtime registry can be smaller or larger depending on configured
backends: for example, perf.rbspy_dump is always present, while Go pprof,
V8 inspector, Linux perf, cloud, Docker, queue, log, trace, and database
tools register only when their prerequisites exist.
| Group | Tools (count) |
|---|---|
k8s (20) |
list_pods, list_nodes, list_namespaces, describe_pod, events, logs, top_pods, top_nodes, list_deployments, list_statefulsets, list_daemonsets, list_jobs, list_cronjobs, list_services, list_ingresses, list_hpa, list_pdbs, list_networkpolicies, list_crds, list_cr (CRD-generic dynamic-client reader; unlocks Argo Rollouts, KEDA, cert-manager, Gateway API, Sloth SLOs, ServiceMonitor, etc. in one tool) |
prom (6) |
query, query_range, label_values, series, anomaly, error_budget |
log (8) |
loki_query_range, loki_labels, loki_label_values, loki_series, es_search, es_indices, es_cluster_health, container (Docker container logs; registers when docker_hosts is configured) |
trace (7) |
tempo_get_trace, tempo_search, service_graph (Tempo metrics-generator service-graph edges), route_red (Tempo metrics-generator per-route RED), jaeger_services, jaeger_operations, jaeger_search_traces |
alert (3) |
list_active, list_silences (Alertmanager v2), list_rules (Prometheus rules API) |
gitops (3) |
argo_list_apps, argo_app_status, argo_app_history (Argo CD v1 API) |
cloud (25) |
CloudWatch metrics: aws_cw_list_metrics, aws_cw_get_metric_statistics. CloudWatch Logs: aws_logs_describe_groups, aws_logs_filter_events, aws_logs_insights_query (Logs Insights, start+poll). X-Ray traces: aws_xray_trace_summaries, aws_xray_batch_get_traces, aws_xray_service_graph. AWS inventory: aws_rds_describe_instances, aws_lambda_list_functions, aws_eks_list_clusters. AWS queues: aws_sqs_queue_depth. Azure Monitor: azure_monitor_metric_definitions, azure_monitor_metrics. Azure Log Analytics: azure_log_analytics_query (KQL). Azure App Insights: azure_appinsights_query (KQL traces/requests/dependencies). Azure inventory: azure_sql_server_list, azure_functionapp_list, azure_aks_list. GCP Cloud Logging: gcp_logging_read (gcloud logging read). GCP inventory: gcp_sql_instances_list, gcp_run_services_list, gcp_container_clusters_list. FinOps/cost: aws_ce_cost_and_usage (Cost Explorer), azure_consumption_usage (consumption usage detail). Read-only via the operator's aws/az/gcloud CLIs; no stored secrets. Registers when a cloud_aws: / cloud_gcp: / cloud_azure: block is configured. GCP metric/trace/cost read is deferred (see docs/RFC-CLOUD-OBSERVABILITY.md) |
change (1) |
recent (orchestrator-agnostic deploy / image / scale / rollout timeline across Kubernetes and Docker, plus cloud control-plane audit events β AWS CloudTrail, GCP Cloud Audit Logs, Azure Activity Log; registers when k8s, docker_hosts, or a cloud provider is available) |
metric (1) |
container_stats (read-only Docker container CPU / mem / net / block-IO; k8s metrics live in prom + k8s.top_*; needs docker_hosts) |
correlate (1) |
workload (cross-signal evidence timeline β change history + metric / log / trace symptoms β with a candidate-cause that aligns the earliest symptom to the change before it; folds in Argo CD sync, cloud control-plane audit changes, and AWS X-Ray cloud-trace symptoms) |
db (18) |
Postgres: pg_version, pg_stat_activity, pg_stat_database, pg_stat_replication, pg_locks, pg_top_table_size. MySQL: mysql_version, mysql_processlist, mysql_global_status, mysql_global_variables, mysql_engine_innodb_status, mysql_top_table_size. Redis: redis_info, redis_dbsize, redis_scan, redis_inspect_key, redis_slowlog, redis_client_list |
perf (9) |
rbspy_dump (Ruby, always-on), go_pprof_cpu, go_pprof_goroutine, go_pprof_heap, go_pprof_allocs, go_pprof_threadcreate (Go pprof; gated on pprof endpoints), v8_inspector_targets, v8_inspector_cpu_profile (Node.js V8; gated on node_inspectors), linux_perf_record (gated on host perf binary) |
jvm (4) |
jstat_gc, jcmd_gc, jcmd_thread_dump, async_profile |
py (2) |
spy_dump, spy_top_snapshot |
gpu (2) |
nvidia_smi, dcgm_metrics |
ebpf (5) |
biolatency, tcptop, tcprtt, execsnoop, bpftrace_oneliner (all RiskHigh; gated by the approval banner) |
queue (2) |
rabbitmq_queues, kafka_consumer_lag |
oncall (2) |
list_incidents, who_is_oncall |
memory (1) |
record (local-only durable fact recording, not a cluster mutation) |
synthetic (1) |
http_check |
No
rubygroup. rbspy is registered asperf.rbspy_dump. If you are looking for Ruby profiling in/tools, searchperf.
Skills are curated multi-step playbooks the agent picks when a
question matches their triggers. They live in
internal/core/skills/builtin/,
embedded into the binary via //go:embed; you can override or add by
dropping a .md file into ~/.cloudy/skills/ β user files win on name
conflicts.
Run cloudy skills --json for the code-derived skill inventory.
| Skill | When it fires |
|---|---|
triage-orchestrator |
"Just got paged β where do I start?" Breadth-first scan, ranks a hypothesis, hands off to the deep skill. |
cluster-recon |
"What's running in my cluster right now?" topology dump. |
incident-context |
"What's burning right now?" β cross-references firing alerts with recent Argo CD syncs and pod restarts. |
deploy-regression |
"Did the last deploy break it?" Aligns Argo sync timestamps with error/latency onset and names the revision to roll back to. |
k8s-incident |
First-pass triage for CrashLoopBackOff / Pending / OOMKilled / Eviction. |
crashloop-deep-dive |
Beyond exit codes β previous-container logs, probe audit, init-container ordering, traces. |
oom-killed-triage |
Container-limit vs. node-level OOM, sawtooth-vs-plateau working-set pattern, JVM heap flag check. |
capacity-scheduling |
Why pods stay Pending β capacity vs. taints/affinity vs. stuck autoscaler / HPA-maxed / PDB block. |
network-connectivity |
Why workload A can't reach B β walks DNS β Service/endpoints β NetworkPolicy β Ingress / mesh sidecar. |
slo-burn |
SLO error-budget burn β multi-window multi-burn-rate, time-to-exhaustion, page-now vs. ticket. |
log-spike-correlation |
Joins a Loki / ES error spike to Prom anomalies and pod events. |
trace-error-pivot |
Walk a p99 / error-rate regression down to the slow span in Tempo or Jaeger and back to the pod. |
db-latency-hunt |
PostgreSQL / MySQL / Redis read-only forensics for slow upstream DB calls. |
prom-explorer |
Interactive PromQL composition without prior knowledge of the metric schema. |
go-runtime |
Go runtime β goroutine leaks, GC pacing (GOGC), scheduler latency, pprof CPU hot paths. |
node-runtime |
Node.js / V8 β event-loop lag, scavenge vs. mark-sweep GC, TurboFan deopt, V8 Inspector CPU profile. |
jvm-gc |
GC pause / heap-exhaustion / old-gen growth diagnosis. |
jvm-thread |
Deadlock, blocked threads, pool exhaustion. |
py-perf |
GIL contention, async-loop stalls, CPU bottlenecks. |
ruby-runtime |
Ruby / Rails β GVL contention, generational GC pressure, YJIT, rbspy stack sampling. |
dotnet-runtime |
.NET / CLR β gen0/1/2 + LOH GC, Server-vs-Workstation mode, ThreadPool starvation, tiered JIT. |
native-perf |
C / C++ / Rust β Linux perf hot paths, cache misses, branch mispredict, lock contention, missed codegen. |
gpu-saturation |
GPU OOM, low utilization, thermal throttling. |
ai-inference |
LLM/ML serving β TTFT / inter-token latency, throughput, KV-cache & batch saturation, GPU util (vLLM / Triton / TGI / TorchServe). |
Bring your own key. Picked at /login, swappable mid-session with
/model <id>.
| Provider | Env var | Model prefix |
|---|---|---|
| Anthropic | ANTHROPIC_API_KEY |
claude-* |
| OpenAI | OPENAI_API_KEY |
gpt-*, o1-* |
| Google Gemini | GOOGLE_API_KEY |
gemini-* |
| Moonshot / Kimi | MOONSHOT_API_KEY |
kimi-* |
| OpenAI-compatible | OPENAI_BASE_URL |
any |
OpenAI-compatible covers Ollama, vLLM, LM Studio, OpenRouter, and any
in-network gateway that speaks the same wire format. LLM adapters
honor HTTP_PROXY / HTTPS_PROXY for corporate egress.
cloudy resolves its state directory in this order: $CLOUDY_HOME β
$XDG_CONFIG_HOME/cloudy β $HOME/.cloudy. Layout:
| Path | What |
|---|---|
config.yaml |
Clusters, backends, model, safety limits. Generated by /setup; hand-editing supported. |
profile.yaml |
Snapshot of the last /setup scan (discovered endpoints + selection state). |
secrets |
Dotenv-format API keys (mode 0600). Written by /login. |
profiles/<name>.yaml |
Permission profile bundles: tool/namespace allow-deny rules and field masking (passwords, tokens). |
active_profile |
Pointer to the currently selected permission profile (managed by cloudy profile use). |
See docs/PERMISSION_PROFILES.md for the permission-profile schema.
- docs/SAFETY.md β read-only guards, risk-rated approval gate, threat model
- docs/AUTO_DISCOVERY.md β what
/setupprobes, where, and how findings map to config - docs/BASTION.md β deploying cloudy on a shared bastion (per-user state, systemd, proxy)
- docs/PERMISSION_PROFILES.md β profile schema, masking rules, per-session limits
- CHANGELOG.md β release notes
Pre-1.0. Build from source. Public API and config schema may shift between minor versions; pin a tag if that matters for you.
MIT.
