Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
6c2e107
refactor(findings): rename dashboard-sprawl class/category to broken-…
cicdteam May 22, 2026
ed524eb
feat(findings): add Finding.Dashboard field
cicdteam May 22, 2026
a9bd3f8
feat(ignore): add Dashboard pattern + --ignore-dashboard flag
cicdteam May 22, 2026
a2680ec
feat(grafana): add Dashboard.PanelTargets and Client.BaseURL
cicdteam May 22, 2026
0b403a8
feat(analyzers): add dashboardhygiene skeleton
cicdteam May 22, 2026
e5023fa
feat(analyzers): dashboardhygiene happy-path detection
cicdteam May 22, 2026
aef8e15
refactor(dashboardhygiene): deterministic sort + named const + param …
cicdteam May 22, 2026
25b0934
test(dashboardhygiene): pin grouping by (dashboard, missing-metric)
cicdteam May 22, 2026
eb26f17
feat(dashboardhygiene): recording-rule outputs count as existing
cicdteam May 22, 2026
86022d7
refactor(dashboardhygiene): enrich VM-flavor comments + strengthen RR…
cicdteam May 22, 2026
84733a9
fix(promqlx): filter sentinel substrings, not just exact matches
cicdteam May 22, 2026
26cc1a8
test(dashboardhygiene): pin silent skips for template-variable exprs …
cicdteam May 22, 2026
f343081
test(dashboardhygiene): pin error surface (per-dashboard, search, VM)
cicdteam May 22, 2026
168e332
feat(dashboardhygiene): fix-snippet builder
cicdteam May 22, 2026
9fa2879
refactor(dashboardhygiene): DRY docs URL + idiomatic range loop + min…
cicdteam May 22, 2026
ac9701f
feat(cli): add 'remetric dashboards broken' subcommand
cicdteam May 22, 2026
2f83eab
refactor(cli): trust analyzer sort order + move brokenPanelCopy to em…
cicdteam May 22, 2026
6d16642
feat(cli): wire dashboardhygiene into scan + report runners
cicdteam May 22, 2026
8419fd2
docs(findings): replace dashboard-sprawl placeholder with broken-pane…
cicdteam May 23, 2026
247b843
test(e2e): broken-panel scenario against demo stack
cicdteam May 23, 2026
dc6b7c7
docs: surface dashboardhygiene + --ignore-dashboard in README + help …
cicdteam May 23, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 10 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,10 @@ Re-metric your stack - find waste in Prometheus, Grafana & Loki.
Prometheus server and it prints a ranked, actionable list of cardinality
problems with suggested `metric_relabel_configs` fixes.

> Status: **alpha** - cardinality, label-pattern, unused-metric, and
> alert-hygiene analyzers are wired up. JSON output, Grafana integration,
> unified `remetric scan`, and HTML/Markdown reports shipped.
> Status: **alpha** - cardinality, label-pattern, unused-metric,
> alert-hygiene, and dashboard-hygiene (broken-panel) analyzers are
> wired up. JSON output, Grafana integration, unified `remetric scan`,
> and HTML/Markdown reports shipped.

![remetric demo](demo/remetric.gif)

Expand Down Expand Up @@ -248,6 +249,7 @@ flag is ignored by `report` - use `--format` instead.
| `remetric metrics unused` | Ingested ∖ used metrics (needs Grafana for dashboard coverage)|
| `remetric alerts unused` | Alerts that never fired in the lookback window |
| `remetric alerts always-firing` | Alerts firing >=95% of the lookback window |
| `remetric dashboards broken` | Flag dashboards whose panels reference missing metrics |
| `remetric report` | Run every analyzer, render terminal/json/html/markdown |
| `remetric scan` | Run every available analyzer, emit a unified Report |

Expand Down Expand Up @@ -276,12 +278,6 @@ Global flags (subset; see `--help` for the full list):
Full reference at [remetric.dev](https://remetric.dev/) - one page per finding
class with detection rules, fix snippets, and false-positive notes.

## What's still missing in v0.1

- No dashboard sprawl analyzer.

This lands in a subsequent release.

## CI integration

Pair any analyzer command with `--fail-on=critical` to fail the build when a
Expand All @@ -308,13 +304,14 @@ Suppress findings that are known noise or out of scope. Patterns are
**anchored full-match** regexes: `foo_.*` matches `foo_bar` but not
`xfoo_bar`. Empty / whitespace-only patterns are silently ignored.

Three target fields, each with its own flag (repeatable):
Four target fields, each with its own flag (repeatable):

| Flag | Drops findings whose ... |
|------|--------------------------|
| `--ignore-metric REGEX` | metric name matches |
| `--ignore-label REGEX` | evidence label matches |
| `--ignore-alert REGEX` | alert name matches |
| `--ignore-metric REGEX` | metric name matches |
| `--ignore-label REGEX` | evidence label matches |
| `--ignore-alert REGEX` | alert name matches |
| `--ignore-dashboard REGEX` | dashboard title matches |

```bash
# Repeatable flag
Expand Down
97 changes: 97 additions & 0 deletions docs/findings/broken-panel.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# Broken panel

A Grafana dashboard panel queries a Prometheus metric that does not exist.

**Class:** `broken-panel`
**Severity:** Medium
**Category:** `dashboard_hygiene`
**Detected by:** `remetric dashboards broken`, `remetric scan`

## What it means

The panel's PromQL target references a metric name that is not in
Prometheus head series and is not declared as the output of any
recording rule. The panel will render as empty (no data) or as a
flat line, and the user will see a silent gap rather than the
expected signal.

## Why it matters

Empty dashboard tiles are dangerous because they look the same as
a system that is fine. An on-call engineer scanning a dashboard
for problems can miss a real issue because the relevant tile shows
"no data" instead of the metric they expect. A scan that surfaces
every (dashboard, missing-metric) pair lets you either restore the
metric or remove the dead query.

## How remetric detects it

1. Build the set of "known metrics":
- Every metric name returned by `/api/v1/label/__name__/values`.
- Every recording-rule output name returned by `/api/v1/rules`
(a freshly-added recording rule counts as known even before
it emits its first sample).
2. For each Grafana dashboard, parse every panel's Prometheus
target with the PromQL parser and extract referenced metric
names.
3. Any extracted name not in the known set is "missing". Findings
are grouped by `(dashboard, missing-metric)` so 50 panels in one
dashboard referencing the same removed metric collapse into a
single finding listing the affected panels.

## Known false positives

- **Intermittent metrics.** A metric that is only present during
scheduled cron-job runs may appear missing during a scan that
runs between executions. Suppress with `--ignore-metric <regex>`.
- **Freshly-rotated retention.** If a metric is in long-term storage
but no longer in head series, it counts as missing. Most users
do not hit this because head-series retention is usually 15 days
or more.
- **VictoriaMetrics without `--vmalert`.** Recording rules live in
the `vmalert` process, not in `vmselect`. When `--vmalert` is not
provided, recording outputs are not visible to remetric and
any panel querying a recording-rule output is reported as broken.
The analyzer surfaces this case as a warning rather than failing.

## How to fix

Pick one of:

1. **Restore the metric.** Re-enable the scrape job, fix the
exporter, or add back the recording rule that emits the
metric.
2. **Remove the dead query.** Edit the dashboard in Grafana,
delete the panel or rewrite its query to use a metric that
still exists.
3. **Suppress.** If the dashboard is known-stale and you cannot
delete it yet, use `--ignore-dashboard "Legacy.*"` (anchored
regex against the dashboard title).

## Sample JSON

```json
{
"id": "broken-panel:abc123:node_disk_io_now",
"severity": "medium",
"category": "dashboard_hygiene",
"class": "broken-panel",
"title": "dashboard \"Frontend SLOs\" references missing metric \"node_disk_io_now\"",
"metric": "node_disk_io_now",
"dashboard": "Frontend SLOs",
"evidence": {
"description": "2 panel(s) in dashboard \"Frontend SLOs\" query \"node_disk_io_now\" which is not present in head series or recording-rule outputs",
"sample_values": ["Disk I/O - last 5m", "Disk I/O - last 1h"]
},
"fix": {
"type": "edit_dashboard",
"config": "Edit dashboard \"Frontend SLOs\" (https://grafana.example.com/d/abc123/frontend-slos)\nand either:\n 1. Restore metric \"node_disk_io_now\" (...)\n 2. Remove/replace the broken queries in panel(s):\n - Disk I/O - last 5m\n - Disk I/O - last 1h\nReference: https://remetric.dev/findings/broken-panel"
},
"impact": {
"series_reduction": 0,
"percentage": 0,
"estimation_method": "broken_panel"
},
"documentation_url": "https://remetric.dev/findings/broken-panel"
}
```
22 changes: 0 additions & 22 deletions docs/findings/dashboard-sprawl.md

This file was deleted.

2 changes: 1 addition & 1 deletion docs/findings/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@ that triggered it. Each class has its own reference page below.
| [Label pattern overly granular](label-pattern-overly-granular.md) | Low / Medium | label_patterns |
| [Never-firing alert](never-firing-alert.md) | Medium | alert_hygiene |
| [Always-firing alert](always-firing-alert.md) | Critical | alert_hygiene |
| [Dashboard sprawl](dashboard-sprawl.md) | n/a (coming soon) | dashboard_sprawl |
| [Broken panel](broken-panel.md) | Medium | dashboard_hygiene |
4 changes: 2 additions & 2 deletions docs/findings/unused-metric.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,5 +89,5 @@ Suppress via `--ignore-metric <regex>`. See the

- [Hot label](hot-label.md) - if the metric is heavily used but bloated by one
label, fix the label rather than dropping the metric.
- [Dashboard sprawl](dashboard-sprawl.md) - if the metric is referenced only by
dashboards nobody looks at, its "used" status may itself be a lie.
- [Broken panel](broken-panel.md) - the inverse case: a dashboard panel
references a metric that no longer exists in Prometheus.
54 changes: 54 additions & 0 deletions e2e/dashboards_e2e_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
// SPDX-License-Identifier: Apache-2.0
// Copyright 2026 Andrei Taranik

//go:build e2e

package e2e

import (
"encoding/json"
"strings"
"testing"
)

// TestE2E_DashboardsBroken_JSON verifies the dashboards broken subcommand
// surfaces the broken panel provisioned by e2e/grafana/dashboards/broken-panel-demo.json,
// which references a metric Prometheus does not scrape.
func TestE2E_DashboardsBroken_JSON(t *testing.T) {
out, err := runCmd(t, binPath(t),
"dashboards", "broken",
"--prometheus", "http://localhost:9090",
"--grafana", "http://localhost:3000",
"--output", "json",
)
if err != nil {
t.Fatalf("dashboards broken failed: %v\noutput:\n%s", err, out)
}

var payload struct {
Findings []struct {
Class string `json:"class"`
Metric string `json:"metric"`
Dashboard string `json:"dashboard"`
} `json:"findings"`
}
if err := json.Unmarshal([]byte(out), &payload); err != nil {
t.Fatalf("json.Unmarshal: %v\noutput:\n%s", err, out)
}
if len(payload.Findings) == 0 {
t.Fatalf("expected at least 1 finding; got 0\noutput:\n%s", out)
}
found := false
for _, f := range payload.Findings {
if f.Class == "broken-panel" && f.Metric == "nonexistent_broken_panel_demo_xyz_total" {
if !strings.Contains(f.Dashboard, "Broken Panel Demo") {
t.Errorf("Dashboard field = %q, want %q", f.Dashboard, "Broken Panel Demo")
}
found = true
break
}
}
if !found {
t.Errorf("expected broken-panel finding for nonexistent_broken_panel_demo_xyz_total; got %+v", payload.Findings)
}
}
17 changes: 17 additions & 0 deletions e2e/grafana/dashboards/broken-panel-demo.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
{
"uid": "broken-panel-demo",
"title": "Broken Panel Demo",
"schemaVersion": 36,
"version": 1,
"panels": [
{
"id": 1,
"type": "graph",
"title": "Bogus Metric",
"datasource": {"type": "prometheus", "uid": "Prometheus"},
"targets": [
{"expr": "nonexistent_broken_panel_demo_xyz_total", "refId": "A"}
]
}
]
}
Loading