feat: opt-in LLM root-cause surface (BYO-key, default OFF, findings-only) by rlaope · Pull Request #259 · rlaope/Argus

rlaope · 2026-05-29T01:34:52Z

Summary

Adds an opt-in LLM root-cause surface that narrates the likely root cause from the existing deterministic doctor/GC findings. It is default OFF, bring-your-own-key, and findings-only — it never fabricates metrics, and the deterministic template output (explain/gcwhy/doctor) remains the default and the fallback.

Safety posture (the whole point)

Default OFF. Disabled unless BOTH argus.llm.enabled=true (or ARGUS_LLM_ENABLED=true) and ARGUS_LLM_API_KEY are set.
Zero network call when disabled. The gate lives in LlmRootCause; when not activatable, no provider is even constructed. A unit test asserts the provider factory never runs.
BYO-key. The key is read only from the environment and is never persisted by Argus.
Findings-only, no hallucinated metrics. The prompt payload is serialized purely from the structured Finding objects, and the system instruction explicitly forbids the model from inventing any number. A unit test feeds known findings and asserts every number in the prompt is a subset of the numbers in the findings.
Deterministic output always wins. The deterministic findings are always printed first; the LLM block is clearly labelled advisory and appended alongside. If the feature is off or the call fails, a skip note prints and the deterministic findings stand on their own.
No LLM SDK dependency. The real provider uses the JDK HttpClient against a provider-neutral OpenAI-compatible chat endpoint (ARGUS_LLM_ENDPOINT / ARGUS_LLM_MODEL override the defaults).

Surfaces

argus doctor --rca — appends an advisory root cause after the health report.
argus explain --llm --bundle=<snapshot.tar.gz> — offline path: reads findings out of a snapshot bundle, prints them deterministically, then appends the advisory.

Acceptance criteria & how verified

Findings → prompt serializer — FindingsPrompt.serialize/buildPrompt; deterministic output remains the default/fallback. ✅
Pluggable provider behind a gate — LlmProvider interface, HttpLlmProvider (JDK HttpClient, no SDK), LlmConfig gate. Unit test disabled_neverInvokesProvider asserts the provider is never constructed/invoked when disabled (also enabledWithoutKey_neverInvokesProvider). ✅
Findings-only guarantee — prompt_containsOnlyNumbersFromFindings: a stub captures the prompt and asserts its numbers are a subset of the findings' numbers; system instruction contains "do not invent". ✅
Advisory labelling + deterministic always shown — LlmAdvisoryRenderer labels the block advisory and prints after the deterministic findings. ✅
Offline bundle path — BundleFindings.fromBundle (JDK-only tar.gz reader). bundlePath_feedsFindingsToProvider exercises bundle → findings → prompt with a stub. ✅

Extra: providerFailure_fallsBackToFindings confirms a provider error is caught and the findings still flow through.

Tests & build (fresh)

./gradlew :argus-cli:test --tests 'io.argus.cli.llm.LlmRootCauseTest' → BUILD SUCCESSFUL, tests="6" skipped="0" failures="0" errors="0".
./gradlew :argus-cli:test (full CLI suite) → BUILD SUCCESSFUL.
./gradlew :argus-cli:fatJar → BUILD SUCCESSFUL (argus-cli-1.5.0-all.jar).
Smoke test with the fatJar (no key set) confirmed default-OFF: deterministic findings print, advisory is skipped with the disabled note, no network.

i18n parity

8 new llm.rca.* keys added to all four locales (printf %s):

messages_en.properties: 667 keys
messages_ko.properties: 667 keys
messages_ja.properties: 667 keys
messages_zh.properties: 667 keys

🤖 Generated with Claude Code

…nly) Add an opt-in LLM root-cause layer that narrates the likely root cause from deterministic doctor/GC findings. The deterministic template output stays the default and the fallback; the LLM advisory is only ever appended alongside it. - FindingsPrompt: serializes structured Findings into a compact, schema'd payload. Findings-only: no metric is synthesised, and the system instruction forbids the model from inventing numbers. - LlmProvider interface + HttpLlmProvider (JDK HttpClient, provider-neutral OpenAI-compatible endpoint; no LLM SDK dependency). Tests inject a stub. - LlmConfig gate: default OFF. Requires argus.llm.enabled=true (or ARGUS_LLM_ENABLED) AND ARGUS_LLM_API_KEY. Disabled => no provider, zero network call. - LlmRootCause orchestrates the gate, fallback, and advisory labelling; the findings always flow through the Result. - BundleFindings reads findings back out of an offline snapshot tar.gz. - doctor --rca and explain --llm --bundle=<snapshot.tar.gz> wire it in; the deterministic findings are always printed first, advisory labelled clearly. - i18n: 8 new llm.rca.* keys across en/ko/ja/zh (667 keys each, in parity). - Tests: disabled => provider never invoked; prompt numbers are a subset of the findings; offline bundle path with a stub; provider failure falls back. Signed-off-by: rlaope <piyrw9754@gmail.com>

parse() treated any stripped line starting with '[' as a new finding header before checking the detail/recommendation branches. A detail/recommendation line that legitimately starts with '[' (e.g. a GC-log tag like [gc,heap] or a quoted JVM flag bracket) was mistaken for a header; parseHeader() then returned a finding with INFO severity or, when the line had no ']', null, which closed out the in-progress finding and silently dropped its remaining lines. The offline LLM RCA bundle could yield fewer findings than the snapshot contained, with no error. Header detection is now strict: a line starts a new finding only when parseHeader() succeeds AND the bracketed token is a recognized Severity (CRITICAL/WARNING/INFO). Any other '['-prefixed line falls through and is attached to the current finding as detail or recommendation, so no lines are dropped. Happy-path parsing of well-formed bundles is unchanged. Signed-off-by: rlaope <piyrw9754@gmail.com>

Signed-off-by: rlaope <piyrw9754@gmail.com>

* fix: harden new 2026-roadmap features for large-scale fleet use Large-scale usability QA over the W1-W8 features (#252-#259) surfaced six correctness/scale defects under fleet-sized load. Each is fixed at the mechanism, not patched around: - OtlpJsonBuilder ignored argus.metrics.legacyNames and dual-emitted every argus_* duplicate over OTLP unconditionally, while the Prometheus path gates them. legacyNames=false halved Prometheus cardinality but never the OTLP stream. Mirror the Prometheus gating exactly (vt-active, heap used/committed, all three CPU ratios, metaspace used/committed/classes); argus-unique series with no semconv equivalent still always ship. - PrometheusMetricsCollector kept the semconv HELP/TYPE dedup set as a shared instance field, racing under concurrent scrapes (Prometheus + agent) and corrupting the scrape. Make it per-scrape method-local. - PinningAnalyzer used a volatile count++ (non-atomic, lost updates) and evicted via removeIf(count <= 1), which wiped the whole map when pinning was spread across many distinct singleton sites. Use AtomicLong and evict by retaining the top-N by count with a tie-break that guarantees bounded growth. - StructuredConcurrencyView.renderNode recursed without bound — a deeply nested StructuredTaskScope tree (or an adversarial dump) overflowed the stack. Cap the render depth and elide beyond it. - AnomalyDetector evicted its recent-anomaly ring via ArrayList.remove(0) (O(ring) per fire); switch to ArrayDeque for O(1) head eviction. - ProfileTraceContext compiled a regex per trace-context normalize() call on the encode hot path; hoist to a static Pattern. Adds OtlpJsonBuilderMetricsTest cases for the legacyNames gating (on/off) and a new PinningAnalyzerTest covering the eviction policy. All touched-module tests green. Signed-off-by: rlaope <piyrw9754@gmail.com> * fix: address code-review findings on the large-scale hardening - PinningAnalyzer.evictLowCountEntries: guard against a negative array index. The map can shrink below MAX_HOTSPOTS between the size() trigger and the counts[] snapshot (concurrent clear()/overlapping eviction), which would make counts[counts.length - MAX_HOTSPOTS] index negative and throw. Re-check the snapshot length before indexing. - StructuredConcurrencyView: the depth-cap elision indented with repeat(MAX_RENDER_DEPTH) (1024 blank chars per elided node); use a modest fixed indent instead. Signed-off-by: rlaope <piyrw9754@gmail.com> --------- Signed-off-by: rlaope <piyrw9754@gmail.com>

rlaope added 3 commits May 29, 2026 10:34

merge: resolve locale parity with master (rightsize + llm.rca keys)

098887d

Signed-off-by: rlaope <piyrw9754@gmail.com>

rlaope merged commit e05d2ea into master May 29, 2026
1 of 7 checks passed

rlaope deleted the feat/w8-llm-rca branch May 29, 2026 03:12

rlaope mentioned this pull request May 29, 2026

fix: harden 2026-roadmap features for large-scale fleet use #262

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: opt-in LLM root-cause surface (BYO-key, default OFF, findings-only)#259

feat: opt-in LLM root-cause surface (BYO-key, default OFF, findings-only)#259
rlaope merged 3 commits into
masterfrom
feat/w8-llm-rca

rlaope commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

rlaope commented May 29, 2026

Summary

Safety posture (the whole point)

Surfaces

Acceptance criteria & how verified

Tests & build (fresh)

i18n parity

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant