Skip to content

feat: opt-in LLM root-cause surface (BYO-key, default OFF, findings-only)#259

Merged
rlaope merged 3 commits into
masterfrom
feat/w8-llm-rca
May 29, 2026
Merged

feat: opt-in LLM root-cause surface (BYO-key, default OFF, findings-only)#259
rlaope merged 3 commits into
masterfrom
feat/w8-llm-rca

Conversation

@rlaope
Copy link
Copy Markdown
Owner

@rlaope rlaope commented May 29, 2026

Summary

Adds an opt-in LLM root-cause surface that narrates the likely root cause from the existing deterministic doctor/GC findings. It is default OFF, bring-your-own-key, and findings-only — it never fabricates metrics, and the deterministic template output (explain/gcwhy/doctor) remains the default and the fallback.

Safety posture (the whole point)

  • Default OFF. Disabled unless BOTH argus.llm.enabled=true (or ARGUS_LLM_ENABLED=true) and ARGUS_LLM_API_KEY are set.
  • Zero network call when disabled. The gate lives in LlmRootCause; when not activatable, no provider is even constructed. A unit test asserts the provider factory never runs.
  • BYO-key. The key is read only from the environment and is never persisted by Argus.
  • Findings-only, no hallucinated metrics. The prompt payload is serialized purely from the structured Finding objects, and the system instruction explicitly forbids the model from inventing any number. A unit test feeds known findings and asserts every number in the prompt is a subset of the numbers in the findings.
  • Deterministic output always wins. The deterministic findings are always printed first; the LLM block is clearly labelled advisory and appended alongside. If the feature is off or the call fails, a skip note prints and the deterministic findings stand on their own.
  • No LLM SDK dependency. The real provider uses the JDK HttpClient against a provider-neutral OpenAI-compatible chat endpoint (ARGUS_LLM_ENDPOINT / ARGUS_LLM_MODEL override the defaults).

Surfaces

  • argus doctor --rca — appends an advisory root cause after the health report.
  • argus explain --llm --bundle=<snapshot.tar.gz> — offline path: reads findings out of a snapshot bundle, prints them deterministically, then appends the advisory.

Acceptance criteria & how verified

  1. Findings → prompt serializerFindingsPrompt.serialize/buildPrompt; deterministic output remains the default/fallback. ✅
  2. Pluggable provider behind a gateLlmProvider interface, HttpLlmProvider (JDK HttpClient, no SDK), LlmConfig gate. Unit test disabled_neverInvokesProvider asserts the provider is never constructed/invoked when disabled (also enabledWithoutKey_neverInvokesProvider). ✅
  3. Findings-only guaranteeprompt_containsOnlyNumbersFromFindings: a stub captures the prompt and asserts its numbers are a subset of the findings' numbers; system instruction contains "do not invent". ✅
  4. Advisory labelling + deterministic always shownLlmAdvisoryRenderer labels the block advisory and prints after the deterministic findings. ✅
  5. Offline bundle pathBundleFindings.fromBundle (JDK-only tar.gz reader). bundlePath_feedsFindingsToProvider exercises bundle → findings → prompt with a stub. ✅

Extra: providerFailure_fallsBackToFindings confirms a provider error is caught and the findings still flow through.

Tests & build (fresh)

  • ./gradlew :argus-cli:test --tests 'io.argus.cli.llm.LlmRootCauseTest' → BUILD SUCCESSFUL, tests="6" skipped="0" failures="0" errors="0".
  • ./gradlew :argus-cli:test (full CLI suite) → BUILD SUCCESSFUL.
  • ./gradlew :argus-cli:fatJar → BUILD SUCCESSFUL (argus-cli-1.5.0-all.jar).
  • Smoke test with the fatJar (no key set) confirmed default-OFF: deterministic findings print, advisory is skipped with the disabled note, no network.

i18n parity

8 new llm.rca.* keys added to all four locales (printf %s):

  • messages_en.properties: 667 keys
  • messages_ko.properties: 667 keys
  • messages_ja.properties: 667 keys
  • messages_zh.properties: 667 keys

🤖 Generated with Claude Code

rlaope added 3 commits May 29, 2026 10:34
…nly)

Add an opt-in LLM root-cause layer that narrates the likely root cause from
deterministic doctor/GC findings. The deterministic template output stays the
default and the fallback; the LLM advisory is only ever appended alongside it.

- FindingsPrompt: serializes structured Findings into a compact, schema'd
  payload. Findings-only: no metric is synthesised, and the system instruction
  forbids the model from inventing numbers.
- LlmProvider interface + HttpLlmProvider (JDK HttpClient, provider-neutral
  OpenAI-compatible endpoint; no LLM SDK dependency). Tests inject a stub.
- LlmConfig gate: default OFF. Requires argus.llm.enabled=true (or
  ARGUS_LLM_ENABLED) AND ARGUS_LLM_API_KEY. Disabled => no provider, zero
  network call.
- LlmRootCause orchestrates the gate, fallback, and advisory labelling; the
  findings always flow through the Result.
- BundleFindings reads findings back out of an offline snapshot tar.gz.
- doctor --rca and explain --llm --bundle=<snapshot.tar.gz> wire it in; the
  deterministic findings are always printed first, advisory labelled clearly.
- i18n: 8 new llm.rca.* keys across en/ko/ja/zh (667 keys each, in parity).
- Tests: disabled => provider never invoked; prompt numbers are a subset of
  the findings; offline bundle path with a stub; provider failure falls back.

Signed-off-by: rlaope <piyrw9754@gmail.com>
parse() treated any stripped line starting with '[' as a new finding
header before checking the detail/recommendation branches. A
detail/recommendation line that legitimately starts with '[' (e.g. a
GC-log tag like [gc,heap] or a quoted JVM flag bracket) was mistaken
for a header; parseHeader() then returned a finding with INFO severity
or, when the line had no ']', null, which closed out the in-progress
finding and silently dropped its remaining lines. The offline LLM RCA
bundle could yield fewer findings than the snapshot contained, with no
error.

Header detection is now strict: a line starts a new finding only when
parseHeader() succeeds AND the bracketed token is a recognized Severity
(CRITICAL/WARNING/INFO). Any other '['-prefixed line falls through and
is attached to the current finding as detail or recommendation, so no
lines are dropped. Happy-path parsing of well-formed bundles is
unchanged.

Signed-off-by: rlaope <piyrw9754@gmail.com>
Signed-off-by: rlaope <piyrw9754@gmail.com>
@rlaope rlaope merged commit e05d2ea into master May 29, 2026
1 of 7 checks passed
@rlaope rlaope deleted the feat/w8-llm-rca branch May 29, 2026 03:12
rlaope added a commit that referenced this pull request May 29, 2026
* fix: harden new 2026-roadmap features for large-scale fleet use

Large-scale usability QA over the W1-W8 features (#252-#259) surfaced six
correctness/scale defects under fleet-sized load. Each is fixed at the
mechanism, not patched around:

- OtlpJsonBuilder ignored argus.metrics.legacyNames and dual-emitted every
  argus_* duplicate over OTLP unconditionally, while the Prometheus path gates
  them. legacyNames=false halved Prometheus cardinality but never the OTLP
  stream. Mirror the Prometheus gating exactly (vt-active, heap used/committed,
  all three CPU ratios, metaspace used/committed/classes); argus-unique series
  with no semconv equivalent still always ship.
- PrometheusMetricsCollector kept the semconv HELP/TYPE dedup set as a shared
  instance field, racing under concurrent scrapes (Prometheus + agent) and
  corrupting the scrape. Make it per-scrape method-local.
- PinningAnalyzer used a volatile count++ (non-atomic, lost updates) and evicted
  via removeIf(count <= 1), which wiped the whole map when pinning was spread
  across many distinct singleton sites. Use AtomicLong and evict by retaining
  the top-N by count with a tie-break that guarantees bounded growth.
- StructuredConcurrencyView.renderNode recursed without bound — a deeply nested
  StructuredTaskScope tree (or an adversarial dump) overflowed the stack. Cap
  the render depth and elide beyond it.
- AnomalyDetector evicted its recent-anomaly ring via ArrayList.remove(0)
  (O(ring) per fire); switch to ArrayDeque for O(1) head eviction.
- ProfileTraceContext compiled a regex per trace-context normalize() call on the
  encode hot path; hoist to a static Pattern.

Adds OtlpJsonBuilderMetricsTest cases for the legacyNames gating (on/off) and a
new PinningAnalyzerTest covering the eviction policy. All touched-module tests
green.

Signed-off-by: rlaope <piyrw9754@gmail.com>

* fix: address code-review findings on the large-scale hardening

- PinningAnalyzer.evictLowCountEntries: guard against a negative array index.
  The map can shrink below MAX_HOTSPOTS between the size() trigger and the
  counts[] snapshot (concurrent clear()/overlapping eviction), which would make
  counts[counts.length - MAX_HOTSPOTS] index negative and throw. Re-check the
  snapshot length before indexing.
- StructuredConcurrencyView: the depth-cap elision indented with
  repeat(MAX_RENDER_DEPTH) (1024 blank chars per elided node); use a modest
  fixed indent instead.

Signed-off-by: rlaope <piyrw9754@gmail.com>

---------

Signed-off-by: rlaope <piyrw9754@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant