feat: opt-in LLM root-cause surface (BYO-key, default OFF, findings-only)#259
Merged
Conversation
…nly) Add an opt-in LLM root-cause layer that narrates the likely root cause from deterministic doctor/GC findings. The deterministic template output stays the default and the fallback; the LLM advisory is only ever appended alongside it. - FindingsPrompt: serializes structured Findings into a compact, schema'd payload. Findings-only: no metric is synthesised, and the system instruction forbids the model from inventing numbers. - LlmProvider interface + HttpLlmProvider (JDK HttpClient, provider-neutral OpenAI-compatible endpoint; no LLM SDK dependency). Tests inject a stub. - LlmConfig gate: default OFF. Requires argus.llm.enabled=true (or ARGUS_LLM_ENABLED) AND ARGUS_LLM_API_KEY. Disabled => no provider, zero network call. - LlmRootCause orchestrates the gate, fallback, and advisory labelling; the findings always flow through the Result. - BundleFindings reads findings back out of an offline snapshot tar.gz. - doctor --rca and explain --llm --bundle=<snapshot.tar.gz> wire it in; the deterministic findings are always printed first, advisory labelled clearly. - i18n: 8 new llm.rca.* keys across en/ko/ja/zh (667 keys each, in parity). - Tests: disabled => provider never invoked; prompt numbers are a subset of the findings; offline bundle path with a stub; provider failure falls back. Signed-off-by: rlaope <piyrw9754@gmail.com>
parse() treated any stripped line starting with '[' as a new finding header before checking the detail/recommendation branches. A detail/recommendation line that legitimately starts with '[' (e.g. a GC-log tag like [gc,heap] or a quoted JVM flag bracket) was mistaken for a header; parseHeader() then returned a finding with INFO severity or, when the line had no ']', null, which closed out the in-progress finding and silently dropped its remaining lines. The offline LLM RCA bundle could yield fewer findings than the snapshot contained, with no error. Header detection is now strict: a line starts a new finding only when parseHeader() succeeds AND the bracketed token is a recognized Severity (CRITICAL/WARNING/INFO). Any other '['-prefixed line falls through and is attached to the current finding as detail or recommendation, so no lines are dropped. Happy-path parsing of well-formed bundles is unchanged. Signed-off-by: rlaope <piyrw9754@gmail.com>
Signed-off-by: rlaope <piyrw9754@gmail.com>
rlaope
added a commit
that referenced
this pull request
May 29, 2026
* fix: harden new 2026-roadmap features for large-scale fleet use Large-scale usability QA over the W1-W8 features (#252-#259) surfaced six correctness/scale defects under fleet-sized load. Each is fixed at the mechanism, not patched around: - OtlpJsonBuilder ignored argus.metrics.legacyNames and dual-emitted every argus_* duplicate over OTLP unconditionally, while the Prometheus path gates them. legacyNames=false halved Prometheus cardinality but never the OTLP stream. Mirror the Prometheus gating exactly (vt-active, heap used/committed, all three CPU ratios, metaspace used/committed/classes); argus-unique series with no semconv equivalent still always ship. - PrometheusMetricsCollector kept the semconv HELP/TYPE dedup set as a shared instance field, racing under concurrent scrapes (Prometheus + agent) and corrupting the scrape. Make it per-scrape method-local. - PinningAnalyzer used a volatile count++ (non-atomic, lost updates) and evicted via removeIf(count <= 1), which wiped the whole map when pinning was spread across many distinct singleton sites. Use AtomicLong and evict by retaining the top-N by count with a tie-break that guarantees bounded growth. - StructuredConcurrencyView.renderNode recursed without bound — a deeply nested StructuredTaskScope tree (or an adversarial dump) overflowed the stack. Cap the render depth and elide beyond it. - AnomalyDetector evicted its recent-anomaly ring via ArrayList.remove(0) (O(ring) per fire); switch to ArrayDeque for O(1) head eviction. - ProfileTraceContext compiled a regex per trace-context normalize() call on the encode hot path; hoist to a static Pattern. Adds OtlpJsonBuilderMetricsTest cases for the legacyNames gating (on/off) and a new PinningAnalyzerTest covering the eviction policy. All touched-module tests green. Signed-off-by: rlaope <piyrw9754@gmail.com> * fix: address code-review findings on the large-scale hardening - PinningAnalyzer.evictLowCountEntries: guard against a negative array index. The map can shrink below MAX_HOTSPOTS between the size() trigger and the counts[] snapshot (concurrent clear()/overlapping eviction), which would make counts[counts.length - MAX_HOTSPOTS] index negative and throw. Re-check the snapshot length before indexing. - StructuredConcurrencyView: the depth-cap elision indented with repeat(MAX_RENDER_DEPTH) (1024 blank chars per elided node); use a modest fixed indent instead. Signed-off-by: rlaope <piyrw9754@gmail.com> --------- Signed-off-by: rlaope <piyrw9754@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an opt-in LLM root-cause surface that narrates the likely root cause from the existing deterministic
doctor/GC findings. It is default OFF, bring-your-own-key, and findings-only — it never fabricates metrics, and the deterministic template output (explain/gcwhy/doctor) remains the default and the fallback.Safety posture (the whole point)
argus.llm.enabled=true(orARGUS_LLM_ENABLED=true) andARGUS_LLM_API_KEYare set.LlmRootCause; when not activatable, no provider is even constructed. A unit test asserts the provider factory never runs.Findingobjects, and the system instruction explicitly forbids the model from inventing any number. A unit test feeds known findings and asserts every number in the prompt is a subset of the numbers in the findings.HttpClientagainst a provider-neutral OpenAI-compatible chat endpoint (ARGUS_LLM_ENDPOINT/ARGUS_LLM_MODELoverride the defaults).Surfaces
argus doctor --rca— appends an advisory root cause after the health report.argus explain --llm --bundle=<snapshot.tar.gz>— offline path: reads findings out of asnapshotbundle, prints them deterministically, then appends the advisory.Acceptance criteria & how verified
FindingsPrompt.serialize/buildPrompt; deterministic output remains the default/fallback. ✅LlmProviderinterface,HttpLlmProvider(JDK HttpClient, no SDK),LlmConfiggate. Unit testdisabled_neverInvokesProviderasserts the provider is never constructed/invoked when disabled (alsoenabledWithoutKey_neverInvokesProvider). ✅prompt_containsOnlyNumbersFromFindings: a stub captures the prompt and asserts its numbers are a subset of the findings' numbers; system instruction contains "do not invent". ✅LlmAdvisoryRendererlabels the block advisory and prints after the deterministic findings. ✅BundleFindings.fromBundle(JDK-only tar.gz reader).bundlePath_feedsFindingsToProviderexercises bundle → findings → prompt with a stub. ✅Extra:
providerFailure_fallsBackToFindingsconfirms a provider error is caught and the findings still flow through.Tests & build (fresh)
./gradlew :argus-cli:test --tests 'io.argus.cli.llm.LlmRootCauseTest'→ BUILD SUCCESSFUL,tests="6" skipped="0" failures="0" errors="0"../gradlew :argus-cli:test(full CLI suite) → BUILD SUCCESSFUL../gradlew :argus-cli:fatJar→ BUILD SUCCESSFUL (argus-cli-1.5.0-all.jar).i18n parity
8 new
llm.rca.*keys added to all four locales (printf%s):messages_en.properties: 667 keysmessages_ko.properties: 667 keysmessages_ja.properties: 667 keysmessages_zh.properties: 667 keys🤖 Generated with Claude Code