fix: harden 2026-roadmap features for large-scale fleet use#262
Merged
Conversation
Large-scale usability QA over the W1-W8 features (#252-#259) surfaced six correctness/scale defects under fleet-sized load. Each is fixed at the mechanism, not patched around: - OtlpJsonBuilder ignored argus.metrics.legacyNames and dual-emitted every argus_* duplicate over OTLP unconditionally, while the Prometheus path gates them. legacyNames=false halved Prometheus cardinality but never the OTLP stream. Mirror the Prometheus gating exactly (vt-active, heap used/committed, all three CPU ratios, metaspace used/committed/classes); argus-unique series with no semconv equivalent still always ship. - PrometheusMetricsCollector kept the semconv HELP/TYPE dedup set as a shared instance field, racing under concurrent scrapes (Prometheus + agent) and corrupting the scrape. Make it per-scrape method-local. - PinningAnalyzer used a volatile count++ (non-atomic, lost updates) and evicted via removeIf(count <= 1), which wiped the whole map when pinning was spread across many distinct singleton sites. Use AtomicLong and evict by retaining the top-N by count with a tie-break that guarantees bounded growth. - StructuredConcurrencyView.renderNode recursed without bound — a deeply nested StructuredTaskScope tree (or an adversarial dump) overflowed the stack. Cap the render depth and elide beyond it. - AnomalyDetector evicted its recent-anomaly ring via ArrayList.remove(0) (O(ring) per fire); switch to ArrayDeque for O(1) head eviction. - ProfileTraceContext compiled a regex per trace-context normalize() call on the encode hot path; hoist to a static Pattern. Adds OtlpJsonBuilderMetricsTest cases for the legacyNames gating (on/off) and a new PinningAnalyzerTest covering the eviction policy. All touched-module tests green. Signed-off-by: rlaope <piyrw9754@gmail.com>
- PinningAnalyzer.evictLowCountEntries: guard against a negative array index. The map can shrink below MAX_HOTSPOTS between the size() trigger and the counts[] snapshot (concurrent clear()/overlapping eviction), which would make counts[counts.length - MAX_HOTSPOTS] index negative and throw. Re-check the snapshot length before indexing. - StructuredConcurrencyView: the depth-cap elision indented with repeat(MAX_RENDER_DEPTH) (1024 blank chars per elided node); use a modest fixed indent instead. Signed-off-by: rlaope <piyrw9754@gmail.com>
rlaope
added a commit
that referenced
this pull request
May 29, 2026
…metheus parity (#263) The code review on #262 surfaced that several argus-unique series (no OTel semconv equivalent) shipped on the Prometheus scrape but were absent from the OTLP export — an operator switching from scrape to OTLP silently lost leak detection and GC-rate diagnostics. Add the scalar argus-unique series to OtlpJsonBuilder, in parity with PrometheusMetricsCollector and with matching values: - argus_gc_pause_time_seconds_avg, argus_gc_overhead_warning - argus_gc_allocation_rate_kbps, argus_gc_promotion_rate_kbps - argus_gc_leak_suspected, argus_gc_leak_confidence - argus_heap_usage_ratio - argus_metaspace_reserved_bytes - argus_carrier_threads_virtual_handled_total These have no semconv equivalent, so they are always emitted (not gated by argus.metrics.legacyNames). A test asserts they ship even when legacyNames=false. Intentionally still Prometheus-only (documented, not a regression): the legacy argus_gc_pause_seconds_* histogram (the OTLP path uses the semconv jvm.gc.duration histogram instead), and the label-dimensioned breakdown series (argus_gc_{pause,events}_breakdown_*, argus_allocation_class_bytes, argus_profiling_method_samples, argus_contention_hotspot_events) which need per-series OTLP attribute support — a separate, larger change. Signed-off-by: rlaope <piyrw9754@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Large-scale / real-world usability QA over the newly-merged 2026-roadmap features (W1–W8, #252–#259). Four parallel scenario reviews hypothesized fleet-sized situations (hundreds–thousands of pods, huge VT dumps, slow/down backends, concurrent scrapes) and checked the code against them. This PR fixes the six confirmed correctness/scale defects with mechanism-level fixes; the remaining findings are triaged below with rationale.
Fixed (this PR)
OtlpJsonBuilderignoredargus.metrics.legacyNamesand dual-emitted everyargus_*duplicate over OTLP unconditionally —legacyNames=falsehalved Prometheus cardinality but never the OTLP stream, silently doubling every downstream series at fleet scale. Fix: mirror the Prometheus gating exactly (vt-active, heap used/committed, all three CPU ratios, metaspace used/committed/classes). Argus-unique series with no semconv equivalent still always ship.PrometheusMetricsCollector.emittedSemconvHeaderswas a shared instance field mutated during a scrape — two concurrent scrapers (Prometheus + Grafana agent) race and corrupt the HELP/TYPE headers. Fix: per-scrape method-local set.PinningAnalyzerusedvolatile count++(non-atomic, lost updates under concurrent pinning) and evicted withremoveIf(count <= 1)— which wiped the entire hotspot map whenever pinning was spread across many distinct sites each seen once. Fix:AtomicLong+ retain top-N by count with a tie-break that guarantees bounded growth (verified by the all-singletons scale case).StructuredConcurrencyView.renderNoderecursed with no depth bound — a deeply nestedStructuredTaskScopetree (or an adversarial dump with a long parent chain) overflows the stack. Fix: cap render depth (512) and elide beyond it.AnomalyDetectorevicted its recent-anomaly ring viaArrayList.remove(0)— O(ring) per fire, on the synchronized path that blocks every JFR drain, with a user-configurable ring size. Fix:ArrayDequefor O(1) head eviction.ProfileTraceContext.normalize()compiled a regex on every call on the OTLP encode hot path. Fix: staticPattern.New tests:
OtlpJsonBuilderMetricsTestlegacyNames gating (on→dual-emit, off→noargus_*duplicates but semconv + argus-unique remain);PinningAnalyzerTesteviction policy (singletons not wiped; heavy hitter survives).Deferred — triaged, not fixed here (rationale)
These are real but need a product/design decision or sit behind a default-off flag; fixing them silently here would over-reach:
PyroscopePusher→PyroscopeTransport.send()blocks the caller for up to the 3s+10s timeout, and has no backoff/circuit-breaker (thundering-herd on a recovering Pyroscope). Timeouts are set, and the pusher is not yet wired into a capture loop (documented follow-up), so there's no live stall today. Needs an async dispatch + bounded queue + backoff design before wire-up.argus rightsizeuses JVM uptime as the observation window (W3) —RightsizeCommandpassess.uptimeMs()to the short-window refusal gate. Heap HWM is a lifetime peak so recommendations stay peak-anchored, but an idle long-running replica can still pass the gate. Changing "observation window" semantics for a single-snapshot attach is a product call.Authorization-header key handling are clean. A prompt cap + error scrubbing are worth adding but are a separate hardening pass.BundleFindings(int)cast on tar entry size (W8) — offline bundle parsing; a >2 GiB single entry is implausible. Cheap guard, low priority.Verification
./gradlew :argus-server:test :argus-cli:test :argus-aggregator:test→ BUILD SUCCESSFUL (incl. new regression tests). Commit-s, noCo-Authored-By.🤖 Generated with Claude Code