Skip to content

fix: harden 2026-roadmap features for large-scale fleet use#262

Merged
rlaope merged 2 commits into
masterfrom
fix/large-scale-hardening
May 29, 2026
Merged

fix: harden 2026-roadmap features for large-scale fleet use#262
rlaope merged 2 commits into
masterfrom
fix/large-scale-hardening

Conversation

@rlaope
Copy link
Copy Markdown
Owner

@rlaope rlaope commented May 29, 2026

What

Large-scale / real-world usability QA over the newly-merged 2026-roadmap features (W1–W8, #252#259). Four parallel scenario reviews hypothesized fleet-sized situations (hundreds–thousands of pods, huge VT dumps, slow/down backends, concurrent scrapes) and checked the code against them. This PR fixes the six confirmed correctness/scale defects with mechanism-level fixes; the remaining findings are triaged below with rationale.

Fixed (this PR)

# Sev Area Defect → Fix
1 CRITICAL W1 OTLP OtlpJsonBuilder ignored argus.metrics.legacyNames and dual-emitted every argus_* duplicate over OTLP unconditionally — legacyNames=false halved Prometheus cardinality but never the OTLP stream, silently doubling every downstream series at fleet scale. Fix: mirror the Prometheus gating exactly (vt-active, heap used/committed, all three CPU ratios, metaspace used/committed/classes). Argus-unique series with no semconv equivalent still always ship.
2 HIGH W1 Prom PrometheusMetricsCollector.emittedSemconvHeaders was a shared instance field mutated during a scrape — two concurrent scrapers (Prometheus + Grafana agent) race and corrupt the HELP/TYPE headers. Fix: per-scrape method-local set.
3 HIGH W7 Pinning PinningAnalyzer used volatile count++ (non-atomic, lost updates under concurrent pinning) and evicted with removeIf(count <= 1) — which wiped the entire hotspot map whenever pinning was spread across many distinct sites each seen once. Fix: AtomicLong + retain top-N by count with a tie-break that guarantees bounded growth (verified by the all-singletons scale case).
4 CRITICAL W7 SC view StructuredConcurrencyView.renderNode recursed with no depth bound — a deeply nested StructuredTaskScope tree (or an adversarial dump with a long parent chain) overflows the stack. Fix: cap render depth (512) and elide beyond it.
5 MEDIUM W6 Anomaly AnomalyDetector evicted its recent-anomaly ring via ArrayList.remove(0) — O(ring) per fire, on the synchronized path that blocks every JFR drain, with a user-configurable ring size. Fix: ArrayDeque for O(1) head eviction.
6 LOW W5 Profiles ProfileTraceContext.normalize() compiled a regex on every call on the OTLP encode hot path. Fix: static Pattern.

New tests: OtlpJsonBuilderMetricsTest legacyNames gating (on→dual-emit, off→no argus_* duplicates but semconv + argus-unique remain); PinningAnalyzerTest eviction policy (singletons not wiped; heavy hitter survives).

Deferred — triaged, not fixed here (rationale)

These are real but need a product/design decision or sit behind a default-off flag; fixing them silently here would over-reach:

  • Pyroscope push is synchronous (W2)PyroscopePusherPyroscopeTransport.send() blocks the caller for up to the 3s+10s timeout, and has no backoff/circuit-breaker (thundering-herd on a recovering Pyroscope). Timeouts are set, and the pusher is not yet wired into a capture loop (documented follow-up), so there's no live stall today. Needs an async dispatch + bounded queue + backoff design before wire-up.
  • argus rightsize uses JVM uptime as the observation window (W3)RightsizeCommand passes s.uptimeMs() to the short-window refusal gate. Heap HWM is a lifetime peak so recommendations stay peak-anchored, but an idle long-running replica can still pass the gate. Changing "observation window" semantics for a single-snapshot attach is a product call.
  • LLM RCA: no prompt-size cap; raw error message could echo a URL-embedded key (W8) — opt-in, default-OFF, BYO-key. Default-OFF enforcement and Authorization-header key handling are clean. A prompt cap + error scrubbing are worth adding but are a separate hardening pass.
  • OTLP Profiles encoder buffers whole payload (W5) — fine behind the experimental default-off flag; would need a streaming writer if promoted to GA.
  • BundleFindings (int) cast on tar entry size (W8) — offline bundle parsing; a >2 GiB single entry is implausible. Cheap guard, low priority.
  • Fleet rightsize aggregate savings unweighted by pod count (W3) — a reporting-semantics choice to document, not a crash.

Verification

./gradlew :argus-server:test :argus-cli:test :argus-aggregator:testBUILD SUCCESSFUL (incl. new regression tests). Commit -s, no Co-Authored-By.

🤖 Generated with Claude Code

rlaope added 2 commits May 29, 2026 14:23
Large-scale usability QA over the W1-W8 features (#252-#259) surfaced six
correctness/scale defects under fleet-sized load. Each is fixed at the
mechanism, not patched around:

- OtlpJsonBuilder ignored argus.metrics.legacyNames and dual-emitted every
  argus_* duplicate over OTLP unconditionally, while the Prometheus path gates
  them. legacyNames=false halved Prometheus cardinality but never the OTLP
  stream. Mirror the Prometheus gating exactly (vt-active, heap used/committed,
  all three CPU ratios, metaspace used/committed/classes); argus-unique series
  with no semconv equivalent still always ship.
- PrometheusMetricsCollector kept the semconv HELP/TYPE dedup set as a shared
  instance field, racing under concurrent scrapes (Prometheus + agent) and
  corrupting the scrape. Make it per-scrape method-local.
- PinningAnalyzer used a volatile count++ (non-atomic, lost updates) and evicted
  via removeIf(count <= 1), which wiped the whole map when pinning was spread
  across many distinct singleton sites. Use AtomicLong and evict by retaining
  the top-N by count with a tie-break that guarantees bounded growth.
- StructuredConcurrencyView.renderNode recursed without bound — a deeply nested
  StructuredTaskScope tree (or an adversarial dump) overflowed the stack. Cap
  the render depth and elide beyond it.
- AnomalyDetector evicted its recent-anomaly ring via ArrayList.remove(0)
  (O(ring) per fire); switch to ArrayDeque for O(1) head eviction.
- ProfileTraceContext compiled a regex per trace-context normalize() call on the
  encode hot path; hoist to a static Pattern.

Adds OtlpJsonBuilderMetricsTest cases for the legacyNames gating (on/off) and a
new PinningAnalyzerTest covering the eviction policy. All touched-module tests
green.

Signed-off-by: rlaope <piyrw9754@gmail.com>
- PinningAnalyzer.evictLowCountEntries: guard against a negative array index.
  The map can shrink below MAX_HOTSPOTS between the size() trigger and the
  counts[] snapshot (concurrent clear()/overlapping eviction), which would make
  counts[counts.length - MAX_HOTSPOTS] index negative and throw. Re-check the
  snapshot length before indexing.
- StructuredConcurrencyView: the depth-cap elision indented with
  repeat(MAX_RENDER_DEPTH) (1024 blank chars per elided node); use a modest
  fixed indent instead.

Signed-off-by: rlaope <piyrw9754@gmail.com>
@rlaope rlaope merged commit 4f82ae4 into master May 29, 2026
7 checks passed
@rlaope rlaope deleted the fix/large-scale-hardening branch May 29, 2026 05:30
rlaope added a commit that referenced this pull request May 29, 2026
…metheus parity (#263)

The code review on #262 surfaced that several argus-unique series (no OTel
semconv equivalent) shipped on the Prometheus scrape but were absent from the
OTLP export — an operator switching from scrape to OTLP silently lost leak
detection and GC-rate diagnostics.

Add the scalar argus-unique series to OtlpJsonBuilder, in parity with
PrometheusMetricsCollector and with matching values:
- argus_gc_pause_time_seconds_avg, argus_gc_overhead_warning
- argus_gc_allocation_rate_kbps, argus_gc_promotion_rate_kbps
- argus_gc_leak_suspected, argus_gc_leak_confidence
- argus_heap_usage_ratio
- argus_metaspace_reserved_bytes
- argus_carrier_threads_virtual_handled_total

These have no semconv equivalent, so they are always emitted (not gated by
argus.metrics.legacyNames). A test asserts they ship even when legacyNames=false.

Intentionally still Prometheus-only (documented, not a regression): the legacy
argus_gc_pause_seconds_* histogram (the OTLP path uses the semconv
jvm.gc.duration histogram instead), and the label-dimensioned breakdown series
(argus_gc_{pause,events}_breakdown_*, argus_allocation_class_bytes,
argus_profiling_method_samples, argus_contention_hotspot_events) which need
per-series OTLP attribute support — a separate, larger change.

Signed-off-by: rlaope <piyrw9754@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant