Skip to content

feat: OTel JVM semconv compliance (dual-emit jvm.* alongside argus_*)#253

Merged
rlaope merged 2 commits into
masterfrom
feat/w1-otel-semconv
May 29, 2026
Merged

feat: OTel JVM semconv compliance (dual-emit jvm.* alongside argus_*)#253
rlaope merged 2 commits into
masterfrom
feat/w1-otel-semconv

Conversation

@rlaope
Copy link
Copy Markdown
Owner

@rlaope rlaope commented May 29, 2026

Summary

Workstream W1 — OTel JVM semantic-convention compliance.

Argus now speaks the OpenTelemetry JVM runtime semantic conventions. A new single
source-of-truth mapping layer (SemconvMetrics) maps Argus GC/heap/thread/class/cpu
metrics to the standard OTel names, and both the Prometheus collector and the OTLP
exporter draw their names, units, and types from that one table.

Standard metrics covered: jvm.gc.duration, jvm.memory.used, jvm.memory.committed,
jvm.memory.used_after_last_gc, jvm.thread.count, jvm.class.count, jvm.cpu.time,
jvm.cpu.recent_utilization — with UCUM units (s, By, 1, {thread}, {class})
and the standard attributes (jvm.gc.name, jvm.gc.action, jvm.memory.pool.name).
In Prometheus exposition the dots become underscores (jvm_gc_duration_seconds, etc.).

Compatibility

  • The legacy argus_* series are still emitted by default, gated behind a new flag
    argus.metrics.legacyNames (default true). Existing dashboards keep working;
    set argus.metrics.legacyNames=false to emit only the standard jvm_* names.
  • Argus-unique metrics keep the argus.* namespace and are emitted regardless of
    the flag — they are differentiators, not "legacy" duplicates. Examples:
    argus_gc_leak_confidence, argus_gc_leak_suspected, argus_gc_overhead_ratio,
    GC allocation/promotion rates, carrier-thread skew, argus_metaspace_reserved_bytes,
    virtual-thread start/end/pinning counters, profiling samples.

OTLP resource attributes

OtlpMetricsExporter (via OtlpJsonBuilder) now stamps the full OTel resource set:
service.name, service.namespace (K8s namespace via Downward API), service.instance.id
(pod name / hostname), telemetry.sdk.name, telemetry.sdk.language=java,
telemetry.sdk.version. The OTLP payload also emits the semconv metric names alongside
the legacy names.

Acceptance criteria and verification

  1. Semconv mapping layer (single source of truth): SemconvMetrics holds the table
    (OTel name, Prometheus name, unit, type, description, attributes). Verified by
    SemconvMetricsComplianceTest.mapping_table_units_are_correct and
    mapping_table_declares_required_attributes.
  2. Dual-emit + legacy flag: Prometheus emits jvm_* alongside argus_* by default;
    argus.metrics.legacyNames=false drops the argus_* duplicates. Verified by
    emits_standard_jvm_series_alongside_legacy_by_default and
    disabling_legacy_names_drops_argus_duplicates_but_keeps_jvm_and_unique.
  3. OTLP resource attributes + semconv names: see OtlpJsonBuilder.appendResource
    and the semconv emissions in each metric section.
  4. /prometheus stays valid: every series has HELP/TYPE and there are no duplicate
    series. Verified by every_series_has_help_and_type_and_no_duplicate_series.
  5. Mapping table covers every exported standard metric, units correct: verified by
    emitted_jvm_names_match_the_mapping_table and mapping_table_units_are_correct.
  6. Argus-unique metrics not renamed, emitted regardless of the flag: verified by
    argus_unique_metrics_emitted_regardless_of_legacy_flag.

Deviations

  • jvm.cpu.time (a monotonic process-CPU-seconds counter) has no JFR-derived source in
    Argus today — only the recent-utilization ratio is available. It is kept in the mapping
    table as the documented contract but is not exported yet. The mapping table is the
    superset/contract, so the "every exported metric is in the table" assertion still holds.
  • The jvm.gc.duration histogram is exported as the existing aggregate; it is not yet
    split per jvm.gc.name/jvm.gc.action (Argus tracks only an aggregate pause histogram).
    The per-collector/per-cause breakdown remains available as the Argus-unique
    argus_gc_pause_breakdown_seconds_total series.

Test / build notes

./gradlew :argus-server:test --tests "io.argus.server.metrics.*" and
:argus-core:test are green. Two unrelated, pre-existing failures on master are out of
scope for W1 and reproduce with this branch's changes stashed:
CorrelationAnalyzerTracePauseTest (uses hardcoded 2026-05-28 timestamps that the 5-minute
retention prunes) and a Gradle implicit-dependency validation error on :argus-server:jar
vs :argus-diagnostics:jar.

🤖 Generated with Claude Code

rlaope added 2 commits May 29, 2026 09:55
Add a single source-of-truth mapping layer (SemconvMetrics) that maps Argus
GC/heap/thread/class/cpu metrics to standard OpenTelemetry JVM semantic-
convention names (jvm.gc.duration, jvm.memory.used, jvm.memory.committed,
jvm.memory.used_after_last_gc, jvm.thread.count, jvm.class.count, jvm.cpu.time,
jvm.cpu.recent_utilization) with UCUM units and the standard attributes
(jvm.gc.name, jvm.gc.action, jvm.memory.pool.name). In Prometheus exposition
dots become underscores (jvm_gc_duration_seconds, ...).

PrometheusMetricsCollector now emits the standard jvm_* series alongside the
existing argus_* series. The legacy argus_* duplicates are gated behind a new
flag argus.metrics.legacyNames (default true) so existing dashboards keep
working. Argus-unique metrics with no semconv equivalent (leak confidence,
carrier-thread skew, reserved metaspace, profiling samples, ...) stay in the
argus.* namespace and are emitted regardless of the flag.

OtlpMetricsExporter now stamps the full OTel resource attributes (service.name,
service.namespace, service.instance.id, telemetry.sdk.name,
telemetry.sdk.language=java, telemetry.sdk.version) and emits the semconv
metric names alongside the legacy names.

Signed-off-by: rlaope <piyrw9754@gmail.com>
…cs contract

The OTLP push path had diverged from the Prometheus path on the OTel JVM
semantic-convention series:

- jvm.memory.used / jvm.memory.committed / jvm.memory.used_after_last_gc
  carried no attributes. Emit jvm.memory.pool.name="heap" on each data point
  so OTLP memory series match the Prometheus {jvm_memory_pool_name="heap"}
  identity.
- The Metaspace memory pool was missing on OTLP. Emit jvm.memory.used /
  jvm.memory.committed with pool="Metaspace", gated by the same
  metaspace-enabled + non-null-analyzer guard the Prometheus path uses.
- jvm.gc.duration was absent on OTLP. Emit it as an OTLP Histogram data point
  from the same aggregate pause histogram the Prometheus path uses
  (explicitBounds + per-bucket counts + sum + count); skipped when there is
  no pause data.

SemconvMetrics.GC_DURATION previously declared jvm.gc.name/jvm.gc.action as
required, but Argus keeps only one aggregate pause histogram and cannot split
it per collector/cause without fabricating bucket data. Removed those
attributes from the table so the declared contract matches what is emitted;
the per-collector/per-cause breakdown remains the Argus-unique
argus_gc_pause_breakdown_seconds_total / argus_gc_events_breakdown_total series.

Tests assert OTLP memory data points carry the pool attribute, OTLP emits the
Metaspace pool and the jvm.gc.duration histogram, and the SemconvMetrics table
declares no attribute that is never emitted.

Signed-off-by: rlaope <piyrw9754@gmail.com>
@rlaope rlaope merged commit 77d1532 into master May 29, 2026
3 of 9 checks passed
@rlaope rlaope deleted the feat/w1-otel-semconv branch May 29, 2026 03:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant