feat: JVM right-sizing for FinOps (argus rightsize + fleet roll-up) by rlaope · Pull Request #255 · rlaope/Argus

rlaope · 2026-05-29T01:14:00Z

What & why — the FinOps angle

argus rightsize <pid> turns observed JVM behavior into concrete capacity numbers: it recommends -Xmx/-Xms, a container memory request and limit, and a container CPU request, derived from the heap high-water mark, the post-GC live-set floor, allocation/promotion rate, and the metaspace + direct-buffer footprint.

Host-level eBPF/RSS tooling can see how much memory a container uses, but it cannot reason about the live set — how much of the heap survives a collection, which is the only defensible floor for -Xmx. RSS conflates heap, off-heap, page cache and GC scratch; it can't tell you "this 4 GB pod only ever keeps 600 MB live, so -Xmx can come down." Argus reads that directly from the JVM, so the recommendation is anchored to the live set and the safety factor, not to a noisy RSS number.

Example output

╭─ JVM Right-Sizing ── floor:83MiB ── safety:1.5x ─────────────────────────────╮
│   Recommended -Xmx:        -Xmx125m                                          │
│   Recommended -Xms:        -Xms125m                                          │
│   Container memory request: 146 MiB                                          │
│   Container memory limit:  183 MiB                                           │
│   Container CPU request:   0.50 cores                                        │
├──────────────────────────────────────────────────────────────────────────────┤
│   Inputs (recommendation is not a black box)                                 │
│     Heap high-water mark:   101 MiB                                          │
│     Post-GC live-set floor: 83 MiB                                           │
│     Metaspace footprint:    0 MiB                                            │
│     Live threads:           20                                               │
│     Allocation rate (est.): 8 MiB/s                                          │
│     Safety factor:          1.5x above the live-set floor                    │
│   -Xmx is never recommended below the observed post-GC live-set floor        │
╰──────────────────────────────────────────────────────────────────────────────╯

With a too-tight current limit it flags the risk RSS tools never surface:

[OOMKILL RISK] container limit (130 MiB) leaves only 4 MiB above -Xmx (125 MiB)
— no room for metaspace/threads/code-cache/direct buffers; raise the limit

--format=json emits a structured object (recommended bytes, OOMKill flag, safety factor, and the full echoed inputs).

Acceptance criteria — how each was verified

Recommends -Xmx/-Xms + container request/limit + CPU request, showing inputs and the safety factor — verified live against a running JVM (rich + JSON output above). The "Inputs" block and safety:1.5x are always rendered.
Hard floor + OOMKill-risk flag — -Xmx (125 MiB) sits above the live-set floor (83 MiB); unit test xmxNeverDropsBelowFloorEvenWhenSafetyFactorWouldRound asserts the floor guarantee. The OOMKill flag fired live with --limit=130m and is covered by oomKillFlagFiresWhenCurrentLimitApproximatesXmx.
Short-window refusal — verified live (the JVM under 60s uptime returned a refusal with reason); covered by shortWindowIsRefusedWithReason and tooFewGcCyclesIsRefused.
--format=json + GET /fleet/rightsize — JSON object verified live; the aggregator endpoint returns a per-deployment current-vs-recommended table with an aggregate savings estimate, reusing the ring-buffer samples already held. Covered by FleetRightsizeComputerTest (over-provisioned shows savings, hot deployment shows none, no-data reports nulls, JSON shape).
Pure-function core, unit-testable without a live JVM — RightsizeRecommender has zero JVM-attach dependency; RightsizeRecommenderTest drives the band/floor/OOMKill/refusal logic directly.
Honesty rails — every number shows its inputs + safety factor; the footer states "-Xmx is never recommended below the observed post-GC live-set floor."

Tests / build

./gradlew :argus-cli:test :argus-aggregator:test — BUILD SUCCESSFUL. New: RightsizeRecommenderTest (9 tests), FleetRightsizeComputerTest (4 tests), all green; full suites pass.
./gradlew :argus-cli:fatJar — BUILD SUCCESSFUL, argus-cli-1.5.0-all.jar (~2.56 MB).

i18n parity

21 new rightsize.* / cmd.rightsize.desc keys added to all four locales with printf %s. Per-locale key count is 680 (en/ko/ja/zh all equal; was 659 + 21).

Notes

The CLI module targets Java 11 source (options.release = 11), so the recommender value objects are plain final classes with accessors rather than records. The aggregator (Java 21) uses records. No build.gradle.kts was modified.

🤖 Generated with Claude Code

Add an argus rightsize <pid> command that recommends -Xmx/-Xms, container memory request/limit, and a CPU request derived from the observed heap high-water mark, the post-GC live-set floor, allocation/promotion rate, and the metaspace + direct-buffer footprint. - Pure-function core (RightsizeRecommender) with no JVM-attach dependency, so the headroom/recommendation math is unit-testable without a live JVM. - Hard floor: recommended -Xmx is never below the observed post-GC live-set floor. An OOMKill-risk flag fires when the container limit leaves no native headroom above -Xmx (metaspace/threads/code-cache/direct buffers). - Refuses with a clear reason when the observation window is too short to be defensible (< 60s or < 2 GC cycles). - --format=json emits a structured recommendation object; the output always shows its inputs and the safety factor (never a black box). - Aggregator GET /fleet/rightsize returns a per-deployment current-vs- recommended memory table (in % of current -Xmx) with an aggregate savings estimate, reusing the fleet ring-buffer samples already held. - i18n: 21 new keys added to all four locale files (printf %s), keeping key counts in parity at 680 each. Signed-off-by: rlaope <piyrw9754@gmail.com>

The OOMKill-risk check evaluated headroom against Argus's own recommended container limit when no --limit was supplied. That limit is round((xmx + native) * 1.25) by construction, so headroomAboveXmx was always far above the 10% threshold and the flag could never fire on the default path. Make the verdict three-state (AT_RISK / OK / UNKNOWN): - observed limit present (--limit or known cgroup limit): evaluate headroom against that real limit, as before. - no observed limit (== 0): report UNKNOWN ("n/a"), never a false all-clear, and tell the operator to pass --limit to get a real answer. The circular self-comparison against the recommended limit is removed entirely. Also round the headroom threshold instead of truncating (long)(xmx * 0.10). JSON gains an oomKillRiskState field; the human-readable output renders all three states. New i18n keys rightsize.oom.ok / rightsize.oom.na added to all four locales (682 keys each). Signed-off-by: rlaope <piyrw9754@gmail.com>

rlaope added 2 commits May 29, 2026 10:13

rlaope merged commit 083b567 into master May 29, 2026
3 of 9 checks passed

rlaope deleted the feat/w3-rightsize branch May 29, 2026 03:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: JVM right-sizing for FinOps (argus rightsize + fleet roll-up)#255

feat: JVM right-sizing for FinOps (argus rightsize + fleet roll-up)#255
rlaope merged 2 commits into
masterfrom
feat/w3-rightsize

rlaope commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

rlaope commented May 29, 2026

What & why — the FinOps angle

Example output

Acceptance criteria — how each was verified

Tests / build

i18n parity

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant