Skip to content

feat: JVM right-sizing for FinOps (argus rightsize + fleet roll-up)#255

Merged
rlaope merged 2 commits into
masterfrom
feat/w3-rightsize
May 29, 2026
Merged

feat: JVM right-sizing for FinOps (argus rightsize + fleet roll-up)#255
rlaope merged 2 commits into
masterfrom
feat/w3-rightsize

Conversation

@rlaope
Copy link
Copy Markdown
Owner

@rlaope rlaope commented May 29, 2026

What & why โ€” the FinOps angle

argus rightsize <pid> turns observed JVM behavior into concrete capacity numbers: it recommends -Xmx/-Xms, a container memory request and limit, and a container CPU request, derived from the heap high-water mark, the post-GC live-set floor, allocation/promotion rate, and the metaspace + direct-buffer footprint.

Host-level eBPF/RSS tooling can see how much memory a container uses, but it cannot reason about the live set โ€” how much of the heap survives a collection, which is the only defensible floor for -Xmx. RSS conflates heap, off-heap, page cache and GC scratch; it can't tell you "this 4 GB pod only ever keeps 600 MB live, so -Xmx can come down." Argus reads that directly from the JVM, so the recommendation is anchored to the live set and the safety factor, not to a noisy RSS number.

Example output

โ•ญโ”€ JVM Right-Sizing โ”€โ”€ floor:83MiB โ”€โ”€ safety:1.5x โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚   Recommended -Xmx:        -Xmx125m                                          โ”‚
โ”‚   Recommended -Xms:        -Xms125m                                          โ”‚
โ”‚   Container memory request: 146 MiB                                          โ”‚
โ”‚   Container memory limit:  183 MiB                                           โ”‚
โ”‚   Container CPU request:   0.50 cores                                        โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚   Inputs (recommendation is not a black box)                                 โ”‚
โ”‚     Heap high-water mark:   101 MiB                                          โ”‚
โ”‚     Post-GC live-set floor: 83 MiB                                           โ”‚
โ”‚     Metaspace footprint:    0 MiB                                            โ”‚
โ”‚     Live threads:           20                                               โ”‚
โ”‚     Allocation rate (est.): 8 MiB/s                                          โ”‚
โ”‚     Safety factor:          1.5x above the live-set floor                    โ”‚
โ”‚   -Xmx is never recommended below the observed post-GC live-set floor        โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

With a too-tight current limit it flags the risk RSS tools never surface:

[OOMKILL RISK] container limit (130 MiB) leaves only 4 MiB above -Xmx (125 MiB)
โ€” no room for metaspace/threads/code-cache/direct buffers; raise the limit

--format=json emits a structured object (recommended bytes, OOMKill flag, safety factor, and the full echoed inputs).

Acceptance criteria โ€” how each was verified

  1. Recommends -Xmx/-Xms + container request/limit + CPU request, showing inputs and the safety factor โ€” verified live against a running JVM (rich + JSON output above). The "Inputs" block and safety:1.5x are always rendered.
  2. Hard floor + OOMKill-risk flag โ€” -Xmx (125 MiB) sits above the live-set floor (83 MiB); unit test xmxNeverDropsBelowFloorEvenWhenSafetyFactorWouldRound asserts the floor guarantee. The OOMKill flag fired live with --limit=130m and is covered by oomKillFlagFiresWhenCurrentLimitApproximatesXmx.
  3. Short-window refusal โ€” verified live (the JVM under 60s uptime returned a refusal with reason); covered by shortWindowIsRefusedWithReason and tooFewGcCyclesIsRefused.
  4. --format=json + GET /fleet/rightsize โ€” JSON object verified live; the aggregator endpoint returns a per-deployment current-vs-recommended table with an aggregate savings estimate, reusing the ring-buffer samples already held. Covered by FleetRightsizeComputerTest (over-provisioned shows savings, hot deployment shows none, no-data reports nulls, JSON shape).
  5. Pure-function core, unit-testable without a live JVM โ€” RightsizeRecommender has zero JVM-attach dependency; RightsizeRecommenderTest drives the band/floor/OOMKill/refusal logic directly.
  6. Honesty rails โ€” every number shows its inputs + safety factor; the footer states "-Xmx is never recommended below the observed post-GC live-set floor."

Tests / build

  • ./gradlew :argus-cli:test :argus-aggregator:test โ€” BUILD SUCCESSFUL. New: RightsizeRecommenderTest (9 tests), FleetRightsizeComputerTest (4 tests), all green; full suites pass.
  • ./gradlew :argus-cli:fatJar โ€” BUILD SUCCESSFUL, argus-cli-1.5.0-all.jar (~2.56 MB).

i18n parity

21 new rightsize.* / cmd.rightsize.desc keys added to all four locales with printf %s. Per-locale key count is 680 (en/ko/ja/zh all equal; was 659 + 21).

Notes

  • The CLI module targets Java 11 source (options.release = 11), so the recommender value objects are plain final classes with accessors rather than records. The aggregator (Java 21) uses records. No build.gradle.kts was modified.

๐Ÿค– Generated with Claude Code

rlaope added 2 commits May 29, 2026 10:13
Add an argus rightsize <pid> command that recommends -Xmx/-Xms, container
memory request/limit, and a CPU request derived from the observed heap
high-water mark, the post-GC live-set floor, allocation/promotion rate, and
the metaspace + direct-buffer footprint.

- Pure-function core (RightsizeRecommender) with no JVM-attach dependency, so
  the headroom/recommendation math is unit-testable without a live JVM.
- Hard floor: recommended -Xmx is never below the observed post-GC live-set
  floor. An OOMKill-risk flag fires when the container limit leaves no native
  headroom above -Xmx (metaspace/threads/code-cache/direct buffers).
- Refuses with a clear reason when the observation window is too short to be
  defensible (< 60s or < 2 GC cycles).
- --format=json emits a structured recommendation object; the output always
  shows its inputs and the safety factor (never a black box).
- Aggregator GET /fleet/rightsize returns a per-deployment current-vs-
  recommended memory table (in % of current -Xmx) with an aggregate savings
  estimate, reusing the fleet ring-buffer samples already held.
- i18n: 21 new keys added to all four locale files (printf %s), keeping key
  counts in parity at 680 each.

Signed-off-by: rlaope <piyrw9754@gmail.com>
The OOMKill-risk check evaluated headroom against Argus's own recommended
container limit when no --limit was supplied. That limit is
round((xmx + native) * 1.25) by construction, so headroomAboveXmx was always
far above the 10% threshold and the flag could never fire on the default path.

Make the verdict three-state (AT_RISK / OK / UNKNOWN):
- observed limit present (--limit or known cgroup limit): evaluate headroom
  against that real limit, as before.
- no observed limit (== 0): report UNKNOWN ("n/a"), never a false all-clear,
  and tell the operator to pass --limit to get a real answer. The circular
  self-comparison against the recommended limit is removed entirely.

Also round the headroom threshold instead of truncating (long)(xmx * 0.10).

JSON gains an oomKillRiskState field; the human-readable output renders all
three states. New i18n keys rightsize.oom.ok / rightsize.oom.na added to all
four locales (682 keys each).

Signed-off-by: rlaope <piyrw9754@gmail.com>
@rlaope rlaope merged commit 083b567 into master May 29, 2026
3 of 9 checks passed
@rlaope rlaope deleted the feat/w3-rightsize branch May 29, 2026 03:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant