feat: JVM right-sizing for FinOps (argus rightsize + fleet roll-up)#255
Merged
Conversation
Add an argus rightsize <pid> command that recommends -Xmx/-Xms, container memory request/limit, and a CPU request derived from the observed heap high-water mark, the post-GC live-set floor, allocation/promotion rate, and the metaspace + direct-buffer footprint. - Pure-function core (RightsizeRecommender) with no JVM-attach dependency, so the headroom/recommendation math is unit-testable without a live JVM. - Hard floor: recommended -Xmx is never below the observed post-GC live-set floor. An OOMKill-risk flag fires when the container limit leaves no native headroom above -Xmx (metaspace/threads/code-cache/direct buffers). - Refuses with a clear reason when the observation window is too short to be defensible (< 60s or < 2 GC cycles). - --format=json emits a structured recommendation object; the output always shows its inputs and the safety factor (never a black box). - Aggregator GET /fleet/rightsize returns a per-deployment current-vs- recommended memory table (in % of current -Xmx) with an aggregate savings estimate, reusing the fleet ring-buffer samples already held. - i18n: 21 new keys added to all four locale files (printf %s), keeping key counts in parity at 680 each. Signed-off-by: rlaope <piyrw9754@gmail.com>
The OOMKill-risk check evaluated headroom against Argus's own recommended
container limit when no --limit was supplied. That limit is
round((xmx + native) * 1.25) by construction, so headroomAboveXmx was always
far above the 10% threshold and the flag could never fire on the default path.
Make the verdict three-state (AT_RISK / OK / UNKNOWN):
- observed limit present (--limit or known cgroup limit): evaluate headroom
against that real limit, as before.
- no observed limit (== 0): report UNKNOWN ("n/a"), never a false all-clear,
and tell the operator to pass --limit to get a real answer. The circular
self-comparison against the recommended limit is removed entirely.
Also round the headroom threshold instead of truncating (long)(xmx * 0.10).
JSON gains an oomKillRiskState field; the human-readable output renders all
three states. New i18n keys rightsize.oom.ok / rightsize.oom.na added to all
four locales (682 keys each).
Signed-off-by: rlaope <piyrw9754@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What & why โ the FinOps angle
argus rightsize <pid>turns observed JVM behavior into concrete capacity numbers: it recommends-Xmx/-Xms, a container memory request and limit, and a container CPU request, derived from the heap high-water mark, the post-GC live-set floor, allocation/promotion rate, and the metaspace + direct-buffer footprint.Host-level eBPF/RSS tooling can see how much memory a container uses, but it cannot reason about the live set โ how much of the heap survives a collection, which is the only defensible floor for
-Xmx. RSS conflates heap, off-heap, page cache and GC scratch; it can't tell you "this 4 GB pod only ever keeps 600 MB live, so -Xmx can come down." Argus reads that directly from the JVM, so the recommendation is anchored to the live set and the safety factor, not to a noisy RSS number.Example output
With a too-tight current limit it flags the risk RSS tools never surface:
--format=jsonemits a structured object (recommended bytes, OOMKill flag, safety factor, and the full echoed inputs).Acceptance criteria โ how each was verified
safety:1.5xare always rendered.-Xmx(125 MiB) sits above the live-set floor (83 MiB); unit testxmxNeverDropsBelowFloorEvenWhenSafetyFactorWouldRoundasserts the floor guarantee. The OOMKill flag fired live with--limit=130mand is covered byoomKillFlagFiresWhenCurrentLimitApproximatesXmx.shortWindowIsRefusedWithReasonandtooFewGcCyclesIsRefused.--format=json+GET /fleet/rightsizeโ JSON object verified live; the aggregator endpoint returns a per-deployment current-vs-recommended table with an aggregate savings estimate, reusing the ring-buffer samples already held. Covered byFleetRightsizeComputerTest(over-provisioned shows savings, hot deployment shows none, no-data reports nulls, JSON shape).RightsizeRecommenderhas zero JVM-attach dependency;RightsizeRecommenderTestdrives the band/floor/OOMKill/refusal logic directly.Tests / build
./gradlew :argus-cli:test :argus-aggregator:testโ BUILD SUCCESSFUL. New:RightsizeRecommenderTest(9 tests),FleetRightsizeComputerTest(4 tests), all green; full suites pass../gradlew :argus-cli:fatJarโ BUILD SUCCESSFUL,argus-cli-1.5.0-all.jar(~2.56 MB).i18n parity
21 new
rightsize.*/cmd.rightsize.desckeys added to all four locales with printf%s. Per-locale key count is 680 (en/ko/ja/zh all equal; was 659 + 21).Notes
options.release = 11), so the recommender value objects are plain final classes with accessors rather than records. The aggregator (Java 21) uses records. Nobuild.gradle.ktswas modified.๐ค Generated with Claude Code