in_process_exporter_metrics: Expire metrics for dead processes#11760
in_process_exporter_metrics: Expire metrics for dead processes#11760piwai wants to merge 2 commits intofluent:masterfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughAdds a cmetrics API to expire stale dynamic label-set metrics by timestamp and invokes it from the process exporter plugin after each /proc scan to remove metrics for processes/threads not observed during that collection pass. ChangesCmetrics: expiration API
Process exporter: expire after scan
Sequence Diagram(s)sequenceDiagram
participant Scanner as Process Scanner (plugin)
participant Cmetrics as Cmetrics Library
participant Metrics as Dynamic Metric Entries
rect rgba(200,200,255,0.5)
Scanner->>Scanner: scan /proc, record scan timestamp (ts)
end
rect rgba(200,255,200,0.5)
Scanner->>Cmetrics: call cmt_expire(ctx->cmt, ts)
end
rect rgba(255,200,200,0.5)
Cmetrics->>Metrics: iterate metric collections (counters,gauges,...)
Cmetrics->>Metrics: remove entries with timestamp < ts
Cmetrics-->>Scanner: return purge count
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.Comment |
|
Here's the request valgrind log |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 99f12369f0
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@plugins/in_process_exporter_metrics/pe_process.c`:
- Around line 946-967: The purge_stale_metrics function currently checks only
PID/TID via get_metric_label_value against active_ids, which allows stale series
to survive on ID reuse; update purge_stale_metrics to build a collision-free key
for each metric (for example by concatenating the full label tuple or PID/TID
plus process start time) and use that key when querying flb_hash_table_get
instead of the single id_val so the exact emitted identity is checked before
calling cmt_map_metric_destroy; apply the same change to the equivalent logic
referenced around the other purge block (the similar code at the later section)
so both places use the full-label or start-time-based key.
- Line 524: flb_hash_table_add calls that populate
active_tids/active_pids/active_fds must not be ignored — check each
flb_hash_table_add return value and treat a failure as a fatal error for this
collection pass: on add failure, log a descriptive error (including the
tid/pid/fd value), set a flag like index_complete = false (or return an error
code) and skip the end-of-pass purge; alternatively return early from the
collection function so purge code does not run. Update every insertion site that
uses flb_hash_table_add (references: active_tids, active_pids, active_fds) to
perform this check and ensure the purge path is gated by index_complete or
skipped when an insertion failed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4c226035-2a3c-4f70-b7f7-44e2c70a3119
📒 Files selected for processing (1)
plugins/in_process_exporter_metrics/pe_process.c
|
@piwai thanks for this contribution. Pls check the reviews provided by the Ai agent |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
plugins/in_process_exporter_metrics/pe_process.c (1)
526-533:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift
name:idis still not a collision-free liveness key.The purge now matches on
name + pid/tid, but several maps distinguish series by more than that (ppidfor process series,threadnamefor thread series). If an ID is reused for another process/thread with the same name before the next scrape, the lookup still hits and the oldppid/threadnameseries survives. That means stale series can still accumulate under reuse-heavy workloads.Please purge against the exact emitted identity for each map, or another collision-free identity such as start time.
Also applies to: 955-983, 1099-1106, 1267-1286
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/in_process_exporter_metrics/pe_process.c` around lines 526 - 533, The current liveness key built in the active_tids purge (created via flb_sds_create(name) + ":" + tid_str and added with flb_hash_table_add(active_tids, active_key, ...)) is not collision-free because it omits other identity fields (e.g., ppid for process maps, threadname for thread maps); update the key construction wherever you see this pattern (the active_key + flb_hash_table_add calls in this file) to include the exact emitted identity used when the series is created—either append ppid or threadname (or the process start time) to the key so the purge matches the full emitted identity, and apply the same fix to the other similar blocks that build active_key before calling flb_hash_table_add.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@plugins/in_process_exporter_metrics/pe_process.c`:
- Around line 1244-1247: When process_thread_update(ctx, ts, pid_str, name,
active_tids, &active_index_complete) returns -1 you must mark the active-index
gate invalid so the purge doesn't remove live thread metrics; update the caller
to set active_index_complete = false (or otherwise invalidate the purge gate)
immediately in the error branch where ret == -1. Apply the same change to the
other failing path referenced (the similar block around the 1265-1289 range) so
both failure paths for process_thread_update() invalidate active_index_complete
when thread enumeration fails.
---
Duplicate comments:
In `@plugins/in_process_exporter_metrics/pe_process.c`:
- Around line 526-533: The current liveness key built in the active_tids purge
(created via flb_sds_create(name) + ":" + tid_str and added with
flb_hash_table_add(active_tids, active_key, ...)) is not collision-free because
it omits other identity fields (e.g., ppid for process maps, threadname for
thread maps); update the key construction wherever you see this pattern (the
active_key + flb_hash_table_add calls in this file) to include the exact emitted
identity used when the series is created—either append ppid or threadname (or
the process start time) to the key so the purge matches the full emitted
identity, and apply the same fix to the other similar blocks that build
active_key before calling flb_hash_table_add.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6fb43019-e2d2-4ca2-b95c-1fa47e7631c9
📒 Files selected for processing (1)
plugins/in_process_exporter_metrics/pe_process.c
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@plugins/in_process_exporter_metrics/pe_process.c`:
- Around line 526-533: The SDS key construction code using flb_sds_create and
flb_sds_cat_safe must check for allocation/appending failure before calling
flb_sds_len or flb_hash_table_add; update the block that builds active_key to
validate that flb_sds_create returned non-NULL and that each flb_sds_cat_safe
succeeded, and if any step fails set *active_index_complete = FLB_FALSE, avoid
calling flb_sds_len or flb_hash_table_add with a NULL key, free/destroy any
partially created SDS via flb_sds_destroy if non-NULL, and return/continue as
appropriate; apply the same defensive checks to the other SDS key construction
sites (the analogous blocks around lines 976-983 and 1099-1106) to prevent NULL
dereference in flb_sds_len or hash operations.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: fbc9fa0c-f17d-49f5-8780-04c410d91e63
📒 Files selected for processing (1)
plugins/in_process_exporter_metrics/pe_process.c
There was a problem hiding this comment.
This is really interesting and prominent idea!
But we don't want to touch the internal structure of cmetrics from Fluent Bit's plugins as much as possible. So, could you try to implement this kind of feature inside of cmetrics if possible?
We're also seeking an opportunity to purge stale metrics in some ways but currently not succeeded yet.
|
@cosmo0920 thanks a lot for the review! |
Yup, implementing cmt_expire() or similar function to provide expiring metrics.
We're also suffering for out of date issue of stale metrics in out_prometheus_remote_write or out_prometheus_expoter. |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
lib/cmetrics/src/cmetrics.c (1)
185-249: ⚡ Quick winConsider extracting the repeated inner-loop body into a static helper.
The purge-by-timestamp inner loop is copy-pasted verbatim six times; the only difference is which
struct cmt_*type is unpacked from the outer list. Extracting it removes ~60 lines, eliminates any future drift if the expiry predicate changes (e.g., adding a callback or a minimum count guard), and makescmt_expireitself much easier to read.♻️ Proposed refactor
Add a
statichelper abovecmt_expire:static int expire_map_metrics(struct cmt_map *map, uint64_t min_timestamp) { int count = 0; struct cfl_list *mhead; struct cfl_list *mtmp; struct cmt_metric *metric; cfl_list_foreach_safe(mhead, mtmp, &map->metrics) { metric = cfl_list_entry(mhead, struct cmt_metric, _head); if (cmt_metric_get_timestamp(metric) < min_timestamp) { cmt_map_metric_destroy(metric); count++; } } return count; }Then simplify
cmt_expire:int cmt_expire(struct cmt *cmt, uint64_t min_timestamp) { int count = 0; struct cfl_list *head; struct cfl_list *tmp; struct cmt_counter *c; struct cmt_gauge *g; struct cmt_summary *s; struct cmt_histogram *h; struct cmt_exp_histogram *eh; struct cmt_untyped *u; - struct cfl_list *mhead; - struct cfl_list *mtmp; - struct cmt_metric *metric; - cfl_list_foreach_safe(head, tmp, &cmt->counters) { - c = cfl_list_entry(head, struct cmt_counter, _head); - cfl_list_foreach_safe(mhead, mtmp, &c->map->metrics) { - metric = cfl_list_entry(mhead, struct cmt_metric, _head); - if (cmt_metric_get_timestamp(metric) < min_timestamp) { - cmt_map_metric_destroy(metric); - count++; - } - } - } - /* ... repeated 5 more times ... */ + cfl_list_foreach_safe(head, tmp, &cmt->counters) { + c = cfl_list_entry(head, struct cmt_counter, _head); + count += expire_map_metrics(c->map, min_timestamp); + } + cfl_list_foreach_safe(head, tmp, &cmt->gauges) { + g = cfl_list_entry(head, struct cmt_gauge, _head); + count += expire_map_metrics(g->map, min_timestamp); + } + cfl_list_foreach_safe(head, tmp, &cmt->summaries) { + s = cfl_list_entry(head, struct cmt_summary, _head); + count += expire_map_metrics(s->map, min_timestamp); + } + cfl_list_foreach_safe(head, tmp, &cmt->histograms) { + h = cfl_list_entry(head, struct cmt_histogram, _head); + count += expire_map_metrics(h->map, min_timestamp); + } + cfl_list_foreach_safe(head, tmp, &cmt->exp_histograms) { + eh = cfl_list_entry(head, struct cmt_exp_histogram, _head); + count += expire_map_metrics(eh->map, min_timestamp); + } + cfl_list_foreach_safe(head, tmp, &cmt->untypeds) { + u = cfl_list_entry(head, struct cmt_untyped, _head); + count += expire_map_metrics(u->map, min_timestamp); + } return count; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/cmetrics/src/cmetrics.c` around lines 185 - 249, The cmt_expire function repeats the same inner loop for each metric map; add a static helper (e.g., expire_map_metrics) that takes struct cmt_map *map and uint64_t min_timestamp, iterates map->metrics using cfl_list_foreach_safe, checks cmt_metric_get_timestamp(metric) < min_timestamp, calls cmt_map_metric_destroy(metric) and counts removals, returning the count; then replace each duplicated inner loop in cmt_expire (for counters, gauges, summaries, histograms, exp_histograms, untypeds) with a call to expire_map_metrics(map, min_timestamp) and accumulate its return value into count.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@lib/cmetrics/src/cmetrics.c`:
- Around line 185-249: The cmt_expire function repeats the same inner loop for
each metric map; add a static helper (e.g., expire_map_metrics) that takes
struct cmt_map *map and uint64_t min_timestamp, iterates map->metrics using
cfl_list_foreach_safe, checks cmt_metric_get_timestamp(metric) < min_timestamp,
calls cmt_map_metric_destroy(metric) and counts removals, returning the count;
then replace each duplicated inner loop in cmt_expire (for counters, gauges,
summaries, histograms, exp_histograms, untypeds) with a call to
expire_map_metrics(map, min_timestamp) and accumulate its return value into
count.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: de6893b7-7d00-4c61-a1e6-b57abf67c6ff
📒 Files selected for processing (3)
lib/cmetrics/include/cmetrics/cmetrics.hlib/cmetrics/src/cmetrics.cplugins/in_process_exporter_metrics/pe_process.c
✅ Files skipped from review due to trivial changes (1)
- plugins/in_process_exporter_metrics/pe_process.c
Signed-off-by: Pierre-Yves Rofes <3604235+piwai@users.noreply.github.com>
Signed-off-by: Pierre-Yves Rofes <3604235+piwai@users.noreply.github.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
@edsiper I reworked following @cosmo0920 suggestion, it's indeed much simpler and generic. I'm just using the collection timestamp as threshold to expire outdated metrics, so it can be added to other plugins as well. |
|
Unfortunately, cmetrics is maintained here: https://github.com/fluent/cmetrics So, we need to address cmetrics issues there. |
|
@cosmo0920 ah ok, well when browsing the cmetrics repository I just noticed fluent/cmetrics#246 which does almost exactly the same thing than I did, actually even more complete than mine with unit tests and all. Not sure how to proceed here, should we ping the author to try to update the branch and merge his PR first, then I can update this one with only the call to cmt_expire() in node_exporter plugin? |
We can take over @pwhelan's work in cmetrics repository. In his work, it's also included unit testing in C so we also needed to add unit testing for expiring cmetics' metrics. |
|
I rebased my PR and tests have passed. ping me if you need anything else to get it merged. |
This add a generic expiration mechanism in cmetrics with cmt_expire(), based on a timestamp. This avoids publishing prometheus metrics for process which are not running anymore on the host.
Uses timestamp based comparison, and removes any metrics with timestamp older than the last collection.
Fixes #9547
Enter
[N/A]in the box, if an item is not applicable to your change.Testing
Before we can approve your change; please submit the following in a comment:
If this is a change to packaging of containers or native binaries then please confirm it works for all targets.
ok-package-testlabel to test for all targets (requires maintainer to do).Documentation
Backporting
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.
Summary by CodeRabbit
Bug Fixes
Improvements