filter_log_to_metrics: retry timer metric append#11774
filter_log_to_metrics: retry timer metric append#11774
Conversation
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
📝 WalkthroughWalkthroughThe PR fixes a metric-appending reliability issue in the log-to-metrics filter by tracking append success and only clearing the pending-data flag when metrics are successfully appended. A new integration test verifies the fix prevents metric loss in timer mode. ChangesMetric Append Reliability Fix
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly Related PRs
Suggested Labels
Suggested Reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ec67e0c2b1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| - name: dummy | ||
| tag: nginx.b2b.access | ||
| rate: 1 | ||
| samples: 5 |
There was a problem hiding this comment.
Remove finite sample cap from timer-mode counter scenario
This scenario can fail nondeterministically because the dummy input stops after samples: 5, but the test waits for a later scrape where value > first_value. If Fluent Bit startup or the first successful scrape happens after all 5 records were already ingested, first_value is already the terminal counter value and the second wait condition can never become true, causing a timeout. Using an unbounded stream (or a much larger sample count) avoids this flaky failure mode.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
tests/integration/scenarios/filter_log_to_metrics/tests/test_filter_log_to_metrics_001.py (2)
60-72: 💤 Low valueRedundant assertion at line 72 and unexplained
time.sleep(3)at line 73.
wait_for_conditionat line 60 already guaranteessecond_value > first_valuebefore returning; theasserton line 72 adds no new information. Thetime.sleep(3)call after the assertion has no documented purpose and only slows the test suite.💡 Suggested clean-up
- assert second_value > first_value - time.sleep(3) finally: service.stop()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/scenarios/filter_log_to_metrics/tests/test_filter_log_to_metrics_001.py` around lines 60 - 72, The test contains a redundant assert and an unnecessary sleep: remove the assert that compares second_value > first_value (the call to service.service.wait_for_condition already guarantees that) and delete the trailing time.sleep(3) which serves no documented purpose; locate the block using service.service.wait_for_condition(... description="increasing log_to_metrics counter scrape"), the variables first_value and second_value, and the helper _counter_value/scrape_metrics calls to make these removals.
39-45: The prometheus_exporter output is safe from timestamps becauseadd_timestampdefaults to false and the test config doesn't enable it, but the parser could be made more resilient.
rsplit(" ", 1)[1]correctly extracts counter values when lines aremetric{labels} 5, and will work fine in this test since the prometheus_exporter plugin defaults to not including timestamps. The configuration optionadd_timestampis set tofalseby default in the plugin and is not overridden in the test YAML. However, if this option were ever enabled (either in the plugin defaults or in the test configuration), the parser would silently extract the timestamp instead of the metric value, causing test failures. Consider making the parser more resilient by validating the extracted value or handling the optional third field explicitly.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/scenarios/filter_log_to_metrics/tests/test_filter_log_to_metrics_001.py` around lines 39 - 45, The _counter_value parser can mistakenly pick a timestamp when prometheus_exporter emits three fields (value plus timestamp); update _counter_value to robustly parse the metric line for COUNTER_NAME and COUNTER_LABELS by splitting the line into whitespace-separated fields, validating that the second-to-last field is the numeric metric value (or trying to parse the last two fields and choosing the one that parses as a float and is in a plausible range), and return that parsed float; reference the _counter_value function and COUNTER_NAME/COUNTER_LABELS when making the change so the logic explicitly handles the optional timestamp case instead of blindly using rsplit(" ", 1)[1].
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@tests/integration/scenarios/filter_log_to_metrics/config/counter_timer_prometheus.yaml`:
- Around line 10-14: The test's dummy generator is capped with "samples: 5",
which can exhaust events before the second scrape and cause timing-dependent
failures; update the dummy block (the entry with name: dummy and tag:
nginx.b2b.access) to remove the sample cap or set it to a large/unlimited value
(e.g., set samples to 0 or samples: 100) so events keep being produced
throughout the test window and the counter can increase between scrapes.
- Around line 23-26: The metric_name currently duplicates the
namespace/subsystem because it contains the full prefix; change metric_name to
the leaf component only (e.g., remove the leading "nginx_request" so metric_name
becomes "status_code_total" or similar desired leaf) so CMT composes the full
Prometheus name from metric_namespace (nginx) + metric_subsystem (request) +
metric_name; also update the corresponding test constant COUNTER_NAME in
test_filter_log_to_metrics_001.py to match the resulting assembled name (e.g.,
"nginx_request_status_code_total") so the test asserts the correct final metric
name.
In
`@tests/integration/scenarios/filter_log_to_metrics/tests/test_filter_log_to_metrics_001.py`:
- Around line 48-51: The test
test_log_to_metrics_counter_timer_emits_repeated_metric_chunks currently calls
Service("counter_timer_prometheus.yaml").start() outside a try/finally so
Service.stop() may be skipped if start() raises; wrap service.start() and the
rest of the test body in a try/finally (or add a try immediately after
instantiating service) and ensure service.stop() is called in the finally block
to guarantee cleanup of the Service instance even on startup failure.
---
Nitpick comments:
In
`@tests/integration/scenarios/filter_log_to_metrics/tests/test_filter_log_to_metrics_001.py`:
- Around line 60-72: The test contains a redundant assert and an unnecessary
sleep: remove the assert that compares second_value > first_value (the call to
service.service.wait_for_condition already guarantees that) and delete the
trailing time.sleep(3) which serves no documented purpose; locate the block
using service.service.wait_for_condition(... description="increasing
log_to_metrics counter scrape"), the variables first_value and second_value, and
the helper _counter_value/scrape_metrics calls to make these removals.
- Around line 39-45: The _counter_value parser can mistakenly pick a timestamp
when prometheus_exporter emits three fields (value plus timestamp); update
_counter_value to robustly parse the metric line for COUNTER_NAME and
COUNTER_LABELS by splitting the line into whitespace-separated fields,
validating that the second-to-last field is the numeric metric value (or trying
to parse the last two fields and choosing the one that parses as a float and is
in a plausible range), and return that parsed float; reference the
_counter_value function and COUNTER_NAME/COUNTER_LABELS when making the change
so the logic explicitly handles the optional timestamp case instead of blindly
using rsplit(" ", 1)[1].
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f14e082d-8dc5-4c92-9aa0-2675b0a3f097
📒 Files selected for processing (3)
plugins/filter_log_to_metrics/log_to_metrics.ctests/integration/scenarios/filter_log_to_metrics/config/counter_timer_prometheus.yamltests/integration/scenarios/filter_log_to_metrics/tests/test_filter_log_to_metrics_001.py
| - name: dummy | ||
| tag: nginx.b2b.access | ||
| rate: 1 | ||
| samples: 5 | ||
| dummy: '{"request_method":"GET","status":"200","host":"example.com","endpoint":"/","hostname":"host-a","upstream_response_time":"0.125"}' |
There was a problem hiding this comment.
samples: 5 risks a timing-dependent test failure.
With only 5 events at rate: 1 and flush_interval_sec: 1, the counter stops increasing after ~5 s. If startup latency pushes the first successful scrape past the point where all 5 events have been flushed (and ctx->new_data is already FLB_FALSE), the counter will be frozen at 5 and second_value > first_value can never be satisfied within the 15-second timeout.
Consider removing the sample cap (or setting it to a large value such as samples: 100) so the counter keeps climbing throughout the test window.
💡 Suggested fix
- samples: 5
+ samples: 0(samples: 0 means unlimited in the dummy plugin.)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - name: dummy | |
| tag: nginx.b2b.access | |
| rate: 1 | |
| samples: 5 | |
| dummy: '{"request_method":"GET","status":"200","host":"example.com","endpoint":"/","hostname":"host-a","upstream_response_time":"0.125"}' | |
| - name: dummy | |
| tag: nginx.b2b.access | |
| rate: 1 | |
| samples: 0 | |
| dummy: '{"request_method":"GET","status":"200","host":"example.com","endpoint":"/","hostname":"host-a","upstream_response_time":"0.125"}' |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@tests/integration/scenarios/filter_log_to_metrics/config/counter_timer_prometheus.yaml`
around lines 10 - 14, The test's dummy generator is capped with "samples: 5",
which can exhaust events before the second scrape and cause timing-dependent
failures; update the dummy block (the entry with name: dummy and tag:
nginx.b2b.access) to remove the sample cap or set it to a large/unlimited value
(e.g., set samples to 0 or samples: 100) so events keep being produced
throughout the test window and the counter can increase between scrapes.
| metric_name: nginx_request_status_code_total | ||
| metric_description: Counts nginx request status codes | ||
| metric_namespace: nginx | ||
| metric_subsystem: request |
There was a problem hiding this comment.
metric_name already embeds the namespace/subsystem prefix, producing a redundant name.
With metric_namespace: nginx, metric_subsystem: request, and metric_name: nginx_request_status_code_total, CMT assembles the full Prometheus name as nginx_request_nginx_request_status_code_total—the nginx_request segment appears twice. The metric_name should be just the leaf component.
💡 Suggested fix
- metric_name: nginx_request_status_code_total
+ metric_name: status_code_total(Update COUNTER_NAME in test_filter_log_to_metrics_001.py accordingly: "nginx_request_status_code_total".)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| metric_name: nginx_request_status_code_total | |
| metric_description: Counts nginx request status codes | |
| metric_namespace: nginx | |
| metric_subsystem: request | |
| metric_name: status_code_total | |
| metric_description: Counts nginx request status codes | |
| metric_namespace: nginx | |
| metric_subsystem: request |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@tests/integration/scenarios/filter_log_to_metrics/config/counter_timer_prometheus.yaml`
around lines 23 - 26, The metric_name currently duplicates the
namespace/subsystem because it contains the full prefix; change metric_name to
the leaf component only (e.g., remove the leading "nginx_request" so metric_name
becomes "status_code_total" or similar desired leaf) so CMT composes the full
Prometheus name from metric_namespace (nginx) + metric_subsystem (request) +
metric_name; also update the corresponding test constant COUNTER_NAME in
test_filter_log_to_metrics_001.py to match the resulting assembled name (e.g.,
"nginx_request_status_code_total") so the test asserts the correct final metric
name.
| def test_log_to_metrics_counter_timer_emits_repeated_metric_chunks(): | ||
| service = Service("counter_timer_prometheus.yaml") | ||
| service.start() | ||
|
|
There was a problem hiding this comment.
service.start() is outside the try/finally block — stop() is skipped on startup failure.
If service.start() raises, the finally: service.stop() never runs, potentially leaving child processes alive.
💡 Suggested fix
service = Service("counter_timer_prometheus.yaml")
- service.start()
-
try:
+ service.start()
+
first_value = service.service.wait_for_condition(🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@tests/integration/scenarios/filter_log_to_metrics/tests/test_filter_log_to_metrics_001.py`
around lines 48 - 51, The test
test_log_to_metrics_counter_timer_emits_repeated_metric_chunks currently calls
Service("counter_timer_prometheus.yaml").start() outside a try/finally so
Service.stop() may be skipped if start() raises; wrap service.start() and the
rest of the test body in a try/finally (or add a try immediately after
instantiating service) and ensure service.stop() is called in the finally block
to guarantee cleanup of the Service instance even on startup failure.
It can potentially fix #11772 ....
Fixes `timer-mode emission so pending metric data is not marked as consumed unless the internal metrics append succeeds.
Previously,
cb_send_metric_chunk()clearedctx->new_dataunconditionally after each timer callback. Ifflb_input_metrics_append()failed, the pending counter state could be dropped from the timer path, causing subsequent emissions to stall until new data reset the flag. The fix only clearsctx->new_dataafter a successful append.Also adds an integration test that runs
log_to_metricsin counter timer mode and verifies Prometheus-exported counter values continue increasing.Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.
Summary by CodeRabbit