Skip to content

filter_log_to_metrics: retry timer metric append#11774

Open
edsiper wants to merge 2 commits intomasterfrom
issue-11772
Open

filter_log_to_metrics: retry timer metric append#11774
edsiper wants to merge 2 commits intomasterfrom
issue-11772

Conversation

@edsiper
Copy link
Copy Markdown
Member

@edsiper edsiper commented May 4, 2026

It can potentially fix #11772 ....

Fixes `timer-mode emission so pending metric data is not marked as consumed unless the internal metrics append succeeds.

Previously, cb_send_metric_chunk() cleared ctx->new_data unconditionally after each timer callback. If flb_input_metrics_append() failed, the pending counter state could be dropped from the timer path, causing subsequent emissions to stall until new data reset the flag. The fix only clears ctx->new_data after a successful append.

Also adds an integration test that runs log_to_metrics in counter timer mode and verifies Prometheus-exported counter values continue increasing.


Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced metric handling to prevent loss of pending metrics by ensuring data is only cleared after successful append operations, improving reliability in timer mode scenarios.

edsiper added 2 commits May 4, 2026 13:59
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

📝 Walkthrough

Walkthrough

The PR fixes a metric-appending reliability issue in the log-to-metrics filter by tracking append success and only clearing the pending-data flag when metrics are successfully appended. A new integration test verifies the fix prevents metric loss in timer mode.

Changes

Metric Append Reliability Fix

Layer / File(s) Summary
Core Logic Fix
plugins/filter_log_to_metrics/log_to_metrics.c
In cb_send_metric_chunk, tracks append success via an appended flag and clears ctx->new_data only if flb_input_metrics_append() succeeds. In cb_log_to_metrics_filter timer mode, sets ctx->new_data only upon successful append, preventing premature loss of pending metrics.
Test Configuration
tests/integration/scenarios/filter_log_to_metrics/config/counter_timer_prometheus.yaml
New integration test config that runs Fluent Bit with a dummy input, applies log_to_metrics to convert nginx logs to counter metrics, and exports via prometheus exporter on a dynamic port.
Test Implementation
tests/integration/scenarios/filter_log_to_metrics/tests/test_filter_log_to_metrics_001.py
New test that starts the service, polls the Prometheus metrics endpoint, and verifies the counter value increases across repeated scrapes, confirming metric chunks are not lost.

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly Related PRs

  • fluent/fluent-bit#11420: Also modifies cb_send_metric_chunk in the same file; coordinates with NULL-context guard and append-success handling logic.

Suggested Labels

backport to v4.2.x

Suggested Reviewers

  • cosmo0920
  • koleini
  • fujimotos

Poem

🐰 Metrics hop through filters clean,
No data lost between the seams,
Timer mode now keeps its stream,
Success flags guard the append dream!
✨ Test it once, test it twice,
Prometheus counters now run nice.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and accurately describes the main fix: addressing a retry/failure scenario in timer-based metric append operations for the log_to_metrics filter.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch issue-11772

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ec67e0c2b1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- name: dummy
tag: nginx.b2b.access
rate: 1
samples: 5
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Remove finite sample cap from timer-mode counter scenario

This scenario can fail nondeterministically because the dummy input stops after samples: 5, but the test waits for a later scrape where value > first_value. If Fluent Bit startup or the first successful scrape happens after all 5 records were already ingested, first_value is already the terminal counter value and the second wait condition can never become true, causing a timeout. Using an unbounded stream (or a much larger sample count) avoids this flaky failure mode.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
tests/integration/scenarios/filter_log_to_metrics/tests/test_filter_log_to_metrics_001.py (2)

60-72: 💤 Low value

Redundant assertion at line 72 and unexplained time.sleep(3) at line 73.

wait_for_condition at line 60 already guarantees second_value > first_value before returning; the assert on line 72 adds no new information. The time.sleep(3) call after the assertion has no documented purpose and only slows the test suite.

💡 Suggested clean-up
-        assert second_value > first_value
-        time.sleep(3)
     finally:
         service.stop()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/integration/scenarios/filter_log_to_metrics/tests/test_filter_log_to_metrics_001.py`
around lines 60 - 72, The test contains a redundant assert and an unnecessary
sleep: remove the assert that compares second_value > first_value (the call to
service.service.wait_for_condition already guarantees that) and delete the
trailing time.sleep(3) which serves no documented purpose; locate the block
using service.service.wait_for_condition(... description="increasing
log_to_metrics counter scrape"), the variables first_value and second_value, and
the helper _counter_value/scrape_metrics calls to make these removals.

39-45: The prometheus_exporter output is safe from timestamps because add_timestamp defaults to false and the test config doesn't enable it, but the parser could be made more resilient.

rsplit(" ", 1)[1] correctly extracts counter values when lines are metric{labels} 5, and will work fine in this test since the prometheus_exporter plugin defaults to not including timestamps. The configuration option add_timestamp is set to false by default in the plugin and is not overridden in the test YAML. However, if this option were ever enabled (either in the plugin defaults or in the test configuration), the parser would silently extract the timestamp instead of the metric value, causing test failures. Consider making the parser more resilient by validating the extracted value or handling the optional third field explicitly.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/integration/scenarios/filter_log_to_metrics/tests/test_filter_log_to_metrics_001.py`
around lines 39 - 45, The _counter_value parser can mistakenly pick a timestamp
when prometheus_exporter emits three fields (value plus timestamp); update
_counter_value to robustly parse the metric line for COUNTER_NAME and
COUNTER_LABELS by splitting the line into whitespace-separated fields,
validating that the second-to-last field is the numeric metric value (or trying
to parse the last two fields and choosing the one that parses as a float and is
in a plausible range), and return that parsed float; reference the
_counter_value function and COUNTER_NAME/COUNTER_LABELS when making the change
so the logic explicitly handles the optional timestamp case instead of blindly
using rsplit(" ", 1)[1].
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@tests/integration/scenarios/filter_log_to_metrics/config/counter_timer_prometheus.yaml`:
- Around line 10-14: The test's dummy generator is capped with "samples: 5",
which can exhaust events before the second scrape and cause timing-dependent
failures; update the dummy block (the entry with name: dummy and tag:
nginx.b2b.access) to remove the sample cap or set it to a large/unlimited value
(e.g., set samples to 0 or samples: 100) so events keep being produced
throughout the test window and the counter can increase between scrapes.
- Around line 23-26: The metric_name currently duplicates the
namespace/subsystem because it contains the full prefix; change metric_name to
the leaf component only (e.g., remove the leading "nginx_request" so metric_name
becomes "status_code_total" or similar desired leaf) so CMT composes the full
Prometheus name from metric_namespace (nginx) + metric_subsystem (request) +
metric_name; also update the corresponding test constant COUNTER_NAME in
test_filter_log_to_metrics_001.py to match the resulting assembled name (e.g.,
"nginx_request_status_code_total") so the test asserts the correct final metric
name.

In
`@tests/integration/scenarios/filter_log_to_metrics/tests/test_filter_log_to_metrics_001.py`:
- Around line 48-51: The test
test_log_to_metrics_counter_timer_emits_repeated_metric_chunks currently calls
Service("counter_timer_prometheus.yaml").start() outside a try/finally so
Service.stop() may be skipped if start() raises; wrap service.start() and the
rest of the test body in a try/finally (or add a try immediately after
instantiating service) and ensure service.stop() is called in the finally block
to guarantee cleanup of the Service instance even on startup failure.

---

Nitpick comments:
In
`@tests/integration/scenarios/filter_log_to_metrics/tests/test_filter_log_to_metrics_001.py`:
- Around line 60-72: The test contains a redundant assert and an unnecessary
sleep: remove the assert that compares second_value > first_value (the call to
service.service.wait_for_condition already guarantees that) and delete the
trailing time.sleep(3) which serves no documented purpose; locate the block
using service.service.wait_for_condition(... description="increasing
log_to_metrics counter scrape"), the variables first_value and second_value, and
the helper _counter_value/scrape_metrics calls to make these removals.
- Around line 39-45: The _counter_value parser can mistakenly pick a timestamp
when prometheus_exporter emits three fields (value plus timestamp); update
_counter_value to robustly parse the metric line for COUNTER_NAME and
COUNTER_LABELS by splitting the line into whitespace-separated fields,
validating that the second-to-last field is the numeric metric value (or trying
to parse the last two fields and choosing the one that parses as a float and is
in a plausible range), and return that parsed float; reference the
_counter_value function and COUNTER_NAME/COUNTER_LABELS when making the change
so the logic explicitly handles the optional timestamp case instead of blindly
using rsplit(" ", 1)[1].
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f14e082d-8dc5-4c92-9aa0-2675b0a3f097

📥 Commits

Reviewing files that changed from the base of the PR and between 86ec64a and ec67e0c.

📒 Files selected for processing (3)
  • plugins/filter_log_to_metrics/log_to_metrics.c
  • tests/integration/scenarios/filter_log_to_metrics/config/counter_timer_prometheus.yaml
  • tests/integration/scenarios/filter_log_to_metrics/tests/test_filter_log_to_metrics_001.py

Comment on lines +10 to +14
- name: dummy
tag: nginx.b2b.access
rate: 1
samples: 5
dummy: '{"request_method":"GET","status":"200","host":"example.com","endpoint":"/","hostname":"host-a","upstream_response_time":"0.125"}'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

samples: 5 risks a timing-dependent test failure.

With only 5 events at rate: 1 and flush_interval_sec: 1, the counter stops increasing after ~5 s. If startup latency pushes the first successful scrape past the point where all 5 events have been flushed (and ctx->new_data is already FLB_FALSE), the counter will be frozen at 5 and second_value > first_value can never be satisfied within the 15-second timeout.

Consider removing the sample cap (or setting it to a large value such as samples: 100) so the counter keeps climbing throughout the test window.

💡 Suggested fix
-      samples: 5
+      samples: 0

(samples: 0 means unlimited in the dummy plugin.)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: dummy
tag: nginx.b2b.access
rate: 1
samples: 5
dummy: '{"request_method":"GET","status":"200","host":"example.com","endpoint":"/","hostname":"host-a","upstream_response_time":"0.125"}'
- name: dummy
tag: nginx.b2b.access
rate: 1
samples: 0
dummy: '{"request_method":"GET","status":"200","host":"example.com","endpoint":"/","hostname":"host-a","upstream_response_time":"0.125"}'
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/integration/scenarios/filter_log_to_metrics/config/counter_timer_prometheus.yaml`
around lines 10 - 14, The test's dummy generator is capped with "samples: 5",
which can exhaust events before the second scrape and cause timing-dependent
failures; update the dummy block (the entry with name: dummy and tag:
nginx.b2b.access) to remove the sample cap or set it to a large/unlimited value
(e.g., set samples to 0 or samples: 100) so events keep being produced
throughout the test window and the counter can increase between scrapes.

Comment on lines +23 to +26
metric_name: nginx_request_status_code_total
metric_description: Counts nginx request status codes
metric_namespace: nginx
metric_subsystem: request
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

metric_name already embeds the namespace/subsystem prefix, producing a redundant name.

With metric_namespace: nginx, metric_subsystem: request, and metric_name: nginx_request_status_code_total, CMT assembles the full Prometheus name as nginx_request_nginx_request_status_code_total—the nginx_request segment appears twice. The metric_name should be just the leaf component.

💡 Suggested fix
-      metric_name: nginx_request_status_code_total
+      metric_name: status_code_total

(Update COUNTER_NAME in test_filter_log_to_metrics_001.py accordingly: "nginx_request_status_code_total".)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
metric_name: nginx_request_status_code_total
metric_description: Counts nginx request status codes
metric_namespace: nginx
metric_subsystem: request
metric_name: status_code_total
metric_description: Counts nginx request status codes
metric_namespace: nginx
metric_subsystem: request
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/integration/scenarios/filter_log_to_metrics/config/counter_timer_prometheus.yaml`
around lines 23 - 26, The metric_name currently duplicates the
namespace/subsystem because it contains the full prefix; change metric_name to
the leaf component only (e.g., remove the leading "nginx_request" so metric_name
becomes "status_code_total" or similar desired leaf) so CMT composes the full
Prometheus name from metric_namespace (nginx) + metric_subsystem (request) +
metric_name; also update the corresponding test constant COUNTER_NAME in
test_filter_log_to_metrics_001.py to match the resulting assembled name (e.g.,
"nginx_request_status_code_total") so the test asserts the correct final metric
name.

Comment on lines +48 to +51
def test_log_to_metrics_counter_timer_emits_repeated_metric_chunks():
service = Service("counter_timer_prometheus.yaml")
service.start()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

service.start() is outside the try/finally block — stop() is skipped on startup failure.

If service.start() raises, the finally: service.stop() never runs, potentially leaving child processes alive.

💡 Suggested fix
     service = Service("counter_timer_prometheus.yaml")
-    service.start()
-
     try:
+        service.start()
+
         first_value = service.service.wait_for_condition(
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/integration/scenarios/filter_log_to_metrics/tests/test_filter_log_to_metrics_001.py`
around lines 48 - 51, The test
test_log_to_metrics_counter_timer_emits_repeated_metric_chunks currently calls
Service("counter_timer_prometheus.yaml").start() outside a try/finally so
Service.stop() may be skipped if start() raises; wrap service.start() and the
rest of the test body in a try/finally (or add a try immediately after
instantiating service) and ensure service.stop() is called in the finally block
to guarantee cleanup of the Service instance even on startup failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

log_to_metrics: Counter mode stops emitting/stalls after several hours while Histogram continues to work (v4 and 5)

1 participant