filter_log_to_metrics: retry timer metric append by edsiper · Pull Request #11774 · fluent/fluent-bit

edsiper · 2026-05-04T21:04:53Z

It can potentially fix #11772 ....

Fixes `timer-mode emission so pending metric data is not marked as consumed unless the internal metrics append succeeds.

Previously, cb_send_metric_chunk() cleared ctx->new_data unconditionally after each timer callback. If flb_input_metrics_append() failed, the pending counter state could be dropped from the timer path, causing subsequent emissions to stall until new data reset the flag. The fix only clears ctx->new_data after a successful append.

Also adds an integration test that runs log_to_metrics in counter timer mode and verifies Prometheus-exported counter values continue increasing.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Summary by CodeRabbit

Bug Fixes
- Enhanced metric handling to prevent loss of pending metrics by ensuring data is only cleared after successful append operations, improving reliability in timer mode scenarios.

Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>

coderabbitai · 2026-05-04T21:05:21Z

📝 Walkthrough

Walkthrough

The PR fixes a metric-appending reliability issue in the log-to-metrics filter by tracking append success and only clearing the pending-data flag when metrics are successfully appended. A new integration test verifies the fix prevents metric loss in timer mode.

Changes

Metric Append Reliability Fix

Layer / File(s)	Summary
Core Logic Fix `plugins/filter_log_to_metrics/log_to_metrics.c`	In `cb_send_metric_chunk`, tracks append success via an `appended` flag and clears `ctx->new_data` only if `flb_input_metrics_append()` succeeds. In `cb_log_to_metrics_filter` timer mode, sets `ctx->new_data` only upon successful append, preventing premature loss of pending metrics.
Test Configuration `tests/integration/scenarios/filter_log_to_metrics/config/counter_timer_prometheus.yaml`	New integration test config that runs Fluent Bit with a dummy input, applies `log_to_metrics` to convert nginx logs to counter metrics, and exports via prometheus exporter on a dynamic port.
Test Implementation `tests/integration/scenarios/filter_log_to_metrics/tests/test_filter_log_to_metrics_001.py`	New test that starts the service, polls the Prometheus metrics endpoint, and verifies the counter value increases across repeated scrapes, confirming metric chunks are not lost.

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly Related PRs

fluent/fluent-bit#11420: Also modifies cb_send_metric_chunk in the same file; coordinates with NULL-context guard and append-success handling logic.

Suggested Labels

backport to v4.2.x

Suggested Reviewers

cosmo0920
koleini
fujimotos

Poem

🐰 Metrics hop through filters clean,
No data lost between the seams,
Timer mode now keeps its stream,
Success flags guard the append dream!
✨ Test it once, test it twice,
Prometheus counters now run nice.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly and accurately describes the main fix: addressing a retry/failure scenario in timer-based metric append operations for the log_to_metrics filter.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch issue-11772

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ec67e0c2b1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-04T21:07:52Z

+    - name: dummy
+      tag: nginx.b2b.access
+      rate: 1
+      samples: 5


Remove finite sample cap from timer-mode counter scenario

This scenario can fail nondeterministically because the dummy input stops after samples: 5, but the test waits for a later scrape where value > first_value. If Fluent Bit startup or the first successful scrape happens after all 5 records were already ingested, first_value is already the terminal counter value and the second wait condition can never become true, causing a timeout. Using an unbounded stream (or a much larger sample count) avoids this flaky failure mode.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

tests/integration/scenarios/filter_log_to_metrics/tests/test_filter_log_to_metrics_001.py (2)
60-72: 💤 Low value

Redundant assertion at line 72 and unexplained time.sleep(3) at line 73.

wait_for_condition at line 60 already guarantees second_value > first_value before returning; the assert on line 72 adds no new information. The time.sleep(3) call after the assertion has no documented purpose and only slows the test suite.
💡 Suggested clean-up
-        assert second_value > first_value
-        time.sleep(3)
     finally:
         service.stop()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/integration/scenarios/filter_log_to_metrics/tests/test_filter_log_to_metrics_001.py`
around lines 60 - 72, The test contains a redundant assert and an unnecessary
sleep: remove the assert that compares second_value > first_value (the call to
service.service.wait_for_condition already guarantees that) and delete the
trailing time.sleep(3) which serves no documented purpose; locate the block
using service.service.wait_for_condition(... description="increasing
log_to_metrics counter scrape"), the variables first_value and second_value, and
the helper _counter_value/scrape_metrics calls to make these removals.
39-45: The prometheus_exporter output is safe from timestamps because add_timestamp defaults to false and the test config doesn't enable it, but the parser could be made more resilient.

rsplit(" ", 1)[1] correctly extracts counter values when lines are metric{labels} 5, and will work fine in this test since the prometheus_exporter plugin defaults to not including timestamps. The configuration option add_timestamp is set to false by default in the plugin and is not overridden in the test YAML. However, if this option were ever enabled (either in the plugin defaults or in the test configuration), the parser would silently extract the timestamp instead of the metric value, causing test failures. Consider making the parser more resilient by validating the extracted value or handling the optional third field explicitly.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/integration/scenarios/filter_log_to_metrics/tests/test_filter_log_to_metrics_001.py`
around lines 39 - 45, The _counter_value parser can mistakenly pick a timestamp
when prometheus_exporter emits three fields (value plus timestamp); update
_counter_value to robustly parse the metric line for COUNTER_NAME and
COUNTER_LABELS by splitting the line into whitespace-separated fields,
validating that the second-to-last field is the numeric metric value (or trying
to parse the last two fields and choosing the one that parses as a float and is
in a plausible range), and return that parsed float; reference the
_counter_value function and COUNTER_NAME/COUNTER_LABELS when making the change
so the logic explicitly handles the optional timestamp case instead of blindly
using rsplit(" ", 1)[1].

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@tests/integration/scenarios/filter_log_to_metrics/config/counter_timer_prometheus.yaml`:
- Around line 10-14: The test's dummy generator is capped with "samples: 5",
which can exhaust events before the second scrape and cause timing-dependent
failures; update the dummy block (the entry with name: dummy and tag:
nginx.b2b.access) to remove the sample cap or set it to a large/unlimited value
(e.g., set samples to 0 or samples: 100) so events keep being produced
throughout the test window and the counter can increase between scrapes.
- Around line 23-26: The metric_name currently duplicates the
namespace/subsystem because it contains the full prefix; change metric_name to
the leaf component only (e.g., remove the leading "nginx_request" so metric_name
becomes "status_code_total" or similar desired leaf) so CMT composes the full
Prometheus name from metric_namespace (nginx) + metric_subsystem (request) +
metric_name; also update the corresponding test constant COUNTER_NAME in
test_filter_log_to_metrics_001.py to match the resulting assembled name (e.g.,
"nginx_request_status_code_total") so the test asserts the correct final metric
name.

In
`@tests/integration/scenarios/filter_log_to_metrics/tests/test_filter_log_to_metrics_001.py`:
- Around line 48-51: The test
test_log_to_metrics_counter_timer_emits_repeated_metric_chunks currently calls
Service("counter_timer_prometheus.yaml").start() outside a try/finally so
Service.stop() may be skipped if start() raises; wrap service.start() and the
rest of the test body in a try/finally (or add a try immediately after
instantiating service) and ensure service.stop() is called in the finally block
to guarantee cleanup of the Service instance even on startup failure.

---

Nitpick comments:
In
`@tests/integration/scenarios/filter_log_to_metrics/tests/test_filter_log_to_metrics_001.py`:
- Around line 60-72: The test contains a redundant assert and an unnecessary
sleep: remove the assert that compares second_value > first_value (the call to
service.service.wait_for_condition already guarantees that) and delete the
trailing time.sleep(3) which serves no documented purpose; locate the block
using service.service.wait_for_condition(... description="increasing
log_to_metrics counter scrape"), the variables first_value and second_value, and
the helper _counter_value/scrape_metrics calls to make these removals.
- Around line 39-45: The _counter_value parser can mistakenly pick a timestamp
when prometheus_exporter emits three fields (value plus timestamp); update
_counter_value to robustly parse the metric line for COUNTER_NAME and
COUNTER_LABELS by splitting the line into whitespace-separated fields,
validating that the second-to-last field is the numeric metric value (or trying
to parse the last two fields and choosing the one that parses as a float and is
in a plausible range), and return that parsed float; reference the
_counter_value function and COUNTER_NAME/COUNTER_LABELS when making the change
so the logic explicitly handles the optional timestamp case instead of blindly
using rsplit(" ", 1)[1].

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f14e082d-8dc5-4c92-9aa0-2675b0a3f097

📥 Commits

Reviewing files that changed from the base of the PR and between 86ec64a and ec67e0c.

📒 Files selected for processing (3)

plugins/filter_log_to_metrics/log_to_metrics.c
tests/integration/scenarios/filter_log_to_metrics/config/counter_timer_prometheus.yaml
tests/integration/scenarios/filter_log_to_metrics/tests/test_filter_log_to_metrics_001.py

coderabbitai · 2026-05-04T21:14:07Z

+    - name: dummy
+      tag: nginx.b2b.access
+      rate: 1
+      samples: 5
+      dummy: '{"request_method":"GET","status":"200","host":"example.com","endpoint":"/","hostname":"host-a","upstream_response_time":"0.125"}'


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

samples: 5 risks a timing-dependent test failure.

With only 5 events at rate: 1 and flush_interval_sec: 1, the counter stops increasing after ~5 s. If startup latency pushes the first successful scrape past the point where all 5 events have been flushed (and ctx->new_data is already FLB_FALSE), the counter will be frozen at 5 and second_value > first_value can never be satisfied within the 15-second timeout.

Consider removing the sample cap (or setting it to a large value such as samples: 100) so the counter keeps climbing throughout the test window.

💡 Suggested fix

- samples: 5 + samples: 0

(samples: 0 means unlimited in the dummy plugin.)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- name: dummy

tag: nginx.b2b.access

rate: 1

samples: 5

dummy: '{"request_method":"GET","status":"200","host":"example.com","endpoint":"/","hostname":"host-a","upstream_response_time":"0.125"}'

- name: dummy

tag: nginx.b2b.access

rate: 1

samples: 0

dummy: '{"request_method":"GET","status":"200","host":"example.com","endpoint":"/","hostname":"host-a","upstream_response_time":"0.125"}'

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/scenarios/filter_log_to_metrics/config/counter_timer_prometheus.yaml` around lines 10 - 14, The test's dummy generator is capped with "samples: 5", which can exhaust events before the second scrape and cause timing-dependent failures; update the dummy block (the entry with name: dummy and tag: nginx.b2b.access) to remove the sample cap or set it to a large/unlimited value (e.g., set samples to 0 or samples: 100) so events keep being produced throughout the test window and the counter can increase between scrapes.

coderabbitai · 2026-05-04T21:14:08Z

+      metric_name: nginx_request_status_code_total
+      metric_description: Counts nginx request status codes
+      metric_namespace: nginx
+      metric_subsystem: request


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

metric_name already embeds the namespace/subsystem prefix, producing a redundant name.

With metric_namespace: nginx, metric_subsystem: request, and metric_name: nginx_request_status_code_total, CMT assembles the full Prometheus name as nginx_request_nginx_request_status_code_total—the nginx_request segment appears twice. The metric_name should be just the leaf component.

💡 Suggested fix

- metric_name: nginx_request_status_code_total + metric_name: status_code_total

(Update COUNTER_NAME in test_filter_log_to_metrics_001.py accordingly: "nginx_request_status_code_total".)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

metric_name: nginx_request_status_code_total

metric_description: Counts nginx request status codes

metric_namespace: nginx

metric_subsystem: request

metric_name: status_code_total

metric_description: Counts nginx request status codes

metric_namespace: nginx

metric_subsystem: request

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/scenarios/filter_log_to_metrics/config/counter_timer_prometheus.yaml` around lines 23 - 26, The metric_name currently duplicates the namespace/subsystem because it contains the full prefix; change metric_name to the leaf component only (e.g., remove the leading "nginx_request" so metric_name becomes "status_code_total" or similar desired leaf) so CMT composes the full Prometheus name from metric_namespace (nginx) + metric_subsystem (request) + metric_name; also update the corresponding test constant COUNTER_NAME in test_filter_log_to_metrics_001.py to match the resulting assembled name (e.g., "nginx_request_status_code_total") so the test asserts the correct final metric name.

coderabbitai · 2026-05-04T21:14:08Z

+def test_log_to_metrics_counter_timer_emits_repeated_metric_chunks():
+    service = Service("counter_timer_prometheus.yaml")
+    service.start()
+


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

service.start() is outside the try/finally block — stop() is skipped on startup failure.

If service.start() raises, the finally: service.stop() never runs, potentially leaving child processes alive.

💡 Suggested fix

service = Service("counter_timer_prometheus.yaml") - service.start() - try: + service.start() + first_value = service.service.wait_for_condition(

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/scenarios/filter_log_to_metrics/tests/test_filter_log_to_metrics_001.py` around lines 48 - 51, The test test_log_to_metrics_counter_timer_emits_repeated_metric_chunks currently calls Service("counter_timer_prometheus.yaml").start() outside a try/finally so Service.stop() may be skipped if start() raises; wrap service.start() and the rest of the test body in a try/finally (or add a try immediately after instantiating service) and ensure service.stop() is called in the finally block to guarantee cleanup of the Service instance even on startup failure.

edsiper added 2 commits May 4, 2026 13:59

filter_log_to_metrics: retry timer metric append

40b7e32

Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>

tests: integration: add log_to_metrics counter timer test

ec67e0c

Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>

edsiper requested a review from cosmo0920 as a code owner May 4, 2026 21:04

github-actions Bot added the docs-required label May 4, 2026

edsiper temporarily deployed to pr May 4, 2026 21:06 — with GitHub Actions Inactive

edsiper mentioned this pull request May 4, 2026

log_to_metrics: Counter mode stops emitting/stalls after several hours while Histogram continues to work (v4 and 5) #11772

Open

edsiper added this to the Fluent Bit v5.0.5 milestone May 4, 2026

chatgpt-codex-connector Bot reviewed May 4, 2026

View reviewed changes

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

edsiper temporarily deployed to pr May 4, 2026 21:32 — with GitHub Actions Inactive

edsiper temporarily deployed to pr May 4, 2026 21:33 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filter_log_to_metrics: retry timer metric append#11774

filter_log_to_metrics: retry timer metric append#11774
edsiper wants to merge 2 commits intomasterfrom
issue-11772

edsiper commented May 4, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 4, 2026 •

edited

Loading

Walkthrough

Changes

Estimated Code Review Effort

Possibly Related PRs

Suggested Labels

Suggested Reviewers

Poem

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 4, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 4, 2026

Uh oh!

coderabbitai Bot May 4, 2026

Uh oh!

coderabbitai Bot May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

edsiper commented May 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated Code Review Effort

Possibly Related PRs

Suggested Labels

Suggested Reviewers

Poem

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

edsiper commented May 4, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 4, 2026 •

edited

Loading