engine: ignore duplicate STOP to prevent shutdown spin#11747
engine: ignore duplicate STOP to prevent shutdown spin#11747jinyongchoi wants to merge 2 commits intofluent:masterfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds a guard in the engine STOP handling to skip re-entrance when shutdown is already in progress, and adds a POSIX-only runtime test that calls shutdown twice under a watchdog to ensure the engine does not busy-loop. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller as Caller (SIGTERM / plugin)
participant Engine as Engine (flb_engine_manager)
participant Config as Config (config->is_shutting_down)
participant EventLoop as EventLoop/Timerfd
Caller->>Engine: send FLB_ENGINE_STOP
Engine->>Config: read is_shutting_down
alt not shutting down
Engine->>EventLoop: register shutdown timer / set shutting flag
Engine->>Engine: perform flush and shutdown sequence
else already shutting down
Engine-->>Caller: return 0 (no-op)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR makes Fluent Bit’s engine shutdown path idempotent by ignoring duplicate FLB_ENGINE_STOP messages once shutdown has already started, preventing an epoll/timerfd busy-loop that can pin CPU and block termination (Fixes #11744).
Changes:
- Ignore duplicate STOP events in
flb_engine_manager()whenconfig->is_shutting_downis already set. - Add a runtime regression test that triggers back-to-back STOP requests and asserts
flb_stop()returns within the grace window. - Register the new runtime test in the runtime test CMake target list.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/flb_engine.c |
Adds an idempotency guard for duplicate STOP messages to prevent shutdown spin. |
tests/runtime/core_shutdown_spin.c |
Introduces a regression test for duplicate STOP shutdown behavior. |
tests/runtime/CMakeLists.txt |
Registers the new runtime test for CTest execution. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/runtime/core_shutdown_spin.c`:
- Around line 22-41: The test core_shutdown_spin.c uses POSIX-only APIs (alarm,
sigaction, SIGALRM, STDERR_FILENO, _exit, and the timeout_abort handler) and is
being registered unconditionally; update tests/runtime/CMakeLists.txt to guard
the test registration for core_shutdown_spin.c with the existing Windows check
pattern (wrap the add_test/add_executable lines for core_shutdown_spin.c inside
an if(NOT FLB_SYSTEM_WINDOWS) ... endif block) so the test (and its use of
timeout_abort, SIGALRM, alarm, STDERR_FILENO, _exit) is only added on
non-Windows platforms.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 76542311-7e98-455b-af44-bc625fc54bc6
📒 Files selected for processing (3)
src/flb_engine.ctests/runtime/CMakeLists.txttests/runtime/core_shutdown_spin.c
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 839473d39e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
When FLB_ENGINE_STOP arrives more than once in quick succession (e.g. an input plugin's internal flb_engine_exit() followed by an external SIGTERM), the second invocation re-enters the STOP handler block in flb_engine_start() and resets config->event_shutdown->status to MK_EVENT_NONE while the shutdown timerfd is still registered in the kernel's epoll set. The event loop dispatcher then drops the timer event because of the 'status != MK_EVENT_NONE' guard in flb_event_load_bucket_queue(), but the level-triggered timerfd keeps reporting EPOLLIN. The pipeline thread busy-loops in epoll_wait() at 100% CPU, grace_count never advances, and the process fails to terminate. Swallow duplicate STOP messages at the flb_engine_manager() boundary once shutdown is already in progress (config->is_shutting_down is set by flb_engine_stop_ingestion() during the first STOP). The first STOP arms the shutdown timer and drives the grace flow; any further STOPs would only corrupt existing event state without benefit. Periodic work during shutdown (flushing, task draining, grace counter) is already handled by the 1s tick in the FLB_ENGINE_SHUTDOWN branch, so swallowing the duplicate is safe. Signed-off-by: jinyong.choi <inimax801@gmail.com>
839473d to
cbd652c
Compare
Add core_shutdown_spin.c covering the duplicate-FLB_ENGINE_STOP busy-spin bug fixed in the previous commit. The test builds a minimal lib-input -> null-output pipeline, invokes flb_engine_exit() twice in quick succession, and asserts that flb_stop() returns within the grace period. A SIGALRM watchdog (SHUTDOWN_WATCHDOG_SEC=10) bounds the wait: if the guard regresses, pthread_join on the spinning worker never returns, the handler aborts the process with a visible FAIL message and exit code 1. This avoids relying on CTest's per-test timeout (1500s default) and surfaces the regression quickly regardless of how the binary is invoked. Signed-off-by: jinyong.choi <inimax801@gmail.com>
cbd652c to
00a1c2b
Compare
|
@codex review |
|
Codex Review: Didn't find any major issues. Already looking forward to the next diff. ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
|
Ready for review, thanks! |
Fix a shutdown spin where duplicate
FLB_ENGINE_STOPmessages (e.g. an input plugin's internalflb_engine_exit()followed by an external SIGTERM, or any path that writes STOP twice toch_manager) cause the pipeline thread to busy-loop at 100% CPU and the process fails to terminate.The second STOP re-enters the handler block in
flb_engine_start()and resetsconfig->event_shutdown->statustoMK_EVENT_NONEwhile the shutdown timerfd is still registered in epoll. The dispatcher then drops the timer event via thestatus != MK_EVENT_NONEguard inflb_event_load_bucket_queue(), but the level-triggered timerfd keeps reporting EPOLLIN — busy-loop,grace_countnever advances,flb_engine_shutdown()unreachable.Fix: swallow duplicate STOP at the
flb_engine_manager()boundary whenconfig->is_shutting_downis already set (first STOP sets it viaflb_engine_stop_ingestion()). Periodic shutdown work (flush, task drain, grace counter) is already driven by the 1s tick in theFLB_ENGINE_SHUTDOWNbranch, so swallowing the duplicate is safe.I considered putting the guard at each
flb_engine_exit()call site or insideflb_engine_exit()itself, but chose to place it inflb_engine_manager()for the following reasons. I'd be happy to move it if maintainers prefer a different layer.flb_engine_exit()has 12+ call sites (in_tail, in_exec, in_stdin, out_exit, filter_expect, flb_lib, winsvc, ...) that look like independent shutdown triggers, so a caller-side check looked prone to races:config->is_shutting_downis set by the engine loop rather than byflb_engine_exit(), so two producers could both observeFALSEand each send a STOP.if (shutdown_fd <= 0)), so extending the same idempotency to theevent_shutdownreset seemed consistent with that existing pattern.Fixes #11744
Enter
[N/A]in the box, if an item is not applicable to your change.Testing
Before we can approve your change; please submit the following in a comment:
Shutdown timing race — not reproducible via configuration, so a regression test is included instead:
tests/runtime/core_shutdown_spin.creproduces the duplicate-STOP race via two back-to-backflb_engine_exit()calls and assertsflb_stop()returns within the grace period.If this is a change to packaging of containers or native binaries then please confirm it works for all targets.
ok-package-testlabel to test for all targets (requires maintainer to do).Documentation
Backporting
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.
Summary by CodeRabbit
Bug Fixes
Tests