fix(envd): fix process cleanup in cgroup test to prevent OOM and data race#2588
fix(envd): fix process cleanup in cgroup test to prevent OOM and data race#2588
Conversation
Three problems in the cgroup test: 1. Orphaned children: killing only bash left tail running as an orphan eating unbounded memory. Fix: Setpgid + kill(-pid) to kill the entire process group. `exec tail` replaces bash with tail so there's a single process — no orphan possible. 2. Double cmd.Wait race: both waitForProcess (goroutine) and t.Cleanup called cmd.Wait, racing on Cmd internals. Fix: t.Cleanup only kills (no Wait); waitForProcess drains its goroutine with <-done after killing on timeout. 3. Goroutine leak on timeout: waitForProcess returned DeadlineExceeded without waiting for the cmd.Wait goroutine. Fix: kill the process group then <-done before returning.
…oup test Use a fixed 512 MiB allocation instead of unbounded tail /dev/zero. If the process escapes cleanup it holds at most 512 MiB then sleeps, rather than growing until the kernel OOM-kills it.
ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one. |
PR SummaryMedium Risk Overview Reviewed by Cursor Bugbot for commit 99aaf8c. Bugbot is set up for automated code reviews on this repo. Configure here. |
❌ 10 Tests Failed:
View the full list of 12 ❄️ flaky test(s)
To view more test analytics, go to the Test Analytics Dashboard |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ad104fcdc8
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Code Review
The pull request updates cgroup2 tests to use a memory-intensive perl process and implements process group termination for cleanup. A correctness issue was identified in the killProcessGroup function where signaling a reaped PID could affect unrelated processes; a check for cmd.ProcessState was recommended to ensure the process has not yet been reaped before sending the signal.
99aaf8c to
ad104fc
Compare
There was a problem hiding this comment.
LGTM — test-only fix that scopes process cleanup correctly and bounds the allocator to prevent host OOM.
Extended reasoning...
Overview
This PR modifies a single test file (packages/envd/internal/services/cgroups/cgroup2_test.go) to fix three issues in the cgroup round-trip test: (1) orphaned tail processes consuming unbounded memory after the bash parent was killed, (2) a data race from both waitForProcess and t.Cleanup calling cmd.Wait(), and (3) replacing the unbounded tail /dev/zero allocator with a perl one-liner that allocates a fixed 512 MiB. The fix introduces Setpgid + syscall.Kill(-pid, SIGKILL) to terminate the entire process group, makes waitForProcess drain its goroutine after killing on timeout, and uses exec so bash replaces itself with the perl child.
Security risks
None. This is test-only code that runs locally as part of the envd test suite. No production code paths, auth, crypto, or external interfaces are touched.
Level of scrutiny
Low. This is a test-only change in a single file, the changes are mechanical and well-justified by the PR description, and the fix follows standard Go patterns for process-group cleanup. The data race fix and OOM mitigation are sensible and the scope is tightly contained.
Other factors
The bug hunting system found no issues. The two prior reviewer comments are noise (Qodo hit its free-tier limit; Cursor posted a placeholder summary). The change is self-contained, well-commented in the PR description, and reduces flakiness risk in CI without altering test semantics.
The cgroup round-trip test spawns
tail /dev/zerounder a memory-limited cgroup to verify the OOM kill. Three problems caused host OOM and flaky failures when running with-countor-race:Killing only bash left
tailrunning as an orphan that ate unbounded memory. The fix starts the child in its own process group (Setpgid) and kills the entire group on timeout and int.Cleanup. The command now usesexecso bash replaces itself with the child process.Both
waitForProcessandt.Cleanupcalledcmd.Wait(), causing a data race. Nowt.Cleanuponly kills, andwaitForProcessowns the wait — it drains the goroutine after killing on timeout so there's no leak or race.The second commit replaces
tail /dev/zerowith a perl one-liner that allocates a fixed 512 MiB and sleeps. If the process escapes cleanup it holds bounded memory rather than growing until the kernel intervenes.