feat(sandbox): pre-pause guest reclaim via envd#2551
feat(sandbox): pre-pause guest reclaim via envd#2551ValentaTomas wants to merge 15 commits intomainfrom
Conversation
Run sync, drop_caches, compact_memory, and fstrim -av on the live VM through envd's Process service immediately before pause to shrink the memfile/rootfs diff snapshot. Composed as a single bash chain with ';' separators so each step is best-effort, the orchestrator owns the deadline via Connect-Timeout-Ms, and all failures are non-fatal. Gated by reclaim-on-pause-timeout-ms (LD int flag, ms; default 0 = disabled). resume-build gains a matching --reclaim-timeout-ms override for local exercise.
PR SummaryMedium Risk Overview Potential issues: Reviewed by Cursor Bugbot for commit c3f843b. Bugbot is set up for automated code reviews on this repo. Configure here. |
Wraps each reclaim step (sync, drop_caches, compact_memory, fstrim) in its own `timeout -s KILL`. A stuck step (most realistically a slow sync on a large dirty backlog) cannot starve the rest, so compact_memory — the diff-critical step — always runs as long as its cap is > 0. Per-step ceilings are runtime-configurable via four new IntFlags: - reclaim-sync-timeout-ms (default 500) - reclaim-drop-caches-timeout-ms (default 200) - reclaim-compact-memory-timeout-ms (default 1000) - reclaim-fstrim-timeout-ms (default 500) Setting any per-step cap to 0 skips that step. The outer reclaim-on-pause-timeout-ms remains the master enable + Connect-Timeout-Ms cap. Pausing FC is unaffected by an in-flight guest sync that we time out: FC only drains in-flight virtio I/O before completing the pause; any unflushed dirty pages stay in the memfile snapshot and converge on resume. Per-step timeouts trade reclaim payoff, never correctness.
…uffix and join Three fixes triggered by Cursor Bugbot review of the previous commit and a follow-up question on the master flag: 1. Drop reclaim-on-pause-timeout-ms. Per-step caps (defaulting to 0) already encode "disabled by default": when every cap is 0 the script is empty and bestEffortReclaim short-circuits without calling envd. The outer Connect-Timeout-Ms is now derived from the sum of per-step caps + 500ms slack. 2. `timeout` accepts s/m/h/d (or fractional seconds), not `ms`. Format each cap as `%.3f` seconds (e.g. 500ms → 0.500). Without this, every step would silently fail with "invalid time interval". 3. Join parts with `; ` (not a single space) and append one trailing `true`. With space-joining, bash parsed `; true timeout ...` as `true` swallowing subsequent steps as args, so only `sync` ever ran. 4. Add `--foreground` to `timeout`. Without it, the SIGKILL doesn't reliably reach a stuck child when run from a non-interactive bash invoked by envd's Process service (verified empirically with `sh -c "sleep 5"` running its full 5s despite a 0.5s timeout). resume-build CLI: replace --reclaim-timeout-ms with a `--reclaim` bool that flips the per-step caps to sane local-test defaults (500/200/1000/500).
The previous refactor commit referenced Sandbox.StartEnvdProcess from reclaim.go and resume-build/main.go but the helper file itself was never tracked, breaking the orchestrator build (typecheck) on CI.
Adds `vm.compaction_proactiveness=0` to the base template's `/etc/sysctl.conf` so kcompactd no longer runs background page migrations in the guest. With 2 MiB host-side hugepage backing of guest RAM, every migration dirties a destination hugepage from the host UFFD's perspective and lands in the next memfile diff — with no snapshot-aligned benefit. The pre-pause `compact_memory` write (#2551) does the work deterministically right before we capture state. Existing templates inherit the change on rebuild.
Redirect each step's stdout+stderr to /dev/null inside the guest so envd doesn't ship per-command output back to the orchestrator. Capture per-step exit codes into `rc` and `exit $rc` instead of the trailing `true`, so the script's overall status reflects whether any step failed. On the orchestrator side, inspect the End event's exit code and log a warning only when envd errors or the script reports non-zero.
❌ 6 Tests Failed:
View the full list of 7 ❄️ flaky test(s)
To view more test analytics, go to the Test Analytics Dashboard |
Drop the StartEnvdProcess helper (avoiding the -l login-shell regression in resume-build's runCommandInSandbox) and inline the envd Process.Start construction in reclaim.go. Drop `timeout --foreground`: by default `timeout` puts the wrapped sh in its own pgroup so SIGKILL reaches all descendants — `--foreground` would target only the direct child.
There was a problem hiding this comment.
An organization admin can view or raise the cap at claude.ai/admin-settings/claude-code. The cap resets at the start of the next billing period.
Once the cap resets or is raised, push a new commit or reopen this pull request to trigger a review.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 40a1e24405
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Code Review by Qodo
1.
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Autofix Details
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Reclaim requires Bash
- Changed the outer shell from /bin/bash to /bin/sh to support minimal images like Alpine that only provide BusyBox with /bin/sh.
Or push these changes by commenting:
@cursor push dc39246efe
Preview (dc39246efe)
diff --git a/packages/orchestrator/pkg/sandbox/reclaim.go b/packages/orchestrator/pkg/sandbox/reclaim.go
--- a/packages/orchestrator/pkg/sandbox/reclaim.go
+++ b/packages/orchestrator/pkg/sandbox/reclaim.go
@@ -83,7 +83,7 @@
pc := processconnect.NewProcessClient(&http.Client{Transport: sandboxHttpClient.Transport}, addr)
req := connect.NewRequest(&process.StartRequest{
- Process: &process.ProcessConfig{Cmd: "/bin/bash", Args: []string{"-c", script}},
+ Process: &process.ProcessConfig{Cmd: "/bin/sh", Args: []string{"-c", script}},
})
req.Header().Set("Connect-Timeout-Ms", strconv.FormatInt(int64(timeout/time.Millisecond), 10))
if s.Config.Envd.AccessToken != nil {You can send follow-ups to the cloud agent here.
Prevents gofmt from re-aligning HostStatsSamplingInterval's trailing comment column when the longer ReclaimCompactMemoryTimeoutMs name was inserted next to it, keeping the diff scoped to the new feature.
…-build Caller passes the bash args (e.g. ["-l","-c",cmd] or ["-c",script]) so login-shell is a per-call decision rather than baked into the helper. Removes the duplicated envd Process.Start boilerplate from reclaim.go and runCommandInSandbox.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
Autofix Details
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Reclaim requires Bash
- Changed /bin/bash to /bin/sh in StartEnvdBash to support minimal guest images that only have sh, since reclaim commands are already shell-compatible.
Or push these changes by commenting:
@cursor push 91ed118737
Preview (91ed118737)
diff --git a/packages/orchestrator/pkg/sandbox/envd_process.go b/packages/orchestrator/pkg/sandbox/envd_process.go
--- a/packages/orchestrator/pkg/sandbox/envd_process.go
+++ b/packages/orchestrator/pkg/sandbox/envd_process.go
@@ -16,7 +16,7 @@
)
// StartEnvdBash opens a streaming Process.Start call against this
-// sandbox's envd, running `/bin/bash` with the given args as `user`.
+// sandbox's envd, running `/bin/sh` with the given args as `user`.
// Caller chooses login-shell vs. plain (e.g. []{"-l","-c",cmd} vs.
// []{"-c",script}). When timeout > 0 it sets `Connect-Timeout-Ms` so
// envd kills the process at the deadline. Auth/user headers are wired
@@ -31,7 +31,7 @@
pc := processconnect.NewProcessClient(&http.Client{Transport: sandboxHttpClient.Transport}, addr)
req := connect.NewRequest(&process.StartRequest{
- Process: &process.ProcessConfig{Cmd: "/bin/bash", Args: bashArgs},
+ Process: &process.ProcessConfig{Cmd: "/bin/sh", Args: bashArgs},
})
if timeout > 0 {
req.Header().Set("Connect-Timeout-Ms", strconv.FormatInt(timeout.Milliseconds(), 10))You can send follow-ups to the cloud agent here.
- Add featureflags.DurationFlag (string-backed, parsed via time.ParseDuration) so LaunchDarkly values like "500ms" are self-describing instead of opaque ints. - Convert reclaim per-step timeouts (sync, drop_caches, compact_memory, fstrim) from IntFlag to DurationFlag. - Rename StartEnvdBash -> StartEnvdShell and switch to /bin/sh so reclaim and resume-build commands work on minimal guests without bash. - Add NewOfflineClient and use it in resume-build so per-run flag overrides (e.g. from -reclaim) take effect even when LAUNCH_DARKLY_API_KEY is set.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Sub-millisecond durations format as zero disabling timeout
- Added check to round up any positive duration below 1ms to 0.001 seconds before formatting, preventing GNU timeout from treating it as 0 (no timeout).
Or push these changes by commenting:
@cursor push 080e581151
Preview (080e581151)
diff --git a/packages/orchestrator/pkg/sandbox/reclaim.go b/packages/orchestrator/pkg/sandbox/reclaim.go
--- a/packages/orchestrator/pkg/sandbox/reclaim.go
+++ b/packages/orchestrator/pkg/sandbox/reclaim.go
@@ -46,7 +46,12 @@
// `timeout` accepts fractional seconds (s/m/h/d), not `ms`. Output
// is dropped; non-zero status is captured into `rc` so the final
// exit code surfaces failures without short-circuiting later steps.
- parts = append(parts, fmt.Sprintf("timeout -s KILL %.3f sh -c %q >/dev/null 2>&1 || rc=$?", d.Seconds(), st.cmd))
+ // Ensure sub-millisecond durations round up to 0.001 so timeout != 0.
+ sec := d.Seconds()
+ if sec < 0.001 && sec > 0 {
+ sec = 0.001
+ }
+ parts = append(parts, fmt.Sprintf("timeout -s KILL %.3f sh -c %q >/dev/null 2>&1 || rc=$?", sec, st.cmd))
sum += d
}
if len(parts) == 0 {You can send follow-ups to the cloud agent here.
Reviewed by Cursor Bugbot for commit 801a37f. Configure here.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 801a37f79e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Replace per-step DurationFlags with a single JSON flag (reclaim-config) that holds all per-step caps and is evaluated against sandbox/team/template LD contexts, so reclaim can be enabled per cohort in LaunchDarkly with one rule. Default is null = all disabled. Sub-ms guard preserved. Drop the now-unused DurationFlag type.


Adds an opt-in pre-pause step that runs
sync,drop_caches,compact_memory, andfstrim -avon the live VM via envd's Process service to shrink the memfile/rootfs diff. Each step is wrapped intimeout -s KILL, so a stuck step (most realistically a slowsyncon a large dirty backlog) cannot starve the rest —compact_memoryalways runs as long as its own cap is> 0.Pausing FC is unaffected by an in-flight guest
syncwe time out: FC only drains in-flight virtio I/O before completing the pause; any unflushed dirty pages stay in the memfile snapshot and converge on resume. Per-step timeouts trade reclaim payoff, never correctness.Disabled by default — every per-step cap defaults to 0, so the chain is empty until an operator opts in step by step. The orchestrator skips the envd call entirely when the chain is empty. The outer
Connect-Timeout-Msis derived from the sum of per-step caps plus a small slack.LD flags (all int, ms;
0skips that step):reclaim-sync-timeout-msreclaim-drop-caches-timeout-msreclaim-compact-memory-timeout-msreclaim-fstrim-timeout-msPairs cleanly with #2553 (disable proactive compaction in the guest base image), but is independent of it and of FPH (#2552). Split out from #2550.