Skip to content

feat(sandbox): pre-pause guest reclaim via envd#2551

Open
ValentaTomas wants to merge 15 commits intomainfrom
feat/sandbox-pause-reclaim
Open

feat(sandbox): pre-pause guest reclaim via envd#2551
ValentaTomas wants to merge 15 commits intomainfrom
feat/sandbox-pause-reclaim

Conversation

@ValentaTomas
Copy link
Copy Markdown
Member

@ValentaTomas ValentaTomas commented May 4, 2026

Adds an opt-in pre-pause step that runs sync, drop_caches, compact_memory, and fstrim -av on the live VM via envd's Process service to shrink the memfile/rootfs diff. Each step is wrapped in timeout -s KILL, so a stuck step (most realistically a slow sync on a large dirty backlog) cannot starve the rest — compact_memory always runs as long as its own cap is > 0.

Pausing FC is unaffected by an in-flight guest sync we time out: FC only drains in-flight virtio I/O before completing the pause; any unflushed dirty pages stay in the memfile snapshot and converge on resume. Per-step timeouts trade reclaim payoff, never correctness.

Disabled by default — every per-step cap defaults to 0, so the chain is empty until an operator opts in step by step. The orchestrator skips the envd call entirely when the chain is empty. The outer Connect-Timeout-Ms is derived from the sum of per-step caps plus a small slack.

LD flags (all int, ms; 0 skips that step):

  • reclaim-sync-timeout-ms
  • reclaim-drop-caches-timeout-ms
  • reclaim-compact-memory-timeout-ms
  • reclaim-fstrim-timeout-ms

Pairs cleanly with #2553 (disable proactive compaction in the guest base image), but is independent of it and of FPH (#2552). Split out from #2550.

Run sync, drop_caches, compact_memory, and fstrim -av on the live VM
through envd's Process service immediately before pause to shrink the
memfile/rootfs diff snapshot. Composed as a single bash chain with
';' separators so each step is best-effort, the orchestrator owns the
deadline via Connect-Timeout-Ms, and all failures are non-fatal.

Gated by reclaim-on-pause-timeout-ms (LD int flag, ms; default 0 =
disabled). resume-build gains a matching --reclaim-timeout-ms override
for local exercise.
@cursor
Copy link
Copy Markdown

cursor Bot commented May 4, 2026

PR Summary

Medium Risk
Touches the sandbox pause/snapshot path and adds envd-driven shell execution before pausing; while best-effort and disabled by default, timeouts/streaming behavior regressions could impact snapshot latency or reliability when enabled.

Overview
Adds an opt-in pre-pause “reclaim” step that runs guest sync, drop_caches, compact_memory, and fstrim via envd before taking a snapshot, driven by a new reclaim-config JSON flag and executed with per-step timeout caps.

Potential issues: StartEnvdShell uses an http.Client without a client-side timeout (and resume-build passes timeout=0), so command execution can now hang indefinitely unless the caller always provides context deadlines; and resume-build now always uses an offline LaunchDarkly client, which can unexpectedly ignore real LD targeting when an API key is present.

Reviewed by Cursor Bugbot for commit c3f843b. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread packages/orchestrator/pkg/sandbox/reclaim.go Outdated
Wraps each reclaim step (sync, drop_caches, compact_memory, fstrim) in
its own `timeout -s KILL`. A stuck step (most realistically a slow sync
on a large dirty backlog) cannot starve the rest, so compact_memory —
the diff-critical step — always runs as long as its cap is > 0.

Per-step ceilings are runtime-configurable via four new IntFlags:
- reclaim-sync-timeout-ms (default 500)
- reclaim-drop-caches-timeout-ms (default 200)
- reclaim-compact-memory-timeout-ms (default 1000)
- reclaim-fstrim-timeout-ms (default 500)

Setting any per-step cap to 0 skips that step. The outer
reclaim-on-pause-timeout-ms remains the master enable + Connect-Timeout-Ms
cap.

Pausing FC is unaffected by an in-flight guest sync that we time out:
FC only drains in-flight virtio I/O before completing the pause; any
unflushed dirty pages stay in the memfile snapshot and converge on
resume. Per-step timeouts trade reclaim payoff, never correctness.
Comment thread packages/orchestrator/pkg/sandbox/reclaim.go Outdated
Comment thread packages/orchestrator/pkg/sandbox/reclaim.go Outdated
…uffix and join

Three fixes triggered by Cursor Bugbot review of the previous commit and
a follow-up question on the master flag:

1. Drop reclaim-on-pause-timeout-ms. Per-step caps (defaulting to 0)
   already encode "disabled by default": when every cap is 0 the script
   is empty and bestEffortReclaim short-circuits without calling envd.
   The outer Connect-Timeout-Ms is now derived from the sum of per-step
   caps + 500ms slack.

2. `timeout` accepts s/m/h/d (or fractional seconds), not `ms`. Format
   each cap as `%.3f` seconds (e.g. 500ms → 0.500). Without this, every
   step would silently fail with "invalid time interval".

3. Join parts with `; ` (not a single space) and append one trailing
   `true`. With space-joining, bash parsed `; true timeout ...` as
   `true` swallowing subsequent steps as args, so only `sync` ever ran.

4. Add `--foreground` to `timeout`. Without it, the SIGKILL doesn't
   reliably reach a stuck child when run from a non-interactive bash
   invoked by envd's Process service (verified empirically with
   `sh -c "sleep 5"` running its full 5s despite a 0.5s timeout).

resume-build CLI: replace --reclaim-timeout-ms with a `--reclaim` bool
that flips the per-step caps to sane local-test defaults (500/200/1000/500).
The previous refactor commit referenced Sandbox.StartEnvdProcess from
reclaim.go and resume-build/main.go but the helper file itself was
never tracked, breaking the orchestrator build (typecheck) on CI.
Comment thread packages/orchestrator/pkg/sandbox/envd_process.go Outdated
ValentaTomas added a commit that referenced this pull request May 4, 2026
Adds `vm.compaction_proactiveness=0` to the base template's
`/etc/sysctl.conf` so kcompactd no longer runs background page
migrations in the guest.

With 2 MiB host-side hugepage backing of guest RAM, every migration
dirties a destination hugepage from the host UFFD's perspective and
lands in the next memfile diff — with no snapshot-aligned benefit. The
pre-pause `compact_memory` write (#2551) does the work deterministically
right before we capture state.

Existing templates inherit the change on rebuild.
Redirect each step's stdout+stderr to /dev/null inside the guest so envd
doesn't ship per-command output back to the orchestrator. Capture
per-step exit codes into `rc` and `exit $rc` instead of the trailing
`true`, so the script's overall status reflects whether any step failed.
On the orchestrator side, inspect the End event's exit code and log a
warning only when envd errors or the script reports non-zero.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 4, 2026

❌ 6 Tests Failed:

Tests completed Failed Passed Skipped
2586 6 2580 7
View the full list of 7 ❄️ flaky test(s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig

Flake rate in main: 72.96% (Passed 43 times, Failed 116 times)

Stack Traces | 36.2s run time
=== RUN   TestUpdateNetworkConfig
=== PAUSE TestUpdateNetworkConfig
=== CONT  TestUpdateNetworkConfig
--- FAIL: TestUpdateNetworkConfig (36.17s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false

Flake rate in main: 73.20% (Passed 41 times, Failed 112 times)

Stack Traces | 1.48s run time
=== RUN   TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
Executing command curl in sandbox i8tw1tvu5ryb55n4u9uei
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1352}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
Executing command curl in sandbox i8tw1tvu5ryb55n4u9uei
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1353}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{start:{pid:1354}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{data:{stdout:"HTTP/2 302 \r\nx-content-type-options: nosniff\r\nlocation: https://dns.google/\r\ndate: Thu, 07 May 2026 07:18:24 GMT\r\ncontent-type: text/html; charset=UTF-8\r\nserver: HTTP server (unknown)\r\ncontent-length: 216\r\nx-xss-protection: 0\r\nx-frame-options: SAMEORIGIN\r\nalt-svc: h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000\r\n\r\n"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_network_update_test.go:391: Command [curl] completed successfully in sandbox i8tw1tvu5ryb55n4u9uei
    sandbox_network_update_test.go:391: 
        	Error Trace:	.../api/sandboxes/sandbox_network_out_test.go:74
        	            				.../api/sandboxes/sandbox_network_update_test.go:60
        	            				.../api/sandboxes/sandbox_network_update_test.go:391
        	Error:      	An error is expected but got nil.
        	Test:       	TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
        	Messages:   	https://8.8.8.8 should be blocked
--- FAIL: TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false (1.48s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost

Flake rate in main: 54.20% (Passed 60 times, Failed 71 times)

Stack Traces | 0s run time
=== RUN   TestBindLocalhost
=== PAUSE TestBindLocalhost
=== CONT  TestBindLocalhost
--- FAIL: TestBindLocalhost (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_::1

Flake rate in main: 59.78% (Passed 37 times, Failed 55 times)

Stack Traces | 7.05s run time
=== RUN   TestBindLocalhost/bind_::1
=== PAUSE TestBindLocalhost/bind_::1
=== CONT  TestBindLocalhost/bind_::1
Executing command python in sandbox iy7qt8jt5f98okvdhkys0
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1264}}
Executing command python in sandbox itxhh4aaqea8c2tk19w01
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_::1
        	Messages:   	Unexpected status code 502 for bind address ::1
--- FAIL: TestBindLocalhost/bind_::1 (7.05s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_localhost

Flake rate in main: 60.22% (Passed 37 times, Failed 56 times)

Stack Traces | 7.56s run time
=== RUN   TestBindLocalhost/bind_localhost
=== PAUSE TestBindLocalhost/bind_localhost
=== CONT  TestBindLocalhost/bind_localhost
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1252}}
Executing command python in sandbox ixzvcpbnxibjck31cd6r9
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_localhost
        	Messages:   	Unexpected status code 502 for bind address localhost
Executing command python in sandbox imy2zrbfm03fb3cby9phc
--- FAIL: TestBindLocalhost/bind_localhost (7.56s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity

Flake rate in main: 60.50% (Passed 47 times, Failed 72 times)

Stack Traces | 73.2s run time
=== RUN   TestSandboxMemoryIntegrity
=== PAUSE TestSandboxMemoryIntegrity
=== CONT  TestSandboxMemoryIntegrity
    sandbox_memory_integrity_test.go:26: Build completed successfully
--- FAIL: TestSandboxMemoryIntegrity (73.20s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity/tmpfs_hash

Flake rate in main: 64.08% (Passed 37 times, Failed 66 times)

Stack Traces | 29.3s run time
=== RUN   TestSandboxMemoryIntegrity/tmpfs_hash
=== PAUSE TestSandboxMemoryIntegrity/tmpfs_hash
=== CONT  TestSandboxMemoryIntegrity/tmpfs_hash
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{start:{pid:1252}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Total memory: 985 MB\nUsed memory before tmpfs mount: 183 MB\nFree memory before tmpfs mount: 801 MB\nMemory to use in integrity test (80% of free, min 64MB): 640 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"640+0 records in\n640+0 records out\n671088640 bytes (671 MB, 640 MiB) copied, 3.35568 s, 200 MB/s\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\tCommand being timed: \"dd if=/dev/urandom of=/mnt/testfile bs=1M count=640\"\n\tUser time (seconds): 0.00\n\tSystem time (seconds): 3.33\n\tPercent of CPU this job got: 99%\n\tElapsed (wall clock) time (h:mm:ss or m:ss): 0:03.36\n\tAverage shared text size (kbytes): 0\n\tAverage unshared data size (kbytes): 0\n\tAverage stack size (kbytes): 0\n\tAverage total size (kbytes): 0\n\tMaximum resident set size (kbytes): 2652\n\tAverage resident set size (kbytes): 0\n\tMajor (requiring I/O) page faults: 3\n\tMinor (reclaiming a frame) page faults: 344\n\tVoluntary context switches: 4\n\tInvoluntary context switches: 36\n\tSwaps: 0\n\tFile system inputs: 176\n\tFile system outputs: 0\n\tSocket messages sent: 0\n\tSocket messages received: 0\n\tSignals delivered: 0\n\tPage size (bytes): 4096\n\tExit status: 0\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Used memory after tmpfs mount and file fill: 830 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{end:{exited:true  status:"exit status 0"}}
    sandbox_memory_integrity_test.go:70: Command [bash] completed successfully in sandbox i3thkmg7c8zzcvh7q1rcr
Executing command bash in sandbox i3thkmg7c8zzcvh7q1rcr (user: root)
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{start:{pid:1269}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{data:{stdout:"f285766c8fd9030e7bb1a355e29e044b0c0c0ee9cd1e1c95b6b6682492d9ff35\n"}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{end:{exited:true  status:"exit status 0"}}
    sandbox_memory_integrity_test.go:74: Command [bash] completed successfully in sandbox i3thkmg7c8zzcvh7q1rcr
Executing command bash in sandbox i3thkmg7c8zzcvh7q1rcr (user: root)
    sandbox_memory_integrity_test.go:99: Command [bash] output: event:{start:{pid:1272}}
    sandbox_memory_integrity_test.go:100: 
        	Error Trace:	.../tests/orchestrator/sandbox_memory_integrity_test.go:100
        	Error:      	Received unexpected error:
        	            	failed to execute command bash in sandbox i3thkmg7c8zzcvh7q1rcr: invalid_argument: protocol error: incomplete envelope: unexpected EOF
        	Test:       	TestSandboxMemoryIntegrity/tmpfs_hash
--- FAIL: TestSandboxMemoryIntegrity/tmpfs_hash (29.32s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Drop the StartEnvdProcess helper (avoiding the -l login-shell regression
in resume-build's runCommandInSandbox) and inline the envd Process.Start
construction in reclaim.go. Drop `timeout --foreground`: by default
`timeout` puts the wrapped sh in its own pgroup so SIGKILL reaches all
descendants — `--foreground` would target only the direct child.
@cla-bot cla-bot Bot added the cla-signed label May 6, 2026
@linear-code
Copy link
Copy Markdown

linear-code Bot commented May 6, 2026

@ValentaTomas ValentaTomas marked this pull request as ready for review May 6, 2026 20:34
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Code review skipped — your organization has reached its monthly code review spending cap.

An organization admin can view or raise the cap at claude.ai/admin-settings/claude-code. The cap resets at the start of the next billing period.

Once the cap resets or is raised, push a new commit or reopen this pull request to trigger a review.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 40a1e24405

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/orchestrator/cmd/resume-build/main.go Outdated
@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented May 6, 2026

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider


Action required

1. -reclaim flag silently no-ops in production ✓ Resolved 🐞
Description
When LAUNCH_DARKLY_API_KEY is set, featureflags.NewClientWithLogLevel connects to the real
LaunchDarkly backend; the four featureflags.NewIntFlag(...) calls inside the if *reclaim block
only update the package-level launchDarklyOfflineStore (the test fixture), which the live client
never reads. The reclaim chain stays disabled despite the flag being passed.
Code

packages/orchestrator/cmd/resume-build/main.go[R80-85]

+	if *reclaim {
+		featureflags.NewIntFlag("reclaim-sync-timeout-ms", 500)
+		featureflags.NewIntFlag("reclaim-drop-caches-timeout-ms", 200)
+		featureflags.NewIntFlag("reclaim-compact-memory-timeout-ms", 1000)
+		featureflags.NewIntFlag("reclaim-fstrim-timeout-ms", 500)
+	}
Evidence
NewIntFlag writes exclusively to launchDarklyOfflineStore (flags.go:141-145). NewClientWithLogLevel
bypasses that store and creates a real LD client whenever launchDarklyApiKey != ""
(client.go:71-85). The resume-build binary creates its client at line 1019 before the reclaim block
runs, but even if order were reversed the offline store is never consulted by the live client.

packages/shared/pkg/featureflags/flags.go[141-145]
packages/shared/pkg/featureflags/client.go[71-85]
packages/orchestrator/cmd/resume-build/main.go[1019-1019]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The `-reclaim` flag calls `featureflags.NewIntFlag(...)` at runtime to set per-step timeout values, but `NewIntFlag` only mutates the package-level `launchDarklyOfflineStore` (the ldtestdata fixture). When the binary runs with `LAUNCH_DARKLY_API_KEY` set, the `featureflags.Client` is backed by the real LaunchDarkly SDK and never reads from that store, so the overrides are silently ignored and the reclaim chain never executes.

## Issue Context
The four flags (`reclaim-sync-timeout-ms`, `reclaim-drop-caches-timeout-ms`, `reclaim-compact-memory-timeout-ms`, `reclaim-fstrim-timeout-ms`) already have their global `IntFlag` variables declared in `flags.go` with default 0. The `resume-build` binary is a developer/operator CLI tool, not a long-running service, so a simple mechanism is preferred.

## Fix Focus Areas
- packages/orchestrator/cmd/resume-build/main.go[80-85] — replace the `NewIntFlag` calls with a mechanism that actually reaches the `featureflags.Client` used by the sandbox factory, e.g. expose a `NewClientWithDatasource` path that pre-seeds the offline store before the client is created, or accept the timeout values as plain integer flags and pass them directly to `buildReclaimScript` without going through LaunchDarkly at all.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

Comment thread packages/orchestrator/cmd/resume-build/main.go
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Reclaim requires Bash
    • Changed the outer shell from /bin/bash to /bin/sh to support minimal images like Alpine that only provide BusyBox with /bin/sh.

Create PR

Or push these changes by commenting:

@cursor push dc39246efe
Preview (dc39246efe)
diff --git a/packages/orchestrator/pkg/sandbox/reclaim.go b/packages/orchestrator/pkg/sandbox/reclaim.go
--- a/packages/orchestrator/pkg/sandbox/reclaim.go
+++ b/packages/orchestrator/pkg/sandbox/reclaim.go
@@ -83,7 +83,7 @@
 	pc := processconnect.NewProcessClient(&http.Client{Transport: sandboxHttpClient.Transport}, addr)
 
 	req := connect.NewRequest(&process.StartRequest{
-		Process: &process.ProcessConfig{Cmd: "/bin/bash", Args: []string{"-c", script}},
+		Process: &process.ProcessConfig{Cmd: "/bin/sh", Args: []string{"-c", script}},
 	})
 	req.Header().Set("Connect-Timeout-Ms", strconv.FormatInt(int64(timeout/time.Millisecond), 10))
 	if s.Config.Envd.AccessToken != nil {

You can send follow-ups to the cloud agent here.

Comment thread packages/orchestrator/pkg/sandbox/reclaim.go Outdated
Prevents gofmt from re-aligning HostStatsSamplingInterval's trailing
comment column when the longer ReclaimCompactMemoryTimeoutMs name was
inserted next to it, keeping the diff scoped to the new feature.
Comment thread packages/orchestrator/pkg/sandbox/reclaim.go Outdated
…-build

Caller passes the bash args (e.g. ["-l","-c",cmd] or ["-c",script]) so
login-shell is a per-call decision rather than baked into the helper.
Removes the duplicated envd Process.Start boilerplate from reclaim.go
and runCommandInSandbox.
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Reclaim requires Bash
    • Changed /bin/bash to /bin/sh in StartEnvdBash to support minimal guest images that only have sh, since reclaim commands are already shell-compatible.

Create PR

Or push these changes by commenting:

@cursor push 91ed118737
Preview (91ed118737)
diff --git a/packages/orchestrator/pkg/sandbox/envd_process.go b/packages/orchestrator/pkg/sandbox/envd_process.go
--- a/packages/orchestrator/pkg/sandbox/envd_process.go
+++ b/packages/orchestrator/pkg/sandbox/envd_process.go
@@ -16,7 +16,7 @@
 )
 
 // StartEnvdBash opens a streaming Process.Start call against this
-// sandbox's envd, running `/bin/bash` with the given args as `user`.
+// sandbox's envd, running `/bin/sh` with the given args as `user`.
 // Caller chooses login-shell vs. plain (e.g. []{"-l","-c",cmd} vs.
 // []{"-c",script}). When timeout > 0 it sets `Connect-Timeout-Ms` so
 // envd kills the process at the deadline. Auth/user headers are wired
@@ -31,7 +31,7 @@
 	pc := processconnect.NewProcessClient(&http.Client{Transport: sandboxHttpClient.Transport}, addr)
 
 	req := connect.NewRequest(&process.StartRequest{
-		Process: &process.ProcessConfig{Cmd: "/bin/bash", Args: bashArgs},
+		Process: &process.ProcessConfig{Cmd: "/bin/sh", Args: bashArgs},
 	})
 	if timeout > 0 {
 		req.Header().Set("Connect-Timeout-Ms", strconv.FormatInt(timeout.Milliseconds(), 10))

You can send follow-ups to the cloud agent here.

Comment thread packages/orchestrator/pkg/sandbox/envd_process.go Outdated
- Add featureflags.DurationFlag (string-backed, parsed via time.ParseDuration)
  so LaunchDarkly values like "500ms" are self-describing instead of opaque ints.
- Convert reclaim per-step timeouts (sync, drop_caches, compact_memory, fstrim)
  from IntFlag to DurationFlag.
- Rename StartEnvdBash -> StartEnvdShell and switch to /bin/sh so reclaim and
  resume-build commands work on minimal guests without bash.
- Add NewOfflineClient and use it in resume-build so per-run flag overrides
  (e.g. from -reclaim) take effect even when LAUNCH_DARKLY_API_KEY is set.
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Sub-millisecond durations format as zero disabling timeout
    • Added check to round up any positive duration below 1ms to 0.001 seconds before formatting, preventing GNU timeout from treating it as 0 (no timeout).

Create PR

Or push these changes by commenting:

@cursor push 080e581151
Preview (080e581151)
diff --git a/packages/orchestrator/pkg/sandbox/reclaim.go b/packages/orchestrator/pkg/sandbox/reclaim.go
--- a/packages/orchestrator/pkg/sandbox/reclaim.go
+++ b/packages/orchestrator/pkg/sandbox/reclaim.go
@@ -46,7 +46,12 @@
 		// `timeout` accepts fractional seconds (s/m/h/d), not `ms`. Output
 		// is dropped; non-zero status is captured into `rc` so the final
 		// exit code surfaces failures without short-circuiting later steps.
-		parts = append(parts, fmt.Sprintf("timeout -s KILL %.3f sh -c %q >/dev/null 2>&1 || rc=$?", d.Seconds(), st.cmd))
+		// Ensure sub-millisecond durations round up to 0.001 so timeout != 0.
+		sec := d.Seconds()
+		if sec < 0.001 && sec > 0 {
+			sec = 0.001
+		}
+		parts = append(parts, fmt.Sprintf("timeout -s KILL %.3f sh -c %q >/dev/null 2>&1 || rc=$?", sec, st.cmd))
 		sum += d
 	}
 	if len(parts) == 0 {

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit 801a37f. Configure here.

Comment thread packages/orchestrator/pkg/sandbox/reclaim.go Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 801a37f79e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/orchestrator/pkg/sandbox/envd_process.go Outdated
Replace per-step DurationFlags with a single JSON flag (reclaim-config)
that holds all per-step caps and is evaluated against
sandbox/team/template LD contexts, so reclaim can be enabled per cohort
in LaunchDarkly with one rule. Default is null = all disabled. Sub-ms
guard preserved.

Drop the now-unused DurationFlag type.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants